Identifying causal effects is an integral part of scientific inquiry, spanning a wide range of questions such as understanding behavior in online systems, effect of social policies, or risk factors for diseases. However, current data mining and machine learning methods focus on prediction, often ignoring the goal of causal inference. This is partly because inferring causality from observed data is hard unless we make strong assumptions about the data-generating process.
In this talk, I will show that we can use properties of the observed data to
test many of the strong assumptions, thus enabling a data mining framework for estimating causal effects. The key idea is to look for naturally occurring variations in the data---"natural" experiments--- that resemble an actual experiment. I will present two such methods. The first utilizes auxiliary data from large-scale systems to automate the search for natural experiments. Applying it to estimate the additional activity caused by Amazon's recommendation system, I find over 20,000 natural experiments, an order of magnitude more than those in past work. These experiments indicate that less than half of the click-throughs typically attributed to the recommendation system are causal; the rest would have happened anyways. The second method proposes a general test for validating natural experiments in observed data, widely considered to be an impossible problem. Results from the test show that many natural experiments used in recent studies from a premier economics journal are likely invalid. More generally, the proposed framework presents a viable way of doing causal inference in large-scale datasets with minimal assumptions.