The Impact of Machine Learning on Economics

The Impact of Machine Learning on Economics - I believe that machine learning (ML) will have a dramatic impact on the field of economics within a short time frame. Indeed, the impact of ML on economics is already well underway, and so it is perhaps not too diffi cult to predict some of the eff ects.

The chapter begins by stating the defi nition of ML that I will use in this chapter, describing its strengths and weaknesses, and contrasting ML with traditional econometrics tools for causal inference, which is a primary focus of the empirical economics literature.

Next, I review some applications of ML in economics where ML can be used off the shelf: the use case in economics is essentially the same use case that the ML tools were designed and optimized for. I then review “prediction policy” problems (Kleinberg et al. 2015), where prediction tools have been embedded in the context of economic decision- making.

Then, I provide an overview of the questions considered and early themes of the emerging literature in econometrics and statistics combining machine learning and causal inference, a literature that is providing insights and theoretical results that are novel from the perspective of both ML and statistics/ econometrics.

Finally, I step back and describe the implications of the fi eld of economics as a whole. Throughout, I make reference to the literature broadly, but do not attempt to conduct a comprehensive survey or reference every application in economics.

The chapter highlights several themes. A fi rst theme is that ML does not add much to questions about identifi cation, which concerns when the object of interest, for example, a causal eff ect, can be estimated with infi nite data, but rather yields great improvements when the goal is semiparametric estimation or when there are a large number of covariates relative to the number of observations.

Machine learning has great strengths in using data to select functional forms fl exibly. A second theme is that a key advantage of ML is that ML views empirical analysis as “algorithms” that estimate and compare many alternative models.

This approach constrasts with economics, where (in principle, though rarely in reality) the researcher picks a model based on principles and estimates it once. Instead, ML algorithms build in “tuning” as part of the algorithm.

The tuning is essentially model selection, and in an ML algorithm that is data driven. There are a whole host of advantages of this approach, including improved performance as well as enabling researchers to be systematic and fully describe the process by which their model was selected.

Of course, cross- validation has also been used historically in economics, for example, for selecting the bandwidth for a kernel regression, but it is viewed as a fundamental part of an algorithm in ML.

A third, closely related theme is that “outsourcing” model selection to algorithm works very well when the problem is “simple”—for example, prediction and classifi cation tasks, where performance of a model can be evaluated by looking at goodness of fi t in a held- out test set.

Those are typically not the problems of greatest interest for empirical researchers in economics, who instead are concerned with causal inference, where there is typically not an unbiased estimate of the ground truth available for comparison.

Thus, more work is required to apply an algorithmic approach to economic problems. The recent literature at the intersection of ML and causal inference, reviewed in this chapter, has focused on providing the conceptual framework and specifi c proposals for algorithms that are tailored for causal inference.

A fourth theme is that the algorithms also have to be modifi ed to provide valid confi dence intervals for estimated eff ects when the data is used to select the model. Many recent papers make use of techniques such as sample splitting, leave- one- out estimation, and other similar techniques to provide confi dence intervals that work both in theory and in practice.

The upside is that using ML can provide the best of both worlds: the model selection is data driven, systematic, and a wide range of models are considered; yet, the model- selection process is fully documented, and confi dence intervals take into account the entire algorithm.

Finally, the combination of ML and newly available data sets will change economics in fairly fundamental ways ranging from new questions, to new approaches, to collaboration (larger teams and interdisciplinary interaction), to a change in how involved economists are in the engineering and implementation of policies.

What Is Machine Learning and What Are Early Use Cases?

It is harder than one might think to come up with an operational defi - nition of ML. The term can be (and has been) used broadly or narrowly; it can refer to a collections of subfi elds of computer science, but also to a set of topics that are developed and used across computer science, engineering, statistics, and increasingly the social sciences.

Indeed, one could devote an entire article to the defi nition of ML, or to the question of whether the thing called ML really needed a new name other than statistics, the distinction between ML and AI, and so on.

However, I will leave this debate to others and focus on a narrow, practical defi nition that will make it easier to distinguish ML from the most commonly used econometric approaches used in applied econometrics until very recently.

For readers coming from a machine- learning background, it is also important to note that applied statistics and econometrics have developed a body of insights on topics ranging from causal inference to effi ciency that have not yet been incorporated in mainstream machine learning, while other parts of machine learning have overlap with methods that have been used in applied statistics and social sciences for many decades.

Starting from a relatively narrow defi nition of machine learning, machine learning is a fi eld that develops algorithms designed to be applied to data sets, with the main areas of focus being prediction (regression), classifi cation, and clustering or grouping tasks.

These tasks are divided into two main branches, supervised and unsupervised ML. Unsupervised ML involves fi nding clusters of observations that are similar in terms of their covariates, and thus can be interpreted as “dimensionality reduction”; it is commonly used for video, images, and text.

There are a variety of techniques available for unsupervised learning, including k- means clustering, topic modeling, community detection methods for networks, and many more. For example, the Latent Dirichlet Allocation model (Blei, Ng, and Jordan 2003) has frequently been applied to fi nd “topics” in textual data.

The output of a typical unsupervised ML model is a partition of the set of observations, where observations within each element of the partition are similar according to some metric, or, a vector of probabilities or weights that describe a mixture of topics or groups that an observation might belong to.

If you read in the newspaper that a computer scientist “discovered cats on YouTube,” that might mean that they used an unsupervised ML method to partition a set of videos into groups, and when a human watches the the largest group, they observe that most of the videos in the largest group contain cats.

This is referred to as “unsupervised” because there were no “labels” on any of the images in the input data; only after examining the items in each group does an observer determine that the algorithm found cats or dogs.

Not all dimensionality reduction methods involve creating clusters; older methods such as principal components analysis can be used to reduce dimensionality, while modern methods include matrix factorization (fi nding two low- dimensional matrices whose product well approximates a larger matrix), regularization on the norm of a matrix, hierarchical Poisson factorization (in a Bayesian framework) (Gopalan, Hofman, and Blei 2015), and neural networks. In my view, these tools are very useful as an intermediate step in empirical work in economics.

They provide a data- driven way to fi nd similar newspaper articles, restaurant reviews, and so forth, and thus create variables that can be used in economic analyses. These variables might be part of the construction of either outcome variables or explanatory variables, depending on the context.

For example, if an analyst wishes to estimate a model of consumer demand for diff erent items, it is common to model consumer preferences over characteristics of the items. Many items are associated with text descriptions as well as online reviews.

Unsupervised learning could be used to discover items with similar product descriptions in an initial phase of fi nding potentially related products, and it could also be used to fi nd subgroups of similar products. Unsupervised learning could further be used to categorize the reviews into types.

An indicator for the review group could be used in subsequent analysis without the analyst having to use human judgement about the review content; the data would reveal whether a certain type of review was associated with higher consumer perceived quality, or not.

An advantage of using unsupervised learning to create covariates is that the outcome data is not used at all; thus, concerns about spurious correlation between constructed covariates and the observed outcome are less problematic.

Despite this, Egami et al. (2016) have argued that researchers may be tempted to fi ne- tune their construction of covariates by testing how they perform in terms of predicting outcomes, thus leading to spurious relationships between covariates and outcomes.

They recommend the approach of sample splitting, whereby the model tuning takes place on one sample of data, and then the selected model is applied on a fresh sample of data. Unsupervised learning can also be used to create outcome variables.

For example, Athey, Mobius, and Pál (2017) examine the impact of Google’s shutdown of Google News in Spain on the types of news consumers read. In this case, the share of news in diff erent categories is an outcome of interest.

Unsupervised learning can be used to categorize news in this type of anal ysis; that paper uses community detection techniques from network theory. In the absence of dimensionality reduction, it would be diffi cult to meaningfully summarize the impact of the shutdown on all of the diff erent news articles consumed in the relevant time frame.

Supervised machine learning typically entails using a set of features or covariates (X ) to predict an outcome (Y).

When using the term prediction, it is important to emphasize that the framework focuses not on forecasting, but rather on a setting where there are some labeled observations where both X and Y are observed (the training data), and the goal is to predict outcomes (Y) in an independent test set based on the realized values of X for each unit in the test set.

In other words, the goal is to construct μˆ(x), which is an estimator of (x) = E[Y |X = x], in order to do a good job predicting the true values of Y in an independent data set. The observations are assumed to be independent, and the joint distribution of X and Y in the training set is the same as that in the test set.

These assumptions are the only substantive assumptions required for most machine- learning methods to work. In the case of classifi cation, the goal is to accurately classify observations. For example, the outcome could be the animal depicted in an image, the “features” or covariates are the pixels in the image, and the goal is to correctly classify images into the correct animal depicted.

A related but distinct estimation problem is to estimate Pr(Y = k |X = x) for each of k = 1, . . , K possible realizations of Y. It is important to emphasize that the ML literature does not frame itself as solving estimation problems so estimating (x) or Pr(Y = k |X = x) is not the primary goal.

Instead, the goal is to achieve goodness of fi t in an independent test set by minimizing deviations between actual outcomes and predicted outcomes. In applied econometrics, we often wish to understand an object like (x) in order to perform exercises like evaluating the impact of changing one covariate while holding others constant.

This is not an explicit aim of ML modeling. There are a variety of ML methods for supervised learning, such as regularized regression (LASSO, ridge and elastic net), random forest, regression trees, support vector machines, neural nets, matrix factorization, and many others, such as model averaging.

See Varian (2014) for an overview of some of the most popular methods and Mullainathan and Spiess (2017) for more details.

(Also note that White [1992] attempted to popularize neural nets in economics in the early 1990s, but at the time they did not lead to substantial performance improvements and did not become popular in economics.)

What leads us to categorize these methods as ML methods rather than traditional econometric or statistical methods? First is simply an observation: until recently, these methods were neither used in published social science research, nor taught in social science courses, while they were widely studied in the self- described ML and/or “statistical learning” literatures.

One exception is ridge regression, which received some attention in economics, and LASSO had also received some attention. But from a more functional perspective, one common feature of many ML methods is that they use datadriven model selection.

That is, the analyst provides the list of covariates or features, but the functional form is at least in part determined as a function of the data, and rather than performing a single estimation (as is done, at least in theory, in econometrics), so that the method is better described as an algorithm that might estimate many alternative models and then select among them to maximize a criterion.

There is typically a trade- off between expressiveness of the model (e.g., more covariates included in a linear regression) and risk of overfi tting, which occurs when the model is too rich relative to the sample size. (See Mullainathan and Spiess [2017] for more discussion of this.)

In the latter case, the goodness of fi t of the model when measured on the sample where the model is estimated is expected to be much better than the goodness of fi t of the model when evaluated on an independent test set.

The ML literature uses a variety of techniques to balance expressiveness against overfi tting. The most common approach is cross- validation whereby the analyst repeatedly estimates a model on part of the data (a “training fold”) and then evaluates it on the complement (the “test fold”).

The complexity of the model is selected to minimize the average of the mean- squared error of the prediction (the squared diff erence between the model prediction and the actual outcome) on the test folds.

Other approaches used to control overfi tting include averaging many diff erent models, sometimes estimating each model on a subsample of the data (one can interpret the random forest in this way).

In contrast, in much of cross- sectional econometrics and empirical work in economics, the tradition has been that the researcher specifi es one model, estimates the model on the full data set, and relies on statistical theory to estimate confi dence intervals for estimated parameters.

The focus is on the estimated eff ects rather than the goodness of fi t of the model. For much empirical work in economics, the primary interest is in the estimate of a causal eff ect, such as the eff ect of a training program, a minimum wage increase, or a price increase.

The researcher might check robustness of this parameter estimate by reporting two or three alternative specifi cations.

Researchers often check dozens or even hundreds of alternative specifi cations behind the scenes, but rarely report this practice because it would invalidate the confi dence intervals reported (due to concerns about multiple testing and searching for specifi cations with the desired results).

There are many disadvantages to the traditional approach, including but not limited to the fact that researchers would fi nd it diffi cult to be systematic or comprehensive in checking alternative specifi cations, and further because researchers were not honest about the practice, given that they did not have a way to correct for the specifi cation search process.

I believe that regularization and systematic model selection have many advantages over traditional approaches, and for this reason will become a standard part of empirical practice in economics. This will particularly be true as we more frequently encounter data sets with many covariates, and also as we see the advantages of being systematic about model selection.

As I discuss later, however, this practice must be modifi ed from traditional ML and in general “handled with care” when the researcher’s ultimate goal is to estimate a causal eff ect rather than maximize goodness of fi t in a test set.

To build some intuition about the diff erence between causal effect estimation and prediction, it can be useful to consider the widely used method of instrumental variables. Instrumental variables are used by economists when they wish to learn a causal effect, for example, the effect of a price on a firm’s sales, but they only have access to observational (nonexperimental) data.

An instrument in this case might be an input cost for the fi rm that shifts over time, and is unrelated to factors that shift consumer’s demand for the product (such demand shifters can be referred to as “confounders” becaues they aff ect both the optimal price set by the firm and the sales of the product).

The instrumental variables method essentially projects the observed prices onto the input costs, thus only making use of the variation in price that is explained by changes in input costs when estimating the impact of price on sales.

It is very common to see that a predictive model (e.g., least squares regression) might have very high explanatory power (e.g., high R2), while the causal model (e.g., instrumental variables regression) might have very low explanatory power (in terms of predicting outcomes).

In other words, economists typically abandon the goal of accurate prediction of outcomes in pursuit of an unbiased estimate of a causal parameter of interest. Another diff erence derives from the key concerns in diff erent approaches, and how those concerns are addressed.

In predictive models, the key concern is the trade- off between expressiveness and overfi tting, and this tradeoff can be evaluated by looking at goodness of fi t in an independent test set. In contrast, there are several distinct concerns for causal models.

The fi rst is whether the parameter estimates from a particular sample are spurious, that is, whether estimates arise due to sampling variation so that if a new random sample of the same size was drawn from the population, the parameter estimate would be substantially different.

The typical approach to this problem in econometrics and statistics is to prove theorems about the consistency and asymptotic normality of the parameter estimates, propose approaches to estimating the variance of parameter estimates, and fi nally to use those results to estimate standard errors that refl ect the sampling uncertainty (under the conditions of the theory).

A more data- driven approach is to use bootstrapping and estimate the empirical distribution of parameter estimates across bootstrap samples. The typical ML approach of evaluating performance in a test set does not directly handle the issue of the uncertainty over parameter estimates, since the parameter of interest is not actually observed in any test set.

The researcher would need to estimate the parameter again in the test set. A second concern is whether the assumptions required to “identify” a causal eff ect are satisfi ed, where in econometrics we say that a parameter is identifi ed if we can learn it eventually with infi nite data (where even in the limit, the data has the same structure as in the sample considered).

It is well known that the causal eff ect of a treatment is not identifi ed without making assumptions, assumptions that are generally not testable (that is, they cannot be rejected by looking at the data). Examples of identifying assumptions include the assumption that the treatment is randomly assigned, or that treatment assignment is “unconfounded.”

In some settings, these assumptions require the analyst to observe all potential “confounders” and control for them adequately; in other settings, the assumptions require that an instrumental variable is uncorrelated with the unobserved component of outcomes.

In many cases it can be proven that even with a data set of infi nite size, the assumptions are not testable they cannot be rejected by looking at the data, and instead must be evaluated on substantive grounds.

Justifying assumptions is one of the primary components of an observational study in applied economics. If the “identifying” assumptions are violated, estimates may be biased (in the same way) in both training data and test data.

Testing assumptions usually requires additional information, like multiple experiments (designed or natural) in the data. Thus, the ML approach of evaluating performance in a test set does not address this concern at all.

Instead, ML is likely to help make estimation methods more credible, while maintaining the identifying assumptions: in practice, coming up with estimation methods that give unbiased estimates of treatment eff ects requires flexibly modeling a variety of empirical relationships, such as the relationship between the treatment assignment and covariates.

Since ML excels at data- driven model selection, it can be useful in systematizing the search for the best functional forms when implementing an estimation technique.

Economists also build more complex models that incorporate both behavioral and statistical assumptions in order to estimate the impact of counterfactual policies that have never been used before. A classic example is McFadden’s methodological work in the early 1970s (e.g., McFadden 1973) analyzing transportation choices.

By imposing the behavioral assumption that consumers maximize utility when making choices, it is possible to estimate parameters of the consumer’s utility function and estimate the welfare eff ects and market share changes that would occur when a choice is added or removed (e.g., extending the BART transportation system), or when the characteristics of the good (e.g., price) are changed. Another example with more complicated behavioral assumptions is the case of auctions.

For a data set with bids from procurement auctions, the “structural” approach involves estimating a probability distribution over bidder values, and then evaluating the counterfactual eff ect of changing auction design (e.g., Laffont, Ossard, and Vuong 1995; Athey, Levin, and Seira 2011; Athey, Coey, and Levin 2013; or the review by Athey and Haile 2007).

For further discussions of the contrast between prediction and parameter estimation, see the recent review by Mullainathan and Spiess (2017).

There is a small literature in ML referred to as “inverse reinforcement learning” (Ng and Russell 2000) that has a similar approach to the structural estimation literature economics; this ML literature has mostly operated independently without much reference to the earlier econometric literature.

The literature attempts to learn “reward functions” (utility functions) from observed behavior in dynamic settings.

There are also other categories of ML models; for example, anomaly detection focuses on looking for outliers or unusual behavior and is used, for example, to detect network intrusion, fraud, or system failures.

Other categories that I will return to are reinforcement learning (roughly, approximate dynamic programming) and multiarmed bandit experimentation (dynamic experimentation where the probabiity of selecting an arm is chosen to balance exploration and exploitation).

These literatures often take a more explicitly causal perspective and thus are somewhat easier to relate to economic models, and so my general statements about the lack of focus on causal inference in ML must be qualifi ed when discussing the literature on bandits.

Before proceeding, it is useful to highlight one other contribution of the ML literature. The contribution is computational rather than conceptual, but it has had such a large impact that it merits a short discussion.

The technique is called stochastic gradient descent (SGD), and it is used in many different types of models, including the estimation of neural networks as well as large scale Bayesian models (e.g., Ruiz, Athey, and Blei [2017], discussed in more detail below).

In short, stochastic gradient descent is a method for optimizing an objective function, such as a likelihood function or a generalized method of moments objective function, with respect to parameters.

When the objective function is expensive to compute (e.g., because it requires numerical integration), stochastic gradient descent can be used.

The main idea is that if the objective is the sum of terms, each term corresponding to a single observation, the gradient can be approximated by picking a single data point and using the gradient evaluated at that observation as an approximation to the average (over observations) of the gradient. This estimate of the gradient will be very noisy, but unbiased.

The idea is that it is more eff ective to “climb a hill” taking lots of steps in a direction that is noisy but unbiased, than it is to take a small number of steps, each in the right direction, which is what happens if computational resources are focused on getting very precise estimates of the gradient of the objective at each step.

Stochastic gradient descent can lead to dramatic performance improvements, and thus enable the estimation of very complex models that would be intractable using traditional approaches.