This post is inspired by the famous paper “Statistical Modeling: The Two Cultures” by the late Leo Breiman. While I do not feel that I can entirely agree with his arguments, they are, without a doubt, worth considering.
A number of data facts that are by now obvious were anything but obvious for me when I started. Most statistics that I learned at school dealt with conditional means, and, sometimes, variances. The focus on the first two moments makes sense for bell-shaped distributions such as normal. It is a lot easier to prove limit theorems about means, so their properties are well-documented. Unfortunately, the vast majority of real data that I encounter are anything but bell-shaped.
Instead, most variables have a very different distribution.
There is almost always a mass point at zero, a relatively well-behaved
lognormal-shaped piece with positive values, and a very long uniform-shaped
Sometimes the tail is so long that the mean can be above the
The outliers also inflate the variance, so most standard inference procedures
will tell a very misleading story about what is going on in the data.
In addition, the tails of the distribution is where the most interesting things happen. If we are looking at model errors, the tails are the cases when the model fit is worst. If we are looking at sales, the tails are the bestsellers. The most important users are also at the tails of the entire distribution. So not only the tails mess up the usual inference procedures, they are also of primary interest by themselves.
People who work with data generally come from one of the two camps: statisticians and machine learners. In economics, we are much closer to statisticians than machine learners. The differences between the camps are numerous and I would not hope to cover them all in a single post. But if I were asked to put a finger on what I think the machine learners do right and economists do wrong, overfitting would be the answer.
As most of you are well aware, overfitting happens when one tailors the model to mimic the existing features of the data too well. Such model, when used for predictions, usually yields wildly inaccurate projections. Nate Silver in his book “The Signal and the Noise” claims that overfitting is probably the biggest sin of which most practitioners of statistics are guilty. His intuitive explanation for the phenomenon is that “an overfit model fits not only the signal, but also the noise in the data”, and I believe this is an excellent summary. On a side note, I highly recommend the book to anyone who is interested in applied data work, though, like many of my non-US friends, I found the baseball chapter somewhat boring.
To be fair, time-series econometricians sometimes discuss the “out-of-sample” model evaluation. I suspect this happens because time-series models are most likely to be used for forecasting. But I believe it would be very instructive to see an “out-of-sample” evaluation for a structural econometric model. I suspect this could lend these models some additional credibility – something that some of these models deeply lack. In the past, it would have been possible to argue that the data are so limited it is impossible to spare some for a holdout sample. These days, however, such arguments appear unsubstantiated.
Another thing that tends to set machine learners apart from statisticians is that many machine learners are fundamentally less willing to make assumptions. This may sound a bit abstract and probably warrants an explanation.
At a very high level of abstraction, a large portion of statistics can be reduced to constructing a model that defines a relationship between a dependent variable and a set of independent variables. It is in defining this relationship where one can choose to make assumptions. It may sound like I am drawing a distinction between parametric and nonparametric statistics, but in fact, the point is more general.
In economics, we construct models to understand specific phenomena. Every little piece of our models usually has a clear and fairly straightforward interpretation. The famous model-building process of Alfred Marshall – write the question in English, translate into math, use math to obtain the answer, translate the answer back into English, burn the math – is still a good rule of thumb. But we cannot make this work without imposing a myriad of assumptions, many of which are so deeply ingrained in our thought process that we no longer consider them to be assumptions per se; rather, we implicitly treat them like axioms.
In contrast, machine learners become deeply suspicious the moment someone starts making assumptions. They raise a valid point: any result that is obtained under an assumption goes out the window once the assumption is violated. Many find such a predicament to be completely unacceptable, and refuse to go down that path. Strangely enough, the ability to make informed assumptions that are mostly true is still highly valued in the field, and has a special term reserved for it: “domain knowledge”.
I feel that in economics we had to acquire “domain knowledge” early because our data is fundamentally very noisy. Without making assumptions, it is impossible to learn anything from, say, macroeconomic data – there are simply too many moving parts and too few degrees of freedom. Imposing a structure on the data can sometimes let us ignore a subset of the explanatory variables, and make the remaining variables more informative. I feel like this idea had not yet made it to the machine learning mainstream.
I went to graduate school in Minnesota, a fountainhead of structural economic modeling. My particular field of economics – Empirical Industrial Organization (IO) – is primarily built around bringing complicated models of firm and consumer behavior to the data. The classic structural approach to modeling goes as follows: describe a set of agents, specify preferences and objectives for every agent, and write down a set of equations that determines how agents interact in equilibrium.
If I had to put a finger on a single commonality shared by virtually all Empirical IO papers, I would have to say that they all complain about insufficient richness of the data. If only some extra source of variation was available, the story goes, the paper would have been so much more insightful. Needless to say, I was quite enthusiastic about joining Amazon: in all likelihood, we have the most detailed and comprehensive data on dynamics of multiple market segments.
Unfortunately, this enthusiasm proved to be slightly hasty, for a simple reason: structural models do not scale to our data. One cannot use a BLP-style demand estimation procedure when consumers have to choose between tens of millions of products. Computing value functions becomes prohibitively expensive when you have to do it for thousands of agents over multiple years of data at daily frequency. But even if you somehow magically address these engineering aspects, the fact remains that structural estimates are numerically very unstable. There are multiple anecdotes of people trying to reproduce published results to no avail. I would rather not point fingers at anyone, but it should be clear that for business purposes, this instability is unacceptable.
Steve Berry once mentioned that he thinks about IO as the econometrics of moderate-sized datasets, and I suspect that anything with over a hundred thousand observations is probably not “moderate” in his book. When hundreds of millions of observations occur routinely, you need to look elsewhere tool-wise. This brings me to the «chicken-and-egg» problem of empirical work: which is more important, data or algorithms (models)?
Though not everyone agrees, in my experience, simpler models with lots of data usually trump more sophisticated models that were fit to smaller datasets. There is an interesting note on this question written by Google Research with a self-explanatory title “The Unreasonable Effectiveness of Data”. There is also an hour-long video by Peter Norvig on the same topic.
This has major implications on the modeling workflow. First, simpler models are easier to inspect for errors. Do not get me wrong: there are many ways to get this wrong. My favorite example is detecting outliers in explanatory variables via inspecting model residuals. A large outlier can exercise so much power on a regression line that the residual that corresponds to it will end up being quite small. In such situation trimming observations with large residuals can actually hurt the model even more. But in general it is easier to detect anomalies when fitting a simpler and more transparent model.
Second, the sad truth about empirical work is that clever usage of explanatory variables or their transformations almost always yields significant payoffs in terms of results. Machine learners refer to this as “feature selection and engineering”, but the main idea remains unchanged. There are no reliable recommendations on what data representation will end up working best in a given application – this process involves a lot of trial and error. A simpler model enables me to iterate on data representations quicker, and discard non-working ideas faster.
Finally, simpler models are easier to maintain. Unlike models in academic papers, which rarely, if ever, get reused, in practice I usually build models for a specific purpose. Suppose that I build a convoluted model that models customer behavior with impressive accuracy. Such a model would almost invariably include a lot of tuning parameters that I would need to calibrate carefully. Should I, at some point, switch jobs, whoever gets tasked with maintaining the model is going to have a hard time delivering results. Most likely, the model would cease being useful soon after my departure. This is not a particularly appealing scenario from an employer’s point of view.