Overfitting

According to the late Leo Breiman, people who work with data generally come from one of the two camps: statisticians and machine learners. In economics, we are much closer to statisticians than machine learners. The differences between the camps are numerous and I would not hope to cover them all in a single post. But if I were asked to put a finger on what I think the machine learners do right and economists do wrong, overfitting would be the answer.

As most of you are well aware, overfitting happens when one tailors the model to mimic the existing features of the data too well. Such model, when used for predictions, usually yields wildly inaccurate projections. Nate Silver in his book claims that overfitting is probably the biggest sin of which most practitioners of statistics are guilty. His intuitive explanation for the phenomenon is that «an overfit model fits not only the signal, but also the noise in the data», and I believe this is an excellent summary. On a side note, I highly recommend the book to anyone who is interested in applied data work, though, like many of my non-US friends, I found the baseball chapter somewhat boring.

To be fair, time-series econometricians sometimes discuss the «out-of-sample» model evaluation. I suspect this happens because time-series models are most likely to be used for forecasting. But I believe it would be very instructive to see an «out-of-sample» evaluation for a structural econometric model. I suspect this could lend these models some additional credibility – something that some of these models deeply lack. In the past, it would have been possible to argue that the data are so limited it is impossible to spare some for a holdout sample. These days, however, such arguments appear unsubstantiated.