konstantin's blog

Software: Stata

In the previous post, I discussed SAS in detail. Now I will turn to my favorite package, Stata. Five years ago nobody would seriously consider it as an option for dealing with large datasets. Today, it is probably one of the most effective tools of which I am aware.

Let’s get done with the bad news first: Stata is, by design, memory-bound. This means if your machine has 16 GB of RAM, realistically you would not be able to work with datasets that are over 15 GB. (Stata itself needs very little RAM, but your operating system needs some as well.) To make matters worse, Stata does not support distributed computations. It was conceived and developed in the 80s and 90s, when the “big data” challenges could be addressed by getting a more powerful host. Third, Stata can only operate on a single dataset at any given moment. Think about a single Excel sheet the size of which is only limited by the available RAM. Finally, not unlike SAS, Stata has its own unique programming language, which is quite quirky and can be tricky to master.

Given these formidable drawbacks, why would someone even bother considering Stata? Because it is otherwise made of pure awesomeness, that’s why. (Ok, I may be exaggerating a bit here, but bear with me.) Most skilled empiricists know the sad truth of working with real data: about eighty percent of time on every project is usually spent on manipulating data and wrestling it into format suitable for analysis. I am yet to find a more efficient tool than Stata for these tasks. Assuming you can fit the relevant data into RAM, once it is loaded, Stata is blazingly swift at cutting and slicing the data. Whenever I open a new dataset, it typically takes less than five minutes to identify any potential problems that crept in during the data construction phase. This makes Stata ideal for prototyping new solutions: at the early stages of any project it is important to fail quickly. Whatever takes hours to do in SAS, usually takes minutes to do in Stata.

I already mentioned that Stata’s internal language is quirky. If you stick with it long enough to get over these quirks, however, you will come to appreciate its flexibility and pithiness. In particular, it provides an uncanny level of versatility when it comes to creating loops. There is effectively no difference in looping over variable values, variable names, strings, or numbers – a feature that enables writing exceedingly compact code. Whenever I use R, SAS, Matlab, or virtually anything else other than Stata, I resent not having this kind of versatility at my disposal.

In addition, Stata was written by economists and, primarily, for economists. A number of estimation methods and routines that are specific to economics have been implemented only in Stata, mostly methods related to panel data analysis. People have conflicting opinions on whether it is best to “code up” all the methods you use from scratch. My take on this is simple: speed maters, so whatever enables rapid exploration is good. Once you find something that you think is working, re-engineering the method from scratch is a perfectly reasonable way to bring extra robustness.

I will conclude by noting that a perpetual license for a very powerful implementation of Stata costs under ten thousand dollars, and one can get a less powerful version for about a thousand. This is my default tool for exploratory data analysis, which I personally find indispensable.

Software: SAS

In the previous posts I talked primarily on what to do and what not to do when working with data. Now I want to switch gears a bit and discuss the tools of the trade, i.e. the packages and programming environments one could use to this end. I have probably had a chance to confront the vast majority of existing analytic solutions, and as such, was able to form an opinion on each. Before I start, however, I would like to share a nice summary of what your statistical language of choice reveals about you. As usual with these kinds of lists, about eighty percent of information thereof is pretty accurate.

I will start with SAS. There are multiple companies out there that depend on SAS heavily for their data analytics, and it is understandable: for a long time, SAS was the only statistical package that could deal with large datasets. Unlike most other high-level packages such as R or Stata, SAS does not keep the data in RAM, so the HDD space is effectively the only limit on the amounts of data it can process. This probably explains why most data analysts with over ten years of experience have a solid proficiency in SAS: when they started their careers, other tools were not an option.

Personally, I am not a huge fan of SAS, for three major reasons. First, it is slow relative to other packages that keep data in memory. These days getting a host with 60+ GBs of RAM is trivial, and hosts with over 200 GBs of RAM are not unheard of. While the raw data size also grew accordingly, most datasets usually shrink to tens of gigabytes when they are processed and ready for analysis, and thus a host with 64 GBs of RAM works out fine nine times out of ten.

Second, SAS has its own peculiar programming language that is unlike anything else out there. This implies a steep learning curve for anyone new to the environment. This is not to say that the problem is unique to SAS—Stata has perhaps even less intuitive language syntax. However, this is still a downside, because it takes time to get new people up-to-speed on current work done within the team, which is never a good thing.

Finally, SAS is outrageously expensive. A relatively modest annual license can easily cost in the low six digits for the first year and in the mid-five digits every year thereafter. While SAS offers an amazingly rich array of modules that are designed to take care of the grunt work such as loading data from databases directly into SAS, these price tags are still hard to justify.

For a while, SAS was able to benefit from the fact that CPU speed grew roughly at the same rate as the size of the average dataset. As long as this was true, one could solve the “big data” problem by allocating a larger host for the analysis. In the early 2000s, however, datasets started growing much faster, and reading from disk became the hardware bottleneck. This gave birth to distributed computing frameworks (read: MapReduce), and the people at SAS Institute were for some reason remarkably oblivious to this structural change. As a result, SAS is now hopelessly behind when it comes to reading data from a distributed storage system such as HDFS. It will be interesting to see what they can make of the situation at hand.

Data vs. Algorithms

Last time I mentioned that applying structural econometric models to datasets with millions of records is prohibitively time-consuming in most cases. This brings me to the «chicken-and-egg» problem of empirical work: which is more important, data or algorithms (models)?

Though not everyone agrees, in my experience, simpler models with lots of data usually trump more sophisticated models that were fit to smaller datasets. There is an interesting note on this question written by Google Research with a self-explanatory title «The Unreasonable Effectiveness of Data». There is also an hour-long video by Peter Norvig on the same topic.

This has major implications on the modeling workflow. First, simpler models are easier to inspect for errors. Do not get me wrong: there are many ways to get this wrong. My favorite example is detecting outliers in explanatory variables via inspecting model residuals. A large outlier can exercise so much power on a regression line that the residual that corresponds to it will end up being quite small. In such situation trimming observations with large residuals can actually hurt the model even more. But in general it is easier to detect anomalies when fitting a simpler and more transparent model.

Second, the sad truth about empirical work is that clever usage of explanatory variables or their transformations almost always yields significant payoffs in terms of results. Machine learners refer to this as “feature selection and engineering”, but the main idea remains unchanged. There are no reliable recommendations on what data representation will end up working best in a given application – this process involves a lot of trial and error. A simpler model enables me to iterate on data representations quicker, and discard non-working ideas faster.

Finally, simpler models are easier to maintain. Unlike models in academic papers, which rarely, if ever, get reused, in practice I usually build models for a specific purpose. Suppose that I build a convoluted model that models customer behavior with impressive accuracy. Such a model would almost invariably include a lot of tuning parameters that I would need to calibrate carefully. Should I, at some point, switch jobs, whoever gets tasked with maintaining the model is going to have a hard time delivering results. Most likely, the model would cease being useful soon after my departure. This is not a particularly appealing scenario from an employer’s point of view.

Structural Econometric Modeling

I went to graduate school in Minnesota, a fountainhead of structural economic modeling. My particular field of economics – Empirical Industrial Organization (IO) – is primarily built around bringing complicated models of firm and consumer behavior to the data. The classic structural approach to modeling goes as follows: describe a set of agents, specify preferences and objectives for every agent, and write down a set of equations that determines how agents interact in equilibrium.

If I had to put a finger on a single commonality shared by virtually all Empirical IO papers, I would have to say that they all complain about insufficient richness of the data. If only some extra source of variation was available, the story goes, the paper would have been so much more insightful. Needless to say, I was quite enthusiastic about joining Amazon: in all likelihood, we have the most detailed and comprehensive data on dynamics of multiple market segments.

Unfortunately, this enthusiasm proved to be slightly hasty, for a simple reason: structural models do not scale to our data. One cannot use a BLP-style demand estimation procedure when consumers have to choose between tens of millions of products. Computing value functions becomes prohibitively expensive when you have to do it for thousands of agents over multiple years of data at daily frequency. But even if you somehow magically address these engineering aspects, the fact remains that structural estimates are numerically very unstable. There are multiple anecdotes of people trying to reproduce published results to no avail. I would rather not point fingers at anyone, but it should be clear that for business purposes, this instability is unacceptable.

Steve Berry once mentioned that he thinks about IO as the econometrics of moderate-sized datasets, and I suspect that anything with over a hundred thousand observations is probably not «moderate» in his book. When hundreds of millions of observations occur routinely, you need to look elsewhere tool-wise.

The Economics of Amazon.com

Recently I have come across a number of comments online that shared a similar set of arguments. In a nutshell, they go as follows: Apple demonstrates fourth most impressive quarter ever – stock prices fall; Facebook beats revenue and profits forecasts – stock prices decline; Amazon misses the forecasts by a wide margin – stock price keeps going up.

I know very little about stock markets, but I know enough to know that predicting prices of individual stocks is very difficult to say the least. That being said, there are a few things about Amazon and how it operates that may not be necessarily obvious. A couple of weeks ago, I found a blog post by an ex-Amazonian, that summarizes these things succinctly. It is not very short, but quite insightful.

Two Cultures of Statistical Modeling

My two previous blog posts were, to a large extent, inspired by this paper by Leo Breiman, whom I already mentioned. While I do not feel that I can entirely agree with his arguments, they are, without a doubt, worth considering.

Structure

Previously I mentioned what machine learners do better than most economists. Now I want to bring up a slightly more controversial distinction. Many machine learning applications are fundamentally less willing to make assumptions. This may sound a bit abstract and probably warrants an explanation.

At a very high level of abstraction, a large portion of statistics can be reduced to constructing a model that defines a relationship between a dependent variable and a set of independent variables. It is in defining this relationship where one can choose to make assumptions. It may sound like I am drawing a distinction between parametric and nonparametric statistics, but in fact, the point is more general.

In economics, we construct models to understand specific phenomena. Every little piece of our models usually has a clear and fairly straightforward interpretation. The famous model-building process of Alfred Marshall – write the question in English, translate into math, use math to obtain the answer, translate the answer back into English, burn the math – is still a good rule of thumb. But we cannot make this work without imposing a myriad of assumptions, many of which are so deeply ingrained in our thought process that we no longer consider them to be assumptions per se; rather, we implicitly treat them like axioms.

In contrast, machine learners become deeply suspicious the moment someone starts making assumptions. They raise a valid point: any result that is obtained under an assumption goes out the window once the assumption is violated. Many find such a predicament to be completely unacceptable, and refuse to go down that path. Strangely enough, the ability to make informed assumptions that are mostly true is still highly valued in the field, and has a special term reserved for it: «domain knowledge».

I feel that in economics we had to acquire «domain knowledge» early because our data is fundamentally very noisy. Without making assumptions, it is impossible to learn anything from, say, macroeconomic data – there are simply too many moving parts and too few degrees of freedom. Imposing a structure on the data can sometimes let us ignore a subset of the explanatory variables, and make the remaining variables more informative. I feel like this idea had not yet made it to the machine learning mainstream.

Overfitting

According to the late Leo Breiman, people who work with data generally come from one of the two camps: statisticians and machine learners. In economics, we are much closer to statisticians than machine learners. The differences between the camps are numerous and I would not hope to cover them all in a single post. But if I were asked to put a finger on what I think the machine learners do right and economists do wrong, overfitting would be the answer.

As most of you are well aware, overfitting happens when one tailors the model to mimic the existing features of the data too well. Such model, when used for predictions, usually yields wildly inaccurate projections. Nate Silver in his book claims that overfitting is probably the biggest sin of which most practitioners of statistics are guilty. His intuitive explanation for the phenomenon is that «an overfit model fits not only the signal, but also the noise in the data», and I believe this is an excellent summary. On a side note, I highly recommend the book to anyone who is interested in applied data work, though, like many of my non-US friends, I found the baseball chapter somewhat boring.

To be fair, time-series econometricians sometimes discuss the «out-of-sample» model evaluation. I suspect this happens because time-series models are most likely to be used for forecasting. But I believe it would be very instructive to see an «out-of-sample» evaluation for a structural econometric model. I suspect this could lend these models some additional credibility – something that some of these models deeply lack. In the past, it would have been possible to argue that the data are so limited it is impossible to spare some for a holdout sample. These days, however, such arguments appear unsubstantiated.

The Tails

A number of data facts that are by now obvious were anything but obvious for me when I started. Most statistics that I learned at school dealt with conditional means, and, sometimes, variances. The focus on the first two moments makes sense for bell-shaped distributions such as normal. It is a lot easier to prove limit theorems about means, so their properties are well-documented. Unfortunately, the vast majority of real data that I encounter are anything but bell-shaped.

Instead, most variables have a very different distribution. There is almost always a mass point at zero, a relatively well-behaved lognormal-shaped piece with positive values, and a very long uniform-shaped tail. Sometimes the tail is so long that the mean can be above the 90th percentile. The outliers also inflate the variance, so most standard inference procedures will tell a very misleading story about what is going on in the data.

In addition, the tails of the distribution is where the most interesting things happen. If we are looking at model errors, the tails are the cases when the model fit is worst. If we are looking at sales, the tails are the bestsellers. The most important users are also at the tails of the entire distribution. So not only the tails mess up the usual inference procedures, they are also of primary interest by themselves.

First Entry

I have decided to resume blogging, but this time I will host it on my personal website. It will cross-post entries to my old blog automatically unless I choose to do otherwise.

I intend to write mostly on professional and related issues. This includes economics, statistical computing, machine learning, and data science.

Syndicate content