Statistics: The Missing Puzzle Piece - Konstantin Golyaev's Website

I don’t think we are generally doing a good job of teaching statistics to students. Because of this, many people possess very spotty understanding of the field and treat it with suspicion. For them, statistics is a bunch of tricks that let you determine whether something is ``significant’’ or not, and, in the latter case, it might as well be zero. Somehow the big picture of statistics is frequently missing from most introductory courses, and it took me a while to reconstitute it back together. What follows is my attempt to write it down for posterity (and others).

A statistician’s worldview starts with the concept of a data generating process, or DGP for short. The DGP is an unverifiable assumption of how the world works. For an admittedly silly example, we might be interested in understanding how a person’s height affects their wage. A possible DGP for this problem would look like this:

\[ w = \alpha + \beta \cdot h + \varepsilon, \] where \(w\) is person’s wage, \(h\) is their height, \(\varepsilon\) represents all other factors that affect wage other than person’s height, and \(\alpha\) and \(\beta\) are the only two unknown parameters in this DGP equation. I am deliberately keeping this simple: one can easily postulate vastly more complex DGPs that attempt to relate height to wages. For our purposes we just need to have a DGP, no matter how silly.

The DGP is fundamentally unfalsifiable and is often assumed to be linear, especially at the early stages of analysis. Standard arguments about local approximations via Taylor expansions are usually brought up to justify the linearity, and often fitting linear models well is hard enough. But the important aspect is that by assuming the DGP we have reduced the dimensionality of the problem dramatically: now we only have the two unknown parameter values \(\alpha\) and \(\beta\) to worry about.

Next, we need a dataset to which we will apply statistical techniques. Only Bayesians can work without data, and even they prefer to have at least some of it to update their priors. For our example we would need a bunch of measurements about people’s wages and heights, and we would need to have both of these measurements for every person in our dataset. For example, you have asked everyone in your office building about how tall they are and how much they get paid, and somehow miraculously was able to get a bunch of people to answer your questions.

Statisticians would think of the dataset as a sample that was drawn from the population of all humans: everyone has a height and most people have wages (and we can assume zero wages for those who are not employed). We will sidestep the issues of how representative our sample is for now, and will only note that a representative sample is generally much more useful for what we hope to achieve.

Once we have a sample of \((w_i, h_i)\) for \(I\) people, we typically rely on our DGP assumption to estimate \(\alpha\) and \(\beta\). Let’s focus on the simple case of using least squares, or OLS, to obtain estimates, usually denoted \(\hat{\alpha}\) and \(\hat{\beta}\). (Many alternative estimation procedures exist, each of them will result in separate estimators and estimates, but it is out of scope for this writeup.) For our purposes it is important to grok that \(\hat{\alpha}\) and \(\hat{\beta}\) are random functions of the sample values of \(w_i\) and \(h_i\). A different sample, even of the same size, will almost certainly result in different values of \(\hat{\alpha}\) and \(\hat{\beta}\), even though we use the same estimator to obtain them. Imagine one sample being taken over a group of professional basketball players, while the other sample uses janitors in the team’s home sports arena.

Because we used a well-understood estimator to obtain \(\hat{\alpha}\) and \(\hat{\beta}\), we know a lot about its statistical properties. For example, we know that as \(I\) approaches infinity, both \(\hat{\alpha}\) and \(\hat{\beta}\) have sampling distributions that become indistinguishable from normal, or, as statisticians like to say, these estimators have asymptotic normal distributions. We use these distributions to perform inference, which is a fancy word that means “testing hypotheses about the population using the sample at hand”.

Perhaps the most common inference would be to test a hypothesis that \(\beta = 0\), which would imply no relationship between height and wages. This hypothesis can be tested by computing the t-statistic via the well-known formula:

\[ \frac{\hat{\beta} - \beta_0}{\mathtt{se}(\hat{\beta}) },\] where \(\mathtt{se}(\hat{\beta})\) is the standard error, or our estimate of the standard deviation of \(\hat{\beta}\)’s sampling distribution. With \(\beta_0 = 0\) the formula simplifies to the ratio of the estimate to its standard error. Recall that \(\hat{\beta}\) has an asymptotic normal distribution. Given the assumption that the null hypothesis \(\hat{\beta} = \beta_0\) is true, we have \(\hat{\beta} \sim_a \mathcal{N}(0,\,\mathtt{se}(\hat{\beta})^2)\), and hence the t-statistic will have an asymptotic standard normal distribution \(\mathcal{N}(0,\,1)\).

Up until this point, we had been simply computing numbers using the sample, but now comes the actual inference part. We can make a judgment call of how probable it is to observe our particular value of t-statistic when we know it is supposed to be a realization of a standard normal random variable. Usually if its absolute value is less than 1.96, we conclude that with 95% certainty we cannot reject the assumption that \(\beta = 0\), even if our sample estimate \(\hat{\beta}\) is far from zero numerically.

Because every sample is in practice finite, all the confidence intervals and p-values derived from asymptotic inferences are at least a little bit wrong. An asymptotic 95% confidence interval can be 92% or 97% in our particular sample, or it could be even less accurate if our \(I\) is very far from infinity. But the part of statistical inference process to which many machine learning practitioners tend to object is the DGP assumption that cannot be verified. More specifically, they correctly point out that all of the above machinery hinges on us knowing the functional form of the DGP, and the moment this assumption is violated, all our inferences go out the window. According to them, if we cannot reject the hypothesis that \(\beta = 0\), it is because the ``true’’ relationship between \(w\) and \(h\) is not linear, but is instead considerably more complicated? In our example it’s easy to propose improvements to the DGP, for example, taking people’s gender into account will likely result in smaller estimates of \(\hat{\beta}\), simply because men are on average taller and happen to get paid more than women.

Is there a solution to misspecified DGP? There is, and it is called the bootstrap, although it requires a computer for all practical purposes, which is not really a big constraint these days. The essence of the bootstrap approach is as follows: you admit that you likely do not know the correct distribution of your estimator, so you attempt to approximate it with the empirical distribution which is obtained by drawing a large number, e.g. 1000, of bootstrap samples of size \(I\) from your sample, with replacement. (If you draw without replacement, all bootstrap samples will be identical.) For each sample, you construct the value of your test statistic, such as the t-statistic we used above. Once you have a 1000 values of the t-statistic, you can compare the value computed off the original sample with this distribution and infer how likely this observed value is under the null hypothesis. Bootstrap is especially useful when your statistic of interest is a complicated function of data, and deriving its asymptotic properties is hard. Tim Hesterberg from Google wrote an excellent paper about bootstrap that I highly recommend reading, but it is more technical than the above writeup.