What is Bootstrapping?¶

Have you ever paused to reflect on the meaning of those standard errors you see in a regression? Sure, you may have a vague sense they tell you about how confident you are in your regression coefficient, but beyond that?

The data we work with in political science is almost always a random sample drawn from a larger population. Our data might be a random sample of voters (drawn from the population of all eligible voters in a state or country), or a random sample of speech (drawn from a population of all political speeches given in a given country in a given period of time).

As such, when we run a regression to calculate the partial correlation between an independent variable and a dependent variable we care about, we aren’t actually able to calculate that relationship using data from the full population for whom we want to know this partial correlation. And it’s always possible that the relationship we find in the random sample of the population for whom we have data will be different from the relationship in the population as a whole!

Standard errors, therefore, can be thought of as estimates of what the distribution of our estimated coefficient (our \(\beta\), the partial correlation between our independent variable and our dependent variable of interest) would look like if we kept (a) drawing new random samples from our population of interest, and (b) re-running our regression.

Now, obviously when we run a regression, we aren’t actually drawing lots of new random samples from our population of interest — instead we leverage the Central Limit Theorem and some other assumptions to conclude that if we were to do this “re-sampling” exercise, the resulting estimates of \(\beta\) would be Normally distributed, would have the mean of the value we actually estimated for \(\beta\), and a standard deviation equal to the standard error given by a normal linear regression.

BUT: for that to be true, there are some assumptions that have to be met. You’ll get lots of opportunities to discuss all those assumptions in 630 this semester, so we won’t go into them now, but they certainly exist!

And so, what if there was another way to predict how our estimates of \(\beta\) might be distributed if we were to keep drawing different samples from our population of interest?

Enter bootstrapping!

How To Bootstrap¶

Bootstrapping is, in my view, one of the coolest and most magical tricks in statistics. Most of us can’t afford to keep collecting data from different samples from our population of interest, but it turns out that you can approximate what would happen if you did by just re-sampling the data you already have (with replacement) and running your regression on each of those re-samples.

Let’s be more concrete. Suppose you had data from the American Community Survey on incomes in the United States, and you’re interested in the relationship between gender and income. To be even more concrete, suppose your data contains information from 1,000 survey respondents.

In a normal analysis, you would regress income on gender, and get back a partial correlation between being a woman and income (along with an estimate of the standard error associated with that partial correlation generated using the assumptions we alluded to above and the Central Limit Theory).

But what if, rather than using those standard errors, we instead:

Re-sampled our data with replacement: we take our actual 1,000 observations and create a new “re-sampled” dataset by picking one random observation to put in our “new” data at a time, 1,000 times, but allowing for the same observation to get picked as many times as chance allows. For example, Respondent 1 might not end up in the “new” dataset, while Respondent 2 might end up 4 times. This gives us a new dataset that looks a lot like what we would have gotten if we’d actually gone back to our population of interest and collected 1,000 entirely new survey responses.
Run the regression: we regress income on gender with this new, re-sampled dataset. This will give us a new estimate of the partial correlation between gender and income (\(\beta\)) that should be similar to but distinct from the estimate we got with our original dataset (since some observations didn’t make it into this new dataset and some have been repeated).
Record the new value of :math:`beta`: Just, you know, write it down or store it somewhere.

Then we just repeat this process a few hundred times until we have a big collection of estimates of \(\beta\) that we can think of as having been estimating using different samples from the population of interest!

Why would we do this? Well, in short the assumptions that must be true for this to be a good estimate of the standard error of \(\beta\) are usually easier to meet than the ones required for the standard errors that are generated by regression packages to be accurate.

If this feels a little hand-wavy, don’t worry, it is. The goal here isn’t to learn all the nuances of bootstrapping, but rather to get the general idea of what it entails and how it’s implemented, because our next exercise will be to… bootstrap!