Bootstrapping Exercise¶
In this exercise, we will bring together a range of different tools and skills we’ve learned — including vector, matrix, and data.frame manipulation, loops, and functions — for a single purpose: calculating bootstrapped standard errors.
To review, to create a bootstrapped estimate of a coefficient and it’s standard error, we need to:
Re-sampled our data with replacement: we take our actual 1,000 observations and create a new “re-sampled” dataset by picking one random observation to put in our “new” data at a time, 1,000 times, but allowing for the same observation to get picked as many times as chance allows. For example, Respondent 1 might not end up in the “new” dataset, while Respondent 2 might end up 4 times. This gives us a new dataset that looks a lot like what we would have gotten if we’d actually gone back to our population of interest and collected 1,000 entirely new survey responses.
Run the regression: we regress income on gender with this new, re-sampled dataset. This will give us a new estimate of the partial correlation between gender and income (\(\beta\)) that should be similar to but distinct from the estimate we got with our original dataset (since some observations didn’t make it into this new dataset and some have been repeated).
Record the new value of :math:`beta`: Just, you know, write it down or store it somewhere.
But don’t worry, we aren’t going just jump straight to trying to do all that at once. It is never a good idea when attempting something complicated (like bootstrapping) to try and “just write it out.”
Rather, the way we will approach this is by building up a script little by little from running a simple regression to taking on the task of doing a full bootstrap.
Resume Experiment Analysis¶
How much harder is it to get a job in the United States if you are Black than if you are White? Or, expressed differently, what is the effect of race on the difficulty of getting a job in the US?
In this exercise, we will be analyzing data from a real world experiment designed to help answer this question. Namely, we will be analyzing data from a randomized experiment in which 4,870 ficticious resumes were sent out to employers in response to job adverts in Boston and Chicago in 2001. The resumes differ in various attributes including the names of the applicants, and different resumes were randomly allocated to job openings.
The “experiment” part of the experiment is that resumes were randomly assigned Black- or White-sounding names, and then watched to see whether employers called the “applicants” with Black-sounding names at the same rate as the applicants with the White-sounding names.
(Which names constituted “Black-sounding names” and “White-sounding names” was determined by analyzing names on Massachusetts birth certificates to determine which names were most associated with Black and White children, and then surveys were used to validate that the names were perceived as being associated with individuals of one racial category or the other).
You can get access to original article here.
Note to Duke students: if you are on the Duke campus network, you’ll be able to access almost any academic journal articles directly; if you are off campus and want access, you can just go to the Duke Library website and search for the article title. Once you find it, you’ll be asked to log in, after which you’ll have full access to the article. You will also find this pattern holds true at nearly any major University in the US.
Step 1: Data Prep¶
Exercise 1¶
Download the data set resume_experiment.dta
from github here, or by doing to www.github.com/nickeubank/MIDS_Data
and opening the resume_experiment
folder.
Then IN A NEW ``.R`` FILE use read_dta
from the haven
library to load the data.
Exercise 2¶
black
is the treatment variable in the data set (whether the resume has a Black-sounding name). call
is the dependent variable of interest (did the employer call the fictitious applicant for an interview)
In addition, the data include a number of variables to describe the other features in each fictitious resume, including applicants education level (education
), years of experience (yearsexp
), gender (female
), computer skills (computerskills
), and number of previous jobs (ofjobs
). Each resume has a random selection of these attributes, so on average the Black-named fictitious applicant resumes have the same qualifications as the White-named applicant resumes.
For this analysis, we will focus our attention on less educated job applicants. In this dataset, education
is a categorical variable coded as follows:
0: Education not reported
1: High school dropout
2: High school graduate
3: Some college
4: College graduate or higher
Please subset your data to only include applications that did not report any college education.
Exercise 3¶
Now that we have the dataset we are interested in analyzing, let’s begin by running a normal linear regression. If you aren’t familiar with R, this can be accomplished with the lm()
function (“lm” for “linear model”). To regress whether an applicant got a call back on whether their resume had a Black-sounding name, whether they were female, and the number of past jobs listed, you would run code that looks something like:
lm("call ~ black + female + ofjobs", resumes)
where you may replace resumes
with whatever you’ve been calling your dataframe.
What is the coefficient on having a Black sounding name, and what is the associated standard error?
(You may have to call summary()
on the result of lm()
to see the standard error. I recommend assigning the result of lm()
to a new variable, then calling summary()
on that new variable).
Step 2: Do A Single Bootstrap¶
Because the idea of bootstrapping is to do something over and over and over, your mind is hopefully thinking “I’m gonna use a loop!” And you’d be right! But before we do that, let’s try and write a little bit of code that does what we want to do over and over once.
Exercise 4¶
The first step in bootstrapping is to re-sample our data with replacement. Use the sample()
function to create a new data.frame that contains a random sample (with replacement) of the rows from the original data.
You will probably need to check the documentation of sample()
.
Hint: Remember if you wanted to get rows 1, 1, 3, and 9 from an existing dataframe as the first four rows of a new dataframe, you could do it with resumes[c(1, 1, 3, 9), ]
. You might want to think about how to build on that example using the output of sample()
.
Exercise 5¶
Verify that your new dataset has the same number of observations as your original dataset! (Always good to check your work).
Exercise 6¶
Now re-run the regression you ran above using this bootstrapped sample. You should get a coefficient on black
that is similar but distinct from the one you got before.
Exercise 7¶
Now the slightly tricky part. Because we’re going to do this over and over, we need to find a way to “save” the coefficient on black
, which means we need to figure out how to get just the coefficient on black
out of the object we got back from lm()
.
Objects, like what you get back from lm()
are kind of like lists, and you can see their contents with the str()
command (for “structure”). So to see what’s in the result of lm()
, you can do something like:
my_model <- lm("call ~ black + female + ofjobs", resumes))
str(my_model)
That output should tell you that my_model
is basically just a list, and you can get entries out of a list with the $
operator. So in this case, you can get the coefficients from a regression with my_model$coeffients
.
I will leave it to you to figure out how to get out JUST the coefficient on black
.
Exercise 8¶
Now, if you’ve been doing all this one step at a time, you may have a bit of a mess in your .R file, so this is a good time to clean things up and leave some comments. At the top, you should have a block that reads in this data and subsets it for less educated applicants.
Then you should have a section with the first regression you ran.
Finally, you should have a section where you are building your bootstrap. It starts by re-sampling your data, running your regression using the new data, and extracting the coefficient on black
.
Take a minute to add some comments and clean up your file to reflect this organization.
Step 3: Build the Loop¶
Now it’s time to build a loop! Initially, let’s just write a loop that will run 5 times (it’s easy to make that more later, but we don’t want it to take a long time running while we’re trouble shooting!)
Exercise 9¶
First, write a loop that will run 5 times, increasing the value of a variable i
each time. Don’t put any of our bootstrap code in there yet — just add a print(i)
statement so you can see that it’s running five times and incrementing i
.
Exercise 10¶
Now put the bootstrap code inside the loop. Now when you run this code, you should see five regressions being run, each a little different. You can add a print statement if needed (e.g., print(bootstrapped_black_coef)
) to ensure you can see what’s happening.
Exercise 11¶
To make the bootstrap work, we’ll need to store the value of our coefficient each time the loop runs. So above the loop, create a vector that’s 5 entries long (e.g., bootstrap_coefficients <- rep(0, 5)
, which repeats the first argument (0
) 5 times).
Then, each time the loop goes around, store the value of the coefficient on black
in the vector. Put the first estimate in the first spot, the second in the second, etc.
Hint: you’ll want to use that i
variable we created above.
When you’re done, the vector you created (I called it bootstrap_coefficients
above) should have no entries that are the 0
s that were there when it was created, and instead all five entries should contain different estimates of the coefficient on black
.
Step 4: Scale It Up¶
OK, we have it working, let’s go big!
Exercise 12¶
We’re going to do our bootstrap 500 times, so make the vector in which you are storing coefficients bigger (rep(0, 500)
) and set your loop to run 500 times. Then let ‘er rip!
Exercise 13¶
Great! Now let’s get that final estimate we wanted. Using mean()
and sd()
, get the mean and standard deviation of your vector of coefficient estimates. How do they compare to the original estimate and standard error you got straight from the lm()
command?
(Extra Credit) Step 5: Generalize It¶
Want more practice? Here are three things to do:
Extra Credit 1: Generalize The Coefficient Storage¶
Rather than just storing the coefficient on black
, store all the coefficients on each pass. Hint: to do this, you won’t be able to store the results in a vector any more…
Extra Credit 2: Put It In A Function¶
Take the code that runs your regression and extracts the coefficients and put it in a stand-alone function. This is a little gratuitous — it’s not that complicated a piece of code — but it’s fun practice for writing functions!
Extra Credit 3: Generalize The Function¶
Take that function and parameterize it so that users can specify a list of coefficients to save.
Extra Credit 4: Generalize It Even More¶
Pull this all together by wrapping all this up in a bootstrap regression function that takes:
a regression specification (as a string),
a dataset,
a number of iterations, and
a list of coefficients to retain
And returns the point estimate and standard errors for those coefficients!