Dataframe Exercises

Manipulating Real Data

  1. Download the world-small.csv dataset from here by clicking the raw button, then right-clicking, selecting “Save As…”, and saving the file as world-small.csv.

  2. Read the world-small.csv data into R and store it in an object called world. (Set your working directory using code first.)

  3. (Conceptual) What is the unit of analysis in the dataset? In other words, what does each row represent?

  4. How many observations does world have? How many variables? Use an R command to find out.

  5. Use brackets and a logical statement to inspect all the values for Nigeria and United States. That is, your code should return two entire rows of the dataset.

  6. Use R to return China’s Polity IV score (this is a measure of how Democratic its political institutions are). As in question 4, use a logical statement and brackets, but don’t return the entire row. Rather, return a single value with the Polity IV score.

  7. What is the lowest GDP per capita in the dataset? (Use R to return only the value.)

  8. What country has the lowest GDP per capita? (Your code should return the country name and be general enough so that if the observations in the dataset — or their order — change, your code should still return the country with the lowest GDP per capita.)

Estimating Labor Market Returns to Education

In this exercise, we’re going to use data from the American Communities Survey (ACS) to study the relationship between educational attainment and wages. The ACS is a survey conducted by the United States Census Bureau (though it is not “The Census,” which is a counting of every person in the United States that takes place every 10 years) to measure numerous features of the US population. The data we will be working with includes about 100 variables from the 2017 ACS survey, and is a 10% sample of the ACS (which itself is a 1% sample of the US population, so we’re working with about a 0.1% sample of the United States).

This data comes from IPUMS, which provides a very useful tool for getting subsets of major survey datasets, not just from the US, but from government statistical agencies the world over.

This is real data, meaning that you are being provided the data as it is provided by IPUMS. Documentation for all variables used in this data can be found here (you can either search by variable name to figure out the meaning of a variable in this data, or search for something you want to see if a variable with the right name is in this data).

Within this data is information on both the educational background and current earnings of a representative sample of Americans. We will now use this data to estimate the labor-market returns to graduating high school and college, and to learn something about the meaning of an educational degree.

1) Data for these exercises can be found here. First, download US_ACS_2017_10pct_sample.dta.

2) Now import US_ACS_2017_10pct_sample.dta. Because this file is a Stata datafile (ends with a .dta suffix), you need to install and load the foreign package, then read in the file with the command read.dta.

2b) Interestingly, you don’t always have to download data to load it! R can also load data directly from a URL if you get the right URL. To illustrate, head over to this location, right click on the “Download” button, and select “Copy Link”. Then in R, instead of writing the path to your file, just paste the URL (e.g. acs <- read.dta("url_here").)

Getting to Know Your Data

When you get a new dataset like this, it’s good to start by trying to get a feel for its contents and organization. Toy datasets you sometimes get in classes are often very small, and easy to look at, but this is a pretty large dataset, so you can’t just open it up and get a good sense of it. Here are some ways to get to know your data.

3) How many rows are in your data?

4) How many columns are in your data?

5) Let’s see what variables are in this dataset. Use names(acs) to get a list of all of the column names.

6) That’s a lot of variables, and definitely more than we need. In general, life is easier when working with these kinds of huge datasets if you can narrow down the number of variables a little. In this exercise, we will be looking at the relationship between education and wages, we need variables for:

  • Age

  • Income

  • Education

  • Employment status (is the person actually working)

These quantities of interest correspond to the following variables in our data: age, inctot, educ, and empstat.

Subset your data to just those variables.

7) Now that we have a more manageable number of variables, it’s often very useful to look at a handful of rows of your data. The easiest way to do this is probably the head() method (which will show you the first six rows), or the tail() method, which will show you the last six rows.

But to get a good sense of your data, it’s often better to use sample() command, which returns a random set of rows. As the first and last rows are sometimes not representative, a random set of rows can be very helpful. Try looking at a random sample of 20 rows with a command like: df[sample(nrow(df), 20), ].

(sample(nrow(df), 20) will create a list of 20 indices from between 1 and nrow(df). Then, because this is in the first position of our array index, we’ll get the rows associated with those indices). This is a great example of how we get sophisticated functionality by composing more basic operations!)

8) Do you see any immediate problems? Write them down with your partner.

9) One problem is that many people seem to have incomes of $9,999,999. Moreover, people with those incomes seem to be very young children.

What you are seeing is one method (a relatively old one) for representing missing data. In this case, the value 9999999 is used to denote observations for which there is no data (Or more specifically, in this case observations where the person is too young to work, so their income value is missing).

So let’s begin by dropping anyone who has inctot equal to 9999999.

10) OK, the other potential problem is that our data includes lots of people who are unemployed and people who are not in the labor force (this means they not only don’t have a job, but also aren’t looking for a job). For this analysis, we want to focus on the wages of people who are currently employed. So subset the dataset for the people for whom empstat is equal to “employed”.

Note that our decision to only look at people who are employed impacts how we should interpret the relationship we estimate between education and income. Because we are only looking at employed people, we will be estimating the relationship between education and income for people who are employed. That means that if education affects the likelihood someone is employed, we won’t capture that in this analysis. (Economists all this the “intensive margin”, while looking at whether people get jobs in the first place is called the “extensive margin”.)

11) Now let’s turn to education. The educ variable seems to have a lot of discrete values. Let’s see what values exist, and their distribution, using the table() function. This is an extremely useful tool you’ll use a lot! Try the following code (modified for the name of your dataset, of course):

table(acs[,'educ'])

12) There are a lot of values in here, so let’s just check a couple. What is the average value of inctot for people whose highest grade level is “grade 12” (in the US, that is someone who has graduated high school)?

13) What is the average income of someone who graduated college (“4 years of college”)? What does that suggest is the value of getting a college degree after graduating high school?

14) What is the average income for someone who has not finished high school? What does that suggest is the value of a high school diploma?

15) Complete the following table:

  • Average income for someone who has not finished high school: _________

  • Average income for someone who only completed 9th grade: _________

  • Average income for someone who only completed 10th grade: _________

  • Average income for someone who only completed 11th grade: _________

  • Average income for someone who finished high school (12th grade) but never started college: _________

  • Average income for someone who completed 4 year of college (in the US, this means graduating from college): _________

16) Why do you think there is no benefit from moving from grade 9 to grade 10, or grade 10 to grade 11, but there is a huge benefit to moving from grade 11 to graduating high school (grade 12)?

Take-aways

Congratulations! You just discovered “the sheepskin effect!”: people with degrees tend to earn substantially more than people who have almost as much education, but don’t have an actual degree.

In economics, this is viewed as evidence that the reason employers pay people with high school degrees more than those without degree is not that they think those who graduated high school have learned specific, useful skills. If that were the case, we would expect employee earnings to rise with every year of high school, since in each year of high school we learn more.

Instead, this suggests employees pay high school graduates more because they think the kind of people who can finish high school are the kind of people who are likely to succeed at their jobs. Finishing high school, in other words, isn’t about accumulating specific knowledge; it’s about showing that you are the kind of person who can rise to the challenge of finishing high school, also suggesting you are also the kind of person who can succeed as an employee.

(Obviously, this does not tell us whether that is an accurate inference, just that that seems to be how employeers think.)

In other words, in the eyes of employers, a high school degree is a signal about the kind of person you are, not certification that you’ve learned a specific set of skills (an idea that earned Michael Spence a Nobel Prize in Economics).

Want More Practice?

(1) The US Census isn’t the only organization that published surveys of their populations – most countries in the world conduct regular surveys, and most publish versions of their data online.

In addition to making US Census data available to researchers, IPUMS (through “IPUMS International”) also makes survey data available from a wide range of countries.

So if you’d like more practice with basic data manipulations, go to IPUMS International and download information on income and education from the country of your choice:

  1. Click “Select Samples” to pick the country and year from which you would like data, then click “Submit Sample Selections”. Note that most countries don’t include income as a variable! Asking income is very sensitive, so is often left off surveys. You can get income directly from surveys of Mexico, Israel, Canada, India (in 2004), Panama, and Venezuela. But you can get “income proxies” (variables that are very correlated with income) like household construction materials (people with unfinished dirt floors tend to be poorer than people with concrete floors) for other countries like most African and Latin American surveys. Also: note there are non-IPUMS surveys which often do provide this data, they just aren’t here.

  2. Select variables (most likely under the “Person” tab) for education and income, or under “Dwelling Characteristics” if you’re looking for wealth proxies.

  3. Click “View Cart”

  4. Click “Select Data Extract”

  5. Change the data format to .dta or to .csv.

  6. Click “Submit Extract”

  7. Login with the IPUMS account you created during week 1 (if you didn’t do that (as assigned!!), register for IPUMS now).

  8. When extract is ready, download!

(2) Repeat these exercises for your new dataset!

Data Citation

The ACS data used in this exercise are a subsample of the IPUMS USA data available from usa.ipums.org.

Please cite use of the data as follows: Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2021. https://doi.org/10.18128/D010.V11.0

These data are intended for this exercise only. Individuals analyzing the data for other purposes must submit a separate data extract request directly via IPUMS USA.

Individuals are not to redistribute the data without permission. Contact ipums@umn.edu for redistribution requests.