Vector Exercises¶
Warmup: Family and Friends¶
Create a vector that represents the age of at least four different family members or friends. You can name it whatever you want.
What is the mean age of the people in your vector? Find out in two ways, with and without using the
mean()
command.How old is the youngest person in your vector? (Use an R command to find out.)
What is the age gap between the youngest person and the oldest person in your vector?
(Again use R to find out, and try to be as general as possible in the sense that your code should work even if the elements in your vector, or their order, change.)
How many people in your vector are above age 25? (Again, try to make your code work even in the case that your vector changes.)
Replace the age of the oldest person in your vector with the age of someone else you know.
Create a new vector that indicates how old each person in your vector will be in 10 years.
Create a new vector that indicates what year each person in your vector will turn 100 years old.
Create a new vector with a random sample of 3 individuals from your original vector. What is the mean age of the people in this new vector?
Time for Real Data: Understanding the US Income Distribution¶
In these exercises, we will load and work with a vector that contains estimates of the total income (from all sources) of a random sample of American households collected by the U.S. Census Bureau between 2015 and 2019 as part of the American Community Survey (ACS).
(Apologies to people who are not from the United States – our modal student is from the US, and the US has good data, so picking the United States seemed like the least bad option. However, if you are interested in completing these same exercises for your own country, head over to IPUMS International to see if analogous income data has been made available by your country’s Census Bureau. Simply click on the “Browse Data” button, then “Select Sample” in the top left to find the most recent data available for your country. Then see if you can find income data under the “Select Harmonized Variables” “PERSON” or “HOUSEHOLD” drop-down menus. Note that income data is hard to collect, so it’s probably not available for most countries.)
Estimate your own household’s gross (pretax) income and write it down (we will be working with household gross income, so if multiple people in your household work add up their incomes.)
If you aren’t an American, head over to this OECD website to find the Purchasing Power Parity (PPP) exchange-rate between your currency and the US dollar in 2019. Note that this exchange rate may be quite different from the official exchange rate – PPP is a method of calculating exchange rates that is meant to take into account differences in cost-of-living and labor across countries to generate an exchange rate that accurately reflects buying power.
Now make a guess about what share of American households earn less money than your household (in other words, what is your household’s income percentile). For example, if I thought my household earned more than 90% of households, I would say that my household is in the 90th income percentile.
Load the Data¶
We are now going to load a vector of total incomes for a representative sample of US households between 2015 and 2019. You can do so with the following command to load the vector and assign it to
incomes
:
incomes <- scan(paste0("https://raw.githubusercontent.com/nickeubank/",
"computational_methods_boot_camp/main/source/",
"data/us_household_incomes.txt"))
(The function paste0()
takes a set of strings and glues them back together into a single string (it concatenates them). Here we don’t have to use it, but it helps break up this long string so it doesn’t just run off the right side of everyone’s screens.)
Getting to Know Our Data¶
When you get a dataset, it’s good to start by “getting to know” your data!
How many observations are in this dataset?
What was the mean (average) household income in the United States between 2015 and 2019?
What was the median household income in the United States during this period? (hint: if
mean()
returns the mean of a vector, you can probably guess how to get the median… :) ). The median income is the income of the household that earned more than 50% of American households, and earned less than 50% of American households.What do you learn from the fact that the mean and median are so different?
Use the
hist()
function to create a histogram of US incomes. Does this comport with your answer from 7?
Income Distribution¶
Now let’s see if your impression of where your household sat in the US income distribution was correct! To answer this, we’ll want to use the
quantile(v)
function. By default, it will provide you with the 0th, 25th, 50th, 75th, and 100th percentiles in the vectorv
. But if you also pass a vectorp
with values between 0 and 1 (e.g.,quantile(v, p)
), it will provide you with the values corresponding to those percentiles as well. So, for example,quantile(incomes, c(0.25))
will return the household income (if I named the vector that I got in question 3incomes
) that is greater than 25% of the values inincomes
(and implicitly the income that is less than the income of the other ~75% of households).
Plug in the percentile that you guessed in question 2—does that look like your household’s gross income? - Odds are that you guessed that you are somewhere near the middle of the US income distribution, but that you are not – this is actually a well-documented sociological phenomenon!.
Try out different percentiles until you find where your household fits into the US income distribution!
In the recent Occupy Wall Street movement in the United States, protesters could often be seen holding signs that read “We Are The 99%!” and “Tax the 1%!”. Without checking the data, what household income cut off do you think defines people who are in the 1%?
Now check your data – using
quantile()
, what is the income cut off that puts a household in the 1%? Is that more or less than what you expected? Then check out this article!
Note: This data cannot be used to reliably measure income cutoffs much above the top 1% given the number of observations in the data.
Bonus Exercise¶
While quantile
makes this easy to do, we don’t really need that function. Can you find the 75% percentile income in our data without quantile
? - Hint: you’ll want to start by sorting the data.
Data Citation¶
The ACS data used in this exercise are a subsample of the IPUMS USA data available from usa.ipums.org.
Please cite use of the data as follows: Steven Ruggles, Sarah Flood, Sophia Foster, Ronald Goeken, Jose Pacas, Megan Schouweiler and Matthew Sobek. IPUMS USA: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2021. https://doi.org/10.18128/D010.V11.0
These data are intended for this exercise only. Individuals analyzing the data for other purposes must submit a separate data extract request directly via IPUMS USA.
Individuals are not to redistribute the data without permission. Contact ipums@umn.edu for redistribution requests.