Matrix Exercises

For the following matrix manipulation exercises, begin by building the following matrix (you’ll probably want to just copy-paste this code), which we can imagine is a survey of age, education and income:

[1]:
income <- c(22000, 65000, 19000, 110000, 14000, 0, 35000)
age <- c(20, 35, 55, 35, 21, 56, 42)
education <- c(12, 16, 11, 22, 12, 8, 12)

svy <- cbind(income, age, education)
svy

# Delete vectors we used to make matrix—
# the goal here is to only work with the matrix.
# Vectors were just to build it up. :)
rm(income, age, education)

A matrix: 7 × 3 of type dbl
incomeageeducation
220002012
650003516
190005511
1100003522
140002112
056 8
350004212

Exercise 1: Summarizing Data

  1. What is the average age of all respondents?

  2. What is the average income of respondents over 30?

  3. What is the average education of respondents with incomes above the average income for all respondents?

Exercise 2: Editing Data

The US government is thinking about offering a 1,500 tax credit to anyone making less than 20,000 a year.

  1. Using the data from svy, create a new vector by subsetting and editing the original income column with the incomes respondents will receive after this tax credit.

    • Do so by subsetting and editing values programmatically, not just typing values by hand. (Yes, writing out a new vector by hand is easy to do in this example, but you couldn’t do it with a large, real dataset!)

    • Do not change the original income column in the process of creating this vector.

  2. What will the average after-tax income be for all respondents?

  3. Add your new column with updated, post-refund incomes as a fourth column in your matrix.

To solve this problem, you’ll want to use the cbind function, short for “column bind.” As detailed in the R documentation (seriously, go take a R documentation is really good and helpful!), cbind concatenates (glues together) matrices horizontally to make new matrices.

Exercise 3: Income Inequality (with Real Data!)

In this exercise, we’ll be working with data from the US Current Population Survey, provided by the National Bureau of Economic Research (NBER). This is a regular survey conducted by the US Bureau of Labor to calculate the US employment rate.

In this exercise, we’ll use this data to study gender and racial wage inequality in the US.

  1. Load data from the 2018 CPS survey with the following command:

cps <- as.matrix(read.table(paste0("https://raw.githubusercontent.com/nickeubank/",
                                   "computational_methods_boot_camp/main/source/",
                                   "data/cps.txt")
                            )
                )

(Users with more R experience may find it odd we’re doing what we’re doing with the as.matrix command here, but please just go with it! We’re practicing matrix manipulations here. We’ll get to other data structures soon.)

This data is a subset of the full CPS survey, and contains only data on employed respondents working at least 35 hours a week (e.g., full-time)).

  1. Using dim, evaluate the size of this matrix. How many rows and how many columns does this matrix have?

  2. The five columns (yes, I know I’m giving away the answer to the question above) of this matrix correspond to:

    • Column 1: Weekly income in dollars.

    • Column 2: Usual hours respondent works per week.

    • Column 3: Gender. 2 is “Female”, 1 is “Male”

    • Column 4: Race. This can take on a lot of values for those who identify as mixed race, but for simplicity in this exercise we’ll just focus on a couple values. For those interested, the full set of codes can be found on page 19 of the CPS codebook.

      • 1: White

      • 2: Black

      • 3: American Indian

      • 4: Asian only

      • 5: Hawaiian/Pacific Islander only

    • Column 5: Survey weights.

Note that race does not break out Hispanic / non-Hispanic identities. In US government surveys, Hispanic / non-Hispanic is usually recorded in a separate ethnicity variable, so many people who identify as Hispanic are identified as White or Black in the race variable.

For the moment, let’s ignore survey weights—they don’t impact results here significantly.

Calculate the average hourly wage for workers. Note this will require more than just using mean on a single column!

Hint: If you create a new column in the process of answering this question, you might want to just add it to your matrix with cbind… odds are you’ll use it again. :)

  1. Now calculate the average hourly wage of working men and the average hourly wage of working women.

  2. Calculate the wage gap as: (men’s avg hourly wage minus women’s hourly wage) / (men’s average hourly wage).

  1. Now calculate the average hourly wage for White respondents and Black respondents.

This will only be an approximation—one would normally also include all respondents of mixed-race into non-mutually exclusive categories like “Any Part Black” or “Any Part White”, and we would also break out Hispanic and non-Hispanic respondents. But as most respondents only pick on racial category, this will still give us a reasonable approximation.

Bonus Exercises!

  1. Now let’s take our sample weights seriously! Calculate the average hourly wage in the US taking into account weights. How?

Well, a normal average is calculated by taking each value, multiplying it by \(1/N\) where \(N\) is the number of observations, and then adding up all the results.

For a weighted average, we take the value for each observation \(i\) and multiply it by

\[weight_i / \sum_1^N weights_j\]

where \(weight_i\) is the observation’s weight, and \(\sum_1^N weights_j\) is the total of all the weights in the population being averaged.

Then we just add those up!