Manipulating Dataframes

In our previous lessons, we learn how to create dataframes from vectors, and how to load them from files. In this lesson, we will learn how to work with our dataframe once we have it loaded up!

Being able to quickly modify datasets – often referred to as “data wrangling” – is critical to being a social scientist. Indeed, most social scientists and data scientists spend a huge proportion of their time of their time cleaning and organizing their data. (about 80 percent in surveys). So this is probably one of the most important readings of the course!

Be aware that there will be a lot of syntax in this reading. However, the goal of the reading is not to have you memorize all the syntax, but rather to understand the logic of how dataframes work. To help with that, I’ve provided a recap section at the end of the reading with examples of all the commands we cover in one place.

So as you read, try and focus on the logic of how dataframes work, not the exact syntax. Syntax is something you can always look up later so long as you understand enough about the logic of what’s going on to realize what you need to look up!

To begin, let’s start by recreating the dataframe we had in our last exercise as an example to work with:

[1]:
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries
A data.frame: 9 × 3
countryyeargdp_pc
<chr><dbl><dbl>
USA 199416166
China199413804
Sudan1994 5507
USA 199514966
China199519778
Sudan199510592
USA 1996 6852
China1996 5114
Sudan1996 7966

Dataframes Are Like Matrices-Plus

Dataframes, like matrices, are just two dimensional grids of data. And so everything we learned about matrices also applies to dataframes (hooray!).

For example, if we want to get the GDP of China in 1994 from our dataframe, we can subset our dataframe using square brackets and logical / name vectors, just like named matrices:

[2]:
countries[countries[, "gdp_pc"] < 10000, "year"]
  1. 1994
  2. 1996
  3. 1996
  4. 1996

So remember that: if you could do it with a matrix, you can do it with a dataframe!

As we’ll see, though, dataframes do have few augmentations designed to make the life of the R users a little easier, and so can kinda be thought of as being like matrices+.

Columns Operations

Unlike matrices, which could have column names, dataframe columns always have names. As a result, we will basically always address columns using their names for reasons will discuss in more detail below.

It is also convention that columns in most datasets correspond to variables, so you we’ll often want to do things like take the average of a single column (e.g. the average of a single variable), or edit the values in a single column.

In fact, accessing a single column is so common with dataframes that there are two exactly-identical ways to get a single column, and one nearly equivalent way. The equivalent ways are:

[3]:
# What we're used to from matrices
countries[, "gdp_pc"]
  1. 16166
  2. 13804
  3. 5507
  4. 14966
  5. 19778
  6. 10592
  7. 6852
  8. 5114
  9. 7966
[4]:
# The shortcut for a single dataframe column
countries$gdp_pc
  1. 16166
  2. 13804
  3. 5507
  4. 14966
  5. 19778
  6. 10592
  7. 6852
  8. 5114
  9. 7966

These are exactly equivalent for single columns! But note that you can’t always use this trick – for example, it doesn’t work for trying to get several columns from a dataframe. Most of the time, though, it’s a very convenient shorthand for single-column manipulations.

And the nearly equivalent way is the following, which is slightly different from those above in that instead of returning a vector, it returns a dataframe with one column:

[5]:
# And if you don't have a comma, R assumes you're accessing columns
countries["gdp_pc"]
A data.frame: 9 × 1
gdp_pc
<dbl>
16166
13804
5507
14966
19778
10592
6852
5114
7966

Modifying Columns

As with matrices, we can use subsetting to make modifications to columns. For example, suppose, as with our matrix version, we wanted to multiple GDP per capita by 1.02 to adjust for inflation. We could either do:

[6]:
# re-create with original gdp_pc
countries <- data.frame(country, year, gdp_pc)
countries
A data.frame: 9 × 3
countryyeargdp_pc
<chr><dbl><dbl>
USA 199416166
China199413804
Sudan1994 5507
USA 199514966
China199519778
Sudan199510592
USA 1996 6852
China1996 5114
Sudan1996 7966
[7]:
countries[, "gdp_pc"] <- countries[,"gdp_pc"] * 1.02

Or

[8]:
countries$gdp_pc <- countries$gdp_pc * 1.02

Creating New Columns

If we wanted to keep both the original gdp_pc column and add a new column with the inflation adjusted values, we can do so just by using a new column name when we assign our values back into the dataframe:

[9]:
# re-create with original gdp_pc
countries <- data.frame(country, year, gdp_pc)
[10]:
# Add new column
countries$adjusted_gdp_pc <- countries$gdp_pc * 1.02
countries
A data.frame: 9 × 4
countryyeargdp_pcadjusted_gdp_pc
<chr><dbl><dbl><dbl>
USA 19941616616489.32
China19941380414080.08
Sudan1994 5507 5617.14
USA 19951496615265.32
China19951977820173.56
Sudan19951059210803.84
USA 1996 6852 6989.04
China1996 5114 5216.28
Sudan1996 7966 8125.32

Analyzing Columns

Finally, as long as we’re talking about columns, it’s worth emphasizing that once you pull a column out of your dataframe, you can analyze it like any other vector (since it is just a vector!). For example:

[11]:
mean(countries$gdp_pc)
11193.8888888889

But two summary functions are worth noting here: table(), to get the number of observations that have a given value in a vector, and the combination prop.table(table()), to get the share of observations with a given value in a vector:

[12]:
# Number of observations by country
table(countries$country)

China Sudan   USA
    3     3     3
[13]:
# Proportion of observations by country
prop.table(table(countries$country))

    China     Sudan       USA
0.3333333 0.3333333 0.3333333

Dropping Columns

Dropping columns can be done in a couple ways. The easiest is to just list the columns one wishes to keep:

[14]:
countries[, c("gdp_pc", "year")]
A data.frame: 9 × 2
gdp_pcyear
<dbl><dbl>
161661994
138041994
55071994
149661995
197781995
105921995
68521996
51141996
79661996

But in big dataframes, we sometimes have lots of columns, and don’t want to list all the columns except the one we want to drop. For that there are two solutions. The first is like this:

[15]:
# Drop columns gdp_pc and year
countries[, !(names(countries) %in% c("gdp_pc", "year"))]

A data.frame: 9 × 2
countryadjusted_gdp_pc
<chr><dbl>
USA 16489.32
China14080.08
Sudan 5617.14
USA 15265.32
China20173.56
Sudan10803.84
USA 6989.04
China 5216.28
Sudan 8125.32

This is a little weird looking, so it’s worth breaking down.

First, names(countries) returns all the column names of countries.

[16]:
names(countries)
  1. 'country'
  2. 'year'
  3. 'gdp_pc'
  4. 'adjusted_gdp_pc'

Then names(countries) %in% c("gdp_pc", "year") returns a logical vector the length of the column names of countries that’s TRUE if the name is in the list, and FALSE otherwise:

[17]:
names(countries) %in% c("gdp_pc", "year")
  1. FALSE
  2. TRUE
  3. TRUE
  4. FALSE

Then finally the ! before that expression is the logical NOT, meaning that it makes all TRUE values into FALSE and vice-versa. So in the end !(names(countries) %in% c("gdp_pc", "year")) returns a logical vector that is TRUE for all values not in the list, and FALSE for those in the list. That is then interpreted as a logical subsetting vector, and all columns not in the list are kept, and those not in the list are dropped.

I know, it’s kinda a lot… but it is a good example of how you can compose simple building blocks to do complicated things in R!

Finally, if you’re dropping a single columns, you can also assign the value of NULL to the column:

[18]:
countries$gdp_pc <- NULL
countries
A data.frame: 9 × 3
countryyearadjusted_gdp_pc
<chr><dbl><dbl>
USA 199416489.32
China199414080.08
Sudan1994 5617.14
USA 199515265.32
China199520173.56
Sudan199510803.84
USA 1996 6989.04
China1996 5216.28
Sudan1996 8125.32

Which… well, just works! :)

Row Operations

In most datasets you work with, each row will correspond to a single observation in your data. Given that, we often manipulate rows as a way of manipulating the sample in our analyses.

Subsetting

Subsetting with logicals is exactly the same with dataframes as it was with matrices, except that we can access column names with the $ notation:

[19]:
countries[countries$year == 1995 & countries$country == "USA", ]
A data.frame: 1 × 3
countryyearadjusted_gdp_pc
<chr><dbl><dbl>
4USA199515265.32

Sorting Dataframes

Often, we’ll want to sort the rows of our dataframe by the values in one of our columns. To do so, we use the order command:

[20]:
# Sort by GDP
countries[order(countries$adjusted_gdp_pc),]
A data.frame: 9 × 3
countryyearadjusted_gdp_pc
<chr><dbl><dbl>
8China1996 5216.28
3Sudan1994 5617.14
7USA 1996 6989.04
9Sudan1996 8125.32
6Sudan199510803.84
2China199414080.08
4USA 199515265.32
1USA 199416489.32
5China199520173.56

What’s happening? order() returns a vector with the indices of the rows of the dataset in sorted order:

[21]:
order(countries$adjusted_gdp_pc)
  1. 8
  2. 3
  3. 7
  4. 9
  5. 6
  6. 2
  7. 4
  8. 1
  9. 5

And then, because it’s a vector of indices being passed in the first position of our square brackets, we get all the rows of countries “subset” by index (though obviously it’s not really a subset, since all row indices appear in the vector – just a re-ordering)!

We can also sort by multiple columns:

[22]:
countries[order(countries$year, countries$country),]
A data.frame: 9 × 3
countryyearadjusted_gdp_pc
<chr><dbl><dbl>
2China199414080.08
3Sudan1994 5617.14
1USA 199416489.32
5China199520173.56
6Sudan199510803.84
4USA 199515265.32
8China1996 5216.28
9Sudan1996 8125.32
7USA 1996 6989.04

And we can use - to sort any variable in descending order rather than ascending order:

[23]:
countries[order(-countries$adjusted_gdp_pc), ]
A data.frame: 9 × 3
countryyearadjusted_gdp_pc
<chr><dbl><dbl>
5China199520173.56
1USA 199416489.32
4USA 199515265.32
2China199414080.08
6Sudan199510803.84
9Sudan1996 8125.32
7USA 1996 6989.04
3Sudan1994 5617.14
8China1996 5216.28

Avoiding Subsetting by Index

As you’ve seen, we almost always access columns by name rather than by using their index numbers. That’s because when working with real data, there’s always a possibility that the order of columns gets jumbled up – maybe you get an updated version of the data set you’re working with that has the columns in different orders, or maybe in a large research project one of your collaborators has modified the order of columns in some of the code that runs before your code.

In these situations, trying to extract a column using its index may give you the wrong answer, while pulling out a column by name will and sure you’re always getting the variable that you intended!

The same logic also applies to subsetting by rows. While subsetting by row index works, we generally avoid using indices for the same reason we avoid subsetting columns by index – if the order or our data changes (say, it gets sorted unexpectedly), we can’t predict how our index subsets will change! That’s why in nearly all the examples above we subset with a logical vector.

Obviously there are exceptions to this rule – order() and sample() are both implicitly subsetting by index. But those functions generate the indices they use from the values of row immediately before they use them, so there is no opportunity for the order of row to change between when those indices are generated and when they are used.

Recap

Phew. OK, I know this reading covered a lot, so here’s a quick recap and a summary table for reference.

  • Dataframes really are just like matrices. The main difference is that each column can be a different type, and dataframes always have column names.

  • We subset single dataframe columns using $, but that’s just a shorthand for the syntax we learned before (df[, "colname"]).

  • The columns of a dataframe are just vectors.

  • We usually subset dataframes with logicals (for rows) or by name (columns) for safety.

And now a reference table, written with a toy dataset called df with columns col1, col2, and col3 in mind:

Looking at your dataframe:

  • Number of rows: nrow(df)

  • Number of columns: ncol(df)

  • First six rows: head(df)

  • Last six rows: tail(df)

  • Quick summary of all data: summary(df)

Row Operations

  • Subset rows by logical: df[df$col1 < 42, ] or df[df[, col1] < 42, ]

  • Random sample of N rows: df[sample(nrow(df), N), ]

  • Sort rows (ascending, one column): df[order(df$col1), ]

  • Sort rows (descending, one column): df[order(-df$col1), ]

  • Sort rows (multiple columns): df[order(df$col1, df$col2), ]

Column Operations

  • Subset one column by name: df$col1 or df[, "col1"]

  • Subset multiple columns by name: df[ , c("col1", "col2")]

  • Drop one column: df$col1 <- NULL

  • Drop set of columns: df[ , !(names(df) %in% c("col1", "col2"))]

  • Editing a single column: df$col1 <- df$col1 * 42 or df[, "col1"] <- df[, "col1"] * 42

  • Create new column: df$newcol <- df$col1 * 42 or df[, "newcol"] <- df[, "col1"] * 42

Learn About a Column:

  • Tabulate number of observations of each value: table(df$col1)

  • Share of observations of each value: prop.table(table(df$col1))

  • Quick summary of one column: summary(df$col1)

Exercises for Class

And now it’s time to put these new skills into action with some exercises! As usual, if you are a synchronous student, please don’t start these before class!