Manipulating Dataframes¶

In our previous lessons, we learn how to create dataframes from vectors, and how to load them from files. In this lesson, we will learn how to work with our dataframe once we have it loaded up!

Being able to quickly modify datasets – often referred to as “data wrangling” – is critical to being a social scientist. Indeed, most social scientists and data scientists spend a huge proportion of their time of their time cleaning and organizing their data. (about 80 percent in surveys). So this is probably one of the most important readings of the course!

Be aware that there will be a lot of syntax in this reading. However, the goal of the reading is not to have you memorize all the syntax, but rather to understand the logic of how dataframes work. To help with that, I’ve provided a recap section at the end of the reading with examples of all the commands we cover in one place.

So as you read, try and focus on the logic of how dataframes work, not the exact syntax. Syntax is something you can always look up later so long as you understand enough about the logic of what’s going on to realize what you need to look up!

To begin, let’s start by recreating the dataframe we had in our last exercise as an example to work with:

[1]:

country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries

A data.frame: 9 × 3
country	year	gdp_pc
<chr>	<dbl>	<dbl>
USA	1994	19306
China	1994	16645
Sudan	1994	14156
USA	1995	14467
China	1995	6374
Sudan	1995	10276
USA	1996	9667
China	1996	16309
Sudan	1996	11971

Dataframes Are Like Matrices-Plus¶

Dataframes, like matrices, are just two dimensional grids of data. And so everything we learned about matrices also applies to dataframes (hooray!).

For example, if we want to get the GDP of China in 1994 from our dataframe, we can subset our dataframe using square brackets and logical / name vectors, just like named matrices:

[2]:

countries[countries[, "gdp_pc"] < 10000, "year"]

1994
1996
1996
1996

So remember that: if you could do it with a matrix, you can do it with a dataframe!

As we’ll see, though, dataframes do have few augmentations designed to make the life of the R users a little easier, and so can kinda be thought of as being like matrices+.

Columns Operations¶

Unlike matrices, which could have column names, dataframe columns always have names. As a result, we will basically always address columns using their names for reasons will discuss in more detail below.

It is also convention that columns in most datasets correspond to variables, so you we’ll often want to do things like take the average of a single column (e.g. the average of a single variable), or edit the values in a single column.

In fact, accessing a single column is so common with dataframes that there are two exactly-identical ways to get a single column, and one nearly equivalent way. The equivalent ways are:

[3]:

# What we're used to from matrices
countries[, "gdp_pc"]

16166
13804
5507
14966
19778
10592
6852
5114
7966

[4]:

# The shortcut for a single dataframe column
countries$gdp_pc

16166
13804
5507
14966
19778
10592
6852
5114
7966

These are exactly equivalent for single columns! But note that you can’t always use this trick – for example, it doesn’t work for trying to get several columns from a dataframe. Most of the time, though, it’s a very convenient shorthand for single-column manipulations.

And the nearly equivalent way is the following, which is slightly different from those above in that instead of returning a vector, it returns a dataframe with one column:

[5]:

# And if you don't have a comma, R assumes you're accessing columns
countries["gdp_pc"]

A data.frame: 9 × 1
gdp_pc
<dbl>
16166
13804
5507
14966
19778
10592
6852
5114
7966

Modifying Columns¶

As with matrices, we can use subsetting to make modifications to columns. For example, suppose, as with our matrix version, we wanted to multiple GDP per capita by 1.02 to adjust for inflation. We could either do:

[6]:

# re-create with original gdp_pc
countries <- data.frame(country, year, gdp_pc)
countries

A data.frame: 9 × 3
country	year	gdp_pc
<chr>	<dbl>	<dbl>
USA	1994	16166
China	1994	13804
Sudan	1994	5507
USA	1995	14966
China	1995	19778
Sudan	1995	10592
USA	1996	6852
China	1996	5114
Sudan	1996	7966

[7]:

countries[, "gdp_pc"] <- countries[, "gdp_pc"] * 1.02

Or

[8]:

countries$gdp_pc <- countries$gdp_pc * 1.02

Creating New Columns¶

If we wanted to keep both the original gdp_pc column and add a new column with the inflation adjusted values, we can do so just by using a new column name when we assign our values back into the dataframe:

[9]:

# re-create with original gdp_pc
countries <- data.frame(country, year, gdp_pc)

[10]:

# Add new column
countries$adjusted_gdp_pc <- countries$gdp_pc * 1.02
countries

A data.frame: 9 × 4
country	year	gdp_pc	adjusted_gdp_pc
<chr>	<dbl>	<dbl>	<dbl>
USA	1994	16166	16489.32
China	1994	13804	14080.08
Sudan	1994	5507	5617.14
USA	1995	14966	15265.32
China	1995	19778	20173.56
Sudan	1995	10592	10803.84
USA	1996	6852	6989.04
China	1996	5114	5216.28
Sudan	1996	7966	8125.32

Analyzing Columns¶

Finally, as long as we’re talking about columns, it’s worth emphasizing that once you pull a column out of your dataframe, you can analyze it like any other vector (since it is just a vector!). For example:

[11]:

mean(countries$gdp_pc)

11193.8888888889

But two summary functions are worth noting here: table(), to get the number of observations that have a given value in a vector, and the combination prop.table(table()), to get the share of observations with a given value in a vector:

[12]:

# Number of observations by country
table(countries$country)


China Sudan   USA
    3     3     3

[13]:

# Proportion of observations by country
prop.table(table(countries$country))


    China     Sudan       USA
0.3333333 0.3333333 0.3333333

Dropping Columns¶

Dropping columns can be done in a couple ways. The easiest is to just list the columns one wishes to keep:

[14]:

countries[, c("gdp_pc", "year")]

A data.frame: 9 × 2
gdp_pc	year
<dbl>	<dbl>
16166	1994
13804	1994
5507	1994
14966	1995
19778	1995
10592	1995
6852	1996
5114	1996
7966	1996

But in big dataframes, we sometimes have lots of columns, and don’t want to list all the columns except the one we want to drop. For that there are two solutions. The first is like this:

[15]:

# Drop columns gdp_pc and year
countries[, !(names(countries) %in% c("gdp_pc", "year"))]

A data.frame: 9 × 2
country	adjusted_gdp_pc
<chr>	<dbl>
USA	16489.32
China	14080.08
Sudan	5617.14
USA	15265.32
China	20173.56
Sudan	10803.84
USA	6989.04
China	5216.28
Sudan	8125.32

This is a little weird looking, so it’s worth breaking down.

First, names(countries) returns all the column names of countries.

[16]:

names(countries)

'country'
'year'
'gdp_pc'
'adjusted_gdp_pc'

Then names(countries) %in% c("gdp_pc", "year") returns a logical vector the length of the column names of countries that’s TRUE if the name is in the list, and FALSE otherwise:

[17]:

names(countries) %in% c("gdp_pc", "year")

FALSE
TRUE
TRUE
FALSE

Then finally the ! before that expression is the logical NOT, meaning that it makes all TRUE values into FALSE and vice-versa. So in the end !(names(countries) %in% c("gdp_pc", "year")) returns a logical vector that is TRUE for all values not in the list, and FALSE for those in the list. That is then interpreted as a logical subsetting vector, and all columns not in the list are kept, and those not in the list are dropped.

I know, it’s kinda a lot… but it is a good example of how you can compose simple building blocks to do complicated things in R!

Finally, if you’re dropping a single columns, you can also assign the value of NULL to the column:

[18]:

countries$gdp_pc <- NULL
countries

A data.frame: 9 × 3
country	year	adjusted_gdp_pc
<chr>	<dbl>	<dbl>
USA	1994	16489.32
China	1994	14080.08
Sudan	1994	5617.14
USA	1995	15265.32
China	1995	20173.56
Sudan	1995	10803.84
USA	1996	6989.04
China	1996	5216.28
Sudan	1996	8125.32

Which… well, just works! :)

Row Operations¶

In most datasets you work with, each row will correspond to a single observation in your data. Given that, we often manipulate rows as a way of manipulating the sample in our analyses.

Subsetting¶

Subsetting with logicals is exactly the same with dataframes as it was with matrices, except that we can access column names with the $ notation:

[19]:

countries[countries$year == 1995 & countries$country == "USA", ]

A data.frame: 1 × 3
	country	year	adjusted_gdp_pc
	<chr>	<dbl>	<dbl>
4	USA	1995	15265.32

Sorting Dataframes¶

Often, we’ll want to sort the rows of our dataframe by the values in one of our columns. To do so, we use the order command:

[20]:

# Sort by GDP
countries[order(countries$adjusted_gdp_pc), ]

A data.frame: 9 × 3
	country	year	adjusted_gdp_pc
	<chr>	<dbl>	<dbl>
8	China	1996	5216.28
3	Sudan	1994	5617.14
7	USA	1996	6989.04
9	Sudan	1996	8125.32
6	Sudan	1995	10803.84
2	China	1994	14080.08
4	USA	1995	15265.32
1	USA	1994	16489.32
5	China	1995	20173.56

What’s happening? order() returns a vector with the indices of the rows of the dataset in sorted order:

[21]:

order(countries$adjusted_gdp_pc)

8
3
7
9
6
2
4
1
5

And then, because it’s a vector of indices being passed in the first position of our square brackets, we get all the rows of countries “subset” by index (though obviously it’s not really a subset, since all row indices appear in the vector – just a re-ordering)!

We can also sort by multiple columns:

[22]:

countries[order(countries$year, countries$country), ]

A data.frame: 9 × 3
	country	year	adjusted_gdp_pc
	<chr>	<dbl>	<dbl>
2	China	1994	14080.08
3	Sudan	1994	5617.14
1	USA	1994	16489.32
5	China	1995	20173.56
6	Sudan	1995	10803.84
4	USA	1995	15265.32
8	China	1996	5216.28
9	Sudan	1996	8125.32
7	USA	1996	6989.04

And we can use - to sort any variable in descending order rather than ascending order:

[23]:

countries[order(-countries$adjusted_gdp_pc), ]

A data.frame: 9 × 3
	country	year	adjusted_gdp_pc
	<chr>	<dbl>	<dbl>
5	China	1995	20173.56
1	USA	1994	16489.32
4	USA	1995	15265.32
2	China	1994	14080.08
6	Sudan	1995	10803.84
9	Sudan	1996	8125.32
7	USA	1996	6989.04
3	Sudan	1994	5617.14
8	China	1996	5216.28

Avoiding Subsetting by Index¶

As you’ve seen, we almost always access columns by name rather than by using their index numbers. That’s because when working with real data, there’s always a possibility that the order of columns gets jumbled up – maybe you get an updated version of the data set you’re working with that has the columns in different orders, or maybe in a large research project one of your collaborators has modified the order of columns in some of the code that runs before your code.

In these situations, trying to extract a column using its index may give you the wrong answer, while pulling out a column by name will and sure you’re always getting the variable that you intended!

The same logic also applies to subsetting by rows. While subsetting by row index works, we generally avoid using indices for the same reason we avoid subsetting columns by index – if the order or our data changes (say, it gets sorted unexpectedly), we can’t predict how our index subsets will change! That’s why in nearly all the examples above we subset with a logical vector.

Obviously there are exceptions to this rule – order() and sample() are both implicitly subsetting by index. But those functions generate the indices they use from the values of row immediately before they use them, so there is no opportunity for the order of row to change between when those indices are generated and when they are used.

Recap¶

Phew. OK, I know this reading covered a lot, so here’s a quick recap and a summary table for reference.

Dataframes really are just like matrices. The main difference is that each column can be a different type, and dataframes always have column names.
We subset single dataframe columns using $, but that’s just a shorthand for the syntax we learned before (df[, "colname"]).
The columns of a dataframe are just vectors.
We usually subset dataframes with logicals (for rows) or by name (columns) for safety.

And now a reference table, written with a toy dataset called df with columns col1, col2, and col3 in mind:

Looking at your dataframe:

Number of rows: nrow(df)
Number of columns: ncol(df)
First six rows: head(df)
Last six rows: tail(df)
Quick summary of all data: summary(df)

Row Operations

Subset rows by logical: df[df$col1 < 42, ] or df[df[, col1] < 42, ]
Random sample of N rows: df[sample(nrow(df), N), ]
Sort rows (ascending, one column): df[order(df$col1), ]
Sort rows (descending, one column): df[order(-df$col1), ]
Sort rows (multiple columns): df[order(df$col1, df$col2), ]

Column Operations

Subset one column by name: df$col1 or df[, "col1"]
Subset multiple columns by name: df[ , c("col1", "col2")]
Drop one column: df$col1 <- NULL
Drop set of columns: df[ , !(names(df) %in% c("col1", "col2"))]
Editing a single column: df$col1 <- df$col1 * 42 or df[, "col1"] <- df[, "col1"] * 42
Create new column: df$newcol <- df$col1 * 42 or df[, "newcol"] <- df[, "col1"] * 42

Learn About a Column:

Tabulate number of observations of each value: table(df$col1)
Share of observations of each value: prop.table(table(df$col1))
Quick summary of one column: summary(df$col1)

Exercises for Class¶

And now it’s time to put these new skills into action with some exercises! As usual, if you are a synchronous student, please don’t start these before class!