dplyr and the Tidyverse

In our previous readings, we learned how to accomplish tasks like subsetting and modifying variables using what’s called “array indexing” (using those [] square brackets).

There is, however, another approach to manipulating dataframes in R that is very popular: using a set of functions provided by the dplyr package.

dplyr is one of a collection of libraries known collectively as the tidyverse designed to essentially provide a way of working with data in R that’s a little different from the array-indexing approach we’ve been focused on up to this point.

To be clear dplyr doesn’t allow you to do anything you couldn’t do with array indexing, it just provides different ways to write your commands. But the way it allows you to write commands is something that many people find quite compelling.

The Philosophy of dplyr

Before we get into dplyr, however, a quick word on the philosophy of dplyr and the tidyverse more generally. The tidyverse is very popular, but it also has some detractors. The basic concern people have with dplyr is that it provides a big library specialized commands for doing specific dataframe manipulations. But

As a result, learning the tidyverse amounts to learning lots of specific functions rather than generalized concepts. In most cases, tidyverse packages don’t embrace generalized abstractions, like array indexing.

As we’ve seen in our past readings, in regular R the logic that dictates how vectors work informs how matrices work, which in turn informs how dataframes work. And if you move into three or four dimensional arrays for modeling time series or real world volumes at some point, what you know about vectors and matrices will also apply there. Indeed, the concept of an array and the idea of array indexing is such a fundamental abstraction in data science that you’ll also find it in languages like Python, Matlab, and Julia you may sometime end up using.

As such, over-reliance on the tidyverse may limit students’ opportunity to learn to combine basic building blocks to accomplish sophisticated tasks. If you only want to do things for which the tidyverse provides an explicit function, that’s not a problem, but it limits ones’ understanding of how to get R to do things that aren’t covered by a specific function in R, which is often what is required when doing social science research.

(If you want to read a more eloquent version of this critique, you can find one here.)

None of that is to suggest you should avoid dplyr or the rest of the tidyverse entirely. To the contrary, I think the tidyverse plotting library (ggplot) is the best plotting library around, and I’m a fan of several dplyr functions (especially rename, which makes an otherwise tedious task quite simple).

But as you use it, be mindful of its different philosophy of programming it embodies, and how using it shapes the way you think about using R.

Mapping Array Indexing onto dplyr

The easiest way to introduce dplyr, I think, is just to show how the things we did in our last reading are done in dplyr:

Row Operations

  • Subset rows by logical:

    • Base R: df[df$col1 < 42, ] or df[df[, col1] < 42, ]

    • dplyr: filter(df, col1 < 42)

  • Random sample of N rows:

    • Base R: df[sample(nrow(df), N), ]

    • dplyr: slice_sample(df, N)

  • Sort rows (ascending, one column):

    • Base R: df[order(df$col1), ]

    • dplyr: arrange(df, col1)

  • Sort rows (descending, one column):

    • Base R: df[order(-df$col1), ]

    • dplyr: arrange(df, desc(col1))

  • Sort rows (multiple columns):

    • Base R: df[order(df$col1, df$col2), ]

    • dplyr: arrange(df, col1, col2)

Column Operations

  • Subset one column by name:

    • Base R: df$col1 or df[, "col1"]

    • dplyr: select(countries, country, gdp_pc)

  • Subset multiple columns by name:

    • Base R: df[ , c("col1", "col2")]

    • dplyr: select(df, col1, col2)

  • Drop one column:

    • Base R: df$col1 <- NULL

    • dplyr: select(df, -col1)

  • Drop set of columns:

    • Base R: df[ , !(names(df) %in% c("col1", "col2"))]

    • dplyr: select(df, -col1, -col2)

  • Editing a single column:

    • Base R: df$col1 <- df$col1 * 42 or df[, "col1"] <- df[, "col1"] * 42

    • dplyr: mutate(countries, gdp_pc = gdp_pc / 1000)

  • Create new column:

    • Base R: df$newcol <- df$col1 * 42 or df[, "newcol"] <- df[, "col1"] * 42

    • dplyr: mutate(countries, gdppc_1k = gdp_pc / 1000)

Installing dplyr

If you want to play with dplyr, you need to:

  • Install dplyr with the command install.packages("dplyr"). You only have to do this once on a given computer.

  • Load it into your R session with library(dplyr). This you have to run every time you open R and want to use dplyr.

Chaining

The last feature of dplyr to be aware of is chaining. Chaining is a way of combining commands to make code more concise. Basically, you use the command %>% to tell R to take the result of one function and make it the first argument in the next.

In dplyr, if we wanted to change country into all lower case text, we could do:

mutate(countries, country = tolower(country))

But alternatively, we could use chaining to do:

countries %>% mutate(country = tolower(country))

Where countries is understood to be the first argument for mutate.

Obviously this isn’t very efficient with only one command, but it can be used with a long series of commands. To illustrate, let’s return to our old countries example dataset:

Suppose we wanted to use countries to create a new data frame called countries_new, which should have observations from years 1995 and 1996 (dropping 1994), should be sorted by country name (in lower case), and should have a new variable equal to GDP per capita in 1000s.

Here’s how we could do this with dplyr commands, but without chaining:

[1]:
library(dplyr)
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries

Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


A data.frame: 9 x 3
countryyeargdp_pc
<chr><dbl><dbl>
USA 199411593
China199413098
Sudan1994 2524
USA 199512150
China199514010
Sudan1995 2759
USA 1996 7592
China1996 4026
Sudan1996 5230
[2]:
countries_new <- filter(countries, year != 1994) #drop year 1994
countries_new <- arrange(countries_new, country) #sort by country names
countries_new <- mutate(countries_new, country = tolower(country), #convert name to lower-case
                        gdppc_1k = gdp_pc / 1000) #create GDP pc in 1000s
countries_new
A data.frame: 6 x 4
countryyeargdp_pcgdppc_1k
<chr><dbl><dbl><dbl>
china19951401014.010
china1996 4026 4.026
sudan1995 2759 2.759
sudan1996 5230 5.230
usa 19951215012.150
usa 1996 7592 7.592

And now here’s the same thing using chaining:

[3]:
countries_new <- countries %>%
    filter(year != 1994) %>%
    arrange(country) %>%
    mutate(country = tolower(country), gdppc_1k = gdp_pc / 1000)
countries_new
A data.frame: 6 x 4
countryyeargdp_pcgdppc_1k
<chr><dbl><dbl><dbl>
china19951401014.010
china1996 4026 4.026
sudan1995 2759 2.759
sudan1996 5230 5.230
usa 19951215012.150
usa 1996 7592 7.592

Chaining always begins with specifying the data frame we want to operate on (e.g., countries). Every subsequent statement will then operate on this data frame, starting with the function that comes right after the data frame and working its way down. In our case, the first thing we’ll do to countries is to subset it. We’ll then sort it by country name. Lastly, we’ll overwrite the country name to be lower-case and create a new variable representing GDP per capita in 1000s.

Is chaining better? Some people find chaining makes code more readable. It certainly makes it more concise.

Personally, my preference is actually to break down long chains of manipulations into a series of distinct commands. Why? Because it allows me to look at each intermediate step and make sure I didn’t mess something up. And as we’ll discuss in a later reading, I think you should always assume you’ve messed something up, because humans are bad at programming! And if you chain a bunch of manipulations, there’s no way to look at the intermediate outputs to check for errors.

But again, chaining is definitely the more popular approach to R these days, so it’s important to introduce!

WARNING: Non-standard Evaluation in the Tidyverse

One feature of the tidyverse syntax that may not immediately jump out at you but which is important to bear in mind is that when you type column names in tidyverse functions, you don’t have to put them in double-quotes:

# To select the column named country:
select(countries, country)

Whereas that is something you have to do with array indexing:

countries[, "country"]

That’s because dplyr and other tidyverse functions make use of something called “non-standard evaluation.”

Normally in a programming language like R, text sitting on its own is interpreted as a variable, and the first thing the language does is replace the variable with the data assigned to that variable. For example, in the second line of the code below, R sees a, interprets it as a variable, and so replaces it with 7 when evaluating the expression.

[4]:
a <- 7
6 * a
42

When you have text you want the computer to think of as data and not as a variable, you normally have to put it in quotes. e.g.:

[5]:
a <- 7

# a without quotes interpreted as a variable:
print(a)
[1] 7
[6]:
# a with double quotes interpreted
# as data -- namely, the character a
print("a")
[1] "a"

But tidyverse functions interpret text on its own as data, not as variables.

We’ll talk later about some of the reasons that this can cause problems, but for the moment I just want to emphasize this because “When do I use double quotes?” is a common question from students, and the fact that the rules change between array indexing and tidyverse functions is a common source of confusion.

So:

  • Normally, text without quotes is interpreted as the name of a variable, but

  • In tidyverse functions, text without quotes is interpreted as data.

Summing Up

In conclusion, dplyr allows you to write more concise commands with more familiar terminology – select and filter rather than array notation. Chaining, similarly, can definitely make code more concise. As a result, many people are drawn to dplyr, and you may be too!

So should you use it? Well, first of all, this isn’t a yes or no question – you may decide there are a couple dplyr functions you really love, but that you don’t like all of them. And more generally, at this point you know a lot about these different approaches to dataframe manipulation, and how tidyverse modifies how R works. That means you can make your own educated decision based on your own preferences.

As for me, I use elements of the tidyverse – especially ggplot for plotting, which we’ll cover in a later lesson – with some frequency. But I’m comfortable doing so because I know that anytime I need to do something that doesn’t feel natural within the tidyverse framework, or where I run into problems (we’ll discuss how easily that happens when you’re writing loops in a later lesson), I know I’m not reliant on the tidyverse and can turn to other tools when necessary.

Want to Learn More?

If there’s anything the tidyverse is good at, it’s documentation! Here are the docs for dplyr.