Intro to Merging

This tutorial explains how to combine different datasets through merging. Merging is basically a way of combining the columns from one dataset with those of another.

To illustrate the kind of situation where you may end up wanting to merge data, suppose that you are interested in how economic development affects democracy across countries, but all you have are two separate datasets, each with data on one of these two outcomes – say, economic data from the World Bank and political data from the Polity Project. In order to study how these things relate to one another, we need to combine these two data sets.

(If you don’t care about democracy, just imagine we’re talking about something you care about, like urban pollution levels instead of democracy scores! The logic is the same.)

In a previous reading, I argued that subsiding data is probably the most important skill you will learn as a data scientist. Being able to subset data carefully makes it possible to study how all sorts of outcomes vary across different groups.

Merging, I would argue, is the second most important skill for generating insights in data science. It is hard to overstate the value that comes from bringing together two disparate sources of information and studying how they relate to one another. This may surprise you coming from someone who teaches data science – most people would assume a data scientist believes value comes from fancy models – but in my experience, most insights come from documenting simple empirical regularities about the world that nobody has noticed before. We often use more sophisticated statistical modeling in order to evaluate the strength of relationships, and to help us understand what is driving those empirical regularities, but all of that work is secondary to just being able to measure basic features of the world. And merging is key to that process.

A Merging Example

To illustrate merging, let’s suppose that we want to do the analysis of the relationship between political outcomes and economic development we described above. Below we create to toy data sets, one with democracy scores (analogous to those you might get from a group like the Polity Project), and one with economic information (like the type of information you might get from the World Developing Indicators):

[1]:
# Generate toy dataset 1:
demo <- data.frame(
    country = c("USA", "Albania", "Turkey", "China", "Sudan"),
    democracy_score = c(19, 16, 16, 5, 10)
    )
demo
A data.frame: 5 × 2
countrydemocracy_score
<chr><dbl>
USA 19
Albania16
Turkey 16
China 5
Sudan 10
[2]:
# Generate toy dataset 2:
econ <- data.frame(
    country = c("China", "Albania", "Turkey", "USA", "Sudan"),
    gdp_pc = c(12000, 10000, 9000, 20000, 500)
    )
econ
A data.frame: 5 × 2
countrygdp_pc
<chr><dbl>
China 12000
Albania10000
Turkey 9000
USA 20000
Sudan 500

To merge two data sets, they must have an identifier in common – one or more columns whose values allow us to understand which rows are identifying the same entities in the two datasets. In the case of these two datasets, the common identifier is country name, stored in the column "country" in both datasets.

To do the actual merge, we will use the join() command from a package called statar. This isn’t the most popular package for merging and R, but for reasons we’ll discuss enter next reading, I definitely think it’s the best.

join() requires at least four arguments: the two datasets you want to merge, and the name of the column(s) containing the common identifier, and the type of merge (which, for now, we’ll just specify as "inner", but we’ll talk more about what that means below). In this case, our call to join would look like this:

[3]:
library(statar)
df <- join(demo, econ, on = "country", kind = "inner")
df
Joining, by = "country"

A data.frame: 5 × 3
countrydemocracy_scoregdp_pc
<chr><dbl><dbl>
USA 1920000
Albania1610000
Turkey 16 9000
China 512000
Sudan 10 500

This is called a 1-to-1 merge because each row in the first dataset matches up to exactly one row in the second dataset – i.e. each country appears only once in the first dataset, and matches up to a single row in the second dataset (the row with the same country identifier).

Your common identifier may be spread across multiple columns. Consider the following example:

[4]:
# Generate toy dataset 1:
country <- rep(c("USA", "China", "Sudan", "India"), 2)
year <- c(1994, 1994, 1994, 1994, 1995, 1995, 1995, 1995)
democracy <- round(runif(8, 0, 20), 0)

demo <- data.frame(country, year, democracy)
demo
A data.frame: 8 × 3
countryyeardemocracy
<chr><dbl><dbl>
USA 1994 8
China1994 7
Sudan199419
India199415
USA 199516
China1995 8
Sudan1995 6
India199515
[5]:
# Generate toy dataset 2:
econ <- data.frame(
    expand.grid(year = 1994:1996, country = c("USA", "China", "Sudan")),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
econ
A data.frame: 9 × 3
yearcountrygdp_pc
<int><fct><dbl>
1994USA 10368
1995USA 9604
1996USA 14462
1994China 5764
1995China16988
1996China 1835
1994Sudan 8100
1995Sudan17933
1996Sudan16246

In this situation, "country" does not fully identify which row is supposed to match up with which row across the two datasets, since we clearly want to match up the row for USA in 1994 in demo with the same row in econ.

To accommodate that, you can pass multiple columns to join’s on argument. e.g.:

[6]:
join(demo, econ, on = c("country", "year"), kind = "inner")
Joining, by = c("country", "year")

A data.frame: 6 × 4
countryyeardemocracygdp_pc
<chr><dbl><dbl><dbl>
USA 1994 810368
China1994 7 5764
Sudan199419 8100
USA 199516 9604
China1995 816988
Sudan1995 617933

What Rows Do You Keep?

As you may have noticed in the example above, our demo dataset had more countries than our econ dataset (it included India, while econ did not), and econ included more years than demo (namely: 1996). And when we merged the two datasets, only the country-year pairs that were common to both data sets were kept.

This is what’s referred to is an inner join. In an inner join, only rows whose identifiers are present in both datasets are kept in the final dataset. But inner is just one of four types of merges:

  • keep all rows (whether the match or not): full join (also sometimes called an outer join)

  • keep matching rows + unmatched rows from your “left” dataset (the first dataframe you pass to join): left join

  • keep matching rows + unmatched rows from your “right” dataset (the second dataframe you pass to join): right join

  • keep only matching rows: inner join

These four types of merges are nicely illustrated in this figure from datacomy:

types-of-joins

We can specify the type of join we want with the kind argument:

[7]:
# keep all rows, matching or not

join(demo, econ, on = c("country", "year"), kind = "full")
Joining, by = c("country", "year")

A data.frame: 11 × 4
countryyeardemocracygdp_pc
<chr><dbl><dbl><dbl>
USA 1994 810368
China1994 7 5764
Sudan199419 8100
India199415 NA
USA 199516 9604
China1995 816988
Sudan1995 617933
India199515 NA
USA 1996NA14462
China1996NA 1835
Sudan1996NA16246
[8]:
# keep matching + unmatched in demo observations
# (Left join)
join(demo, econ, on = c("country", "year"), kind = "left")

Joining, by = c("country", "year")

A data.frame: 8 × 4
countryyeardemocracygdp_pc
<chr><dbl><dbl><dbl>
USA 1994 810368
China1994 7 5764
Sudan199419 8100
India199415 NA
USA 199516 9604
China1995 816988
Sudan1995 617933
India199515 NA

As you can see, when you keep unmatched rows, R places NA values in entries that didn’t match, and so don’t have data for certain columns. We’ll talk about NAs in our missing data reading.

1-to-Many / Many-to-1 Merging

In our previous examples, the identifier we used clearly told R how to match each row of one dataset to exactly one row of the other dataset. We call this a 1-to-1 merge.

Sometimes, though, each row from one dataset may match up to multiple rows of the other. In these situations, we do a 1-to-many merge. Say, for example, we have the following two data frames:

[9]:
demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo
A data.frame: 9 × 3
countryyeardemocracy_score
<fct><int><dbl>
USA 1994 0
China1994 8
Sudan199413
USA 199517
China1995 6
Sudan199514
USA 199619
China1996 1
Sudan1996 2
[10]:
region <- data.frame(
    country = c("USA", "China", "Sudan"),
    region = c("America", "Asia", "Africa")
    )
region
A data.frame: 3 × 2
countryregion
<chr><chr>
USA America
ChinaAsia
SudanAfrica

Because region doesn’t vary over time, we have only 1 entry per country in the region dataset, while we have many entries for each county in demo (one per country-year pair). And when we merge the datasets, we want all the rows in demo to get copies of the matching row in region.

This can be done with join with no actual modifications to our code:

[11]:
join(demo, region, on = "country", kind = "inner")
Joining, by = "country"

A data.frame: 9 × 4
countryyeardemocracy_scoreregion
<chr><int><dbl><chr>
USA 1994 0America
China1994 8Asia
Sudan199413Africa
USA 199517America
China1995 6Asia
Sudan199514Africa
USA 199619America
China1996 1Asia
Sudan1996 2Africa

Many-to-Many Merges

There is such thing as a many-to-many merge, but it is incredibly rare you would ever want to do it. It results in the new dataset having all possible combinations of matching rows, which (if done incorrectly) can lead to a combinatorial explosion. So… I’d say don’t worry about it? :)

Recap

  • Merging is a method of bringing information from different datasets together.

  • Merging requires specifying the column(s) whose values tell R which rows to match up.

  • There are four types of merge, each of which will keep different data:

    • Full Join: Keep all rows.

    • Left Join: Keep rows that match + unmatched rows from the left (first) data set

    • Right Join: Keep rows that match + unmatched rows from the right (second) data set

    • Inner Join: Keep only rows that match.

Next Up

Let’s read more about the problems that can occur in merging, and how to address them!