Intro to Merging¶

This tutorial explains how to combine different datasets through merging. Merging is basically a way of combining the columns from one dataset with those of another.

To illustrate the kind of situation where you may end up wanting to merge data, suppose that you are interested in how economic development affects democracy across countries, but all you have are two separate datasets, each with data on one of these two outcomes – say, economic data from the World Bank and political data from the Polity Project. In order to study how these things relate to one another, we need to combine these two data sets.

(If you don’t care about democracy, just imagine we’re talking about something you care about, like urban pollution levels instead of democracy scores! The logic is the same.)

In a previous reading, I argued that subsiding data is probably the most important skill you will learn as a data scientist. Being able to subset data carefully makes it possible to study how all sorts of outcomes vary across different groups.

Merging, I would argue, is the second most important skill for generating insights in data science. It is hard to overstate the value that comes from bringing together two disparate sources of information and studying how they relate to one another. This may surprise you coming from someone who teaches data science – most people would assume a data scientist believes value comes from fancy models – but in my experience, most insights come from documenting simple empirical regularities about the world that nobody has noticed before. We often use more sophisticated statistical modeling in order to evaluate the strength of relationships, and to help us understand what is driving those empirical regularities, but all of that work is secondary to just being able to measure basic features of the world. And merging is key to that process.

A Merging Example¶

To illustrate merging, let’s suppose that we want to do the analysis of the relationship between political outcomes and economic development we described above. Below we create to toy data sets, one with democracy scores (analogous to those you might get from a group like the Polity Project), and one with economic information (like the type of information you might get from the World Developing Indicators):

[1]:

# Generate toy dataset 1:
demo <- data.frame(
    country = c("USA", "Albania", "Turkey", "China", "Sudan"),
    democracy_score = c(19, 16, 16, 5, 10)
    )
demo

A data.frame: 5 × 2
country	democracy_score
<chr>	<dbl>
USA	19
Albania	16
Turkey	16
China	5
Sudan	10

[2]:

# Generate toy dataset 2:
econ <- data.frame(
    country = c("China", "Albania", "Turkey", "USA", "Sudan"),
    gdp_pc = c(12000, 10000, 9000, 20000, 500)
    )
econ

A data.frame: 5 × 2
country	gdp_pc
<chr>	<dbl>
China	12000
Albania	10000
Turkey	9000
USA	20000
Sudan	500

To merge two data sets, they must have an identifier in common – one or more columns whose values allow us to understand which rows are identifying the same entities in the two datasets. In the case of these two datasets, the common identifier is country name, stored in the column "country" in both datasets.

To do the actual merge, we will use the join() command from a package called statar. This isn’t the most popular package for merging and R, but for reasons we’ll discuss enter next reading, I definitely think it’s the best.

join() requires at least four arguments: the two datasets you want to merge, and the name of the column(s) containing the common identifier, and the type of merge (which, for now, we’ll just specify as "inner", but we’ll talk more about what that means below). In this case, our call to join would look like this:

[3]:

library(statar)
df <- join(demo, econ, on = "country", kind = "inner")
df

Joining, by = "country"

A data.frame: 5 × 3
country	democracy_score	gdp_pc
<chr>	<dbl>	<dbl>
USA	19	20000
Albania	16	10000
Turkey	16	9000
China	5	12000
Sudan	10	500

This is called a 1-to-1 merge because each row in the first dataset matches up to exactly one row in the second dataset – i.e. each country appears only once in the first dataset, and matches up to a single row in the second dataset (the row with the same country identifier).

Your common identifier may be spread across multiple columns. Consider the following example:

[4]:

# Generate toy dataset 1:
country <- rep(c("USA", "China", "Sudan", "India"), 2)
year <- c(1994, 1994, 1994, 1994, 1995, 1995, 1995, 1995)
democracy <- round(runif(8, 0, 20), 0)

demo <- data.frame(country, year, democracy)
demo

A data.frame: 8 × 3
country	year	democracy
<chr>	<dbl>	<dbl>
USA	1994	8
China	1994	7
Sudan	1994	19
India	1994	15
USA	1995	16
China	1995	8
Sudan	1995	6
India	1995	15

[5]:

# Generate toy dataset 2:
econ <- data.frame(
    expand.grid(year = 1994:1996, country = c("USA", "China", "Sudan")),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
econ

A data.frame: 9 × 3
year	country	gdp_pc
<int>	<fct>	<dbl>
1994	USA	10368
1995	USA	9604
1996	USA	14462
1994	China	5764
1995	China	16988
1996	China	1835
1994	Sudan	8100
1995	Sudan	17933
1996	Sudan	16246

In this situation, "country" does not fully identify which row is supposed to match up with which row across the two datasets, since we clearly want to match up the row for USA in 1994 in demo with the same row in econ.

To accommodate that, you can pass multiple columns to join’s on argument. e.g.:

[6]:

join(demo, econ, on = c("country", "year"), kind = "inner")

Joining, by = c("country", "year")

A data.frame: 6 × 4
country	year	democracy	gdp_pc
<chr>	<dbl>	<dbl>	<dbl>
USA	1994	8	10368
China	1994	7	5764
Sudan	1994	19	8100
USA	1995	16	9604
China	1995	8	16988
Sudan	1995	6	17933

What Rows Do You Keep?¶

As you may have noticed in the example above, our demo dataset had more countries than our econ dataset (it included India, while econ did not), and econ included more years than demo (namely: 1996). And when we merged the two datasets, only the country-year pairs that were common to both data sets were kept.

This is what’s referred to is an inner join. In an inner join, only rows whose identifiers are present in both datasets are kept in the final dataset. But inner is just one of four types of merges:

keep all rows (whether the match or not): full join (also sometimes called an outer join)
keep matching rows + unmatched rows from your “left” dataset (the first dataframe you pass to join): left join
keep matching rows + unmatched rows from your “right” dataset (the second dataframe you pass to join): right join
keep only matching rows: inner join

These four types of merges are nicely illustrated in this figure from datacomy:

types-of-joins

We can specify the type of join we want with the kind argument:

[7]:

# keep all rows, matching or not

join(demo, econ, on = c("country", "year"), kind = "full")

Joining, by = c("country", "year")

A data.frame: 11 × 4
country	year	democracy	gdp_pc
<chr>	<dbl>	<dbl>	<dbl>
USA	1994	8	10368
China	1994	7	5764
Sudan	1994	19	8100
India	1994	15	NA
USA	1995	16	9604
China	1995	8	16988
Sudan	1995	6	17933
India	1995	15	NA
USA	1996	NA	14462
China	1996	NA	1835
Sudan	1996	NA	16246

[8]:

# keep matching + unmatched in demo observations
# (Left join)
join(demo, econ, on = c("country", "year"), kind = "left")

Joining, by = c("country", "year")

A data.frame: 8 × 4
country	year	democracy	gdp_pc
<chr>	<dbl>	<dbl>	<dbl>
USA	1994	8	10368
China	1994	7	5764
Sudan	1994	19	8100
India	1994	15	NA
USA	1995	16	9604
China	1995	8	16988
Sudan	1995	6	17933
India	1995	15	NA

As you can see, when you keep unmatched rows, R places NA values in entries that didn’t match, and so don’t have data for certain columns. We’ll talk about NAs in our missing data reading.

1-to-Many / Many-to-1 Merging¶

In our previous examples, the identifier we used clearly told R how to match each row of one dataset to exactly one row of the other dataset. We call this a 1-to-1 merge.

Sometimes, though, each row from one dataset may match up to multiple rows of the other. In these situations, we do a 1-to-many merge. Say, for example, we have the following two data frames:

[9]:

demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo

A data.frame: 9 × 3
country	year	democracy_score
<fct>	<int>	<dbl>
USA	1994	0
China	1994	8
Sudan	1994	13
USA	1995	17
China	1995	6
Sudan	1995	14
USA	1996	19
China	1996	1
Sudan	1996	2

[10]:

region <- data.frame(
    country = c("USA", "China", "Sudan"),
    region = c("America", "Asia", "Africa")
    )
region

A data.frame: 3 × 2
country	region
<chr>	<chr>
USA	America
China	Asia
Sudan	Africa

Because region doesn’t vary over time, we have only 1 entry per country in the region dataset, while we have many entries for each county in demo (one per country-year pair). And when we merge the datasets, we want all the rows in demo to get copies of the matching row in region.

This can be done with join with no actual modifications to our code:

[11]:

join(demo, region, on = "country", kind = "inner")

Joining, by = "country"

A data.frame: 9 × 4
country	year	democracy_score	region
<chr>	<int>	<dbl>	<chr>
USA	1994	0	America
China	1994	8	Asia
Sudan	1994	13	Africa
USA	1995	17	America
China	1995	6	Asia
Sudan	1995	14	Africa
USA	1996	19	America
China	1996	1	Asia
Sudan	1996	2	Africa

Many-to-Many Merges¶

There is such thing as a many-to-many merge, but it is incredibly rare you would ever want to do it. It results in the new dataset having all possible combinations of matching rows, which (if done incorrectly) can lead to a combinatorial explosion. So… I’d say don’t worry about it? :)

Recap¶

Merging is a method of bringing information from different datasets together.
Merging requires specifying the column(s) whose values tell R which rows to match up.
There are four types of merge, each of which will keep different data:
- Full Join: Keep all rows.
- Left Join: Keep rows that match + unmatched rows from the left (first) data set
- Right Join: Keep rows that match + unmatched rows from the right (second) data set
- Inner Join: Keep only rows that match.

Next Up¶

Let’s read more about the problems that can occur in merging, and how to address them!