Defensive Merging

As you saw in the last reading, merging is a little tricky to think about, by very straightforward to implement – specify your on columns and you’re off to the races. But, in my view, the fact that merging data is so easy is actually horrifyingly dangerous.

As discussed in our defensive programming reading, when working with real datasets we always have a model in our head about how our data is structured.

(Remember that in these exercises we play with little toy datasets that you can print out in their entirety to make it easy to understand what’s going on, but you can never actually look at all of your data we are working with thousands or hundreds of thousands of observations and tens or hundreds of variables.)

But it’s important always bear in mind that what we have in our heads is just that: a model, which may or may not be accurate.

I bring this up because merging is a great opportunity to test your model, and also a place where, if your model is wrong, you can get into a lot of trouble.

Suppose, for example, that we’re working with the econ and demo datasets from the last reading, and we want to merge on country. As before we’re assuming that country is a common identifier. But now let’s add a small problem – in demo we’ll denote the US as "USA", and in econ we’re write it as "United States" (this is not an uncommon happening). Suppose we don’t notice, and merge our data on country:

[1]:
# Generate toy dataset 1:
demo <- data.frame(
    country = c("USA", "Albania", "Turkey", "China", "Sudan"),
    democracy_score = c(19, 16, 16, 5, 10)
    )
demo
A data.frame: 5 × 2
countrydemocracy_score
<chr><dbl>
USA 19
Albania16
Turkey 16
China 5
Sudan 10
[2]:
# Generate toy dataset 2:
econ <- data.frame(
    expand.grid(year = 1994:1996, country = c("United States", "China", "Sudan")),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
econ
A data.frame: 9 × 3
yearcountrygdp_pc
<int><fct><dbl>
1994United States17337
1995United States 3513
1996United States 8776
1994China 14028
1995China 6810
1996China 1139
1994Sudan 6393
1995Sudan 4983
1996Sudan 7583
[3]:
library(statar)
join(econ, demo, on = c("country", "year"), kind = "inner")
Joining, by = "country"

A data.frame: 6 × 4
yearcountrygdp_pcdemocracy_score
<int><chr><dbl><dbl>
1994China14028 5
1995China 6810 5
1996China 1139 5
1994Sudan 639310
1995Sudan 498310
1996Sudan 758310

… oops. We just dropped the US from our analysis. No error messages, no warnings; it’s just gone.

Again, this is obvious here, but what if you had 180 countries in your data, and 10 years instead of 3 – that’d be 1,800 observations. You wouldn’t just see it.

Similarly, because it doesn’t seem like regions would change over time, you assume that your region dataset has one observation per country, but in actuality at some point in the past the United Nations changed region designations, and so the dataset includes one entry per country-year:

[4]:
region <- data.frame(
    country = c("USA", "China", "Sudan"),
    region = c("America", "Asia", "Africa"),
    year = c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
    )
region
A data.frame: 9 × 3
countryregionyear
<chr><chr><dbl>
USA America1994
ChinaAsia 1994
SudanAfrica 1994
USA America1995
ChinaAsia 1995
SudanAfrica 1995
USA America1996
ChinaAsia 1996
SudanAfrica 1996

Now if you merged this dataset with demo on "country" like we did above, this is what would happen:

[5]:
join(demo, region, on = "country", kind = "inner")
Joining, by = "country"

A data.frame: 9 × 4
countrydemocracy_scoreregionyear
<chr><dbl><chr><dbl>
USA 19America1994
USA 19America1995
USA 19America1996
China 5Asia 1994
China 5Asia 1995
China 5Asia 1996
Sudan10Africa 1994
Sudan10Africa 1995
Sudan10Africa 1996

…oops. Now each country-year appears three times in our data! If we try to analyze this statistically, these repeat observations would make our results meaningless.

And again, no warnings, no error messages.

Merging with Checks

And now we get to the reason that I am a big advocate of using the statar package for merging: unlike other R packages, statar implements a number of safety features that are default in most other languages (in this case, the author is inspired by Stata, thus the name):

  • Check Merge Type: If you specify whether you you’re doing a 1-to-1, many-to-1, 1-to-many, or many-to-many merge, statar will check the data to see if you are correct. If not, it will raise an error.

  • Provide Merge Diagnostics: statar provides a diagnostic tool for evaluating whether your merges were successful to catch situations like the USA / United States situation described above.

Checking Merge Type

To illustrate, let’s try our region merge again with statar. We think we’re doing a many-to-1 merge (we expect several country-year pairs in demo to each match to one row in region), so we use the argument check = m~1:

join(demo, region, kind = "inner", on = "country", check = m~1)

> ERROR: Error: Variable(s) country don't uniquely identify observations in y
> Error: Variable(s) country don't uniquely identify observations in y
> Traceback:
>
> 1. join(demo, region, kind = "inner", on = "country", check = m ~
> .     1)
> 2. stop(paste0("Variable(s) ", paste(vars, collapse = " "), " don't uniquely identify observations in y"),
> .     call. = FALSE)

Ta-da! It points out that country doesn’t uniquely identify rows in our region dataset! Now we know we have a problem, and we won’t make a dataset full of duplicate entries.

Providing Merge Diagnostics

statar provides merge diagnostic information when you use the gen argument – you pass it a character, and it creates a new column with that name that tells you about whether each row is the result of a successful merge:

[6]:
# Do our merge with US problem
# as an outer ("full") merge.
# `gen="_merge"` will now add a column
# to our data with diagnostic info.

# Update demo to have year again.
demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo

merged_data <- join(econ, demo, on = c("country", "year"),
                    kind = "full", gen = "merge_check",
                    check = 1~1)
merged_data
A data.frame: 9 × 3
countryyeardemocracy_score
<fct><int><dbl>
USA 1994 5
China1994 9
Sudan199414
USA 199516
China1995 9
Sudan1995 2
USA 199616
China1996 2
Sudan199611
Joining, by = c("year", "country")

A data.frame: 12 × 5
yearcountrygdp_pcdemocracy_scoremerge_check
<int><fct><dbl><dbl><int>
1994United States17337NA1
1995United States 3513NA1
1996United States 8776NA1
1994China 14028 93
1995China 6810 93
1996China 1139 23
1994Sudan 6393143
1995Sudan 4983 23
1996Sudan 7583113
1994USA NA 52
1995USA NA162
1996USA NA162

As you can see, join has added a column called merge_check with values of 1, 2, and 3. The meaning of these is:

  • 1: Row in left dataset, no match in right dataset.

  • 2: Row in right dataset, no match in left dataset.

  • 3: Row merged successfully.

We can now check these values with table():

[7]:
table(merged_data$merge_check)

1 2 3
3 3 6

And we can add tests here too! For example, if we expected everything to merge correctly, we can add the following test (again, see defensive programming reading for discussion of tests):

[8]:
library(assertthat)
assert_that(all(merged_data$merge_check == 3))

> ERROR: Error: Elements 1, 2, 3, 10, 11, 12 of merged_data$merge_check == 3 are not true
> Error: Elements 1, 2, 3, 10, 11, 12 of merged_data$merge_check == 3 are not true
> Traceback:
>
> 1. assert_that(all(merged_data$merge_check == 3))

And now we know something went wrong. What went wrong? Let’s look at the rows that didn’t merge!

[9]:
merged_data[merged_data$merge_check != 3, ]
A data.frame: 6 × 5
yearcountrygdp_pcdemocracy_scoremerge_check
<int><fct><dbl><dbl><int>
11994United States17337NA1
21995United States 3513NA1
31996United States 8776NA1
101994USA NA 52
111995USA NA162
121996USA NA162

Well there you go then! We can immediately see the problem.

As a rule, I basically never do inner joins, because with an inner join any problems with the merge – e.g. rows you expected to merge that didn’t – get dropped, masking any problems.

Instead, I strongly recommend always doing a full join, checking the diagnostic information provided by join, and then dropping any rows you no longer want.

But what if I want to do an inner, left, or right join?

Simple! Do the outer join, then drop the rows you don’t want using your merge_check variable. e.g. for an inner join, you can do merged_data = merged_data[merged_data$merge_check == 3,], for a left join you can do merged_data = merged_data[merged_data$merge_check != 2,], etc. But crucially, you can now do these after you’ve made sure nothing crazy happened in your merge by looking at the merge_check values.