Defensive Merging¶

As you saw in the last reading, merging is a little tricky to think about, by very straightforward to implement – specify your on columns and you’re off to the races. But, in my view, the fact that merging data is so easy is actually horrifyingly dangerous.

As discussed in our defensive programming reading, when working with real datasets we always have a model in our head about how our data is structured.

(Remember that in these exercises we play with little toy datasets that you can print out in their entirety to make it easy to understand what’s going on, but you can never actually look at all of your data we are working with thousands or hundreds of thousands of observations and tens or hundreds of variables.)

But it’s important always bear in mind that what we have in our heads is just that: a model, which may or may not be accurate.

I bring this up because merging is a great opportunity to test your model, and also a place where, if your model is wrong, you can get into a lot of trouble.

Suppose, for example, that we’re working with the econ and demo datasets from the last reading, and we want to merge on country. As before we’re assuming that country is a common identifier. But now let’s add a small problem – in demo we’ll denote the US as "USA", and in econ we’re write it as "United States" (this is not an uncommon happening). Suppose we don’t notice, and merge our data on country:

[1]:

# Generate toy dataset 1:
demo <- data.frame(
    country = c("USA", "Albania", "Turkey", "China", "Sudan"),
    democracy_score = c(19, 16, 16, 5, 10)
    )
demo

A data.frame: 5 × 2
country	democracy_score
<chr>	<dbl>
USA	19
Albania	16
Turkey	16
China	5
Sudan	10

[2]:

# Generate toy dataset 2:
econ <- data.frame(
    expand.grid(year = 1994:1996, country = c("United States", "China", "Sudan")),
    gdp_pc = round(runif(9, 1000, 20000), 0)
    )
econ

A data.frame: 9 × 3
year	country	gdp_pc
<int>	<fct>	<dbl>
1994	United States	17337
1995	United States	3513
1996	United States	8776
1994	China	14028
1995	China	6810
1996	China	1139
1994	Sudan	6393
1995	Sudan	4983
1996	Sudan	7583

[3]:

library(statar)
join(econ, demo, on = c("country", "year"), kind = "inner")

Joining, by = "country"

A data.frame: 6 × 4
year	country	gdp_pc	democracy_score
<int>	<chr>	<dbl>	<dbl>
1994	China	14028	5
1995	China	6810	5
1996	China	1139	5
1994	Sudan	6393	10
1995	Sudan	4983	10
1996	Sudan	7583	10

… oops. We just dropped the US from our analysis. No error messages, no warnings; it’s just gone.

Again, this is obvious here, but what if you had 180 countries in your data, and 10 years instead of 3 – that’d be 1,800 observations. You wouldn’t just see it.

Similarly, because it doesn’t seem like regions would change over time, you assume that your region dataset has one observation per country, but in actuality at some point in the past the United Nations changed region designations, and so the dataset includes one entry per country-year:

[4]:

region <- data.frame(
    country = c("USA", "China", "Sudan"),
    region = c("America", "Asia", "Africa"),
    year = c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
    )
region

A data.frame: 9 × 3
country	region	year
<chr>	<chr>	<dbl>
USA	America	1994
China	Asia	1994
Sudan	Africa	1994
USA	America	1995
China	Asia	1995
Sudan	Africa	1995
USA	America	1996
China	Asia	1996
Sudan	Africa	1996

Now if you merged this dataset with demo on "country" like we did above, this is what would happen:

[5]:

join(demo, region, on = "country", kind = "inner")

Joining, by = "country"

A data.frame: 9 × 4
country	democracy_score	region	year
<chr>	<dbl>	<chr>	<dbl>
USA	19	America	1994
USA	19	America	1995
USA	19	America	1996
China	5	Asia	1994
China	5	Asia	1995
China	5	Asia	1996
Sudan	10	Africa	1994
Sudan	10	Africa	1995
Sudan	10	Africa	1996

…oops. Now each country-year appears three times in our data! If we try to analyze this statistically, these repeat observations would make our results meaningless.

And again, no warnings, no error messages.

Merging with Checks¶

And now we get to the reason that I am a big advocate of using the statar package for merging: unlike other R packages, statar implements a number of safety features that are default in most other languages (in this case, the author is inspired by Stata, thus the name):

Check Merge Type: If you specify whether you you’re doing a 1-to-1, many-to-1, 1-to-many, or many-to-many merge, statar will check the data to see if you are correct. If not, it will raise an error.
Provide Merge Diagnostics: statar provides a diagnostic tool for evaluating whether your merges were successful to catch situations like the USA / United States situation described above.

Checking Merge Type¶

To illustrate, let’s try our region merge again with statar. We think we’re doing a many-to-1 merge (we expect several country-year pairs in demo to each match to one row in region), so we use the argument check = m~1:

join(demo, region, kind = "inner", on = "country", check = m~1)

> ERROR: Error: Variable(s) country don't uniquely identify observations in y
> Error: Variable(s) country don't uniquely identify observations in y
> Traceback:
>
> 1. join(demo, region, kind = "inner", on = "country", check = m ~
> .     1)
> 2. stop(paste0("Variable(s) ", paste(vars, collapse = " "), " don't uniquely identify observations in y"),
> .     call. = FALSE)

Ta-da! It points out that country doesn’t uniquely identify rows in our region dataset! Now we know we have a problem, and we won’t make a dataset full of duplicate entries.

Providing Merge Diagnostics¶

statar provides merge diagnostic information when you use the gen argument – you pass it a character, and it creates a new column with that name that tells you about whether each row is the result of a successful merge:

[6]:

# Do our merge with US problem
# as an outer ("full") merge.
# `gen="_merge"` will now add a column
# to our data with diagnostic info.

# Update demo to have year again.
demo <- data.frame(
    expand.grid(country = c("USA", "China", "Sudan"), year = 1994:1996),
    democracy_score = round(runif(9, 0, 20), 0)
    )
demo

merged_data <- join(econ, demo, on = c("country", "year"),
                    kind = "full", gen = "merge_check",
                    check = 1~1)
merged_data

A data.frame: 9 × 3
country	year	democracy_score
<fct>	<int>	<dbl>
USA	1994	5
China	1994	9
Sudan	1994	14
USA	1995	16
China	1995	9
Sudan	1995	2
USA	1996	16
China	1996	2
Sudan	1996	11

Joining, by = c("year", "country")

A data.frame: 12 × 5
year	country	gdp_pc	democracy_score	merge_check
<int>	<fct>	<dbl>	<dbl>	<int>
1994	United States	17337	NA	1
1995	United States	3513	NA	1
1996	United States	8776	NA	1
1994	China	14028	9	3
1995	China	6810	9	3
1996	China	1139	2	3
1994	Sudan	6393	14	3
1995	Sudan	4983	2	3
1996	Sudan	7583	11	3
1994	USA	NA	5	2
1995	USA	NA	16	2
1996	USA	NA	16	2

As you can see, join has added a column called merge_check with values of 1, 2, and 3. The meaning of these is:

1: Row in left dataset, no match in right dataset.
2: Row in right dataset, no match in left dataset.
3: Row merged successfully.

We can now check these values with table():

[7]:

table(merged_data$merge_check)


1 2 3
3 3 6

And we can add tests here too! For example, if we expected everything to merge correctly, we can add the following test (again, see defensive programming reading for discussion of tests):

[8]:

library(assertthat)

assert_that(all(merged_data$merge_check == 3))

> ERROR: Error: Elements 1, 2, 3, 10, 11, 12 of merged_data$merge_check == 3 are not true
> Error: Elements 1, 2, 3, 10, 11, 12 of merged_data$merge_check == 3 are not true
> Traceback:
>
> 1. assert_that(all(merged_data$merge_check == 3))

And now we know something went wrong. What went wrong? Let’s look at the rows that didn’t merge!

[9]:

merged_data[merged_data$merge_check != 3, ]

A data.frame: 6 × 5
	year	country	gdp_pc	democracy_score	merge_check
	<int>	<fct>	<dbl>	<dbl>	<int>
1	1994	United States	17337	NA	1
2	1995	United States	3513	NA	1
3	1996	United States	8776	NA	1
10	1994	USA	NA	5	2
11	1995	USA	NA	16	2
12	1996	USA	NA	16	2

Well there you go then! We can immediately see the problem.

As a rule, I basically never do inner joins, because with an inner join any problems with the merge – e.g. rows you expected to merge that didn’t – get dropped, masking any problems.

Instead, I strongly recommend always doing a full join, checking the diagnostic information provided by join, and then dropping any rows you no longer want.

But what if I want to do an inner, left, or right join?

Simple! Do the outer join, then drop the rows you don’t want using your merge_check variable. e.g. for an inner join, you can do merged_data = merged_data[merged_data$merge_check == 3,], for a left join you can do merged_data = merged_data[merged_data$merge_check != 2,], etc. But crucially, you can now do these after you’ve made sure nothing crazy happened in your merge by looking at the merge_check values.