Missing Data

One of the big differences between real data and toy datasets is that real data is almost always incomplete – governments may not have reported all the data they were supposed to, survey respondents may have hung up before the survey ended, data may have been corrupted, etc.

To accommodate this, R has a few very special data types / data values we’ll discuss here: NA, NaN, NULL, and inf/-inf. You’ve seen these pop up briefly in the past – when I tried to coerce a character to a number (as.numeric("Nick"), we got back an NA), and we’ve used df$col1 <- NULL to delete a column – but I’ve put off discussing these in detail because they have very special behavior that’s a little unintuitive.

NA

NA is the data value you’ll come across most in R to represent missing data. It basically means “I have no meaningful data to put here”.

An important feature of NA is that it is said to poison anything it touches, meaning that if any operation encounters an NA, it will return an NA. This is considered desirable because, almost by definition, R can’t make a guess about what it should do when it encounters an NA; how to handle an NA is something only a human can decide.

Here are some examples of NA poisoning:

[3]:
4 + NA
<NA>
[2]:
NA == 4
<NA>
[1]:
mean(c(1, 2, NA))
<NA>

Of course, this is sometimes a problem, so most R functions accept an na.rm (remove NA) keyword argument, which will calculate results while ignore NAs. But for the reason described above, this is almost never the default behavior:

[6]:
mean(c(1, 2, NA), na.rm = TRUE)
1.5

Checking for NAs

Because NA poisons any operation that touches it, you can’t directly check for whether something is NA the way you would with any other value, because you’ll just get back an NA!:

[7]:
NA == NA
<NA>
[3]:
x <- NA
x == NA
<NA>

Instead, you have to use a special function: is.na():

[4]:
is.na(x)
TRUE
[8]:
!is.na(7)
TRUE

NaN, Inf, -Inf

NaN stands for “Not a Number”, and is basically like NA. The difference is that NaN is actually implemented by an international standard that dictates how computers deal with a certain kind of number (what R called numeric, and most people call “floats” or “floating point number”). As a result, you may occassionally get NaN back from certain numerical computations when you might otherwise expect a NA:

[11]:
0/0
NaN

As with NA, NaN will poison any operation is touches:

[12]:
mean(c(1, 2, NaN))
NaN

And you can’t test for it with ==:

[14]:
NaN == NaN
<NA>

Thankfully, while there is a function called is.nan() the tests whether something is a NaN (and returns FALSE for an NA), is.na() will return TRUE for a NaN, so in general in R, just use is.na() to check for missing data:

[17]:
is.na(NaN)
TRUE

Inf and -Inf

Inf and -Inf are also implemented in the international specification for floating-point numbers, and are used when a number get so big or so small that it can’t be represented with numeric. This is unbelievably hard to do with anything but truly pathological functions, like dividing by zero:

[19]:
7 / 0
Inf
[20]:
-7 / 0
-Inf

If you get an Inf, it doesn’t mean you’re working with really big numbers and R has failed you; it means you did something wrong. :)

Like NA and NaN, Inf and -Inf will also poison values:

[21]:
Inf - 5
Inf

NULL

NULL is a little different from these other values in that it isn’t meant as a stand-in for missing data, and so maybe doesn’t belong in this reading (but I have no idea where else to put it). NULL is… programming oblivion? Where a function would return NA if it has been asked to run a computation that doesn’t make sense (e.g. as.numeric("Nick")), it returns a NULL if it doesn’t have anything to return at all.

If you try and put NULL into a vector, it isn’t kept as a record of bad or missing data, R just views that entry as non-existant:

[23]:
c(1, NULL, 3)
  1. 1
  2. 3

So yeah, NULL is a programmers tool more than a data scientists tool.