Intro to Dataframes

In our previous lessons, we’ve talked about how vectors are often used to store lots of different observations of a given measurement (e.g. the answers of different survey respondents to a given question), and how matrices can be used to collect lots of different measurements in columns (e.g. each column can be answers to different questions).

But matrices have one major limitation when it comes to social science workflows, which is that all the entries in a matrix have to be of the same type. In reality, however, we often have datasets with lots of different data types. For example, we might have numeric data on age and income, but character data for people’s names, preferred political candidate, etc. Or we might have data on power plants across the US that includes numeric data on capacity, age, and pollution alongside character data on the power plant’s fuel and the company that owns the plant.

To deal with this kind of heterogeneous tabular data, we turn to the data.frame.

Dataframes are basically just a collection of vectors, where each vector corresponds to a different column, and each column has a single type. Since they’re two-dimensional data structures like matrices, we can actually subset them in the same way as matrices, but they are more flexible in terms of the types of data they can store.

In this reading, we’ll discussing how to create dataframes from vectors and how to load dataframes from files. Then in our next reading we will discuss how to manipulate dataframes (which, as we will discover, is very similar to how we learn to manipulate matrices).

Creating Dataframes for Vectors

Let’s start by learning how to create a dataframe from vectors in R. This turns out to be very simple — just combine a set of vectors of the same length using the data.frame() command.

[1]:
# Create three vectors
name <- c("al", "bea", "carol")
age <- c(6, 7, 4)
hair <- c("brown", "green", "blond")

# Create data frame
children <- data.frame(name, age, hair)
children
A data.frame: 3 × 3
nameagehair
<chr><dbl><chr>
al 6brown
bea 7green
carol4blond

Or we can create our data frame by inserting our vectors as keyword arguments:

[2]:
# Create data frame
children <- data.frame(
    name = c("al", "bea", "carol"),
    age = c(6, 7, 4),
    hair = c("brown", "green", "blond")
)
children
A data.frame: 3 × 3
nameagehair
<chr><dbl><chr>
al 6brown
bea 7green
carol4blond

Note that unlike matrices and vectors – which can have names – dataframe columns always have names, and you’ll usually see columns accessed by name for reasons we’ll discuss below:

[3]:
children[, "hair"]
  1. 'brown'
  2. 'green'
  3. 'blond'

And as we discussed before, the columns of a dataframe are just our old friends, the vector!

[4]:
class(children[, "hair"])
'character'

Getting to Know Your Dataframe

To better understand the proper structure of datasets, let’s create a second data frame that has a more realistic data structure:

[5]:
country <- rep(c("USA", "China", "Sudan"), 3)
year <- c(1994, 1994, 1994, 1995, 1995, 1995, 1996, 1996, 1996)
gdp_pc <- round(runif(9, 1000, 20000))

countries <- data.frame(country, year, gdp_pc)
countries
A data.frame: 9 × 3
countryyeargdp_pc
<chr><dbl><dbl>
USA 1994 2887
China199415273
Sudan1994 5232
USA 199514511
China199512811
Sudan1995 7416
USA 199613420
China199611140
Sudan199614712

Where we can pretend that gdp_pc is a measure of a country’s GDP per capita in a given year.

(A quick aside: rep(), as you may recall, creates a vector that repeats the first input the number of times specified by the second input. runif() creates, in this case, 9 random values uniformly distributed between 1000 and 20000.)

Now let’s explore some common functions for getting to know your dataframe!

Use nrow() and ncol() to get the number of rows (observations) and columns (variables):

[6]:
# Num rows (e.g. observations)
nrow(countries)
9
[7]:
# Num columns (e.g. variables)
ncol(countries)
3

Use head() and tail() to look at the first and last few rows of a dataset, respectively. Obviously this is more useful when we have datasets with hundreds or thousands of observations you can’t just look at the whole thing at once! :)

[8]:
head(countries)
A data.frame: 6 × 3
countryyeargdp_pc
<chr><dbl><dbl>
1USA 1994 2887
2China199415273
3Sudan1994 5232
4USA 199514511
5China199512811
6Sudan1995 7416
[9]:
tail(countries)
A data.frame: 6 × 3
countryyeargdp_pc
<chr><dbl><dbl>
4USA 199514511
5China199512811
6Sudan1995 7416
7USA 199613420
8China199611140
9Sudan199614712

Finally, summary() is a great tool for getting a quick sense of the variables in the data:

[10]:
# Get some summary information about each variable
summary(countries)
   country               year          gdp_pc
 Length:9           Min.   :1994   Min.   : 2887
 Class :character   1st Qu.:1994   1st Qu.: 7416
 Mode  :character   Median :1995   Median :12811
                    Mean   :1995   Mean   :10822
                    3rd Qu.:1996   3rd Qu.:14511
                    Max.   :1996   Max.   :15273

Reading Data from Files

Yup, it’s finally time for REAL DATA! HOORAY!

Well it’s often useful to construct a dataframe manually from vectors for illustrative purposes, in reality the dataframes that you work with will almost always be coming from a file that somebody has provided you (an Excel spreadsheet, a csv spreadsheet, a Stata data set, etc.). To read these kinds of files, you need to:

  1. Tell R where to find the file you want to read by setting your working directory.

  2. Execute a command that will read the file from the specified location.

Working Directory

The first step to loading a file is to understand the concept of a working directory.

The working directory is the location on your file system that R thinks of as being “open” in your current session of R. The working directory specifies where R will do things like save or look for files by default. For example, if you’re working directly is currently set to your desktop and you save a file using a command like write.csv("my_data.csv"), the file my_data.csv will be saved to your desktop. Similarly, if you try to open a file with a command like read.csv("my_data.csv"), R will look for a file called “my_data.csv” on your desktop and try to load it, and if it can’t find that file on your desktop, it will say that it was unable to find the file you asked for.

You can see the current working directory of your R session with the command getwd(). On my system (macOS), the output will look something like this:

[11]:
getwd()
'/Users/Nick/github/computational_methods_boot_camp/source'

What that is saying is that my current working directory is the folder source in the folder computational_methods_boot_camp in the folder github located in my user directory. The exact result you see will vary depending on how you opened R on your own computer, and if you’re working on a Windows computer you can expect that your working directory will start C:/ instead of a simple /.

To change your working directory, you can use the command setwd("[new working directory]"). For example, if I wanted to move my working directory to my desktop, I’d type:

[12]:
setwd("/users/nick/downloads")

And if you want to see what’s in your working directory (as a sanity check to ensure you’re in the right place), run dir():

[13]:
setwd("/Users/Nick/github/computational_methods_boot_camp/source/data")
dir()
  1. 'Datasaurus.tsv'
  2. 'Frank_All_v97.csv'
  3. 'RatesDeaths_AllIndicators.xlsx'
  4. 'State_FIPS.csv'
  5. 'anes-1948-2012.zip'
  6. 'democracy.csv'
  7. 'gdp.csv'
  8. 'nyc-311-sample'
  9. 'nyc-311-sample.zip'
  10. 'region.csv'
  11. 'states.csv'
  12. 'states_codebook.csv'
  13. 'trump_tweets.json'
  14. 'unicef-all.csv'
  15. 'unicef-u5mr.csv'
  16. 'world-small.csv'

NOTE: Setting Working Directory in RStudio

There are a few different ways to specify file paths, and they can get a little tricky. In the long run learning how file paths work is an important skill, but for the moment, if you are ever unable to set your working directory to the folder you want, in RStudio you can set the working directory by going to the Session menu, going to Set Working Directory, and Choose Directory....

Once you pick a folder, you will see RStudio insert the correct path into the setwd() function in your console, changing your working directory and showing you what the filepath actually is.

(Note that on macOS, RStudio will often insert a path that starts with ~/. That’s just a shorthand on macs for the current user’s user directory, and is the same as typing out /users/[your user name]/).

If you want to learn more about file paths, you can read about them here.

Reading the File

Now that we’ve told R where to look for our file, it’s time to read it in.

Datasets come in many formats, usually identified by their file suffix. A file ending in .csv, for example (e.g. file.csv) is a kind of spreadsheet with data stored as comma-separated values. A file ending in .dta (e.g. file.dta) is a dataset created by the program Stata, and so on. Thankfully, R can read almost any standard data format you may get.

To illustrate, here are a handful of commands for reading different types of files. Don’t try to memorize these! I’m only providing you with this list so that you understand that these commands exist so that if you ever need them in the future it will occur to you to google them.

Note: The ability to load some formats requires third party packages.

The ability to load some data formats requires installing third party packages. You can install a third party package onto your computer using the command install.packages() (e.g. install.packages("foreign")), after which you load it into any R session where you want to use the package with the library() command (e.g. library(foreign)). Note the use of quotes around the package name in install.packages(), but not library(). Not sure why…

# Available by default
df <- read.csv("file.csv")           # Comma separated values
df <- read.csv("file.txt", sep="\t") # tab separated values

# Using the `foreign` library
library(foreign) #load foreign
df <- read.dta("file.dta")   # Stata data
df <- read.spss("file.spss") # SPSS data

# Using the `readxl` library
library(readxl)
df <- read_excel("file.xls")  # Excel xls spreadsheet
df <- read_excel("file.xlsx") # Excel xlsx spreadsheet

For the exercises we’ll be doing next, we’ll work with the world-small.csv dataset, which you can download here by clicking the raw button, then right-clicking, selecting “Save As…”, and saving the file as world-small.csv.).

As noted, different commands are used to read different types of files. This is the syntax used for reading a .csv file:

[14]:
# Set working directory to this file's directory
# (this will be different for you!)
setwd("/Users/Nick/github/computational_methods_boot_camp/source/data")

# Read in the data and assign to a variable.
world <- read.csv("world-small.csv")

As we can see, the return value from read.csv is a data.frame!

[15]:
class(world)
'data.frame'

Using some of the tools we saw above, let’s take a quick look at our dataset!

[16]:
# Number of observations:
nrow(world)
145
[17]:
# Number of observations
ncol(world)
4
[18]:
# Let's look at the first few rows
head(world)
A data.frame: 6 × 4
countryregiongdppcap08polityIV
<chr><chr><int><dbl>
1Albania C&E Europe 771517.8
2Algeria Africa 803310.0
3Angola Africa 5899 8.0
4ArgentinaS. America 1433318.0
5Armenia C&E Europe 607015.0
6AustraliaAsia-Pacific3567720.0
[19]:
# And get a sense of the variables in the data
summary(world)
   country             region            gdppcap08        polityIV
 Length:145         Length:145         Min.   :  188   Min.   : 0.000
 Class :character   Class :character   1st Qu.: 2153   1st Qu.: 7.667
 Mode  :character   Mode  :character   Median : 7271   Median :16.000
                                       Mean   :13252   Mean   :13.408
                                       3rd Qu.:19330   3rd Qu.:19.000
                                       Max.   :85868   Max.   :20.000

Next Steps

Now that we know how to create or load dataframes, now we can turn to learning how we can work with them!