Intro to Vectors¶
Now that we have a solid grasp on how R actually interprets code, how to assign values to variables, and some of the main types of data in R, we can now turn to the fundamental building block of R: vectors.
Vectors are data structures designed to store collections of individual measurements. For example, a vector can be used to hold the heights of everyone in a classroom, or a series of measurements of one’s heart rate taken over time.
In this reading, we’ll learn about how we can use vectors to manage data about things like age, height, eye color, GDP per capita, and war initiation. By the end of this tutorial, you’ll know how to create vectors, how to subset them, how to modify them, and how to summarize them.
Why Do We Need Vectors?¶
It is rarely the case in science that we work with singular values (e.g. the number 7
, or a person’s name "Jill"
). Most of the time, we’re working with a collection of observations, such as the ages of everyone who responded to a survey, or the GDP of all the countries in the world. Indeed, one can argue that in so far as social science is the study of empirical regularities—patterns that are consistently present throughout social contexts—it is only through the study of multiple measurements that we can engage in the search for regularities, and thus do social science!
To accommodate this need, one of the objects you’ll use most in R is a vector. A vector is a collection of values, all of the same data type. For example, we might have a numeric
vector full of the ages of all survey respondents, or a character
vector full of the names of countries. We call this property of vectors being homogeneously typed.
In fact… OK, it’s time to come clean: you’ve been working with vectors this whole time! Vectors are so fundamental to R that all data in R is stored as vectors. Even something simple like the number 7, in R, is actually stored as a numeric vector of length 1:
[1]:
a <- 7
length(a)
Creating vectors¶
In our previous exercises, you learned how to create single-entry vectors using the assignment operator (e.g. a <- 6
).
But usually we want vectors with more than one entry (otherwise, why have vectors?). There are a number of ways to create vectors in R, but the most fundamental is by using the c()
function to concatenate data into a vector:
[2]:
# Numeric vectors
a_numeric_vector <- c(20, 25, 60, 55)
a_numeric_vector
- 20
- 25
- 60
- 55
[3]:
# Character vectors
a_character_vector <- c("Red", "Green", "Purple")
a_character_vector
- 'Red'
- 'Green'
- 'Purple'
[4]:
# Logical vectors
a_logical_vector <- c(TRUE, FALSE, TRUE)
c()
doesn’t just work with values you write by hand, though – you can also use c()
with variables to combine longer vectors:
[5]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- c(a, b)
c
- 1
- 2
- 3
- 4
- 5
- 6
There are also a lot of other convenience functions for creating commonly used vectors. For example, it’s very common to want to get a vector of sequential numbers, so if you type 1:20
you get a vector of all of the counting numbers from 1 to 20:
[6]:
1:20
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
And if you want a more unusual sequence, you can use seq()
, which takes a starting point, and ending point, and a step size to create any sequence you may want. Here’s all the even numbers from 2 to 20:
[7]:
seq(2, 20, 2)
- 2
- 4
- 6
- 8
- 10
- 12
- 14
- 16
- 18
- 20
(This is the first time we’ve seen a function in practice that takes multiple arguments. Recall they’re just like the functions that only take one argument, just with different arguments separate by commas).
And if you want a vector of length N
with the same value repeated over and over, you can use rep()
:
[8]:
# Create a vector with the value 42
# repeated 10 times.
rep(42, 10)
- 42
- 42
- 42
- 42
- 42
- 42
- 42
- 42
- 42
- 42
Vector Math¶
One of the great things about vectors is that we can do all sorts of mathematical operations to vectors efficiently.
If you do math with two vectors, one of which has length one, you basically just get the operation applied to every entry.
[9]:
# Here's what we'll start with
numbers <- 1:10
numbers
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
[10]:
# You can modify all values in a vector
# by doing math with a vector of length 1
numbers / 10
- 0.1
- 0.2
- 0.3
- 0.4
- 0.5
- 0.6
- 0.7
- 0.8
- 0.9
- 1
[11]:
numbers + 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
The same thing happens with mathematical functions – the function gets applied to each entry:
[12]:
# Modify a vector using a function
sqrt(numbers) #square root
- 1
- 1.4142135623731
- 1.73205080756888
- 2
- 2.23606797749979
- 2.44948974278318
- 2.64575131106459
- 2.82842712474619
- 3
- 3.16227766016838
[13]:
exp(numbers) #exponentiate
- 2.71828182845905
- 7.38905609893065
- 20.0855369231877
- 54.5981500331442
- 148.413159102577
- 403.428793492735
- 1096.63315842846
- 2980.95798704173
- 8103.08392757538
- 22026.4657948067
If you have two vectors of the same length, mathematical operations will occur “element-wise”, meaning the mathematical operation will be applied to the two 1st entries, then the two 2nd entries, then the two 3rd entries, etc. For example, if we were to add our vector of the values 1 through 10 to a vector with five 0s, then five 1s, R would do the following:
1 + 0 = 1 + 0 = 1
2 + 0 = 2 + 0 = 2
3 + 0 = 3 + 0 = 3
4 + 0 = 4 + 0 = 4
5 + 0 = 5 + 0 = 5
6 + 1 = 6 + 1 = 7
7 + 1 = 7 + 1 = 8
8 + 1 = 8 + 1 = 9
9 + 1 = 9 + 1 = 10
10 + 1 = 10 + 1 = 11
(Obviously, R likes to print out vectors sideways, but personally I think of them as column vectors, so have written them out like that here).
[14]:
# Two vectors with the same number of elements
numbers2 <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
numbers3 <- numbers2 + numbers
numbers3
- 1
- 2
- 3
- 4
- 5
- 7
- 8
- 9
- 10
- 11
Vector arithmetics can also be carried out in R on two multi-value vectors with different lengths using the recycling rule, but… that’s a thing you probably don’t want to do anyway. That’s a weird behavior! :)
I’d stick to interacting your vector with either a length-1 vector, or with another vector of the same length to avoid confusion.
Summarizing vectors¶
We often want to get summary statistics from a vector — that is, learn something general about it by looking beyond its constituent elements. If we have a vector in which each element represents a person’s height, for example, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is.
So for example, we can use summary(numbers)
to get a lot of summary stats at once:
[42]:
summary(numbers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 250 499 499 748 997
Or we can use any one of a handful of other helper functions!
class(numbers) #check the class
length(numbers) #number of elements
max(numbers) #maximum value
min(numbers) #minimum value
sum(numbers) #sum of all values in the vector
mean(numbers) #mean
median(numbers) #median
var(numbers) #variance
sd(numbers) #standard deviation
quantile(numbers) #percentiles in intervals of .25
quantile(numbers, probs = seq(0, 1, 0.1)) #percentiles in invervals of 0.1
Don’t worry about memorizing these or anything – basically, you just need to have a sense of the kinds of things you can do with functions, and if you ever need one can can’t remember the name of the function, you can google it to get the specific function name.
Type Promotion¶
There’s one last lesson that’s worth learning about vectors, because it can get you in trouble.
As noted above, vectors can only contain one type of data, but if you try and use c()
to combine vectors of different types, R will try and be clever and find a way to combine that by doing something called “Type Promotion”, which is a way of converting all the data you give it to the same type. For example, if I tried to create a vector by combining a character vector and a numeric vector, R would convert the numeric vector to a character vector so all the data could fit in a numeric vector:
[13]:
a <- c("Nick", 42)
a
- 'Nick'
- '42'
Why did R convert 42
to "42"
and not convert "Nick"
to a numeric type? Well because "Nick"
can’t be represented as a numeric type in any meaningful sense while any number (like 42
) can always be represented as a character in a meaningful way.
Indeed, there’s a hierarchy of data types, where a type lower on the hierarchy can always be converted into something higher in the order, but not the other way around. That hierarchy is:
logical
–> numeric
–> character
When R is asked to combine vectors of different data types, it will try to move things up this hierarchy by the smallest amount possible in order to make everything the same type.
(Note there are individual cases that can move backwards – the character "5"
could logically be turned into 5
– but you can’t always convert a character to a numeric, so for consistently R only moves in directions that are always possible.
For example, if you combine logical
and numeric
vectors, R will convert all of the data into numeric
(remember from our previous lesson that R thanks of TRUE
as being like 1
, and FALSE
as being like 0
).
[14]:
c(1, 2, TRUE)
- 1
- 2
- 1
But it doesn’t convert that data into a character
vector (even though it could!) because it’s trying to make the smallest movements up that hierarchy that it can. But if we try to combine logical
, numeric
, and character
vectors, R would be forced to convert everything into a character
vector:
[16]:
c(TRUE, 42, "Julio")
- 'TRUE'
- '42'
- 'Julio'
Recap¶
Vectors are one of the fundamental building blocks of R. Even a number like
7
is just a vector of length 1.Vectors are collections of data of the same type.
Vectors can be created with the
c()
function.You can easily do math between any vector and a vector of length 1, or a vector of the same length.
You can do math between vectors of different lengths that aren’t length 1, but… the way it works is weird, so don’t.
If data of different types are passed to the
c()
function, it will type promote them to the lowest type that can store all the input types.
Next Steps¶
Now that we’re familiar with vectors, it’s time to learn to manipulate them!