Arrays¶
Note: this reading is kind of a “bonus” reading: the contents of this reading aren’t something you’re likely to use in your first year, but as noted in the course introduction, part of the goal of this bootcamp is to set you up with a strong understanding of the fundamentals of computational data science, and as part of that, I want to make you familiar with this set of concepts. So please read this with a spirit of curiosity and don’t worry if you don’t end up getting all of it!
As we saw in our last reading, generalizing the way we subset vectors to subsetting matrices was quite easy – we just subset our matrices by doing what we did with vectors, but with two terms in our square brackets separated by a comma (e.g. [1, 1]
instead of just [1]
).
But guess what? Just as it was easy to generalize from one dimension to two, it turns out that we can also generalize from two dimensions to N dimensions the same way! In fact, rather than thinking of vectors and matrices as to different thing, we can think of them as special cases (the case of N=1 and N=2) of a more general data structure: Arrays.
Arrays are collections of data of the same type with a regular structure organized into N dimensions. When N=1, an array is the equivalent of a vector:
[1]:
array(1:9)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
And when N=2, an array is just a matrix:
[2]:
array(1:6, dim = c(2, 3))
1 | 3 | 5 |
2 | 4 | 6 |
But when N > 2, things start to get a little harder to visualize, but also a lot more powerful.
But WHY, I hear you asking, would I ever want more than two dimensions?!
While a large portion of data science is done with two-dimensional data—since it can be used to represent both tabular data and also images—we live in a world with three spatial dimensions. As a result, many data science applications require data about three-dimensional objects, otherwise known as volumetric data.
To illustrate, consider the case of an MRI. While we’ve all seen two-dimensional MRI images, but a real MRI scan consists of a series of these two-dimensional slices stacked on top of one another to form a three-dimensional picture of patients. And this data can very naturally be represented in a three-dimensional array!
To illustrate, consider the following MRI scan which has been saved as a .npy
file (a format created by numpy for storing arrays of arbitrary shape):
[3]:
library(reticulate)
np <- import("numpy")
mri <- np$load("data/mri_neck_vertical_slices_rorientation.npy")
Warning message:
"package 'reticulate' was built under R version 4.1.2"
As we can see, this MRI is represented by a three-dimensional array. More specifically, it consists of 15 stacked images, each of which is 512 x 512 pixels.
[4]:
class(mri)
[5]:
dim(mri)
- 15
- 512
- 512
And since this is just a set of stacked images, we can easily pull out one of these slices and visualize it:
[6]:
image(mri[8, , ], col = grey(seq(0, 1, length = 256)))
Voila! Here we can clearly see a patient’s neck from a side view with the spinal cord running up through to the base of the patient’s brain in the top right, and the patients front teeth on the left.
This is the 8th slice of 15, so it’s showing us the approximate middle of the patients neck. But if we wanted to move outward from the center of the patient’s body, we could pick a slice closer to 1 or 15:
[7]:
image(mri[15, , ], col = grey(seq(0, 1, length = 256)))
Here, as we move out to the side of the patient’s neck, we can see the MRI catching their side teeth in the bottom left, as well as more of the patient’s neck muscles.
Finally, because this data is three-dimensional, we also have the option of cutting our own slices along a different axis. In the images above, we’ve specified a specific value for the first dimension of our array and plotted the second and third dimensions along the x and y-axes of our image. This has given us vertical slices of the patient’s neck.
But if, instead, we fix a value of the second dimension, we can actually get a horizontal slice of the patients neck. Now, because we only have 15 slices along the first dimension of our array, the resolution along this axis is not nearly as good (a full MRI study includes taking detailed slices along all axes, and CT scans can be used to generate volumetric data with consistent resolution along all axes), but it does illustrate how working with volumetric data allows us to really explore three-dimensional objects:
[8]:
image(mri[, , 200],
col = grey(seq(0, 1, length = 256)),
asp = 4, xlim = c(0, 1))
Here we can see the patient’s wind-pipe as a dark circle on the left side of the image, and a cross-section of their spine in the middle. Or if we wanted to move up a little ways:
[9]:
image(mri[, , 400],
col = grey(seq(0, 1, length = 256)),
asp = 4, xlim = c(0, 1))
We can see the patients upper jaw on the left, the top of their spine in the middle, and the lower portion of their brain (their cerebellum) on the right.
If that feels a little abstract, consider this illustration of what we just did. We start with our image that shows the patient’s full skull and neck from the side (mri[8, , ]
), then overlay it with the horizontal slices we pulled above:
Of course, scientists don’t just use this type of volumetric data to organize a collection of two-dimensional images—they also use it to analyze inherently three-dimensional spatial patterns in data, such as the paths that blood vessels or nerves follow through the body, or spatial correlations in different tissues properties.
Other Uses for Arrays¶
As this example hopefully illustrates, the fact that arrays can organize data along an arbitrary number of dimensions is a really powerful idea. Here are just a few other applications for their use!
Color Images: In our last reading, we saw that black and white images can be thought of as matrices, as each cell can contain the greyscale value for a given pixel. When working with color images, however, we often work with three-dimensional arrays, where the first two dimensions are the x-y coordinates of each pixel, and the third dimension is used to differentiate between the value for red in a pixel, the value of green in a pixel, the value of blue in a pixel, and the luminance (brightness) of the pixel.
Repeated measurements over time: Just as a single survey is easily represented as a matrix, so too can repeated surveys be easily represented by making time a third dimension. This makes it easy to pull out a single wave of the survey, or to pick out all the responses for a given person over time.
Brain Scans: fMRI scan the entire volume of the brain, and that volumetric data is most naturally stored in the three-dimensional array, just like the temperature data above. Obviously not all social scientists will end up working with brain scans, but there’s certainly a lot of cutting-edge work in this area!
3D measurements that evolve over time: just as we can model survey data that evolves over time in three dimensions, we can also model three-dimensional volumetric data that evolves over time (e.g. not just a slice of a climate model, but its evoluation) in a four-dimensional array!
Satellite Data: Satellite image data usually comes in the form of sets of 2-dimensional images, where each image includes information about light intensity at a given wavelength. When these images are stacked to, say, generate a color image, or identify wavelength combinations common to certain types of pollution, flood waters, or specific crops, you get a three-dimensional array.
So yeah, high dimensional arrays are a SUPER powerful data structure, and one you shouldn’t shy away.
And crucially, they aren’t that special! All the tools we’ve already learned can allow us to do these powerful manipulations!