Loop Exercise

Exercise 1

Create a dataframe from the “Datasaurus.tsv” file with the following code:

[ ]:
data <- read.csv(
    paste0(
        "https://raw.githubusercontent.com/",
        "nickeubank/computational_methods_boot_camp/",
        "main/source/data/Datasaurus.tsv"
    ),
    sep = "\t"
)

Datasaurus.tsv is a tab-separated-value file, so it’s like a CSV, but it uses tabs not commas to separate columns. We can read this with read.csv, but we have to tell it that the column separator is a tab with the argument sep="\t". Take a look at a few of the top rows:

[ ]:
head(data)

A data.frame: 6 x 26
example1_xexample1_yexample2_xexample2_yexample3_xexample3_yexample4_xexample4_yexample5_xexample5_y...example9_xexample9_yexample10_xexample10_yexample11_xexample11_yexample12_xexample12_yexample13_xexample13_y
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>...<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
132.3311161.4111051.2038983.3397855.9930379.2772655.384697.179551.1479290.86741...47.6952095.2411958.2136191.8818950.4815193.2227065.8155495.5883738.3377692.47272
253.4214626.1868858.9744785.4998250.0322579.0130751.538596.025650.5171389.10239...44.6099893.0758458.1960592.2149950.2824197.6099865.6722791.9334035.7518794.11677
363.9202030.8321951.8720785.8297451.2884682.4359446.153894.487250.2074885.46005...43.8563894.0858758.7182390.3105350.1867099.6946839.0027292.2618432.7672288.51829
470.2895182.5336548.1799385.0451251.1705479.1652942.820591.410350.0694883.05767...41.5789390.3035757.2783789.9076150.3269190.0220537.7953093.5324633.7296188.62227
534.1188345.7345541.6832084.0179444.3779178.1646340.769288.333350.5628582.93782...49.1774296.6105358.0820292.0081550.4562189.9874135.5139089.5991937.2382583.72493
667.6707237.1109537.8904282.5674945.0102777.8808638.717984.871850.2885382.97525...42.6522590.5606457.4894588.0852930.4648582.0892339.2194583.5434836.0272082.04078

Exercise 2

This dataset actually contains 13 separate example datasets, each with two variables named example[number]_x and example[number]_y.

In order to get a better sense of what these datasets look like, our first goal will be to write a loop that iterates over each example dataset (numbered 1 to 13) and prints out the mean and standard deviation for example[number]_x and example[number]_y for each dataset.

For example, the first iteration of this loop might return something like (note I’m also rounding my answers for readability):

Example Dataset 1:
Mean x: 23.123,
Mean y: 98.242,
Std Dev x: 21.247,
Std Dev y: 32.243,
Correlation: -0.742

(Though you shouldn’t get those specific values)

But as we discussed in our reading, we want to approach this in stages. So first, write out the skeleton of your loop. Don’t put anything in it except to print out the value of the variable that is iterating in the loop for now.

Note: the function in R for the correlation between two variables may not be what you think it is…

Exercise 3

Now sketch out what the inside of your loop will look like. As discussed in the readings, begin by creating a variable that takes on the first value you’ll loop over, then, using that variable, write code that does what you want your first pass of the loop to do.

You’ll probably need to make use of print(), paste0(), and round().

Round your answers to three decimal places. Note these numbers should look very similar across datasets.

Exercise 4

Now put them together! Can you print out these summary statistics for all 13 pairs of variables?

Exercise 5

Based only on these results, discuss what might you conclude about these example datasets with your partner. Write down your thoughts.

Exercise 6

Now we’re going to write a loop that plots these datasets out in a scatter plot. To do this, we’ll be using the library ggplot2, a libraries from the tidyverse the is unequivocally the best tool for plotting in R.

We aren’t going to do a full lesson on ggplot because you’re sure to encounter it in the future, but the basic syntax to plot the first dataset would look like this:

[ ]:
library(ggplot2)

# First is the dataset, then within `aes_string()` you specify
# which column from the data is the x-axis and which column is the y-axis,
# then `+ geom_point()` tells ggplot to plot points for each line of the data.

ggplot(data, aes(x = .data[["example1_x"]], y = .data[["example1_y"]])) +
    geom_point()

../_images/exercises_exercise_loops_10_0.png

Now build a loop that plots a scatter plot of all 13 of these data pairs. As before, approach it first by writing a skeleton loop that specifies how the loop will work, then write the inside separately, then integrate the two.

Note: if you’ve used ggplot before, note that we’re passing the name of our variables ("example1_x") as a string inside .data[[]]—this avoids the tidyverse odd evaluation problem so you can pass characters as text in double quotes and variables as text not in quotes like normal!

Note: your ggplot may only show up if you wrap it in a print() when it’s inside your loop!

Exercise 7

Review you plots. How does your impression of how these datasets differ from what you wrote down in Exercise 3?

Solutions

Here’s the deal with programming: the only way to learn to program is to wrestle with solving your own problems. The best way to learn is to do so actively – if you just look at answers as soon as you get stuck, your process is more passive. You may feel like you’re learning more, but research shows that that’s an illusion – students who learn passively think they’re learning more than they are (a summary of work in this area is here).

Moreover, in coding in particular, the process of debugging your code is a critical skill in and of itself, and the only way you will learn to do it is by literally spending hours working through your own problems. And once you’ve seen an answer, there’s no way to unsee it, so proceed only if you absolutely have to.

OK, if you still want to proceed, you can find solutions here.