Function Example¶
Functions are nice tools both for creating generalizable code, and also just for organizing your work. Because a function is a bundle of code that does one thing, putting parts of your work in functions can be helpful in defining the goal of that chunk of code.
To illustrate one use of a function, we’ll write a function that reads and manipulates a .csv file. We can then put this in a for loop to iterate over several files with a similar structure and combine the resulting data frames into one data frame.
As you’ll see, one could cram all the code we’re gonna write directly into the for-loop at the end, but by breaking out part of it into a function, the problem is more easily broken into smaller parts.
What follows is an example of how you might use functions in a real data science workflow. The goal is not for you to be comfortable with everything that you see – for example, the code uses a couple of tricks for manipulating data on dates – but rather just to get to see what some of this stuff looks like!
Reading several files¶
Begin by downloading a .zip file with service request data from NYC. The zip file contains six files for years 2004-2009, each with 10,000 observations. The data are originally from NYC’s Open Data portal, which hosts datasets with millions of service requests filed by residents through the city’s 311 program. For the purpose of this example, I have taken a random sample of 10,000 for each year.
Here’s what the 2004 file looks like (the other years have the same structure).
[1]:
url2004 <- "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-2004-sample.csv"
nyc04 <- read.csv(url2004)
head(nyc04)
Unique.Key | Created.Date | Closed.Date | Complaint.Type | Location | |
---|---|---|---|---|---|
<int> | <chr> | <chr> | <chr> | <chr> | |
1 | 4735434 | 01/23/2004 12:00 AM | 02/02/2004 12:00 AM | Boilers | (40.71511134258793, -73.98998982667266) |
2 | 7547062 | 06/04/2004 12:00 AM | 06/09/2004 12:00 AM | HEATING | (40.871781348425515, -73.88238262118011) |
3 | 5050661 | 08/04/2004 12:00 AM | 08/06/2004 12:00 AM | General Construction/Plumbing | (40.59418801428136, -73.80082145383885) |
4 | 7281795 | 11/26/2004 12:00 AM | 12/10/2004 12:00 AM | PLUMBING | (40.85911979460089, -73.90605127158484) |
5 | 1443894 | 08/22/2004 12:00 AM | 08/22/2004 12:00 AM | Noise - Street/Sidewalk | (40.54800892371052, -74.17041676351323) |
6 | 3244577 | 12/02/2004 12:00 AM | 12/15/2004 12:00 AM | Noise |
The variables in the data are as follows:
Unique.Key
: An id number unique to each request.Created.Date
: The date the request was filed in the 311 system.Closed.Date
: The date the request was resolved by city workers (NA
implies that it was never resolved).Complaint.Type
: The subject of the complaint.Location
: Coordinates that give the location of the service issue.
Our goal with the function is to read the file and clean it. In particular, we want to convert the Created.Date
and Closed.Date
variables so that R recognizes them as dates. From these variables, we can then calculate measures of government responsiveness: (1) how many days it took city workers to resolve a request, and (2) whether or not a request was resolved within a week.
[2]:
library(lubridate) # to work with dates
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
[3]:
# Load required packages
# Create a function that reads and cleans a service request file.
# The input is the name of a service request file and the
# output is a data frame with cleaned variables.
clean_dta <- function(file_name) {
# Read the file and save it to an object called 'dta'
dta <- read.csv(file_name)
# Clean the dates in the dta file and generate responsiveness measures
# mdy(substring(dta$Created.Date, 1, 10)) pulls just the month-day-year
# from our columns with dates, then `mdy` tells R to read it as a date
# in month-day-year format.
dta$opened <- mdy(substring(dta$Created.Date, 1, 10))
dta$closed <- mdy(substring(dta$Closed.Date, 1, 10))
# Number of days between an issue opens and is resolved.
dta$resptime <- as.numeric(difftime(dta$closed, dta$opened, units = "days"))
# Create indicator of whether solved within 7 days.
# responses in less than 0 is bad data.
dta[dta$resptime < 0 | is.na(dta$resptime), "resptime"] <- NA
dta$solvedin7 <- as.numeric(dta$resptime <= 7)
# Return the cleaned data
return(dta)
}
Let’s test the function on the 2004 data:
[4]:
# Execute function on the 2004 data
nyc04 <- clean_dta(url2004)
head(nyc04)
Unique.Key | Created.Date | Closed.Date | Complaint.Type | Location | opened | closed | resptime | solvedin7 | |
---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <chr> | <chr> | <chr> | <date> | <date> | <dbl> | <dbl> | |
1 | 4735434 | 01/23/2004 12:00 AM | 02/02/2004 12:00 AM | Boilers | (40.71511134258793, -73.98998982667266) | 2004-01-23 | 2004-02-02 | 10 | 0 |
2 | 7547062 | 06/04/2004 12:00 AM | 06/09/2004 12:00 AM | HEATING | (40.871781348425515, -73.88238262118011) | 2004-06-04 | 2004-06-09 | 5 | 1 |
3 | 5050661 | 08/04/2004 12:00 AM | 08/06/2004 12:00 AM | General Construction/Plumbing | (40.59418801428136, -73.80082145383885) | 2004-08-04 | 2004-08-06 | 2 | 1 |
4 | 7281795 | 11/26/2004 12:00 AM | 12/10/2004 12:00 AM | PLUMBING | (40.85911979460089, -73.90605127158484) | 2004-11-26 | 2004-12-10 | 14 | 0 |
5 | 1443894 | 08/22/2004 12:00 AM | 08/22/2004 12:00 AM | Noise - Street/Sidewalk | (40.54800892371052, -74.17041676351323) | 2004-08-22 | 2004-08-22 | 0 | 1 |
6 | 3244577 | 12/02/2004 12:00 AM | 12/15/2004 12:00 AM | Noise | 2004-12-02 | 2004-12-15 | 13 | 0 |
The cleaned dataset has four new variables:
opened
: The date the request was filed in date format.closed
: The date the request was resolved in date format.resptime
: The number of days it took to resolve the request (closed
-opened
).solvedin7
: A dummy variable equal to 1 if the request was solved within a week and 0 otherwise.
We can now use this function on all the six files using a for loop, or something called lapply() (Read more about lapply()
here).
[5]:
# loop over and collect in a list!
url_stem <- "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-"
url_suffix <- "-sample.csv"
# Get first one so we can append others to the bottom
nyc_all <- clean_dta(paste0(url_stem, 2004, url_suffix))
for (year in 2005:2009) {
url <- paste0(url_stem, year, url_suffix)
new_data <- clean_dta(url)
nyc_all <- rbind(nyc_all, new_data)
}
[6]:
# 10 random rows
nyc_all[sample(nrow(nyc_all), 10), ]
Unique.Key | Created.Date | Closed.Date | Complaint.Type | Location | opened | closed | resptime | solvedin7 | |
---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <chr> | <chr> | <chr> | <date> | <date> | <dbl> | <dbl> | |
30235 | 9119502 | 08/16/2007 12:00:00 AM | 08/18/2007 12:00:00 AM | Industrial Waste | (40.6924338169897, -73.76791481654355) | 2007-08-16 | 2007-08-18 | 2 | 1 |
35420 | 10017391 | 12/07/2007 12:00:00 AM | 12/13/2007 12:00:00 AM | HEATING | (40.691388690509996, -73.93827993449325) | 2007-12-07 | 2007-12-13 | 6 | 1 |
57117 | 14167032 | 06/12/2009 12:00:00 AM | 07/09/2009 12:00:00 AM | GENERAL CONSTRUCTION | (40.59682173951916, -73.74713511250768) | 2009-06-12 | 2009-07-09 | 27 | 0 |
21730 | 7708695 | 10/22/2006 12:00 AM | 10/23/2006 12:00 AM | HEATING | (40.68886730172827, -73.82833980080076) | 2006-10-22 | 2006-10-23 | 1 | 1 |
15500 | 8649578 | 12/04/2005 12:00 AM | 01/05/2007 12:00 AM | GENERAL CONSTRUCTION | (40.82673859886853, -73.94034406281531) | 2005-12-04 | 2007-01-05 | 397 | 0 |
12478 | 347498 | 08/18/2005 12:00 AM | 08/18/2005 12:00 AM | Derelict Vehicle | (40.614853534290404, -73.91703853432448) | 2005-08-18 | 2005-08-18 | 0 | 1 |
32767 | 8706092 | 01/31/2007 12:00:00 AM | 02/08/2007 12:00:00 AM | HEATING | (40.83353655524185, -73.91701881580157) | 2007-01-31 | 2007-02-08 | 8 | 0 |
39613 | 3834490 | 03/23/2007 12:00:00 AM | Rodent | (40.825096598119764, -73.91334384257658) | 2007-03-23 | NA | NA | NA | |
57025 | 14331411 | 07/06/2009 12:00:00 AM | 07/29/2009 12:00:00 AM | Air Quality | (40.825431183120024, -73.89044261279176) | 2009-07-06 | 2009-07-29 | 23 | 0 |
3339 | 7550682 | 05/29/2004 12:00 AM | PAINT - PLASTER | (40.63718891788018, -73.90757435445164) | 2004-05-29 | NA | NA | NA |
Ta-da! We cleaned a whole bunch of different datasets and merged them all together in only handful of lines of code!