Function Example

Functions are nice tools both for creating generalizable code, and also just for organizing your work. Because a function is a bundle of code that does one thing, putting parts of your work in functions can be helpful in defining the goal of that chunk of code.

To illustrate one use of a function, we’ll write a function that reads and manipulates a .csv file. We can then put this in a for loop to iterate over several files with a similar structure and combine the resulting data frames into one data frame.

As you’ll see, one could cram all the code we’re gonna write directly into the for-loop at the end, but by breaking out part of it into a function, the problem is more easily broken into smaller parts.

What follows is an example of how you might use functions in a real data science workflow. The goal is not for you to be comfortable with everything that you see – for example, the code uses a couple of tricks for manipulating data on dates – but rather just to get to see what some of this stuff looks like!

Reading several files

Begin by downloading a .zip file with service request data from NYC. The zip file contains six files for years 2004-2009, each with 10,000 observations. The data are originally from NYC’s Open Data portal, which hosts datasets with millions of service requests filed by residents through the city’s 311 program. For the purpose of this example, I have taken a random sample of 10,000 for each year.

Here’s what the 2004 file looks like (the other years have the same structure).

[1]:
url2004 <- "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-2004-sample.csv"
nyc04 <- read.csv(url2004)
head(nyc04)
A data.frame: 6 × 5
Unique.KeyCreated.DateClosed.DateComplaint.TypeLocation
<int><chr><chr><chr><chr>
1473543401/23/2004 12:00 AM02/02/2004 12:00 AMBoilers (40.71511134258793, -73.98998982667266)
2754706206/04/2004 12:00 AM06/09/2004 12:00 AMHEATING (40.871781348425515, -73.88238262118011)
3505066108/04/2004 12:00 AM08/06/2004 12:00 AMGeneral Construction/Plumbing(40.59418801428136, -73.80082145383885)
4728179511/26/2004 12:00 AM12/10/2004 12:00 AMPLUMBING (40.85911979460089, -73.90605127158484)
5144389408/22/2004 12:00 AM08/22/2004 12:00 AMNoise - Street/Sidewalk (40.54800892371052, -74.17041676351323)
6324457712/02/2004 12:00 AM12/15/2004 12:00 AMNoise

The variables in the data are as follows:

  • Unique.Key: An id number unique to each request.

  • Created.Date: The date the request was filed in the 311 system.

  • Closed.Date: The date the request was resolved by city workers (NA implies that it was never resolved).

  • Complaint.Type: The subject of the complaint.

  • Location: Coordinates that give the location of the service issue.

Our goal with the function is to read the file and clean it. In particular, we want to convert the Created.Date and Closed.Date variables so that R recognizes them as dates. From these variables, we can then calculate measures of government responsiveness: (1) how many days it took city workers to resolve a request, and (2) whether or not a request was resolved within a week.

[2]:
library(lubridate) # to work with dates

Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union


[3]:
# Load required packages

# Create a function that reads and cleans a service request file.
# The input is the name of a service request file and the
# output is a data frame with cleaned variables.
clean_dta <- function(file_name) {

    # Read the file and save it to an object called 'dta'
    dta <- read.csv(file_name)

    # Clean the dates in the dta file and generate responsiveness measures
    # mdy(substring(dta$Created.Date, 1, 10)) pulls just the month-day-year
    # from our columns with dates, then `mdy` tells R to read it as a date
    # in month-day-year format.

    dta$opened <- mdy(substring(dta$Created.Date, 1, 10))
    dta$closed <- mdy(substring(dta$Closed.Date, 1, 10))

    # Number of days between an issue opens and is resolved.
    dta$resptime <- as.numeric(difftime(dta$closed, dta$opened, units = "days"))

    # Create indicator of whether solved within 7 days.
    # responses in less than 0 is bad data.
    dta[dta$resptime < 0 | is.na(dta$resptime), "resptime"] <- NA
    dta$solvedin7 <- as.numeric(dta$resptime <= 7)

    # Return the cleaned data
    return(dta)
}

Let’s test the function on the 2004 data:

[4]:
# Execute function on the 2004 data
nyc04 <- clean_dta(url2004)
head(nyc04)
A data.frame: 6 × 9
Unique.KeyCreated.DateClosed.DateComplaint.TypeLocationopenedclosedresptimesolvedin7
<int><chr><chr><chr><chr><date><date><dbl><dbl>
1473543401/23/2004 12:00 AM02/02/2004 12:00 AMBoilers (40.71511134258793, -73.98998982667266) 2004-01-232004-02-02100
2754706206/04/2004 12:00 AM06/09/2004 12:00 AMHEATING (40.871781348425515, -73.88238262118011)2004-06-042004-06-09 51
3505066108/04/2004 12:00 AM08/06/2004 12:00 AMGeneral Construction/Plumbing(40.59418801428136, -73.80082145383885) 2004-08-042004-08-06 21
4728179511/26/2004 12:00 AM12/10/2004 12:00 AMPLUMBING (40.85911979460089, -73.90605127158484) 2004-11-262004-12-10140
5144389408/22/2004 12:00 AM08/22/2004 12:00 AMNoise - Street/Sidewalk (40.54800892371052, -74.17041676351323) 2004-08-222004-08-22 01
6324457712/02/2004 12:00 AM12/15/2004 12:00 AMNoise 2004-12-022004-12-15130

The cleaned dataset has four new variables:

  • opened: The date the request was filed in date format.

  • closed: The date the request was resolved in date format.

  • resptime: The number of days it took to resolve the request (closed - opened).

  • solvedin7: A dummy variable equal to 1 if the request was solved within a week and 0 otherwise.

We can now use this function on all the six files using a for loop, or something called lapply() (Read more about lapply() here).

[5]:
# loop over and collect in a list!

url_stem <- "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-"
url_suffix <- "-sample.csv"

# Get first one so we can append others to the bottom
nyc_all <- clean_dta(paste0(url_stem, 2004, url_suffix))

for (year in 2005:2009) {
    url <- paste0(url_stem, year, url_suffix)
    new_data <- clean_dta(url)
    nyc_all <- rbind(nyc_all, new_data)
}
[6]:
# 10 random rows
nyc_all[sample(nrow(nyc_all), 10), ]
A data.frame: 10 × 9
Unique.KeyCreated.DateClosed.DateComplaint.TypeLocationopenedclosedresptimesolvedin7
<int><chr><chr><chr><chr><date><date><dbl><dbl>
30235 911950208/16/2007 12:00:00 AM08/18/2007 12:00:00 AMIndustrial Waste (40.6924338169897, -73.76791481654355) 2007-08-162007-08-18 2 1
354201001739112/07/2007 12:00:00 AM12/13/2007 12:00:00 AMHEATING (40.691388690509996, -73.93827993449325)2007-12-072007-12-13 6 1
571171416703206/12/2009 12:00:00 AM07/09/2009 12:00:00 AMGENERAL CONSTRUCTION(40.59682173951916, -73.74713511250768) 2009-06-122009-07-09 27 0
21730 770869510/22/2006 12:00 AM 10/23/2006 12:00 AM HEATING (40.68886730172827, -73.82833980080076) 2006-10-222006-10-23 1 1
15500 864957812/04/2005 12:00 AM 01/05/2007 12:00 AM GENERAL CONSTRUCTION(40.82673859886853, -73.94034406281531) 2005-12-042007-01-05397 0
12478 34749808/18/2005 12:00 AM 08/18/2005 12:00 AM Derelict Vehicle (40.614853534290404, -73.91703853432448)2005-08-182005-08-18 0 1
32767 870609201/31/2007 12:00:00 AM02/08/2007 12:00:00 AMHEATING (40.83353655524185, -73.91701881580157) 2007-01-312007-02-08 8 0
39613 383449003/23/2007 12:00:00 AM Rodent (40.825096598119764, -73.91334384257658)2007-03-23NA NANA
570251433141107/06/2009 12:00:00 AM07/29/2009 12:00:00 AMAir Quality (40.825431183120024, -73.89044261279176)2009-07-062009-07-29 23 0
3339 755068205/29/2004 12:00 AM PAINT - PLASTER (40.63718891788018, -73.90757435445164) 2004-05-29NA NANA

Ta-da! We cleaned a whole bunch of different datasets and merged them all together in only handful of lines of code!