Function Example¶

Functions are nice tools both for creating generalizable code, and also just for organizing your work. Because a function is a bundle of code that does one thing, putting parts of your work in functions can be helpful in defining the goal of that chunk of code.

To illustrate one use of a function, we’ll write a function that reads and manipulates a .csv file. We can then put this in a for loop to iterate over several files with a similar structure and combine the resulting data frames into one data frame.

As you’ll see, one could cram all the code we’re gonna write directly into the for-loop at the end, but by breaking out part of it into a function, the problem is more easily broken into smaller parts.

What follows is an example of how you might use functions in a real data science workflow. The goal is not for you to be comfortable with everything that you see – for example, the code uses a couple of tricks for manipulating data on dates – but rather just to get to see what some of this stuff looks like!

Reading several files¶

Begin by downloading a .zip file with service request data from NYC. The zip file contains six files for years 2004-2009, each with 10,000 observations. The data are originally from NYC’s Open Data portal, which hosts datasets with millions of service requests filed by residents through the city’s 311 program. For the purpose of this example, I have taken a random sample of 10,000 for each year.

Here’s what the 2004 file looks like (the other years have the same structure).

[1]:

url2004 <- "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-2004-sample.csv"
nyc04 <- read.csv(url2004)
head(nyc04)

A data.frame: 6 × 5
	Unique.Key	Created.Date	Closed.Date	Complaint.Type	Location
	<int>	<chr>	<chr>	<chr>	<chr>
1	4735434	01/23/2004 12:00 AM	02/02/2004 12:00 AM	Boilers	(40.71511134258793, -73.98998982667266)
2	7547062	06/04/2004 12:00 AM	06/09/2004 12:00 AM	HEATING	(40.871781348425515, -73.88238262118011)
3	5050661	08/04/2004 12:00 AM	08/06/2004 12:00 AM	General Construction/Plumbing	(40.59418801428136, -73.80082145383885)
4	7281795	11/26/2004 12:00 AM	12/10/2004 12:00 AM	PLUMBING	(40.85911979460089, -73.90605127158484)
5	1443894	08/22/2004 12:00 AM	08/22/2004 12:00 AM	Noise - Street/Sidewalk	(40.54800892371052, -74.17041676351323)
6	3244577	12/02/2004 12:00 AM	12/15/2004 12:00 AM	Noise

The variables in the data are as follows:

Unique.Key: An id number unique to each request.
Created.Date: The date the request was filed in the 311 system.
Closed.Date: The date the request was resolved by city workers (NA implies that it was never resolved).
Complaint.Type: The subject of the complaint.
Location: Coordinates that give the location of the service issue.

Our goal with the function is to read the file and clean it. In particular, we want to convert the Created.Date and Closed.Date variables so that R recognizes them as dates. From these variables, we can then calculate measures of government responsiveness: (1) how many days it took city workers to resolve a request, and (2) whether or not a request was resolved within a week.

[2]:

library(lubridate) # to work with dates


Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union

[3]:

# Load required packages

# Create a function that reads and cleans a service request file.
# The input is the name of a service request file and the
# output is a data frame with cleaned variables.
clean_dta <- function(file_name) {

    # Read the file and save it to an object called 'dta'
    dta <- read.csv(file_name)

    # Clean the dates in the dta file and generate responsiveness measures
    # mdy(substring(dta$Created.Date, 1, 10)) pulls just the month-day-year
    # from our columns with dates, then `mdy` tells R to read it as a date
    # in month-day-year format.

    dta$opened <- mdy(substring(dta$Created.Date, 1, 10))
    dta$closed <- mdy(substring(dta$Closed.Date, 1, 10))

    # Number of days between an issue opens and is resolved.
    dta$resptime <- as.numeric(difftime(dta$closed, dta$opened, units = "days"))

    # Create indicator of whether solved within 7 days.
    # responses in less than 0 is bad data.
    dta[dta$resptime < 0 | is.na(dta$resptime), "resptime"] <- NA
    dta$solvedin7 <- as.numeric(dta$resptime <= 7)

    # Return the cleaned data
    return(dta)
}

Let’s test the function on the 2004 data:

[4]:

# Execute function on the 2004 data
nyc04 <- clean_dta(url2004)
head(nyc04)

A data.frame: 6 × 9
	Unique.Key	Created.Date	Closed.Date	Complaint.Type	Location	opened	closed	resptime	solvedin7
	<int>	<chr>	<chr>	<chr>	<chr>	<date>	<date>	<dbl>	<dbl>
1	4735434	01/23/2004 12:00 AM	02/02/2004 12:00 AM	Boilers	(40.71511134258793, -73.98998982667266)	2004-01-23	2004-02-02	10	0
2	7547062	06/04/2004 12:00 AM	06/09/2004 12:00 AM	HEATING	(40.871781348425515, -73.88238262118011)	2004-06-04	2004-06-09	5	1
3	5050661	08/04/2004 12:00 AM	08/06/2004 12:00 AM	General Construction/Plumbing	(40.59418801428136, -73.80082145383885)	2004-08-04	2004-08-06	2	1
4	7281795	11/26/2004 12:00 AM	12/10/2004 12:00 AM	PLUMBING	(40.85911979460089, -73.90605127158484)	2004-11-26	2004-12-10	14	0
5	1443894	08/22/2004 12:00 AM	08/22/2004 12:00 AM	Noise - Street/Sidewalk	(40.54800892371052, -74.17041676351323)	2004-08-22	2004-08-22	0	1
6	3244577	12/02/2004 12:00 AM	12/15/2004 12:00 AM	Noise		2004-12-02	2004-12-15	13	0

The cleaned dataset has four new variables:

opened: The date the request was filed in date format.
closed: The date the request was resolved in date format.
resptime: The number of days it took to resolve the request (closed - opened).
solvedin7: A dummy variable equal to 1 if the request was solved within a week and 0 otherwise.

We can now use this function on all the six files using a for loop, or something called lapply() (Read more about lapply() here).

[5]:

# loop over and collect in a list!

url_stem <- "https://raw.githubusercontent.com/nickeubank/computational_methods_boot_camp/main/source/data/nyc-311-sample/nyc-311-"
url_suffix <- "-sample.csv"

# Get first one so we can append others to the bottom
nyc_all <- clean_dta(paste0(url_stem, 2004, url_suffix))

for (year in 2005:2009) {
    url <- paste0(url_stem, year, url_suffix)
    new_data <- clean_dta(url)
    nyc_all <- rbind(nyc_all, new_data)
}

[6]:

# 10 random rows
nyc_all[sample(nrow(nyc_all), 10), ]

A data.frame: 10 × 9
	Unique.Key	Created.Date	Closed.Date	Complaint.Type	Location	opened	closed	resptime	solvedin7
	<int>	<chr>	<chr>	<chr>	<chr>	<date>	<date>	<dbl>	<dbl>
30235	9119502	08/16/2007 12:00:00 AM	08/18/2007 12:00:00 AM	Industrial Waste	(40.6924338169897, -73.76791481654355)	2007-08-16	2007-08-18	2	1
35420	10017391	12/07/2007 12:00:00 AM	12/13/2007 12:00:00 AM	HEATING	(40.691388690509996, -73.93827993449325)	2007-12-07	2007-12-13	6	1
57117	14167032	06/12/2009 12:00:00 AM	07/09/2009 12:00:00 AM	GENERAL CONSTRUCTION	(40.59682173951916, -73.74713511250768)	2009-06-12	2009-07-09	27	0
21730	7708695	10/22/2006 12:00 AM	10/23/2006 12:00 AM	HEATING	(40.68886730172827, -73.82833980080076)	2006-10-22	2006-10-23	1	1
15500	8649578	12/04/2005 12:00 AM	01/05/2007 12:00 AM	GENERAL CONSTRUCTION	(40.82673859886853, -73.94034406281531)	2005-12-04	2007-01-05	397	0
12478	347498	08/18/2005 12:00 AM	08/18/2005 12:00 AM	Derelict Vehicle	(40.614853534290404, -73.91703853432448)	2005-08-18	2005-08-18	0	1
32767	8706092	01/31/2007 12:00:00 AM	02/08/2007 12:00:00 AM	HEATING	(40.83353655524185, -73.91701881580157)	2007-01-31	2007-02-08	8	0
39613	3834490	03/23/2007 12:00:00 AM		Rodent	(40.825096598119764, -73.91334384257658)	2007-03-23	NA	NA	NA
57025	14331411	07/06/2009 12:00:00 AM	07/29/2009 12:00:00 AM	Air Quality	(40.825431183120024, -73.89044261279176)	2009-07-06	2009-07-29	23	0
3339	7550682	05/29/2004 12:00 AM		PAINT - PLASTER	(40.63718891788018, -73.90757435445164)	2004-05-29	NA	NA	NA

Ta-da! We cleaned a whole bunch of different datasets and merged them all together in only handful of lines of code!