Chapter 4 Data Manipulation

4.1 Objectives

4.2 Motivation

To follow along with the code in this chapter you’ll need to download and import the dog_licenses data. Run the following code in the Console:

library(tidyverse)
dog_licenses <- readr::read_csv("https://github.com/merely-useful/novice-r/raw/master/data/nyc-dog-licenses.csv.gz") %>% 
  drop_na(animal_name)

This data comes from the [New York City OpenData Portal][nyc-opendata-dogs] where the data is described as:

All dog owners residing in NYC are required by law to license their dogs. The data is sourced from the DOHMH Dog Licensing System (https://a816-healthpsi.nyc.gov/DogLicense), where owners can apply for and renew dog licenses. Each record represents a unique dog license that was active during the year, but not necessarily a unique record per dog, since a license that is renewed during the year results in a separate record of an active license period. Each record stands as a unique license period for the dog over the course of the yearlong time frame.

Let’s get started by taking a “look” at the data. The RStudio Viewer has an interface much like other spreadsheet programs you might have used. You can use this Viewer to look at the dog_licenses tibble with the View() function:

View(dog_licenses)

This viewer has some basic data manipulation features:

  • Arrange You can change the order of the rows in the data based on the values in a column by clicking the up/down arrow next to the column name.

  • Filter You can filter to include only rows which have a certain value in a column by first clicking the small funnel icon labelled “Filter,” then typing a desired value in the appropriate column.

Arrange and filter are known as data manipulation verbs. Individually, they describe a single simple manipulation of a dataset. It’s surprising how many questions you can answer using just these two basic verbs:

  • How old is the oldest dog in this data? To answer we can arrange the animal_birth_month column in increasing order and see Jack, a Pug from Queens, was born in January 1999 (this license was issued in May 2015, making Jack at least 16 at the time). You’ll notice that there are other dogs with this same birthday.

  • What range of license issue dates are in this data? Arrange license_issued_date once in increasing order and once in decreasing order, to find the issue dates range from 12th September 2015 to 31st December 2016.

  • How many dogs licenses belong to dogs named Fido? Filter the animal_name column with Fido, and see “Showing … of 12 entries” - so 12!

  • and many more…

While these verbs are powerful in their own right, their real power comes from combining them. For example, we can answer the more complicated question “Which dog named Fido is the oldest?” by first filtering then arranging.

The Viewer in RStudio, however, has two huge limitations:

  • It’s a point and click interface. This means to repeat the same operation again you need to remember exactly the steps of point, clicking and typing you performed to get to your answer. Consequently, it’s hard to share those steps unambiguously with someone else, and it’s hard to save your results for future.

  • The manipulation verbs in the viewer are limited. There is no way to rearrange the columns, add new variables or calculate summaries like counts or averages.

You’ll start this chapter by overcoming this first limitation. You won’t use the Viewer to arrange and filter, you’ll learn to write code to do the same operations. Then you’ll increase your vocabulary of data manipulation verbs to include:

  • selecting variables,
  • adding variables,
  • summarizing rows, and
  • performing these operations on subsets of the data.

Combining these verbs you’ll be able to answers questions like:

  • How long are licences issued for?

  • What are the most popular breeds?

  • What names are most popular for licensed dogs in New York? Does this vary geographically?

  • When are dogs born?

FIXME: update these question to reflect things that are actually done in this and later chapters.

To master data manipulation you need to master two pieces:

  • How to describe the action you want with the data manipulation verbs individually. This is a language specific skill - in this chapter, you’ll use the functions in the dplyr package.

  • Identifying which verbs, and in which order to apply them, to answer a question of interest. This skill will translate across all technologies, but it takes a little longer to master.

4.2.1 Exercise: Point and click data manipulation

Using the RStudio Viewer answer the following questions:

  • How many dog licenses belong to dogs named “Queen” that live in “Queens?”

4.3 Exploring data in the console

Let’s take a look at the data in the console:

dog_licenses
## # A tibble: 118,542 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1        533 BONITA      F             2013-05-01        Unknown        Queens 
##  2        548 ROCKY       M             2014-05-01        Labrador Retr… Queens 
##  3        622 BULLY       M             2010-07-01        American Pit … Queens 
##  4        633 COCO        M             2005-02-01        Labrador Retr… Queens 
##  5        655 SKI         F             2012-09-01        American Pit … Queens 
##  6        872 CHASE       M             2013-11-01        Shih Tzu       Queens 
##  7        874 CHEWY       M             2014-09-01        Shih Tzu       Queens 
##  8        875 CHASE       M             2008-08-01        Labrador Retr… Queens 
##  9        893 MILEY       F             2008-07-01        Boxer          Queens 
## 10        919 KENZI       F             2010-05-01        Schnauzer, Mi… Queens 
## # … with 118,532 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

Notice that in contrast to the Viewer you only see the first 10 rows of the dataset, and just the first few columns. The number of columns you see depends on the width of your console, so you may see more or fewer than displayed here. You should also note some of the contents of the columns have been abbreviated. The at the end of some values in breed_name indicates these values have been truncated for display purposes.

You’ll be using the dplyr package for data manipulation. Since it is part of the tidyverse, you’ll need to load the tidyverse package to begin:

library(tidyverse)

4.3.1 Re-arranging rows

You can reorder rows of data with the dplyr function arrange(). The arrange function takes a tibble as its first argument and column names as the remaining arguments. The result will have the rows ordered in increasing value of the specified column. For example, to find the licenses belonging to the oldest dogs we arrange dog_licenses using the animal_birth_month column:

arrange(dog_licenses, animal_birth_month)
## # A tibble: 118,542 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1      15568 JACK        M             1999-01-01        Pug            Queens 
##  2      23695 KATTY       F             1999-01-01        Chihuahua      Queens 
##  3     101309 TOMMY       M             1999-01-01        Unknown        Queens 
##  4       1598 SARAH       F             1999-01-01        West High Whi… Queens 
##  5       8628 DOMINO      M             1999-01-01        Labrador Retr… Queens 
##  6      15733 BRINKS      M             1999-01-01        Yorkshire Ter… Queens 
##  7      30419 LUCKY       M             1999-01-01        Unknown        Queens 
##  8      31348 MAGGIE      F             1999-01-01        German Shephe… Queens 
##  9      34685 COOKIE      M             1999-01-01        Pomeranian     Queens 
## 10      37194 DARCY       F             1999-01-01        Cocker Spaniel Queens 
## # … with 118,532 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

You’ll see Jack the Pug that lives in Queens, just like you did in the Viewer.

To arrange the rows by decreasing value, you need to wrap the column name in desc() (short for descending order). For instance to find the youngest dogs:

arrange(dog_licenses, desc(animal_birth_month))
## # A tibble: 118,542 x 15
##    row_number animal_name  animal_gender animal_birth_mo… breed_name    borough 
##         <dbl> <chr>        <chr>         <date>           <chr>         <chr>   
##  1     120352 MARLEY       M             2016-12-01       Cocker Spani… Manhatt…
##  2     121981 MR.          M             2016-12-01       Chihuahua Cr… Brooklyn
##  3     120501 RORY         M             2016-12-01       Unknown       Brooklyn
##  4     122028 TAQUITO      M             2016-12-01       Papillon      Staten …
##  5     115820 REX          M             2016-11-01       Maltese       Queens  
##  6     121727 CHANDERBALLI M             2016-11-01       Havanese      Queens  
##  7     115777 ANGEL        F             2016-11-01       Poodle, Mini… Queens  
##  8     118175 MASON        M             2016-11-01       American Sta… Queens  
##  9     121601 TEDDY        M             2016-11-01       Havanese      Brooklyn
## 10     120995 LOLA         F             2016-11-01       Morkie        Staten …
## # … with 118,532 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

As another example, to find the earliest issue date we can order by increasing license_issued_date:

arrange(dog_licenses, license_issued_date)
## # A tibble: 118,542 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1          1 QUEEN       F             2013-04-01        Akita Crossbr… Manhat…
##  2          2 CHEWBACCA   F             2012-06-01        Labrador Retr… Manhat…
##  3          3 IAN         M             2006-01-01        Unknown        Manhat…
##  4          7 LOLA        F             2009-06-01        Maltese        Manhat…
##  5          4 PAIGE       F             2014-07-01        American Pit … Manhat…
##  6          5 BUDDY       M             2008-04-01        Unknown        Manhat…
##  7          8 YOGI        M             2010-09-01        Boxer          Bronx  
##  8         10 MUNECA      F             2013-05-01        Beagle         Brookl…
##  9         27 BESS        F             2010-09-01        Beagle         Brookl…
## 10         26 BIGS        M             2004-12-01        American Pit … Brookl…
## # … with 118,532 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

The first row is the record with the earliest issue date, but we can’t actually see that date because the column license_issued_date isn’t being displayed due to space. One solution is to extract only the columns we are interested in, a manipulation known as selecting columns.

4.3.2 Exercise: Arranging character strings

Use arrange() to order the dog licenses by animal_name in increasing order. What does this tell you about the way R treats punctuation and numbers when dealing with alphabetical order?

4.4 How can I select subsets of my data?

Two verbs are used to subset data:

  • select() to select columns
  • filter() to select rows

You’ll learn about these two functions in this section, along with learning about a way to chain together multiple operations on a dataset.

4.4.1 Selecting columns

The select() function in dplyr is used to extract some subset of columns (but keep all the rows) from a tibble. Just like arrange(), it takes a tibble as its first argument and column names as the remaining arguments. For example, to keep only the animal_name column:

select(dog_licenses, animal_name)
## # A tibble: 118,542 x 1
##    animal_name
##    <chr>      
##  1 BONITA     
##  2 ROCKY      
##  3 BULLY      
##  4 COCO       
##  5 SKI        
##  6 CHASE      
##  7 CHEWY      
##  8 CHASE      
##  9 MILEY      
## 10 KENZI      
## # … with 118,532 more rows

You can provide additional column names as arguments to keep additional specified columns, for example to keep animal_name and breed_name:

select(dog_licenses, animal_name, breed_name)
## # A tibble: 118,542 x 2
##    animal_name breed_name                        
##    <chr>       <chr>                             
##  1 BONITA      Unknown                           
##  2 ROCKY       Labrador Retriever Crossbreed     
##  3 BULLY       American Pit Bull Terrier/Pit Bull
##  4 COCO        Labrador Retriever                
##  5 SKI         American Pit Bull Terrier/Pit Bull
##  6 CHASE       Shih Tzu                          
##  7 CHEWY       Shih Tzu                          
##  8 CHASE       Labrador Retriever                
##  9 MILEY       Boxer                             
## 10 KENZI       Schnauzer, Miniature              
## # … with 118,532 more rows

To return to finding the earliest issue date, you need to first arrange by increasing license_issued_date and then select the license_issued_date column. One approach is to store the result of the arrange step,

dog_by_date <- arrange(dog_licenses, license_issued_date)

Then apply the select step to this object:

select(dog_by_date, license_issued_date)
## # A tibble: 118,542 x 1
##    license_issued_date
##    <date>             
##  1 2014-09-12         
##  2 2014-09-12         
##  3 2014-09-12         
##  4 2014-09-12         
##  5 2014-09-12         
##  6 2014-09-12         
##  7 2014-09-12         
##  8 2014-09-13         
##  9 2014-09-13         
## 10 2014-09-13         
## # … with 118,532 more rows

There are lots of shortcuts you can use with select() to avoid having to type out all the variables you want to keep. For example, you can ask for all the columns that start with a certain string:

select(dog_licenses, starts_with("Animal"))
## # A tibble: 118,542 x 3
##    animal_name animal_gender animal_birth_month
##    <chr>       <chr>         <date>            
##  1 BONITA      F             2013-05-01        
##  2 ROCKY       M             2014-05-01        
##  3 BULLY       M             2010-07-01        
##  4 COCO        M             2005-02-01        
##  5 SKI         F             2012-09-01        
##  6 CHASE       M             2013-11-01        
##  7 CHEWY       M             2014-09-01        
##  8 CHASE       M             2008-08-01        
##  9 MILEY       F             2008-07-01        
## 10 KENZI       F             2010-05-01        
## # … with 118,532 more rows

Take a look in the [“Useful functions” section of the select() help page][select-docs] for a complete list:

?dplyr::select

4.4.2 Exercise: Find the latest license issue date

Combine arrange() and select() to confirm the last issue date in this dataset is 31st December 2016.

4.4.3 Combining operations with the pipe %>%

You’ve seen you can combine data manipulation steps to do more complicated tasks, but so far you’ve done so by saving an intermediate object, in our previous example the object dog_by_date:

dog_by_date <- arrange(dog_licenses, license_issued_date)
select(dog_by_date, license_issued_date)

The pipe, %>%, is an operator that allows you to chain together operations without intermediate objects and maintain readability. The name, “pipe,” comes from the plumbing kind of pipe, not the smoking kind, and references the idea of objects flowing out of one function and into another. Let’s just look at the first step in our manipulation:

arrange(dog_licenses, license_issued_date)

With the pipe this can be rewritten as:

dog_licenses %>% arrange(license_issued_date)

The pipe takes the object on the left hand side and passes it as the first argument to the function on the right hand side. So, here the dog_licenses dataset is passed to the first argument of arrange(). Inside arrange() we can then list any additional arguments as we normally would.

The pipe works very nicely with the data manipulation verbs because every verb expects a tibble as its first argument and returns a tibble. This means the result of one operation is easily piped into the next operation, allowing you to chain together multiple steps. For instance, piping the result of the arrange step above into the select() function:

dog_licenses %>%
  arrange(license_issued_date) %>%
  select(license_issued_date)

When you see the pipe, read it as “and then.” So, the above code would be read:

Take the dog_licenses data, and then arrange the rows by the license_issued_date, and then select the column license_issued_date.

The result is code that matches very closely how we might describe the steps we performed in natural language. It’s so natural that for the remainder of the chapter we’ll use the pipe when combining data manipulation steps.

4.4.4 Exercise: Reading aloud

Read the following code aloud to your neighbor (or cat, dog, or [rubber duck][rubber-duck-debugging]). Remember to pronounce %>% as “and then.”

dog_licenses %>%
  arrange(license_issued_date) %>%
  select(license_expired_date)

What question might it answer?

4.4.5 Exercise: Using the pipe

Re-write this snippet of code to use the pipe:

select(dog_licenses, animal_name, breed_name)

Use the pipe to re-write this snippet of code to avoid the intermediate variable:

name_and_breed <- select(dog_licenses, animal_name, breed_name)
arrange(name_and_breed, breed_name)

4.4.6 Filtering to keep a subset of rows

The function to filter rows of a tibble is filter(). Like arrange() and select(), its first argument is a tibble. The remaining arguments describe which rows to keep. The rows to keep are specified with a logical expression - something that is either TRUE or FALSE. The rows where this expression is TRUE will be returned.

One of the simplest kinds of logical expression is a test for equality with the == operator. For example, to keep the rows where animal_name is BRUNO you could do:

dog_licenses %>% filter(animal_name == "BRUNO")
## # A tibble: 272 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1      12001 BRUNO       M             2010-05-01        American Pit … Queens 
##  2      27228 BRUNO       M             2013-07-01        Jack Russell … Queens 
##  3      68192 BRUNO       M             2002-01-01        Shih Tzu       Queens 
##  4      70175 BRUNO       M             2015-12-01        Chihuahua Cro… Queens 
##  5     120562 BRUNO       M             2016-05-01        Labrador Retr… Queens 
##  6       3915 BRUNO       M             2014-03-01        Doberman Pins… Queens 
##  7       9614 BRUNO       M             2014-12-01        Boxer          Queens 
##  8      15606 BRUNO       M             2015-02-01        French Bulldog Queens 
##  9      32742 BRUNO       M             2014-03-01        Maltipoo       Queens 
## 10      34299 BRUNO       M             2003-01-01        Unknown        Queens 
## # … with 262 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

You could read this code aloud as:

Take the dog_licenses data, and then filter for only rows when animal_name is equal to “BRUNO”

The logical expression, animal_name == "BRUNO", will be TRUE when the value of the animal_name column is exactly equal to the character string "BRUNO" - any differences in characters, case, or whitespace will result in FALSE.

Above, each value within a column was compared against the same fixed string ("BRUNO"). You can also compare the values from two columns against each other by including a column name on each side of the == sign. In programming terms this is known as element-wise comparison. For example, license_issued_date == animal_birth_month will return TRUE for a row only if for that row the date the license was issued is the exact same date as the birth month for the dog. If you take a look:

dog_licenses %>%
  filter(license_issued_date == animal_birth_month)
## # A tibble: 0 x 15
## # … with 15 variables: row_number <dbl>, animal_name <chr>,
## #   animal_gender <chr>, animal_birth_month <date>, breed_name <chr>,
## #   borough <chr>, zip_code <dbl>, community_district <dbl>,
## #   census_tract_2010 <dbl>, neighborhood_tabulation_area <chr>,
## #   city_council_district <dbl>, congressional_district <dbl>,
## #   state_senatorial_district <dbl>, license_issued_date <date>,
## #   license_expired_date <date>

There is no output (apart from the column names), which means that no rows satisfy this criteria.

Remember that a string is surrounded by quotes while a column name is not. When you are reading code, look for the quotes to figure out if the comparison is to a string, or (by the absence of quotes) to the strings within a column. Use the same strategy to figure out where the quotes should be in your own code, but beware: misplacing quotes often won’t result in an error, but instead a result that you weren’t expecting. For example, I might be interested in the licenses issued to male dogs and try:

dog_licenses %>%
  filter("animal_gender" == "M")
## # A tibble: 0 x 15
## # … with 15 variables: row_number <dbl>, animal_name <chr>,
## #   animal_gender <chr>, animal_birth_month <date>, breed_name <chr>,
## #   borough <chr>, zip_code <dbl>, community_district <dbl>,
## #   census_tract_2010 <dbl>, neighborhood_tabulation_area <chr>,
## #   city_council_district <dbl>, congressional_district <dbl>,
## #   state_senatorial_district <dbl>, license_issued_date <date>,
## #   license_expired_date <date>

The result has zero rows, which would suggest there are no such licenses, but in fact this is the answer to a different question. Can you see what is wrong with the code? By surrounding animal_gender in quotes, R has interpreted the comparison as: is the string "animal_gender" equal to the string "M"?. The answer is FALSE, and no rows are returned. I actually wanted to compare the animal_gender column to the string "M", so animal_gender should have no quotes around it:

dog_licenses %>%
  filter(animal_gender == "M")
## # A tibble: 64,770 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1        548 ROCKY       M             2014-05-01        Labrador Retr… Queens 
##  2        622 BULLY       M             2010-07-01        American Pit … Queens 
##  3        633 COCO        M             2005-02-01        Labrador Retr… Queens 
##  4        872 CHASE       M             2013-11-01        Shih Tzu       Queens 
##  5        874 CHEWY       M             2014-09-01        Shih Tzu       Queens 
##  6        875 CHASE       M             2008-08-01        Labrador Retr… Queens 
##  7        976 APOLLO      M             2014-10-01        American Pit … Queens 
##  8       1297 JERRY       M             2009-06-01        Labrador Retr… Queens 
##  9       2133 SIMON       M             2010-12-01        Havanese       Queens 
## 10       2289 BUDDY       M             2012-06-01        Labrador Retr… Queens 
## # … with 64,760 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

The operator, == (you can read as “is equal to,” or simply “equals”), is a specific kind of comparison. Other comparisons include:

Operator Meaning
< less than
> greater than
<= less than or equal to
>= greater than or equal to
!= not equal to

4.4.7 Exercise: Enterprising dogs

Are there any dogs called “Spock,” “Picard,” or “Janeway?”

These are Star Trek characters. Can you find any dogs with the same name as your favorite character from a book, TV show, or movie?

4.4.8 Exercise: Expired Licenses

This code creates a variable that contains the date for the start of the year 2016:

start_of_2016 <- as.Date("2016-01-01")

Use filter() with this variable to find:

  • The dog licenses that were issued before 2016
  • The dog licenses that expire before 2016

4.4.9 More complicated expressions

Logical expressions can be combined with logical operators to construct more complicated expressions. For example, the AND operator, &, returns TRUE only if the expressions on both sides of it are TRUE. For example, you’ve seen, you can find the licenses to dogs called "BRUNO":

dog_licenses %>%
  filter(animal_name == "BRUNO")
## # A tibble: 272 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1      12001 BRUNO       M             2010-05-01        American Pit … Queens 
##  2      27228 BRUNO       M             2013-07-01        Jack Russell … Queens 
##  3      68192 BRUNO       M             2002-01-01        Shih Tzu       Queens 
##  4      70175 BRUNO       M             2015-12-01        Chihuahua Cro… Queens 
##  5     120562 BRUNO       M             2016-05-01        Labrador Retr… Queens 
##  6       3915 BRUNO       M             2014-03-01        Doberman Pins… Queens 
##  7       9614 BRUNO       M             2014-12-01        Boxer          Queens 
##  8      15606 BRUNO       M             2015-02-01        French Bulldog Queens 
##  9      32742 BRUNO       M             2014-03-01        Maltipoo       Queens 
## 10      34299 BRUNO       M             2003-01-01        Unknown        Queens 
## # … with 262 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

And you could find the dog licenses issued to dogs that live in Brooklyn:

dog_licenses %>%
  filter(borough == "Brooklyn")
## # A tibble: 29,334 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1       2895 FUDGE       M             2014-07-01        American Pit … Brookl…
##  2       4057 STAR        F             2011-01-01        Poodle         Brookl…
##  3      74463 MUNECA      F             2011-09-01        Chihuahua Cro… Brookl…
##  4      76232 KATTY       F             2002-06-01        Chihuahua      Brookl…
##  5      85113 SHADOW      M             2015-03-01        American Pit … Brookl…
##  6      85997 SPARKIE     M             2013-08-01        Maltese Cross… Brookl…
##  7      92451 SNOW        M             2014-07-01        Maltese        Brookl…
##  8     104256 BELLA       F             2016-07-01        Maltese        Brookl…
##  9      10389 SPARKLE     F             2006-01-01        Schnauzer, St… Brookl…
## 10      82492 UNKNOWN     F             2015-11-01        Pomeranian     Brookl…
## # … with 29,324 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

If you want to find the licenses that are to dogs named Bruno in Brooklyn, you could combine the two logical statements with &:

dog_licenses %>%
  filter((animal_name == "BRUNO") & (borough == "Brooklyn"))
## # A tibble: 55 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1       8496 BRUNO       M             2005-04-01        American Pit … Brookl…
##  2      65466 BRUNO       M             2012-11-01        Golden Retrie… Brookl…
##  3     115999 BRUNO       M             2009-01-01        Pug            Brookl…
##  4     118963 BRUNO       M             2005-01-01        Unknown        Brookl…
##  5      10505 BRUNO       M             2006-04-01        Bull Dog, Eng… Brookl…
##  6      14444 BRUNO       M             2007-01-01        Cocker Spaniel Brookl…
##  7      14690 BRUNO       M             2014-04-01        Shih Tzu       Brookl…
##  8      47454 BRUNO       M             2011-06-01        Pug            Brookl…
##  9      58918 BRUNO       M             2006-04-01        Bull Dog, Eng… Brookl…
## 10     105964 BRUNO       M             2016-04-01        Bull Dog, Fre… Brookl…
## # … with 45 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

The parentheses around each logical expression are optional, but can help visually to group the components to &, especially if those logical expressions get more complicated.

If you want to create an “or” type expression, like “licenses to dogs named BRUNO or dogs named BRUCE,” you need to combine two comparisons with the OR operator, |.

dog_licenses %>%
  filter((animal_name == "BRUNO") | (animal_name == "BRUCE"))

If you find yourself combining lots of comparisons on the same column with |, like dogs named BRUNO, BRUCE or BRADY:

dog_licenses %>%
  filter((animal_name == "BRUNO") | (animal_name == "BRUCE") | (animal_name == "BRADY"))

You can save a lot of typing with %in%:

dog_licenses %>%
  filter(animal_name %in% c("BRUNO", "BRUCE", "BRADY"))

On the right hand side of %in% the function c(), combines many single values into a vector. %in% will return TRUE for an element of the left hand side if it is contained in the vector on the right hand side.

4.4.10 Exercise: Expired Licenses

This code creates two variables that contain the dates for the start and end of the year 2016:

start_of_2016 <- as.Date("2016-01-01")
endt_of_2016 <- as.Date("2016-12-31")

Use filter() with these variables to find the dog licenses that expire during 2016.

4.5 How can I calculate new values?

So far, you’ve been manipulating the columns that already exist in a dataset, but what if you want to add new ones? The function mutate() handles this kind of operation.

To see how this works let’s start with a logical expression: animal_name == "CHASE". If you used this with filter(), you would get all the rows back where the license was issued to a dog named CHASE. Let’s say instead of subsetting the data, you want to add a column called called_chase that contained the TRUE and FALSE result. You might do this for example, if you are interested in comparing the two groups of dogs, rather than just keeping one of them. With mutate() after passing in the data, you pass named arguments, where the name is the name you desire for the new column, and its value is the way to calculate it:

dog_licenses %>%
  mutate(called_chase = animal_name == "CHASE") %>%
  select(animal_name, called_chase)
## # A tibble: 118,542 x 2
##    animal_name called_chase
##    <chr>       <lgl>       
##  1 BONITA      FALSE       
##  2 ROCKY       FALSE       
##  3 BULLY       FALSE       
##  4 COCO        FALSE       
##  5 SKI         FALSE       
##  6 CHASE       TRUE        
##  7 CHEWY       FALSE       
##  8 CHASE       TRUE        
##  9 MILEY       FALSE       
## 10 KENZI       FALSE       
## # … with 118,532 more rows

Take dog_licences, and then, mutate to add a column called called_chase which is the result of testing whether animal_name is exactly "CHASE", and then, select the columns animal_name and called_chase.

The select statement isn’t crucial to the calculation here, but it does help me draw your attention to the columns that were involved in this step.

This can be a useful intermediate step in filtering, since it gives you a chance to examine the logical statement before using it to filter:

dog_licenses %>%
  mutate(called_chase = animal_name == "CHASE") %>%
  filter(called_chase)
## # A tibble: 126 x 16
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1        872 CHASE       M             2013-11-01        Shih Tzu       Queens 
##  2        875 CHASE       M             2008-08-01        Labrador Retr… Queens 
##  3      32652 CHASE       M             2013-09-01        Yorkshire Ter… Queens 
##  4      42125 CHASE       M             2015-08-01        Chihuahua      Queens 
##  5     109847 CHASE       M             2013-04-01        Terrier mix    Queens 
##  6     114557 CHASE       M             2009-07-01        Siberian Husky Queens 
##  7      33434 CHASE       M             2014-08-01        Schnauzer, St… Queens 
##  8      45142 CHASE       M             2005-01-01        Unknown        Queens 
##  9       2528 CHASE       M             2011-10-01        Lhasa Apso     Queens 
## 10       3426 CHASE       M             2014-03-01        Yorkshire Ter… Queens 
## # … with 116 more rows, and 10 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>, called_chase <lgl>

Take a closer look at the argument to mutate:

called_chase = animal_name == "CHASE"

On the left of the = is the argument name, called_chase. This is a name you choose - it should be descriptive and follow good style. On the right of the = is an expression that must either return as many values as there are rows, or a single value. Here, the logical expression involves the column animal_name so it returns as many values as there are rows.

If you would prefer the values to be something other than TRUE or FALSE, instead maybe you want them to be something like "called chase" or "not called chase", we would need to use the ifelse().

A call to ifelse() takes the form:

ifelse(test, yes, no)

Where test is a logical expression, yes the value for the elements that return TRUE, and no the value for the elements that return FALSE (Figure 4.1). (Both yes and no could be other column names, in which case the corresponding element of yes would be returned for TRUE elements).

dog_licenses %>%
  mutate(
    called_chase = ifelse(animal_name == "CHASE", "called chase", "not called chase")
    ) %>%
  select(animal_name, called_chase)
## # A tibble: 118,542 x 2
##    animal_name called_chase    
##    <chr>       <chr>           
##  1 BONITA      not called chase
##  2 ROCKY       not called chase
##  3 BULLY       not called chase
##  4 COCO        not called chase
##  5 SKI         not called chase
##  6 CHASE       called chase    
##  7 CHEWY       not called chase
##  8 CHASE       called chase    
##  9 MILEY       not called chase
## 10 KENZI       not called chase
## # … with 118,532 more rows
Schematic of `ifelse()`.

Figure 4.1: Schematic of ifelse().

You can perform multiple mutate steps at once by passing more arguments to mutate(), so an alternative way of writing the above code (with more keystrokes, but with lines that are shorter) would be:

dog_licenses %>%
  mutate(
    is_chase = animal_name == "CHASE",
    called_chase = ifelse(is_chase, "called chase", "not called chase")
    ) %>%
  select(animal_name, is_chase, called_chase)
## # A tibble: 118,542 x 3
##    animal_name is_chase called_chase    
##    <chr>       <lgl>    <chr>           
##  1 BONITA      FALSE    not called chase
##  2 ROCKY       FALSE    not called chase
##  3 BULLY       FALSE    not called chase
##  4 COCO        FALSE    not called chase
##  5 SKI         FALSE    not called chase
##  6 CHASE       TRUE     called chase    
##  7 CHEWY       FALSE    not called chase
##  8 CHASE       TRUE     called chase    
##  9 MILEY       FALSE    not called chase
## 10 KENZI       FALSE    not called chase
## # … with 118,532 more rows

Notice that the computation for the called_chase column refers to the is_chase column. The arguments to mutate are computed in order, so columns created later in the same mutate() can refer to columns created earlier.

Arithmetic is another common operation that returns as many elements as there are rows. For instance we could see how long licenses are issued for:

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  select(license_duration)
## # A tibble: 118,542 x 1
##    license_duration
##    <drtn>          
##  1 1118 days       
##  2 1826 days       
##  3  697 days       
##  4 1096 days       
##  5 1826 days       
##  6  731 days       
##  7  731 days       
##  8 1097 days       
##  9  421 days       
## 10  402 days       
## # … with 118,532 more rows

The output shows us licenses aren’t issued for a standard time period. In these first rows, there are some licenses issued for a whole number of years: 2 (731 days), 3 (1097 days) and 5 (1826 days). However, others seem to be for fractions of years like 421 days.

You can find other functions that are useful in combination with mutate() in the [“Useful functions” section of the mutate() documentation][mutate-useful].

You can use mutate() with operations that give one number based on all the rows:

dog_licenses %>%
  mutate(
    license_duration = license_expired_date - license_issued_date,
    avg_duration = mean(license_duration)) %>%
  select(license_duration, avg_duration)
## # A tibble: 118,542 x 2
##    license_duration avg_duration
##    <drtn>           <drtn>      
##  1 1118 days        467.321 days
##  2 1826 days        467.321 days
##  3  697 days        467.321 days
##  4 1096 days        467.321 days
##  5 1826 days        467.321 days
##  6  731 days        467.321 days
##  7  731 days        467.321 days
##  8 1097 days        467.321 days
##  9  421 days        467.321 days
## 10  402 days        467.321 days
## # … with 118,532 more rows

You’ll get back the original number of rows, but the single value will be repeated in all of them - on average licenses are issued for 467.321 days. You’ll see a different verb, summarise() that collapses many rows into one later in this chapter.

4.5.1 Exercise: Ages of dogs

Use mutate() to add a column age_at_issue that contains the dogs approximate age on the day the license was issued.

4.5.2 Exercise: Unknown breeds

Use mutate() along with ifelse() to create a column breed that takes values "unknown" if breed_name is "unknown" and "known" otherwise.

Extra challenge Can you figure out if the licenses issued to unknown breed dogs are of longer or shorter duration on average than known breed dogs?

4.5.3 Exercise: Name length

The function str_length() in the stringr package finds the length of character strings. Replace the ___ in following code to add a column called name_length that contains the length of the dog’s name:

dog_licenses %>%
  ___(___ = stringr::str_length(animal_name))

Now add an arrange() step to find the licenses issued to dogs with the longest names?

4.6 How can I tell what’s gone wrong in my programs?

FIXME: The errors shown in the markdown are way more informative than those in the Console. Try to get the error displaying in the book like they do for someone in the console?

Let me share an interaction my (CVW’s) husband had at a Trader Joes (a small specialized supermarket) soon after we arrived in the USA from New Zealand.

Josh: “Do you sell bat-trees?”

Store-person: “What?”

Josh: “Do you sell bat-trees?”

Store person: “Huh? Bat…trees?”

Josh: "Do you sell bat-er-ries?

Store-person: “Oh…you mean batteries! No. We don’t sell batteries.”

This is a pretty accurate analogy for what it feels like when you are learning R. You know what you want, but you have to ask for it in a way R understands. When R doesn’t understand you, or when R can’t give you what you want, you’ll get an error. You’ll know when you get one in R because the only output you see will start with Error.

buy_batteries(store = "Trader Joes")
## Error in buy_batteries(store = "Trader Joes"): could not find function "buy_batteries"

It’s also a good illustration of the two kinds of problems that occur: syntax errors and runtime errors. Syntax errors are like the “What? Huh?” moments. R doesn’t understand what you are asking it to do, because something about the way you are asking doesn’t conform to what R expects and it will not even try to run your code. Runtime errors are more like the “No, I can’t help you” moments. R understands what you are asking and runs your code, but during the run something goes wrong and R has to stop running the code.

You generally want to make sure you’ve ruled out syntax errors before assuming it’s a runtime error. Unfortunately, R doesn’t distinguish these two types of error in its output, so we’ll discuss some of the most common examples in the following sections.

This analogy is also a reminder that sometimes you’ll have to repeat yourself. With R, giving the exact same instruction should always result in the exact same error, but it’s not uncommon, even for longtime R users, to run and edit a line of code multiple times before it runs without error.

There is one flaw in this analogy: R isn’t a real person. So, if you need to vent your frustration by cursing it, insulting its parentage or storming off, it’s fine, no one’s feelings will be hurt. Of course, it won’t change R’s response…

4.6.1 Common syntax errors

Syntax errors occur when your code can’t be broken into its component pieces by R. For example, R expects the arguments to a function to start after the opening parenthesis ((), be separated by a comma (,) and finish at the closing parenthesis ()). When R sees this code:

filter(dog_licenses breed_name == "Finnish Lapphund")
## Error: <text>:1:21: unexpected symbol
## 1: filter(dog_licenses breed_name
##                         ^

The missing comma means R can’t figure out where the first argument to filter() ends: is it after dog_licenses, after breed_name, or after ==?

Take a closer look at the error message. Error messages always begin with Error, then optionally the name of the function that returned an error (not in this example), followed by a :, and some description of the error that occurred. The message unexpected symbol is one common kind of syntax error - in this case R encountered some code where it was expecting a comma or closing parenthesis. We can fix it by putting in the missing comma:

filter(dog_licenses, breed_name == "Finnish Lapphund")
## # A tibble: 1 x 15
##   row_number animal_name animal_gender animal_birth_month breed_name    borough 
##        <dbl> <chr>       <chr>         <date>             <chr>         <chr>   
## 1     118408 FREDDIE     M             2011-11-01         Finnish Lapp… Manhatt…
## # … with 9 more variables: zip_code <dbl>, community_district <dbl>,
## #   census_tract_2010 <dbl>, neighborhood_tabulation_area <chr>,
## #   city_council_district <dbl>, congressional_district <dbl>,
## #   state_senatorial_district <dbl>, license_issued_date <date>,
## #   license_expired_date <date>

Syntax errors are usually the result of typos. Some things to keep an eye out for:

  • If you are modelling your code on an example, pay very close attention to the punctuation: commas ,, parenthesis (, ), brackets, [, ], and quotes ", '. Every opening parenthesis, bracket, or quote needs a matching closing one, and they must be closed in the reverse order they were opened.

  • New lines don’t matter if they happen between arguments, or after pipe operators, but they can be problematic in other locations.

  • R ignores other whitespace (spaces or tabs) unless it’s inside a character string (i.e. inside quotes), so different spacing shouldn’t be the cause of an error, but it is a good idea to follow good style for spacing. FIXME: link to style guide section.

4.6.2 Exercise: Syntax errors

Fix these syntax errors.

  • dog_licenses %>%
      arrange(desc(license_expired_date)))
    ## Error: <text>:2:38: unexpected ')'
    ## 1: dog_licenses %>%
    ## 2:   arrange(desc(license_expired_date)))
    ##                                         ^
  • dog_licenses %>%
      mutate(
        month_born = lubridate::month(animal_birth_month)
        year_born = lubridate::year(animal_birth_month))
    ## Error: <text>:4:5: unexpected symbol
    ## 3:     month_born = lubridate::month(animal_birth_month)
    ## 4:     year_born
    ##        ^
  • dog_licenses %>%
      filter(animal_name == "BRUNO)
    ## Error: <text>:2:25: unexpected INCOMPLETE_STRING
    ## 1: dog_licenses %>%
    ## 2:   filter(animal_name == "BRUNO)
    ##                            ^

    (If you run this code in the Console, you might not get an error, but you should see a + on a new line, a signal that R is waiting for more input and a clue that there is something missing in this code).

  • dog_licenses
      %>% filter(animal_gender == "M")
    ## Error: <text>:2:3: unexpected SPECIAL
    ## 1: dog_licenses
    ## 2:   %>%
    ##      ^

4.6.3 Common runtime errors

Runtime errors come in an infinite number of flavors because there are so many ways that you ask for that might be impossible to do. For example you might try to do arithmetic with character strings:

"apple" + "banana"
## Error in "apple" + "banana": non-numeric argument to binary operator

Or try to filter with something that isn’t a logical:

dog_licenses %>% filter(animal_name)
## Error: Problem with `filter()` input `..1`.
## ✖ Input `..1` must be a logical vector, not a character.
## ℹ Input `..1` is `animal_name`.

Perhaps the most common runtime error is of the form Error: object not found:

an_object_i_dont_have
## Error in eval(expr, envir, enclos): object 'an_object_i_dont_have' not found

This is R complaining that you’ve asked it to operate on an object that it doesn’t know about. Often this is actually a typo in disguise, for example you’ve misspelled the name of the object,

my_object <- 12
my_objet
## Error in eval(expr, envir, enclos): object 'my_objet' not found

you’ve used the wrong case,

My_object
## Error in eval(expr, envir, enclos): object 'My_object' not found

or you’ve forgotten that you’ve used a separator

myobject
## Error in eval(expr, envir, enclos): object 'myobject' not found

This error also often arises when you forgot quotes around strings. For example, if we want all the dogs that are male, and try

dog_licenses %>% filter(animal_gender == M)
## Error: Problem with `filter()` input `..1`.
## ✖ object 'M' not found
## ℹ Input `..1` is `animal_gender == M`.

you get an error because R is looking for an object called M to compare to the values in the column called animal_gender. What we really wanted to do was compare the values in animal_gender to the string "M":

dog_licenses %>% filter(animal_gender == "M")
## # A tibble: 64,770 x 15
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1        548 ROCKY       M             2014-05-01        Labrador Retr… Queens 
##  2        622 BULLY       M             2010-07-01        American Pit … Queens 
##  3        633 COCO        M             2005-02-01        Labrador Retr… Queens 
##  4        872 CHASE       M             2013-11-01        Shih Tzu       Queens 
##  5        874 CHEWY       M             2014-09-01        Shih Tzu       Queens 
##  6        875 CHASE       M             2008-08-01        Labrador Retr… Queens 
##  7        976 APOLLO      M             2014-10-01        American Pit … Queens 
##  8       1297 JERRY       M             2009-06-01        Labrador Retr… Queens 
##  9       2133 SIMON       M             2010-12-01        Havanese       Queens 
## 10       2289 BUDDY       M             2012-06-01        Labrador Retr… Queens 
## # … with 64,760 more rows, and 9 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>

4.6.4 Exercise: object not found

Fix these object not found errors. (Hint: the names of the objects or variables being created should give you a clue to the intent of the code)

  • dog_licenses %>%
      mutate(year_issued = lubridate::year(Liscenceissuedate))
    ## Error: Problem with `mutate()` input `year_issued`.
    ## ✖ object 'Liscenceissuedate' not found
    ## ℹ Input `year_issued` is `lubridate::year(Liscenceissuedate)`.
  • dogs_named_bruno <- dog_licenses %>%
      filter(animal_name == BRUNO)
    ## Error: Problem with `filter()` input `..1`.
    ## ✖ object 'BRUNO' not found
    ## ℹ Input `..1` is `animal_name == BRUNO`.

4.6.5 Warnings and messages

There are two other kinds of alerts R can give: warnings and messages. These can both appear in the console with the same color as an error, but they are informational as opposed to fatal.

Messages are purely informational, for example when you read data in with read_csv() you get a message that describes the columns and their data types as parsed by the function:

sites <- read_csv("site.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   site_id = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

Warnings generally alert you that something was slightly unexpected but that R recovered and gave you a result anyway. Poorly formatted CSV files will often result in warnings from read_csv():

bad_csv <- "
id,
1, 2
2, 1
3, 5
"
bad <- read_csv(bad_csv)
## Warning: Missing column names filled in: 'X2' [2]

Here there was a missing column name. read_csv() still returns an object but the warning alerts you that it made some assumption to get it, i.e. that it made up a column name:

bad
## # A tibble: 3 x 2
##      id    X2
##   <dbl> <dbl>
## 1     1     2
## 2     2     1
## 3     3     5

Warnings don’t stop you from proceeding, but they should alert you to question whether you should be proceeding.

4.6.6 What do I do when I get an Error I can’t fix?

  • Check that the error is reproducible. Restart R with a clean slate and re-run your code up to and including the code that gives the error. This is the R version of the classic tech advice to “turn it off, then turn it on again.” FIXME: link to reproducibility chapter.

  • Try searching for it online: for example, searching for “R unexpected string constant” lead me to the question [“Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code” on StackOverflow][error-unexpected] which gives some great examples of ways this error might arise.

  • Ask for help. You are most likely to get help when you can provide a reproducible example. Stack Overflow has detailed instruction on how to create a [minimal reproducible example][so-mre] when asking a question to increase the chances that the question receives a specific and helpful answer. The key principles listed on the website recommends that an answer follows these guidelines:

    • Minimal – Use as little code as possible that still produces the same problem
    • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
    • Reproducible – Test the code you’re about to provide to make sure it reproduces the problem

4.7 How can I operate on subsets of my data?

The syntax for summarise() is the same as mutate() but it expects operations that reduce all rows down to one row. Recall from mutate() that this code added the average license duration to every row of the data:

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  mutate(avg_duration = mean(license_duration))

You actually saw this in one mutate() statement, but I’ve separated out the line that calculates the average so it’s easier to see the difference with summarise(). See what happens when you switch out the final mutate() with summarise():

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  summarise(avg_duration = mean(license_duration))
## # A tibble: 1 x 1
##   avg_duration
##   <drtn>      
## 1 467.321 days

Instead of the the one value repeated on every row, we get a new tibble with only one row, and a single column that corresponds to our requested summary.

Any function that takes many values and reduces them to one is a good candidate for summarise(), for example we could find the shortest licence duration by swapping in min() instead of mean():

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  summarise(shortest_duration = min(license_duration))
## # A tibble: 1 x 1
##   shortest_duration
##   <drtn>           
## 1 1 days

Like mutate() you can also create multiple summary columns at once:

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration)
  )
## # A tibble: 1 x 3
##   avg_duration shortest_duration longest_duration
##   <drtn>       <drtn>            <drtn>          
## 1 467.321 days 1 days            2191 days

Lot’s of statistical operations produce one numbers summaries and are appropriate for use with summarise(): sd(), min(), max(), mean(), median(), quantile() (with a single argument). You can see more examples in the [“Useful functions” section of the summarise() documentation][summarise-useful]. Whenever you are summarizing many rows, it’s a good idea to keep track of how many rows were summarized. This is so common, dplyr provides a special function, n(), that simply counts the number of rows. To add it to your summary:

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration),
    n_licenses = n()
  )
## # A tibble: 1 x 4
##   avg_duration shortest_duration longest_duration n_licenses
##   <drtn>       <drtn>            <drtn>                <int>
## 1 467.321 days 1 days            2191 days            118542

Now, imagine you want this summary just for licenses issued to dogs in the Bronx. You might do something like:

Take the dog_licenses data, and then, mutate to add a column called license_duration, and then filter to keep rows where the borough is "Bronx", and then summarise to find the mean, min and max duration along with the number of rows.

In code:

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  filter(borough == "Bronx") %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration),
    n_licenses = n()
  )
## # A tibble: 1 x 4
##   avg_duration  shortest_duration longest_duration n_licenses
##   <drtn>        <drtn>            <drtn>                <int>
## 1 435.9884 days 2 days            1919 days             12043

But how does this compare to Brooklyn? You could do the same operation again, but now for Brooklyn:

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  filter(borough == "Brooklyn") %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration),
    n_licenses = n()
  )
## # A tibble: 1 x 4
##   avg_duration  shortest_duration longest_duration n_licenses
##   <drtn>        <drtn>            <drtn>                <int>
## 1 465.8845 days 2 days            2191 days             29334

What about Queens? This kind of operation—summarising different subsets of the same data—is so common there is a much easier way to do it: combining summarise() with group_by().

The only difference in the code, is that instead of filtering for a specific borough we’ll group_by() the column borough.

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  group_by(borough) %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration),
    n_licenses = n()
  )
## # A tibble: 57 x 5
##    borough      avg_duration  shortest_duration longest_duration n_licenses
##    <chr>        <drtn>        <drtn>            <drtn>                <int>
##  1 ARVERNE      334.0000 days 334 days           334 days                 1
##  2 Astoria      498.0000 days 239 days           757 days                 2
##  3 ASTORIA      387.6667 days 366 days           405 days                 3
##  4 B            347.0000 days 347 days           347 days                 1
##  5 Bayside      410.0000 days 410 days           410 days                 1
##  6 BELLE HARBOR 309.0000 days 309 days           309 days                 1
##  7 Briarwood    540.5000 days 358 days           723 days                 2
##  8 Bronx        435.9884 days   2 days          1919 days             12043
##  9 BRONX        370.3333 days  72 days           418 days               102
## 10 Brooklyn     465.8845 days   2 days          2191 days             29334
## # … with 47 more rows

The group_by() verb doesn’t perform any changes to the data except to add a signal that this data is now grouped. Subsequent operations will then happen within these groups. In the case of summarise() we now get one row per group, and these are all stacked together in our result.

You might have been a little surprised by the result above. I thought there were only five boroughs in New York (at least that’s what the Beastie Boys told me). Notice some boroughs are represented more than once by variations in case or spelling: Bronx, BRONX. As far as group_by() is concerned these are distinct values of this variable. There also seem to be smaller designations than Borough in this data. You’ll get a chance to try and resolve this in an exercise below.

4.7.1 Exercise: Dog birth months

The following code creates a new column month_born that holds the name of the month the licensed dog was born:

dog_licenses %>%
  mutate(month_born = lubridate::month(animal_birth_month, label = TRUE))

Use a group_by() step and a summarise() step to find the number of dogs born in each month. Which month stands out? Can you guess why?

4.7.2 Exercise: Order matters?

In the example above for dogs licensed in Brooklyn:

dog_licenses %>%
  mutate(license_duration = license_expired_date - license_issued_date) %>%
  filter(borough == "Brooklyn") %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration),
    n_licenses = n()
  )

the filter() step came after the mutate() step. Does this matter?

  • Swap the order in the code and see if you get the same results.

  • Write out how you might describe the steps. Is it obvious you can swap the filter and mutate step and get the same results?

  • Which steps can’t you swap the order of? Why?

  • Despite giving the same results, some orderings of the data manipulation steps will take longer to compute. Can you guess why?

4.7.3 Exercise: the five boroughs

The column neighborhood_tabulation_area is a code for the “Neighborhood Tabulation Areas,” and has been geo-coded from the licensee’s address (as opposed to self reported). The first two characters correspond to the Borough.

This code creates a new variable called borough_code that contains just these two characters:

dog_licenses %>%
  mutate(borough_code = stringr::str_sub(neighborhood_tabulation_area, 1, 2))
## # A tibble: 118,542 x 16
##    row_number animal_name animal_gender animal_birth_mon… breed_name     borough
##         <dbl> <chr>       <chr>         <date>            <chr>          <chr>  
##  1        533 BONITA      F             2013-05-01        Unknown        Queens 
##  2        548 ROCKY       M             2014-05-01        Labrador Retr… Queens 
##  3        622 BULLY       M             2010-07-01        American Pit … Queens 
##  4        633 COCO        M             2005-02-01        Labrador Retr… Queens 
##  5        655 SKI         F             2012-09-01        American Pit … Queens 
##  6        872 CHASE       M             2013-11-01        Shih Tzu       Queens 
##  7        874 CHEWY       M             2014-09-01        Shih Tzu       Queens 
##  8        875 CHASE       M             2008-08-01        Labrador Retr… Queens 
##  9        893 MILEY       F             2008-07-01        Boxer          Queens 
## 10        919 KENZI       F             2010-05-01        Schnauzer, Mi… Queens 
## # … with 118,532 more rows, and 10 more variables: zip_code <dbl>,
## #   community_district <dbl>, census_tract_2010 <dbl>,
## #   neighborhood_tabulation_area <chr>, city_council_district <dbl>,
## #   congressional_district <dbl>, state_senatorial_district <dbl>,
## #   license_issued_date <date>, license_expired_date <date>, borough_code <chr>

Which borough has the longest average licence duration?

4.8 How can I read my own tabular data into R?

4.8.1 What is tabular data?

Tabular data describes data that is in the form of a table: values arranged in rows each of the same length, or equivalently values arranged in columns each of the same length. Here’s a small example of some tabular data:

site_id latitude longitude
DR-1 -49.85 -128.57
DR-3 -47.15 -126.72
MSK-4 -48.87 -123.40

Each row records information on a site at which water measurements are taken. There are three columns: a site identification code, and the location of the site in latitude and longitude.

This is an incredibly common way of displaying data, but when storing tabular data in a file, we need a way to communicate when records and values begin and end. A very popular format for doing this is CSV. CSV, is short for comma separated values, and like the name suggests, a comma, ,, is used to separate the values for each column, while each record goes on a new line. The file is plain text, but we use the extension .csv to indicate that is follows the CSV format conventions.

Here’s how the table above would look inside the CSV file site.csv:

site_id,latitude,longitude
DR-1,-49.85,-128.57
DR-3,-47.15,-126.72
MSK-4,-48.87,-123.4

In this case the first line has the column names, this is common (and recommended!), but not universal.

Why is CSV so popular?

  • It’s human readable. CSV isn’t a special file type, it is a simple plain text file that follows some conventions. This means you don’t need any special software to look at the contents—you can open it up in anything that can examine text and take a look inside.
  • It’s computer readable. Because CSV files all have the same structure it’s easy to write computer programs to read them. This means in almost any program designed to work with data, which is basically all the common programming languages, you’ll find functions that will import CSV files. This also means it’s easy to create CSV files—you can export them from Excel, write them from R, or even write one from scratch in a text editor.

If you want to look inside a CSV file in RStudio you can navigate to its location in the “Files” pane and click on its name. Selecting “View File,” will open it in the Source pane.

However, if you want to work with CSV data in R, it isn’t enough to look inside the file. You need to read the contents of the file and store it in R’s memory. This process is known as data import.

4.8.2 Importing CSV data into R

To work with data in R you need to have it in R’s memory. The read_csv() function in the readr package will import a CSV file, and represent it as a tibble, if you give it the location of the CSV file. For example, to read the site.csv data and store it in an object called sites:

library(tidyverse)
sites <- read_csv(here("data", "site.csv"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   site_id = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )
sites
## # A tibble: 3 x 3
##   site_id latitude longitude
##   <chr>      <dbl>     <dbl>
## 1 DR-1       -49.8     -129.
## 2 DR-3       -47.2     -127.
## 3 MSK-4      -48.9     -123.

FIXME: talk about files paths, point reader to the place where file paths are talked about, or assume file is in their working directory.

Notice that read_csv() gave us a message about what it did: it parsed our data file and found three columns site_id, latitude and longitude. It also mentions what kind of data it assumed was in each column. FIXME: point to further discussion of data types.

The object sites is now R’s representation of the data from the site.csv file.

4.8.3 Exercise: Import visited.csv

Use read_csv() to read visited.csv into R. How does R indicate a cell with a missing value?

visited <- read_csv(here("data", "visited.csv"))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   visit_id = col_double(),
##   site_id = col_character(),
##   visit_date = col_date(format = "")
## )
visited
## # A tibble: 8 x 3
##   visit_id site_id visit_date
##      <dbl> <chr>   <date>    
## 1      619 DR-1    1927-02-08
## 2      622 DR-1    1927-02-10
## 3      734 DR-3    1930-01-07
## 4      735 DR-3    1930-01-12
## 5      751 DR-3    1930-02-26
## 6      752 DR-3    NA        
## 7      837 MSK-4   1932-01-14
## 8      844 DR-1    1932-03-22

4.8.4 Exercise: Import IRS tax return data for New York City

Use read_csv() to import the CSV file nyc-tax-returns.csv.

FIXME: Should we introduce the “common things that go wrong” / “the most common additional arguments,” e.g. skip, na, col_names, col_types? My feeling is not now, but sometime later.

4.9 How can I save my results?

Say, you’ve now got a summary of the license durations by borough:

dog_licenses %>%
  mutate(
    license_duration = license_expired_date - license_issued_date,
    borough_code = stringr::str_sub(neighborhood_tabulation_area, 1, 2)) %>%
  group_by(borough_code) %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration),
    n_licenses = n()
  )
## # A tibble: 5 x 5
##   borough_code avg_duration  shortest_duration longest_duration n_licenses
##   <chr>        <drtn>        <drtn>            <drtn>                <int>
## 1 BK           465.2768 days 2 days            2191 days             29558
## 2 BX           435.8062 days 2 days            1919 days             12050
## 3 MN           494.0276 days 1 days            2189 days             41668
## 4 QN           452.1386 days 2 days            2186 days             24420
## 5 SI           439.4875 days 4 days            2164 days             10846

How do you save this result for future use?

If you just need this tibble later in your code you can assign it to a variable:

duration_by_borough <- dog_licenses %>%
  mutate(
    license_duration = license_expired_date - license_issued_date,
    borough_code = stringr::str_sub(neighborhood_tabulation_area, 1, 2)) %>%
  group_by(borough_code) %>%
  summarise(
    avg_duration = mean(license_duration),
    shortest_duration = min(license_duration),
    longest_duration = max(license_duration),
    n_licenses = n()
  )

Then you can access it whenever you need it:

duration_by_borough
## # A tibble: 5 x 5
##   borough_code avg_duration  shortest_duration longest_duration n_licenses
##   <chr>        <drtn>        <drtn>            <drtn>                <int>
## 1 BK           465.2768 days 2 days            2191 days             29558
## 2 BX           435.8062 days 2 days            1919 days             12050
## 3 MN           494.0276 days 1 days            2189 days             41668
## 4 QN           452.1386 days 2 days            2186 days             24420
## 5 SI           439.4875 days 4 days            2164 days             10846

This keeps our result around in memory, but often you also want to preserve the data in a file on disk. There are two common choices for format: CSV and RDS.

You’ve already seen CSV files. Saving your results in this format gives you all the benefits of that format: plain text files easily shared and opened. You can save a tibble as a CSV file with the readr function write_csv(), where all you need to specify is the path:

duration_by_borough %>% write_csv("duration-by-borough.csv")

You can then read this file in any project or R session with read_csv():

duration_by_borough <- read_csv("duration-by-borough.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   borough_code = col_character(),
##   avg_duration = col_double(),
##   shortest_duration = col_double(),
##   longest_duration = col_double(),
##   n_licenses = col_double()
## )

RDS files are a special R format. They are binary files as opposed to plain text files, which means you can’t just open them up and look inside. If you are sharing them you’d also need your collaborators to have R. These are downsides, but the advantage of this format is that it can be much quicker to load, it will preserve special R data types (for example factors, or nested structures), and can save any R object not just tabular data structures.

To save the tibble as an RDS, use write_rds():

duration_by_borough %>% write_rds("duration-by-borough.rds")

To read it back in, use read_rds():

duration_by_borough <- read_rds("duration-by-borough.rds")

Often you’ll save your data in both formats to make sure you get the best of both worlds. You’ll talk more about where to save your data in FIXME: add link.

4.9.1 Exercise

Save the tibble with our extra columns:

dog_licenses %>%
  mutate(
    license_duration = license_expired_date - license_issued_date,
    borough_code = stringr::str_sub(neighborhood_tabulation_area, 1, 2))

into a CSV file called dog-licenses-extra.csv.