It is rare that we get the data in exactly the right form we need.
We'll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with.
We'll discuss how to transform data using the dplyr
package and a dataset on flights departing New York City in 2013.
library(tidyverse)library(nycflights13)head(flights)View(flights) ?flights
dplyr
basics
dplyr
functions to solve various data manipulation challenges:filter()
).arrange()
).select()
).mutate()
).summarise()
). dplyr
basics
filter(<DATA.FRAME>, LOGICAL_STATEMENT, !is.na(VARIABLE))
arrange(<DATA.FRAME>, VARIABLE1, desc(VARIABLE2), -VARIABLE3)
select(<DATA.FRAME>, VARIABLE, -VARIABLE2, starts_with("abc"), contains("time"), A:Z)
mutate(<DATA.FRAME>, ...)
summarise(<DATA.FRAME>, ...)
mutate()
mutate()
mutate()
is useful to add new columns that are functions of existing columns.flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time )mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60 )mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )
mutate()
transmute()
.transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours)
mutate()
Useful creation functions
There are many functions for creating new variables that we can use with mutate()
.
Arithmetic operators: +
, -
, *
, /
, ^
%/%
(integer division) and %%
(remainder).transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100)
mutate()
There are many functions for creating new variables that we can use with mutate()
.
Offsets: lead()
and lag()
allow us to refer to leading or lagging values.
Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum()
, cumprod()
, cummin()
, cummax()
, and cummean()
(x <- 1:10)lag(x)lead(x)cumsum(x)cummean(x)
mutate()
Useful creation functions
mutate()
.min_rank()
, dense_rank()
, row_number()
, , percent_rank()
, and cume_dist()
.rank_me <- tibble( x = c(10, 5, 1, 5, 5, NA) )rank_me <- mutate(rank_me, id_x = row_number(x), x_min_rank = min_rank(x), x_dense_rank = dense_rank(x) )
row_number()
assigns each element a unique value.
min_rank()
and dense_rank()
assign tied values the same rank.
min_rank()
function assigns a rank equal to the number of values less than that tied value plus one.
dense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one.
mutate()
Useful creation functions
if_else()
and ifelse()
functions. if_else(<condition>, <if TRUE>, <else>)
ifelse(<condition>, <if TRUE>, <else>)
flight_season <-mutate(flights, summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE))flight_season <-mutate(flights, summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0))flight_season <-mutate(flights, # ifelse() is less restrictive summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0))
mutate()
Exercises
Currently dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
Compare air_time
with arr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?
Compare dep_time
, sched_dep_time
, and dep_delay.
How would you expect those three numbers to be related?
Find the 10 most delayed flights using a ranking function. How do you want to handle ties?
summarize()
summarize()
summarize()
collapses a data frame to a single row.na.rm
argument removes the missing values prior to computation when using summarize()
or other aggregate functions.summarize()
is not terribly useful unless we pair it with group_by()
.summarize(flights, delay = mean(dep_delay, na.rm = TRUE))by_day <- group_by(flights, year, month, day)summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
summarize()
Example
distance
and the mean level arr_delay
for each value of dest
.by_dest <- group_by(flights, dest)delay <- summarize(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))delay <- filter(delay, count > 20, dest != "HNL")ggplot(data = delay, mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
summarize()
Combining multiple operations with the pipe (%>%
)
%>%
).delays <- flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL")
summarize()
Missing values
na.rm
?flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay))flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay, na.rm = TRUE))
summarize()
Missing values
not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))not_cancelled %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay))
summarize()
Counts
tailnum
)) that have the highest average delaysdelays <- not_cancelled %>% group_by(tailnum) %>% summarize( delay = mean(arr_delay) )ggplot(data = delays, mapping = aes(x = delay)) + geom_freqpoly(binwidth = 10)
summarize()
Counts
n()
), or a count of non-missing values (sum(!is.na(x))
).delays <- not_cancelled %>% group_by(tailnum) %>% summarize( delay = mean(arr_delay, na.rm = TRUE), n = n() )ggplot(data = delays, mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10)delays %>% filter(n > 25) %>% ggplot(mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10)
summarize()
Useful summary functions
Measures of location: mean(x)
and median(x)
.
Measures of spread: sd(x)
, IQR(x)
.
not_cancelled %>% group_by(year, month, day) %>% summarize( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) ) # the average positive delay# Brackets[] lets us filter (or subset) data from a column# Why is distance to some destinations more variable than to others?not_cancelled %>% group_by(dest) %>% summarize(distance_sd = sd(distance)) %>% arrange(desc(distance_sd))
summarize()
Useful summary functions
min(x)
, quantile(x, 0.25)
, max(x)
, range(x)
.first(x)
, nth(x, 2)
, last(x)
.not_cancelled %>% group_by(year, month, day) %>% summarize(first = min(dep_time), last = max(dep_time) )
not_cancelled %>% group_by(year, month, day) %>% summarize(first_dep = first(dep_time),last_dep = last(dep_time) )
not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r) )
summarize()
Useful summary functions
sum(!is.na(x))
, n_distinct(x)
, count(x)
, and count(x, wt = VARIABLE)
.# Which destinations have the most carriers?not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers))not_cancelled %>% count(dest)not_cancelled %>% count(tailnum, wt = distance)
summarize()
Useful summary functions
dplyr::count()
and ggplot2::cut_width()
:ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)diamonds %>% count(cut_width(carat, 0.5))
summarize()
Useful summary functions
sum(x > 10)
, mean(y == 0)
.# How many flights left before 5am? # (these usually indicate delayed flights from the previous day)not_cancelled %>% group_by(year, month, day) %>% summarize(n_early = sum(dep_time < 500))# What proportion of flights are delayed by more than an hour?not_cancelled %>% group_by(year, month, day) %>% summarize(hour_prop = mean(arr_delay > 60))
summarize()
Grouping by multiple variables
daily <- group_by(flights, year, month, day)(per_day <- summarize(daily, flights = n()))(per_month <- summarize(per_day, flights = sum(flights)))(per_year <- summarize(per_month, flights = sum(flights)))
summarize()
Ungrouping
ungroup()
.daily %>% ungroup() %>% # no longer grouped by date summarize(flights = n()) # all flights
summarize()
Exercises
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Q. Which is more important: arrival delay or departure delay?
summarize()
Exercises
Come up with another approach that will give you the same output as not_cancelled
%>%
count(dest)
and not_cancelled %>% count(tailnum, wt = distance)
(without using count()
).
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)
) is slightly suboptimal. Why? Which is the most important column?
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
summarize()
Exercises
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarize(n())
)
What does the sort argument to count()
do. When might you use it?
summarize()
, but you can also do convenient operations with mutate()
and filter()
.flights %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10)
popular_dests <- flights %>% group_by(dest) %>% filter(n() > 17250)popular_dests
popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay)
Exercises
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum
) has the worst on-time record?
What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
Exercises
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag()
, explore how the delay of a flight is related to the delay of the immediately preceding flight.
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Exercises
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
For each plane, count the number of flights before the first delay of greater than 1 hour.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
It is rare that we get the data in exactly the right form we need.
We'll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with.
We'll discuss how to transform data using the dplyr
package and a dataset on flights departing New York City in 2013.
library(tidyverse)library(nycflights13)head(flights)View(flights) ?flights
dplyr
basics
dplyr
functions to solve various data manipulation challenges:filter()
).arrange()
).select()
).mutate()
).summarise()
). dplyr
basics
filter(<DATA.FRAME>, LOGICAL_STATEMENT, !is.na(VARIABLE))
arrange(<DATA.FRAME>, VARIABLE1, desc(VARIABLE2), -VARIABLE3)
select(<DATA.FRAME>, VARIABLE, -VARIABLE2, starts_with("abc"), contains("time"), A:Z)
mutate(<DATA.FRAME>, ...)
summarise(<DATA.FRAME>, ...)
mutate()
mutate()
mutate()
is useful to add new columns that are functions of existing columns.flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time )mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60 )mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )
mutate()
transmute()
.transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours)
mutate()
Useful creation functions
There are many functions for creating new variables that we can use with mutate()
.
Arithmetic operators: +
, -
, *
, /
, ^
%/%
(integer division) and %%
(remainder).transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100)
mutate()
There are many functions for creating new variables that we can use with mutate()
.
Offsets: lead()
and lag()
allow us to refer to leading or lagging values.
Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum()
, cumprod()
, cummin()
, cummax()
, and cummean()
(x <- 1:10)lag(x)lead(x)cumsum(x)cummean(x)
mutate()
Useful creation functions
mutate()
.min_rank()
, dense_rank()
, row_number()
, , percent_rank()
, and cume_dist()
.rank_me <- tibble( x = c(10, 5, 1, 5, 5, NA) )rank_me <- mutate(rank_me, id_x = row_number(x), x_min_rank = min_rank(x), x_dense_rank = dense_rank(x) )
row_number()
assigns each element a unique value.
min_rank()
and dense_rank()
assign tied values the same rank.
min_rank()
function assigns a rank equal to the number of values less than that tied value plus one.
dense_rank()
function assigns a rank equal to the number of distinct values less than that tied value plus one.
mutate()
Useful creation functions
if_else()
and ifelse()
functions. if_else(<condition>, <if TRUE>, <else>)
ifelse(<condition>, <if TRUE>, <else>)
flight_season <-mutate(flights, summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE))flight_season <-mutate(flights, summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0))flight_season <-mutate(flights, # ifelse() is less restrictive summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0))
mutate()
Exercises
Currently dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
Compare air_time
with arr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?
Compare dep_time
, sched_dep_time
, and dep_delay.
How would you expect those three numbers to be related?
Find the 10 most delayed flights using a ranking function. How do you want to handle ties?
summarize()
summarize()
summarize()
collapses a data frame to a single row.na.rm
argument removes the missing values prior to computation when using summarize()
or other aggregate functions.summarize()
is not terribly useful unless we pair it with group_by()
.summarize(flights, delay = mean(dep_delay, na.rm = TRUE))by_day <- group_by(flights, year, month, day)summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
summarize()
Example
distance
and the mean level arr_delay
for each value of dest
.by_dest <- group_by(flights, dest)delay <- summarize(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE))delay <- filter(delay, count > 20, dest != "HNL")ggplot(data = delay, mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
summarize()
Combining multiple operations with the pipe (%>%
)
%>%
).delays <- flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL")
summarize()
Missing values
na.rm
?flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay))flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay, na.rm = TRUE))
summarize()
Missing values
not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))not_cancelled %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay))
summarize()
Counts
tailnum
)) that have the highest average delaysdelays <- not_cancelled %>% group_by(tailnum) %>% summarize( delay = mean(arr_delay) )ggplot(data = delays, mapping = aes(x = delay)) + geom_freqpoly(binwidth = 10)
summarize()
Counts
n()
), or a count of non-missing values (sum(!is.na(x))
).delays <- not_cancelled %>% group_by(tailnum) %>% summarize( delay = mean(arr_delay, na.rm = TRUE), n = n() )ggplot(data = delays, mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10)delays %>% filter(n > 25) %>% ggplot(mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10)
summarize()
Useful summary functions
Measures of location: mean(x)
and median(x)
.
Measures of spread: sd(x)
, IQR(x)
.
not_cancelled %>% group_by(year, month, day) %>% summarize( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) ) # the average positive delay# Brackets[] lets us filter (or subset) data from a column# Why is distance to some destinations more variable than to others?not_cancelled %>% group_by(dest) %>% summarize(distance_sd = sd(distance)) %>% arrange(desc(distance_sd))
summarize()
Useful summary functions
min(x)
, quantile(x, 0.25)
, max(x)
, range(x)
.first(x)
, nth(x, 2)
, last(x)
.not_cancelled %>% group_by(year, month, day) %>% summarize(first = min(dep_time), last = max(dep_time) )
not_cancelled %>% group_by(year, month, day) %>% summarize(first_dep = first(dep_time),last_dep = last(dep_time) )
not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r) )
summarize()
Useful summary functions
sum(!is.na(x))
, n_distinct(x)
, count(x)
, and count(x, wt = VARIABLE)
.# Which destinations have the most carriers?not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers))not_cancelled %>% count(dest)not_cancelled %>% count(tailnum, wt = distance)
summarize()
Useful summary functions
dplyr::count()
and ggplot2::cut_width()
:ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)diamonds %>% count(cut_width(carat, 0.5))
summarize()
Useful summary functions
sum(x > 10)
, mean(y == 0)
.# How many flights left before 5am? # (these usually indicate delayed flights from the previous day)not_cancelled %>% group_by(year, month, day) %>% summarize(n_early = sum(dep_time < 500))# What proportion of flights are delayed by more than an hour?not_cancelled %>% group_by(year, month, day) %>% summarize(hour_prop = mean(arr_delay > 60))
summarize()
Grouping by multiple variables
daily <- group_by(flights, year, month, day)(per_day <- summarize(daily, flights = n()))(per_month <- summarize(per_day, flights = sum(flights)))(per_year <- summarize(per_month, flights = sum(flights)))
summarize()
Ungrouping
ungroup()
.daily %>% ungroup() %>% # no longer grouped by date summarize(flights = n()) # all flights
summarize()
Exercises
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Q. Which is more important: arrival delay or departure delay?
summarize()
Exercises
Come up with another approach that will give you the same output as not_cancelled
%>%
count(dest)
and not_cancelled %>% count(tailnum, wt = distance)
(without using count()
).
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)
) is slightly suboptimal. Why? Which is the most important column?
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
summarize()
Exercises
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarize(n())
)
What does the sort argument to count()
do. When might you use it?
summarize()
, but you can also do convenient operations with mutate()
and filter()
.flights %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10)
popular_dests <- flights %>% group_by(dest) %>% filter(n() > 17250)popular_dests
popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay)
Exercises
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum
) has the worst on-time record?
What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
Exercises
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag()
, explore how the delay of a flight is related to the delay of the immediately preceding flight.
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Exercises
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
For each plane, count the number of flights before the first delay of greater than 1 hour.