+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 14


DANL 200: Introduction to Data Analytics

Byeong-Hak Choe

October 18, 2022

1 / 36

Data Transformation


2 / 36

Data Transformation

  • It is rare that we get the data in exactly the right form we need.

  • We'll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with.

  • We'll discuss how to transform data using the dplyr package and a dataset on flights departing New York City in 2013.

library(tidyverse)
library(nycflights13)
head(flights)
View(flights)
?flights
3 / 36

Data Transformation

dplyr basics

  • We are going to discuss the five key dplyr functions to solve various data manipulation challenges:
    • Pick observations by their values (filter()).
    • Reorder the rows (arrange()).
    • Pick variables by their names (select()).
    • Create new variables with functions of existing variables (mutate()).
    • Collapse many values down to a single summary (summarise()).
4 / 36

Data Transformation

dplyr basics

  • filter(<DATA.FRAME>, LOGICAL_STATEMENT, !is.na(VARIABLE))
  • arrange(<DATA.FRAME>, VARIABLE1, desc(VARIABLE2), -VARIABLE3)
  • select(<DATA.FRAME>, VARIABLE, -VARIABLE2, starts_with("abc"), contains("time"), A:Z)
  • mutate(<DATA.FRAME>, ...)
  • summarise(<DATA.FRAME>, ...)
  • The first argument is a data frame.
  • The subsequent arguments describe what to do with the data frame, using the variable names.
  • The result is a new data frame.
5 / 36

mutate()


6 / 36

Add new variables with mutate()

  • mutate() is useful to add new columns that are functions of existing columns.
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time )
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60 )
mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours )
7 / 36

Add new variables with mutate()

  • If we only want to keep the new variables, use transmute().
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
8 / 36

Add new variables with mutate()

Useful creation functions

  • There are many functions for creating new variables that we can use with mutate().

  • Arithmetic operators: +, -, *, /, ^

  • Modular arithmetic: %/% (integer division) and %% (remainder).
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
9 / 36

Add new variables with mutate()

  • There are many functions for creating new variables that we can use with mutate().

  • Offsets: lead() and lag() allow us to refer to leading or lagging values.

  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(), and cummean()

(x <- 1:10)
lag(x)
lead(x)
cumsum(x)
cummean(x)
10 / 36

Add new variables with mutate()

Useful creation functions

  • There are many functions for creating new variables that we can use with mutate().
  • Ranking: min_rank(), dense_rank(), row_number(), , percent_rank(), and cume_dist().
rank_me <- tibble( x = c(10, 5, 1, 5, 5, NA) )
rank_me <- mutate(rank_me,
id_x = row_number(x),
x_min_rank = min_rank(x),
x_dense_rank = dense_rank(x) )
  • row_number() assigns each element a unique value.

  • min_rank() and dense_rank() assign tied values the same rank.

    • min_rank() function assigns a rank equal to the number of values less than that tied value plus one.

    • dense_rank() function assigns a rank equal to the number of distinct values less than that tied value plus one.

11 / 36

Add new variables with mutate()

Useful creation functions

  • To create new variables based on a condition, we can use the if_else() and ifelse() functions.
    • if_else(<condition>, <if TRUE>, <else>)
    • ifelse(<condition>, <if TRUE>, <else>)
flight_season <-mutate(flights,
summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE))
flight_season <-mutate(flights,
summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0))
flight_season <-mutate(flights, # ifelse() is less restrictive
summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0))
12 / 36

Add new variables with mutate()

Exercises

  • Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

  • Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

  • Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

  • Find the 10 most delayed flights using a ranking function. How do you want to handle ties?

13 / 36

summarize()


14 / 36

Grouped summaries with summarize()

  • summarize() collapses a data frame to a single row.
  • na.rm argument removes the missing values prior to computation when using summarize() or other aggregate functions.
  • summarize() is not terribly useful unless we pair it with group_by().
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
15 / 36

Grouped summaries with summarize()

Example

  • Explore the relationship between the mean level distance and the mean level arr_delay for each value of dest.
by_dest <- group_by(flights, dest)
delay <- summarize(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
16 / 36

Grouped summaries with summarize()

Combining multiple operations with the pipe (%>%)

  • Explore the relationship between the distance and average delay for each location using the pipe (%>%).
delays <- flights %>%
group_by(dest) %>%
summarize( count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE) ) %>%
filter(count > 20, dest != "HNL")
  • A good way to pronounce %>% when reading code is “then.”
  • A shortcut for %>% is Ctrl+Shift+M (Command+Shift+M for Mac).
17 / 36

Grouped summaries with summarize()

Missing values

  • What happens if we don’t set na.rm?
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay, na.rm = TRUE))
18 / 36

Grouped summaries with summarize()

Missing values

  • Suppose missing values represent cancelled flights.
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
19 / 36

Grouped summaries with summarize()

Counts

  • Let’s look at the planes (identified by their tail number (tailnum)) that have the highest average delays
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay)
)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
20 / 36

Grouped summaries with summarize()

Counts

  • When doing any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))).
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize( delay = mean(arr_delay, na.rm = TRUE),
n = n() )
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
21 / 36

Grouped summaries with summarize()

Useful summary functions

  • Measures of location: mean(x) and median(x).

  • Measures of spread: sd(x), IQR(x).

not_cancelled %>%
group_by(year, month, day) %>%
summarize( avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) ) # the average positive delay
# Brackets[] lets us filter (or subset) data from a column
# Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarize(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
22 / 36

Grouped summaries with summarize()

Useful summary functions

  • Measures of rank: min(x), quantile(x, 0.25), max(x), range(x).
  • Measures of position: first(x), nth(x, 2), last(x).
  • When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarize(first = min(dep_time),
last = max(dep_time) )
not_cancelled %>%
group_by(year, month, day) %>%
summarize(first_dep = first(dep_time),last_dep = last(dep_time) )
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r) )
23 / 36

Grouped summaries with summarize()

Useful summary functions

  • Measures of count: sum(!is.na(x)), n_distinct(x), count(x), and count(x, wt = VARIABLE).
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
not_cancelled %>%
count(dest)
not_cancelled %>%
count(tailnum, wt = distance)
24 / 36

Grouped summaries with summarize()

Useful summary functions

  • We can compute the height of bars in a histogram by combining dplyr::count() and ggplot2::cut_width():
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
diamonds %>%
count(cut_width(carat, 0.5))
25 / 36

Grouped summaries with summarize()

Useful summary functions

  • Counts and proportions of logical values: sum(x > 10), mean(y == 0).
# How many flights left before 5am?
# (these usually indicate delayed flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500))
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarize(hour_prop = mean(arr_delay > 60))
26 / 36

Grouped summaries with summarize()

Grouping by multiple variables

  • When you group by multiple variables, each summary peels off one level of the grouping.
daily <- group_by(flights, year, month, day)
(per_day <- summarize(daily, flights = n()))
(per_month <- summarize(per_day, flights = sum(flights)))
(per_year <- summarize(per_month, flights = sum(flights)))
27 / 36

Grouped summaries with summarize()

Ungrouping

  • If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
daily %>%
ungroup() %>% # no longer grouped by date
summarize(flights = n()) # all flights
28 / 36

Grouped summaries with summarize()

Exercises

  • Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

    • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

    • A flight is always 10 minutes late.

    • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

    • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

Q. Which is more important: arrival delay or departure delay?

29 / 36

Grouped summaries with summarize()

Exercises

  • Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

  • Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

  • Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

30 / 36

Grouped summaries with summarize()

Exercises

  • Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarize(n()))

  • What does the sort argument to count() do. When might you use it?

31 / 36

Grouped mutates


32 / 36

Grouped mutates (and filters)

  • Grouping is most useful in conjunction with summarize(), but you can also do convenient operations with mutate() and filter().
  • Find the worst members of each group:
flights %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
  • Find all groups bigger than a threshold:
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 17250)
popular_dests
  • Standardize to compute per group metrics:
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
33 / 36

Grouped mutates (and filters)

Exercises

  • Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

  • Which plane (tailnum) has the worst on-time record?

  • What time of day should you fly if you want to avoid delays as much as possible?

  • For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

34 / 36

Grouped mutates (and filters)

Exercises

  • Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.

  • Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

35 / 36

Grouped mutates (and filters)

Exercises

  • Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

  • For each plane, count the number of flights before the first delay of greater than 1 hour.

36 / 36

Data Transformation


2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow