Lecture 13

class: title-slide, left, bottom

# Lecture 13
----
## **DANL 200: Introduction to Data Analytics**
### Byeong-Hak Choe
### October 13, 2022

---
# Announcement
### <p style="color:#00449E"> Schedules up to Midterm </p>

.panelset[
.panel[.panel-name[Online Class on Oct. 18]

- On Tuesday, October 18, 2022, we will have an **asynchronous** online class.

- On Canvas, I will leave the web-link for the class recording on Monday, October 17, 2022.

- Feel free to email me if you have any questions from the online class on Tuesday, October 18, 2022.

- You can also ask me any questions in person on Thursday, October 20, 2022 during our class time.

]

.panel[.panel-name[Office Hours]

- I will not have office hours on Monday and Wednesday, October 17 and 19, 2022.
  
  - Instead, I will have office hours on **Friday, October 21, 2022**, from 1:30 P.M. to 5:30 P.M at my office, South Hall 117B.

]

.panel[.panel-name[Homework 1 & 2]

- Evaluation criteria for Homework Assignment 1 include ...
  - Whether or not your R code works;
  - Visualization quality---how much informative and clear it is; 
  - Comments based on your visualization results.

- Do not leave your answer empty:
  - There is always a plenty of partial credits for homeworks and exams!

- Homework Assignment 2 will be posted during this weekend.
  - Due is October 24, 11:59 P.M., Eastern Time.

]

.panel[.panel-name[Midterm]
- The Midterm Exam will be on Thursday, October 27, 2022, class time.

- Please email me as soon as possible if you need to re-schedule the midterm exam.
  
]

]

---
# Workflow
### <p style="color:#00449E"> Shortcuts for Switching Apps</p>

.pull-left[
**Mac**
- **command + Tab** allows us to switch between apps quickly.

- If we use the full-screen mode of an app, the trackpad gesture of "three finger drag" switches across the screens.

]

.pull-right[
**Windows**

- **Alt + Tab** allows us to switch between apps quickly.

- Trackpad gestures might be supported depending on laptops or versions of Windows.

]

---
# Workflow
### <p style="color:#00449E"> Shortcuts for RStudio and RScript </p>

.pull-left[
**Mac**

- **command + shift + N** opens a new RScript.
- **command + return** runs a current line or selected lines.
- **command + shift + C** is the shortcut for # (commenting).
- **option + - ** is the shortcut for `<-`.
]

.pull-right[
**Windows**

- **Ctrl + Shift + N** opens a new RScript.
- **Ctrl + return** runs a current line or selected lines.
- **Ctrl + Shift + C** is the shortcut for # (commenting).
- **Alt + - ** is the shortcut for `<-`.
]

---
# Workflow

- **Home/End** moves the blinking cursor bar to the beginning/End of the line.
  - **Ctrl** (**command/fn** for Mac Users) **+** <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M447.1 256C447.1 273.7 433.7 288 416 288H109.3l105.4 105.4c12.5 12.5 12.5 32.75 0 45.25C208.4 444.9 200.2 448 192 448s-16.38-3.125-22.62-9.375l-160-160c-12.5-12.5-12.5-32.75 0-45.25l160-160c12.5-12.5 32.75-12.5 45.25 0s12.5 32.75 0 45.25L109.3 224H416C433.7 224 447.1 238.3 447.1 256z"/></svg> / <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M438.6 278.6l-160 160C272.4 444.9 264.2 448 256 448s-16.38-3.125-22.62-9.375c-12.5-12.5-12.5-32.75 0-45.25L338.8 288H32C14.33 288 .0016 273.7 .0016 256S14.33 224 32 224h306.8l-105.4-105.4c-12.5-12.5-12.5-32.75 0-45.25s32.75-12.5 45.25 0l160 160C451.1 245.9 451.1 266.1 438.6 278.6z"/></svg> works too.

- **PgUp/PgDn** moves the blinking cursor bar to the top/bottom line of the script on the screen. 
  - **Fn + ** <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M374.6 246.6C368.4 252.9 360.2 256 352 256s-16.38-3.125-22.62-9.375L224 141.3V448c0 17.69-14.33 31.1-31.1 31.1S160 465.7 160 448V141.3L54.63 246.6c-12.5 12.5-32.75 12.5-45.25 0s-12.5-32.75 0-45.25l160-160c12.5-12.5 32.75-12.5 45.25 0l160 160C387.1 213.9 387.1 234.1 374.6 246.6z"/></svg> / <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M374.6 310.6l-160 160C208.4 476.9 200.2 480 192 480s-16.38-3.125-22.62-9.375l-160-160c-12.5-12.5-12.5-32.75 0-45.25s32.75-12.5 45.25 0L160 370.8V64c0-17.69 14.33-31.1 31.1-31.1S224 46.31 224 64v306.8l105.4-105.4c12.5-12.5 32.75-12.5 45.25 0S387.1 298.1 374.6 310.6z"/></svg> works too.

- **Ctrl** (**command** for Mac Users) **+ Z** undoes the previous action.
- **Ctrl** (**command** for Mac Users) **+ Shift + Z** redoes when undo is executed.

- **Ctrl** (**command** for Mac Users) **+ F** is useful when finding a phrase (and replace the phrase) in the RScript.

- **Ctrl** (**command** for Mac Users) **+ D** deletes a current line.

---
class: inverse, center, middle

# Data Transformation
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Data Transformation

- It is rare that we get the data in exactly the right form we need.

- We'll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with.

- We'll discuss how to transform data using the `dplyr` package and a dataset on flights departing New York City in 2013.

```r
library(tidyverse)
library(nycflights13)
head(flights)
View(flights) 
?flights
```

---
# Data Transformation
### <p style="color:#00449E"> `dplyr` basics </p>

- We are going to discuss the five key `dplyr` functions to solve various data manipulation challenges:
  - Pick observations by their values (`filter()`).
  - Reorder the rows (`arrange()`).
  - Pick variables by their names (`select()`).
  - Create new variables with functions of existing variables (`mutate()`).
  - Collapse many values down to a single summary (`summarise()`).

---
# Data Transformation
### <p style="color:#00449E"> `dplyr` basics </p>

- `filter(<DATA.FRAME>, LOGICAL_STATEMENT)`
- `arrange(<DATA.FRAME>, ...)`
- `select(<DATA.FRAME>, ...)`
- `mutate(<DATA.FRAME>, ...)`
- `summarise(<DATA.FRAME>, ...)`

- The first argument is a data frame.
- The subsequent arguments describe what to do with the data frame, using the variable names.
- The result is a new data frame.

---
class: inverse, center, middle

# `filter()`
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Filter rows with `filter()`

- `filter()` allows us to subset observations based on their values.

```r
filter(flights, month == 1, day == 1)

jan1 <- filter(flights, month == 1, day == 1)

(dec25 <- filter(flights, month == 12, day == 25))
```

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Comparisons

- R provides the standard comparison operators: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

```r
filter(flights, month = 1)

sqrt(2) ^ 2 == 2
1 / 49 * 49 == 1

near(sqrt(2) ^ 2,  2)
near(1 / 49 * 49, 1)
```

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Logical Operators

- Multiple arguments to `filter()` are often combined with logical operators.

<div class="figure" style="text-align: center">
<img src="../lec_figs/transform-logical.png" alt="Complete set of boolean operations in `R`" width="66%" />
<p class="caption">Complete set of boolean operations in `R`</p>
</div>

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Logical Operators

- De Morgan’s law:

- `!(x & y)` is the same as `!x | !y`. 
- `!(x | y)` is the same as `!x & !y`.

- For example, comparison and logical operators can be used with `filter()` as follows:

```r
filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120 & dep_delay <= 120)
```

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Missing values

- Almost any operation involving an unknown value will also be unknown.

```r
NA > 5
10 == NA
NA + 10
NA / 2

NA == NA
```

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Missing values

- Let x be Mary's age. We don't know how old she is.
- Let y be John's age. We don't know how old he is.
- Are John and Mary the same age?

```r
x <- NA
y <- NA
x == y
```

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Missing values

- If we want to determine if a value is missing, use `is.na()`.
- If we want to preserve missing values, ask `filter()` for them explicitly.

```r
x <- NA
is.na(x)

df <- tibble(y = c(1, NA, 3))
filter(df, y > 1)
filter(df, is.na(y) | y > 1)
```

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Exercises

- Find all flights that
  - Had an arrival delay of two or more hours
  - Flew to Houston (`IAH` or `HOU`)
  - Were operated by United, American, or Delta
  - Departed in summer (July, August, and September)
  - Arrived more than two hours late, but didn’t leave late
  - Were delayed by at least an hour, but made up over 30 minutes in flight
  - Departed between midnight and 6am (inclusive)

---
# Filter rows with `filter()`
### <p style="color:#00449E"> Exercises

- Another useful `dplyr` filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?

- How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?

- Why is `NA^0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)

---
class: inverse, center, middle

# `arrange()`
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Arrange rows with `arrange()`

- `arrange()` changes their order.

- If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

- Use `desc()` to re-order by a column in descending order.

```r
arrange(flights, year, month, day)

# re-order observations by dep_delay in descending order.
arrange(flights, [?])
```

---
# Arrange rows with `arrange()`
### <p style="color:#00449E"> Exercises

- How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`).

- Sort flights to find the most delayed flights. Find the flights that left earliest.

- Sort flights to find the fastest (highest speed) flights.

- Which flights traveled the farthest? Which traveled the shortest?

---
class: inverse, center, middle

# `select()`
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Select columns with `select()`

- It’s not uncommon to get datasets with hundreds or even thousands of variables.
- `select()` allows us to narrow in on the variables we're actually interested in.

```r
select(flights, year, month, day)

select(flights, year:day)

select(flights, -(year:day))
```

---
# Select columns with `select()`

- There are a number of helper functions we can use within `select()`:

- `starts_with("abc")`: matches names that begin with "abc".

- `ends_with("xyz")`: matches names that end with "xyz".

- `contains("ijk")`: matches names that contain "ijk".

- `num_range("x", 1:3)`: matches x1, x2 and x3.

---
# Select columns with `select()`
### <p style="color:#00449E"> Rename variables with `rename()`

- `rename()` can be used to rename variables:

- `rename(<DATA>, <NEW_VARIABLE> = <EXISTING_VARIABLE>)`

```r
rename(flights, tail_num = tailnum)
```

---
# Select columns with `select()`
### <p style="color:#00449E"> Change the order of variables

- The function, `everything()`, can be used to move variables to the start of the data frame:

```r
select(flights, time_hour, air_time, [?])
```

---
# Select columns with `select()`
### <p style="color:#00449E"> Exercises

- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.

- What happens if we include the name of a variable multiple times in a `select()` call?

- What does the `any_of()` function do? Why might it be helpful in conjunction with this vector?

```r
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```

---
# Select columns with `select()`
### <p style="color:#00449E"> Exercises

- Does the result of running the following code surprise us? How do the select helpers deal with case by default? How can we change that default?

```r
select(flights, contains("TIME"))
```

---
class: inverse, center, middle

# `mutate()`
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Add new variables with `mutate()`

- `mutate()` is useful to add new columns that are functions of existing columns.

```r
flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time )

mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60 )

mutate(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours )
```

---
# Add new variables with `mutate()`
- If we only want to keep the new variables, use `transmute()`.

```r
transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
```

---
# Add new variables with `mutate()`
### <p style="color:#00449E"> Useful creation functions

- There are many functions for creating new variables that we can use with `mutate()`.

- Arithmetic operators: `+`, `-`, `*`, `/`, `^`
- Modular arithmetic: `%/%` (integer division) and `%%` (remainder).

```r
transmute(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)
```

---
# Add new variables with `mutate()`

- There are many functions for creating new variables that we can use with `mutate()`.

- Offsets: `lead()` and `lag()` allow us to refer to leading or lagging values.

- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, and `cummean()`

```r
(x <- 1:10)
lag(x)
lead(x)
cumsum(x)
cummean(x)
```

---
# Add new variables with `mutate()`
### <p style="color:#00449E"> Useful creation functions

- There are many functions for creating new variables that we can use with `mutate()`.

- Ranking: `min_rank()`, `dense_rank()`, `row_number()`, , `percent_rank()`, and `cume_dist()`.

```r
rank_me <- tibble( x = c(10, 5, 1, 5, 5, NA) )
rank_me <- mutate(rank_me,
                  id_x = row_number(x),
                  x_min_rank = min_rank(x),
                  x_dense_rank = dense_rank(x) )
```

---
# Add new variables with `mutate()`
### <p style="color:#00449E"> Useful creation functions

- To create new variables based on a condition, we can use the `if_else()` and `ifelse()` functions. 
  - `if_else(<condition>, <if TRUE>, <else>)`
  - `ifelse(<condition>, <if TRUE>, <else>)`

```r
flight_season <-mutate(flights,
    summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE))

flight_season <-mutate(flights,
    summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0))

flight_season <-mutate(flights,   # ifelse() is less restrictive
    summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0))
```

---
# Add new variables with `mutate()`
### <p style="color:#00449E"> Exercises
 
- Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

- Compare `air_time` with `arr_time - dep_time`. What do you expect to see? What do you see? What do you need to do to fix it?

- Compare `dep_time`, `sched_dep_time`, and `dep_delay.` How would you expect those three numbers to be related?

- Find the 10 most delayed flights using a ranking function. How do you want to handle ties?

---
class: inverse, center, middle

# `summarize()`
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E">

- `summarize()` collapses a data frame to a single row.
- `na.rm` argument removes the missing values prior to computation when using `summarize()` or other aggregate functions.
- `summarize()` is not terribly useful unless we pair it with `group_by()`.

```r
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Example

- Explore the relationship between the mean level `distance` and the mean level `arr_delay` for each value of `dest`.

```r
by_dest <- group_by(flights, dest)
delay <- summarize(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)

delay <- filter(delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Combining multiple operations with the pipe (`%>%`)

- Explore the relationship between the distance and average delay for each location using the pipe (`%>%`).

```r
delays <- flights %>% 
  group_by(dest) %>% 
  summarize( count = n(),
             dist = mean(distance, na.rm = TRUE),
             delay = mean(arr_delay, na.rm = TRUE) ) %>% 
  filter(count > 20, dest != "HNL")
```

- A good way to pronounce %>% when reading code is “then.”
- A shortcut for %>%  is Ctrl+Shift+M  (Command+Shift+M for Mac).

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Missing values

- What happens if we don’t set `na.rm`?

```r
flights %>% 
  group_by(year, month, day) %>% 
  summarize(mean = mean(dep_delay))
  
flights %>% 
  group_by(year, month, day) %>% 
  summarize(mean = mean(dep_delay, na.rm = TRUE))
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Missing values

- Suppose missing values represent cancelled flights.

```r
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarize(mean = mean(dep_delay))
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Counts

- Let’s look at the planes (identified by their tail number (`tailnum`)) that have the highest average delays

```r
delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarize(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Counts

- When doing any aggregation, it’s always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).

```r
delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarize( delay = mean(arr_delay, na.rm = TRUE),
             n = n() )

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

delays %>% 
  filter(n > 25) %>% 
  ggplot(mapping = aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Useful summary functions

- Measures of location: `mean(x)` and `median(x)`.

- Measures of spread: `sd(x)`, `IQR(x)`.

```r
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarize( avg_delay1 = mean(arr_delay),
             avg_delay2 = mean(arr_delay[arr_delay > 0]) ) # the average positive delay
# Brackets[] lets us filter (or subset) data from a column
# Why is distance to some destinations more variable than to others?
not_cancelled %>% 
  group_by(dest) %>% 
  summarize(distance_sd = sd(distance)) %>% 
  arrange(desc(distance_sd))
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Useful summary functions

- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`, `range(x)`.
- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.

- When do the first and last flights leave each day?

.panelset[

.panel[.panel-name[(1)]

```r
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarize(first = min(dep_time), 
            last = max(dep_time) )
```
]

.panel[.panel-name[(2)]

```r
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarize(first_dep = first(dep_time),last_dep = last(dep_time) )
```
]

.panel[.panel-name[(3)]

```r
not_cancelled %>% 
  group_by(year, month, day) %>% 
  mutate(r = min_rank(desc(dep_time))) %>% 
  filter(r %in% range(r) )
```
]

]

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Useful summary functions

- Measures of count: `sum(!is.na(x))`, `n_distinct(x)`,  `count(x)`, and `count(x, wt = VARIABLE)`.

```r
# Which destinations have the most carriers?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))
  
  
not_cancelled %>% 
  count(dest)

not_cancelled %>% 
  count(tailnum, wt = distance)
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Useful summary functions

- We can compute the height of bars in a histogram by combining `dplyr::count()` and `ggplot2::cut_width()`:

```r
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

diamonds %>% 
  count(cut_width(carat, 0.5))
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Useful summary functions

- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.

```r
# How many flights left before 5am? 
# (these usually indicate delayed flights from the previous day)
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarize(n_early = sum(dep_time < 500))

# What proportion of flights are delayed by more than an hour?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarize(hour_prop = mean(arr_delay > 60))
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Grouping by multiple variables

- When you group by multiple variables, each summary peels off one level of the grouping.

```r
daily <- group_by(flights, year, month, day)

(per_day   <- summarize(daily, flights = n()))

(per_month <- summarize(per_day, flights = sum(flights)))

(per_year  <- summarize(per_month, flights = sum(flights)))
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Ungrouping

- If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.

```r
daily %>% 
  ungroup() %>%             # no longer grouped by date
  summarize(flights = n())  # all flights
```

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Exercises

- Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

- A flight is always 10 minutes late.

- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

Q. Which is more important: arrival delay or departure delay?

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Exercises

- Come up with another approach that will give you the same output as `not_cancelled` `%>%` `count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`).

- Our definition of cancelled flights (`is.na(dep_delay) | is.na(arr_delay)` ) is slightly suboptimal. Why? Which is the most important column?

- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

---
# Grouped summaries with `summarize()`
### <p style="color:#00449E"> Exercises

- Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarize(n())`)

- What does the sort argument to `count()` do. When might you use it?

---
class: inverse, center, middle

# Grouped mutates
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Grouped mutates (and filters)
### <p style="color:#00449E">

- Grouping is most useful in conjunction with `summarize()`, but you can also do convenient operations with `mutate()` and `filter()`.

.panelset[

.panel[.panel-name[worst]
- Find the worst members of each group:

```r
flights %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
```
]

.panel[.panel-name[threshold]
- Find all groups bigger than a threshold:

```r
popular_dests <- flights %>% 
  group_by(dest) %>% 
  filter(n() > 17250)
popular_dests
```
]

.panel[.panel-name[standardization]
- Standardize to compute per group metrics:

```r
popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, dest, arr_delay, prop_delay)
```
]

]

---
# Grouped mutates (and filters)
### <p style="color:#00449E"> Exercises

- Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

- Which plane (`tailnum`) has the worst on-time record?

- What time of day should you fly if you want to avoid delays as much as possible?

- For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

---
# Grouped mutates (and filters)
### <p style="color:#00449E"> Exercises

- Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.

- Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

---
# Grouped mutates (and filters)
### <p style="color:#00449E"> Exercises

- Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

- For each plane, count the number of flights before the first delay of greater than 1 hour.