class: title-slide, left, bottom # Lecture 13 ---- ## **DANL 200: Introduction to Data Analytics** ### Byeong-Hak Choe ### October 13, 2022 --- # Announcement ### <p style="color:#00449E"> Schedules up to Midterm </p> .panelset[ .panel[.panel-name[Online Class on Oct. 18] - On Tuesday, October 18, 2022, we will have an **asynchronous** online class. - On Canvas, I will leave the web-link for the class recording on Monday, October 17, 2022. - Feel free to email me if you have any questions from the online class on Tuesday, October 18, 2022. - You can also ask me any questions in person on Thursday, October 20, 2022 during our class time. ] .panel[.panel-name[Office Hours] - I will not have office hours on Monday and Wednesday, October 17 and 19, 2022. - Instead, I will have office hours on **Friday, October 21, 2022**, from 1:30 P.M. to 5:30 P.M at my office, South Hall 117B. ] .panel[.panel-name[Homework 1 & 2] - Evaluation criteria for Homework Assignment 1 include ... - Whether or not your R code works; - Visualization quality---how much informative and clear it is; - Comments based on your visualization results. - Do not leave your answer empty: - There is always a plenty of partial credits for homeworks and exams! - Homework Assignment 2 will be posted during this weekend. - Due is October 24, 11:59 P.M., Eastern Time. ] .panel[.panel-name[Midterm] - The Midterm Exam will be on Thursday, October 27, 2022, class time. - Please email me as soon as possible if you need to re-schedule the midterm exam. ] ] --- # Workflow ### <p style="color:#00449E"> Shortcuts for Switching Apps</p> .pull-left[ **Mac** - **command + Tab** allows us to switch between apps quickly. - If we use the full-screen mode of an app, the trackpad gesture of "three finger drag" switches across the screens. ] .pull-right[ **Windows** - **Alt + Tab** allows us to switch between apps quickly. - Trackpad gestures might be supported depending on laptops or versions of Windows. ] --- # Workflow ### <p style="color:#00449E"> Shortcuts for RStudio and RScript </p> .pull-left[ **Mac** - **command + shift + N** opens a new RScript. - **command + return** runs a current line or selected lines. - **command + shift + C** is the shortcut for # (commenting). - **option + - ** is the shortcut for `<-`. ] .pull-right[ **Windows** - **Ctrl + Shift + N** opens a new RScript. - **Ctrl + return** runs a current line or selected lines. - **Ctrl + Shift + C** is the shortcut for # (commenting). - **Alt + - ** is the shortcut for `<-`. ] --- # Workflow - **Home/End** moves the blinking cursor bar to the beginning/End of the line. - **Ctrl** (**command/fn** for Mac Users) **+**
/
works too. - **PgUp/PgDn** moves the blinking cursor bar to the top/bottom line of the script on the screen. - **Fn + **
/
works too. - **Ctrl** (**command** for Mac Users) **+ Z** undoes the previous action. - **Ctrl** (**command** for Mac Users) **+ Shift + Z** redoes when undo is executed. - **Ctrl** (**command** for Mac Users) **+ F** is useful when finding a phrase (and replace the phrase) in the RScript. - **Ctrl** (**command** for Mac Users) **+ D** deletes a current line. --- class: inverse, center, middle # Data Transformation <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Data Transformation - It is rare that we get the data in exactly the right form we need. - We'll need to create some new variables, summaries, rename the variables or reorder the observations in order to make the data a little easier to work with. - We'll discuss how to transform data using the `dplyr` package and a dataset on flights departing New York City in 2013. ```r library(tidyverse) library(nycflights13) head(flights) View(flights) ?flights ``` --- # Data Transformation ### <p style="color:#00449E"> `dplyr` basics </p> - We are going to discuss the five key `dplyr` functions to solve various data manipulation challenges: - Pick observations by their values (`filter()`). - Reorder the rows (`arrange()`). - Pick variables by their names (`select()`). - Create new variables with functions of existing variables (`mutate()`). - Collapse many values down to a single summary (`summarise()`). --- # Data Transformation ### <p style="color:#00449E"> `dplyr` basics </p> - `filter(<DATA.FRAME>, LOGICAL_STATEMENT)` - `arrange(<DATA.FRAME>, ...)` - `select(<DATA.FRAME>, ...)` - `mutate(<DATA.FRAME>, ...)` - `summarise(<DATA.FRAME>, ...)` - The first argument is a data frame. - The subsequent arguments describe what to do with the data frame, using the variable names. - The result is a new data frame. --- class: inverse, center, middle # `filter()` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Filter rows with `filter()` - `filter()` allows us to subset observations based on their values. ```r filter(flights, month == 1, day == 1) jan1 <- filter(flights, month == 1, day == 1) (dec25 <- filter(flights, month == 12, day == 25)) ``` --- # Filter rows with `filter()` ### <p style="color:#00449E"> Comparisons - R provides the standard comparison operators: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). ```r filter(flights, month = 1) sqrt(2) ^ 2 == 2 1 / 49 * 49 == 1 near(sqrt(2) ^ 2, 2) near(1 / 49 * 49, 1) ``` --- # Filter rows with `filter()` ### <p style="color:#00449E"> Logical Operators - Multiple arguments to `filter()` are often combined with logical operators. <div class="figure" style="text-align: center"> <img src="../lec_figs/transform-logical.png" alt="Complete set of boolean operations in `R`" width="66%" /> <p class="caption">Complete set of boolean operations in `R`</p> </div> --- # Filter rows with `filter()` ### <p style="color:#00449E"> Logical Operators - De Morgan’s law: - `!(x & y)` is the same as `!x | !y`. - `!(x | y)` is the same as `!x & !y`. - For example, comparison and logical operators can be used with `filter()` as follows: ```r filter(flights, month == 11 | month == 12) nov_dec <- filter(flights, month %in% c(11, 12)) filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120 & dep_delay <= 120) ``` --- # Filter rows with `filter()` ### <p style="color:#00449E"> Missing values - Almost any operation involving an unknown value will also be unknown. ```r NA > 5 10 == NA NA + 10 NA / 2 NA == NA ``` --- # Filter rows with `filter()` ### <p style="color:#00449E"> Missing values - Let x be Mary's age. We don't know how old she is. - Let y be John's age. We don't know how old he is. - Are John and Mary the same age? ```r x <- NA y <- NA x == y ``` --- # Filter rows with `filter()` ### <p style="color:#00449E"> Missing values - If we want to determine if a value is missing, use `is.na()`. - If we want to preserve missing values, ask `filter()` for them explicitly. ```r x <- NA is.na(x) df <- tibble(y = c(1, NA, 3)) filter(df, y > 1) filter(df, is.na(y) | y > 1) ``` --- # Filter rows with `filter()` ### <p style="color:#00449E"> Exercises - Find all flights that - Had an arrival delay of two or more hours - Flew to Houston (`IAH` or `HOU`) - Were operated by United, American, or Delta - Departed in summer (July, August, and September) - Arrived more than two hours late, but didn’t leave late - Were delayed by at least an hour, but made up over 30 minutes in flight - Departed between midnight and 6am (inclusive) --- # Filter rows with `filter()` ### <p style="color:#00449E"> Exercises - Another useful `dplyr` filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges? - How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent? - Why is `NA^0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!) --- class: inverse, center, middle # `arrange()` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Arrange rows with `arrange()` - `arrange()` changes their order. - If we provide more than one column name, each additional column will be used to break ties in the values of preceding columns. - Use `desc()` to re-order by a column in descending order. ```r arrange(flights, year, month, day) # re-order observations by dep_delay in descending order. arrange(flights, [?]) ``` --- # Arrange rows with `arrange()` ### <p style="color:#00449E"> Exercises - How could you use `arrange()` to sort all missing values to the start? (Hint: use `is.na()`). - Sort flights to find the most delayed flights. Find the flights that left earliest. - Sort flights to find the fastest (highest speed) flights. - Which flights traveled the farthest? Which traveled the shortest? --- class: inverse, center, middle # `select()` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Select columns with `select()` - It’s not uncommon to get datasets with hundreds or even thousands of variables. - `select()` allows us to narrow in on the variables we're actually interested in. ```r select(flights, year, month, day) select(flights, year:day) select(flights, -(year:day)) ``` --- # Select columns with `select()` - There are a number of helper functions we can use within `select()`: - `starts_with("abc")`: matches names that begin with "abc". - `ends_with("xyz")`: matches names that end with "xyz". - `contains("ijk")`: matches names that contain "ijk". - `num_range("x", 1:3)`: matches x1, x2 and x3. --- # Select columns with `select()` ### <p style="color:#00449E"> Rename variables with `rename()` - `rename()` can be used to rename variables: - `rename(<DATA>, <NEW_VARIABLE> = <EXISTING_VARIABLE>)` ```r rename(flights, tail_num = tailnum) ``` --- # Select columns with `select()` ### <p style="color:#00449E"> Change the order of variables - The function, `everything()`, can be used to move variables to the start of the data frame: ```r select(flights, time_hour, air_time, [?]) ``` --- # Select columns with `select()` ### <p style="color:#00449E"> Exercises - Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. - What happens if we include the name of a variable multiple times in a `select()` call? - What does the `any_of()` function do? Why might it be helpful in conjunction with this vector? ```r vars <- c("year", "month", "day", "dep_delay", "arr_delay") ``` --- # Select columns with `select()` ### <p style="color:#00449E"> Exercises - Does the result of running the following code surprise us? How do the select helpers deal with case by default? How can we change that default? ```r select(flights, contains("TIME")) ``` --- class: inverse, center, middle # `mutate()` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Add new variables with `mutate()` - `mutate()` is useful to add new columns that are functions of existing columns. ```r flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60 ) mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` --- # Add new variables with `mutate()` - If we only want to keep the new variables, use `transmute()`. ```r transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` --- # Add new variables with `mutate()` ### <p style="color:#00449E"> Useful creation functions - There are many functions for creating new variables that we can use with `mutate()`. - Arithmetic operators: `+`, `-`, `*`, `/`, `^` - Modular arithmetic: `%/%` (integer division) and `%%` (remainder). ```r transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100 ) ``` --- # Add new variables with `mutate()` - There are many functions for creating new variables that we can use with `mutate()`. - Offsets: `lead()` and `lag()` allow us to refer to leading or lagging values. - Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, and `cummean()` ```r (x <- 1:10) lag(x) lead(x) cumsum(x) cummean(x) ``` --- # Add new variables with `mutate()` ### <p style="color:#00449E"> Useful creation functions - There are many functions for creating new variables that we can use with `mutate()`. - Ranking: `min_rank()`, `dense_rank()`, `row_number()`, , `percent_rank()`, and `cume_dist()`. ```r rank_me <- tibble( x = c(10, 5, 1, 5, 5, NA) ) rank_me <- mutate(rank_me, id_x = row_number(x), x_min_rank = min_rank(x), x_dense_rank = dense_rank(x) ) ``` --- # Add new variables with `mutate()` ### <p style="color:#00449E"> Useful creation functions - To create new variables based on a condition, we can use the `if_else()` and `ifelse()` functions. - `if_else(<condition>, <if TRUE>, <else>)` - `ifelse(<condition>, <if TRUE>, <else>)` ```r flight_season <-mutate(flights, summer_month = if_else(month %in% c(6, 7, 8), TRUE, FALSE)) flight_season <-mutate(flights, summer_month = if_else(month %in% c(6, 7, 8), TRUE, 0)) flight_season <-mutate(flights, # ifelse() is less restrictive summer_month = ifelse(month %in% c(6, 7, 8), TRUE, 0)) ``` --- # Add new variables with `mutate()` ### <p style="color:#00449E"> Exercises - Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight. - Compare `air_time` with `arr_time - dep_time`. What do you expect to see? What do you see? What do you need to do to fix it? - Compare `dep_time`, `sched_dep_time`, and `dep_delay.` How would you expect those three numbers to be related? - Find the 10 most delayed flights using a ranking function. How do you want to handle ties? --- class: inverse, center, middle # `summarize()` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> - `summarize()` collapses a data frame to a single row. - `na.rm` argument removes the missing values prior to computation when using `summarize()` or other aggregate functions. - `summarize()` is not terribly useful unless we pair it with `group_by()`. ```r summarize(flights, delay = mean(dep_delay, na.rm = TRUE)) by_day <- group_by(flights, year, month, day) summarize(by_day, delay = mean(dep_delay, na.rm = TRUE)) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Example - Explore the relationship between the mean level `distance` and the mean level `arr_delay` for each value of `dest`. ```r by_dest <- group_by(flights, dest) delay <- summarize(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay <- filter(delay, count > 20, dest != "HNL") ggplot(data = delay, mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Combining multiple operations with the pipe (`%>%`) - Explore the relationship between the distance and average delay for each location using the pipe (`%>%`). ```r delays <- flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") ``` - A good way to pronounce %>% when reading code is “then.” - A shortcut for %>% is Ctrl+Shift+M (Command+Shift+M for Mac). --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Missing values - What happens if we don’t set `na.rm`? ```r flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay)) flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay, na.rm = TRUE)) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Missing values - Suppose missing values represent cancelled flights. ```r not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay)) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Counts - Let’s look at the planes (identified by their tail number (`tailnum`)) that have the highest average delays ```r delays <- not_cancelled %>% group_by(tailnum) %>% summarize( delay = mean(arr_delay) ) ggplot(data = delays, mapping = aes(x = delay)) + geom_freqpoly(binwidth = 10) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Counts - When doing any aggregation, it’s always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`). ```r delays <- not_cancelled %>% group_by(tailnum) %>% summarize( delay = mean(arr_delay, na.rm = TRUE), n = n() ) ggplot(data = delays, mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10) delays %>% filter(n > 25) %>% ggplot(mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Useful summary functions - Measures of location: `mean(x)` and `median(x)`. - Measures of spread: `sd(x)`, `IQR(x)`. ```r not_cancelled %>% group_by(year, month, day) %>% summarize( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) ) # the average positive delay # Brackets[] lets us filter (or subset) data from a column # Why is distance to some destinations more variable than to others? not_cancelled %>% group_by(dest) %>% summarize(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Useful summary functions - Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`, `range(x)`. - Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. - When do the first and last flights leave each day? .panelset[ .panel[.panel-name[(1)] ```r not_cancelled %>% group_by(year, month, day) %>% summarize(first = min(dep_time), last = max(dep_time) ) ``` ] .panel[.panel-name[(2)] ```r not_cancelled %>% group_by(year, month, day) %>% summarize(first_dep = first(dep_time),last_dep = last(dep_time) ) ``` ] .panel[.panel-name[(3)] ```r not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r) ) ``` ] ] --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Useful summary functions - Measures of count: `sum(!is.na(x))`, `n_distinct(x)`, `count(x)`, and `count(x, wt = VARIABLE)`. ```r # Which destinations have the most carriers? not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) not_cancelled %>% count(dest) not_cancelled %>% count(tailnum, wt = distance) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Useful summary functions - We can compute the height of bars in a histogram by combining `dplyr::count()` and `ggplot2::cut_width()`: ```r ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) diamonds %>% count(cut_width(carat, 0.5)) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Useful summary functions - Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`. ```r # How many flights left before 5am? # (these usually indicate delayed flights from the previous day) not_cancelled %>% group_by(year, month, day) %>% summarize(n_early = sum(dep_time < 500)) # What proportion of flights are delayed by more than an hour? not_cancelled %>% group_by(year, month, day) %>% summarize(hour_prop = mean(arr_delay > 60)) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Grouping by multiple variables - When you group by multiple variables, each summary peels off one level of the grouping. ```r daily <- group_by(flights, year, month, day) (per_day <- summarize(daily, flights = n())) (per_month <- summarize(per_day, flights = sum(flights))) (per_year <- summarize(per_month, flights = sum(flights))) ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Ungrouping - If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`. ```r daily %>% ungroup() %>% # no longer grouped by date summarize(flights = n()) # all flights ``` --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Exercises - Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios: - A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time. - A flight is always 10 minutes late. - A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time. - 99% of the time a flight is on time. 1% of the time it’s 2 hours late. Q. Which is more important: arrival delay or departure delay? --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Exercises - Come up with another approach that will give you the same output as `not_cancelled` `%>%` `count(dest)` and `not_cancelled %>% count(tailnum, wt = distance)` (without using `count()`). - Our definition of cancelled flights (`is.na(dep_delay) | is.na(arr_delay)` ) is slightly suboptimal. Why? Which is the most important column? - Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? --- # Grouped summaries with `summarize()` ### <p style="color:#00449E"> Exercises - Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarize(n())`) - What does the sort argument to `count()` do. When might you use it? --- class: inverse, center, middle # Grouped mutates <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Grouped mutates (and filters) ### <p style="color:#00449E"> - Grouping is most useful in conjunction with `summarize()`, but you can also do convenient operations with `mutate()` and `filter()`. .panelset[ .panel[.panel-name[worst] - Find the worst members of each group: ```r flights %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10) ``` ] .panel[.panel-name[threshold] - Find all groups bigger than a threshold: ```r popular_dests <- flights %>% group_by(dest) %>% filter(n() > 17250) popular_dests ``` ] .panel[.panel-name[standardization] - Standardize to compute per group metrics: ```r popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay) ``` ] ] --- # Grouped mutates (and filters) ### <p style="color:#00449E"> Exercises - Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping. - Which plane (`tailnum`) has the worst on-time record? - What time of day should you fly if you want to avoid delays as much as possible? - For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination. --- # Grouped mutates (and filters) ### <p style="color:#00449E"> Exercises - Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight. - Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air? --- # Grouped mutates (and filters) ### <p style="color:#00449E"> Exercises - Find all destinations that are flown by at least two carriers. Use that information to rank the carriers. - For each plane, count the number of flights before the first delay of greater than 1 hour.