class: title-slide, left, bottom # Lecture 5 ---- ## **DANL 200: Introduction to Data Analytics** ### Byeong-Hak Choe ### September 13, 2022 --- # Announcement ### <p style="color:#00449E">Accounting Expo</p> - Are you interested in learning about Accounting, Consulting, Audit or Tax? - Stop in for pizza and meet 20+ Firms (alumni and employers) from the Big 4, National Players and Regional firms! - When? September 15th, 5:00 PM-7:00 PM - Where? Ballroom - Dress code? Business Casual - Practice, prep or questions? In-person drop-ins! - South 110 (or 112) on September 13, 9:00 AM-Noon - South 110 (or 112) on September 14, 9:00 AM-4:00 PM --- class: inverse, center, middle # Workflow <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Workflow ### <p style="color:#00449E"> Paths, Directories, and RStudio Projects </p> - **Absolute paths** are paths that point to the same place regardless of your working directory. - Mac: `/Users/byeong-hakchoe/Desktop/DANL/tvshows.csv` - Windows: `C:\Users\bchoe\Desktop\DANL\tvshows.csv` - In the path, Mac and uses slashes (e.g. `plots/diamonds.csv`) and Windows uses backslashes (e.g. `plots\diamonds.csv`). - When using absolute paths, I recommend Windows users to replace backslashes (`\`) with slashes (`/`). - Backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes. --- # Workflow ### <p style="color:#00449E"> Paths, Directories, and RStudio Projects </p> - I keep all the files associated with a project together — input data, R scripts, analytical results, figures. - This is such a wise and common practice that RStudio has built-in support for this via **projects**. - Let's make a **project** in RStudio. --- # Making RStudio Projects - Windows 11 <img src="../lec_figs/rstudio-project_windows.gif" width="90%" style="display: block; margin: auto;" /> --- # Making RStudio Projects - Windows 10 <img src="../lec_figs/rstudio-project_windows10.gif" width="90%" style="display: block; margin: auto;" /> --- # Making RStudio Projects - Mac <img src="../lec_figs/rstudio-project_mac.gif" width="90%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Data Visualization with `ggplot()` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Exploratory Data Analysis <img src="../lec_figs/data-science-explore.png" width="40%" style="display: block; margin: auto;" /> - In data visualization, you'll learn the basic structure of a `ggplot` plot. It turns data into plots. - In data transformation, you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries. - In **exploratory data analysis**, you'll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data. --- # Data Visualization - First Steps ```r library(tidyverse) # library(ggplot2) if tidyverse is not available mpg ?mpg ``` - The `mpg` data frame, provided by `ggplot2`, contains observations collected by the US Environmental Protection Agency on 38 models of car. - Q. Do cars with big engines use more fuel than cars with small engines? - `displ`: a car's engine size, in liters. - `hwy`: a car's fuel efficiency on the highway, in miles per gallon (mpg). --- # Data Visualization - First Steps ### <p style="color:#00449E"> Creating a `ggplot` </p> - What does the relationship between engine size and fuel efficiency look like? - To plot `mpg`, run the following code to put `displ` on the `x`-axis and `hwy` on the `y`-axis: ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` --- # Data Visualization - First Steps ### <p style="color:#00449E"> Graphing Template </p> - To make a ggplot plot, replace the bracketed sections in the code below with a `data.frame`, a `geom` function, or a collection of mappings such as `x = VAR_1` and `y = VAR_2`. ```r ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) ``` --- class: inverse, center, middle # Aesthetic Mappings <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Aesthetic Mappings - In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend. <img src="../lec_figs/r4s_330_1.png" width="50%" style="display: block; margin: auto;" /> - How can you explain these cars? Are those hybrids? --- # Aesthetic Mappings - An aesthetic is a visual property (e.g., `size`, `shape`, `color`) of the objects (e.g., `class`) in your plot. - You can display a point in different ways by changing the values of its aesthetic properties. <img src="../lec_figs/r4s_330_2.png" width="50%" style="display: block; margin: auto;" /> --- # Aesthetic Mappings ### <p style="color:#00449E"> Adding a `color` to the plot </p> ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, * color = class) ) ``` -- ### <p style="color:#00449E"> Adding a `shape` to the plot </p> ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, * shape = class) ) ``` --- # Aesthetic Mappings ### <p style="color:#00449E"> Adding a `size` to the plot </p> ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, * size = class) ) ``` -- ### <p style="color:#00449E"> Adding an `alpha` (transparency) to the plot </p> ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, * alpha = class) ) ``` --- # Aesthetic Mappings ### <p style="color:#00449E"> Discrete vs. Continuous Variables </p> <!-- - A **variable** is a quantity whose value changes. --> - A **discrete variable** is a variable whose value is obtained by *counting*. - Number of students present - Number of red marbles in a jar - Number of heads when flipping three coins - Students’ grade level - A **continuous variable** is a variable whose value is obtained by *measuring*. - Height of students in class - Weight of students in class - Time it takes to get to school - Distance traveled between classes --- # Aesthetic Mappings ### <p style="color:#00449E"> Specifying a `color` to the plot </p> ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), * color = "blue") ``` --- # Aesthetic Mappings - To set an aesthetic manually, set the aesthetic by name as an argument of your `geom_*()` function; i.e. it goes outside of `aes()`. - You'll need to pick a level that makes sense for that aesthetic: - The name of a `color` as a *character string*. - The `size` of a point in *mm*. - The `shape` of a point as a *number*, as shown below. <img src="../lec_figs/r4s_330_3.png" width="60%" style="display: block; margin: auto;" /> --- # Aesthetic Mappings ### <p style="color:#00449E"> Specifying a `color` to the plot? </p> ```r ggplot(data = mpg) + geom_point( mapping = aes(x = displ, y = hwy, * color = "blue") ) ``` --- # Common problems in ggplot() - One common problem when creating `ggplot2` graphics is to put the `+` in the wrong place. ```r ggplot(data = mpg) *+ geom_point( mapping = aes(x = displ, y = hwy) ) ``` --- # Aesthetic Mappings ### <p style="color:#00449E"> Exercises </p> - Which variables in mpg are categorical? Which variables are continuous? (Hint: type `?mpg` to read the documentation for the dataset). How can you see this information when you run `mpg`? - Map a continuous variable to `color`, `size`, and `shape.` How do these aesthetics behave differently for categorical vs. continuous variables? --- # Aesthetic Mappings ### <p style="color:#00449E"> Exercises </p> - What happens if you map the same variable to multiple aesthetics? - What does the stroke aesthetic do? What shapes does it work with? (Hint: use `?geom_point`) - What happens if you map an aesthetic to something other than a variable name, like `aes(color = displ < 5)`? (Note, you'll also need to specify `x` and `y`). --- class: inverse, center, middle # Facets <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Facets - One way to add a variable, particularly useful for categorical variables, is to use **facets** to split your plot into facets, subplots that each display one subset of the data. - To facet your plot by a single variable, use `facet_wrap()`. ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + * facet_wrap(~ class, nrow = 2) ``` --- # Facets - To facet your plot on the combination of two variables, add `facet_grid()` to your plot call. - The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`. ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + * facet_grid(drv ~ cyl) ``` --- # Facets ### <p style="color:#00449E"> Exercises </p> - What happens if you facet on a continuous variable? - What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean? How do they relate to this plot? ```r ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl)) ``` - What plots does the following code make? What does `.` do? ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + * facet_grid(drv ~ .) ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + * facet_grid(drv ~ .) ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + * facet_grid(. ~ cyl) ``` --- # Facets ### <p style="color:#00449E"> Exercises </p> - Take the first faceted plot in this section: ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) ``` - What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset? --- # Facets ### <p style="color:#00449E"> Exercises </p> - Read `?facet_wrap`. What does `nrow` do? What does `ncol` do? What other options control the layout of the individual panels? Why doesn’t `facet_grid()` have `nrow` and `ncol` arguments? - When using `facet_grid`, you should usually put the variable with more unique levels in the columns. Why? --- # Aesthetic Mappings and Facets ### <p style="color:#00449E"> Exercises </p> - Use the following data.frame. ```r tvshows_web <- read_csv( 'https://bcdanl.github.io/data/tvshows.csv') ``` - Describe the relationship between audience size (`GRP`) and audience engagement (`PE`) using `ggplot`. Explain the relationship in words. --- class: inverse, center, middle # Geometric Objects <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Geometric Objects How are these two plots similar? .pull-left[ <img src="../lec_figs/r4s_360_1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="../lec_figs/r4s_360_2.png" width="100%" style="display: block; margin: auto;" /> ] --- # Geometric Objects - A `geom_*()` is the geometrical object that a plot uses to represent data. - Bar charts use `geom_bar()`; - Line charts use `geom_line()`; - Boxplots use the `geom_boxplot()`; - Scatterplots use the `geom_point()`; - Fitted lines use the `geom_smooth()`; - and many more! - We can use different `geom_*()` to plot the same data. --- # Geometric Objects - To change the geom in your plot, change the geom function that you add to `ggplot()`. .panelset[ .panel[.panel-name[Scatterplot] .pull-left[ ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` ] .pull-right[ <!-- --> ] ] <!----> .panel[.panel-name[Fitted lines] .pull-left[ ```r ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy)) ``` ] .pull-right[ <!-- --> ] ] <!----> ] --- # Geometric Objects - Every geom function in `ggplot2` takes a mapping argument. - However, not every aesthetic works with every `geom`. - You could set the `shape` of a point, but you couldn't set the `shape` of a line; - You could set the `linetype` of a line. ```r ggplot( data = mpg ) + geom_smooth( mapping = aes( x = displ, y = hwy, * linetype = drv) ) ``` --- # Geometric Objects - You can set the `group` aesthetic to a *categorical variable* to draw multiple objects. - `ggplot2` will draw a separate object for each unique value of the grouping variable. ```r ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy)) ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, * group = drv)) ``` --- # Geometric Objects - In practice, `ggplot2` will automatically group the data for these `geoms` whenever you map an aesthetic to a discrete variable (as in the `linetype` example). ```r ggplot(data = mpg) + geom_smooth( mapping = aes(x = displ, y = hwy, * color = drv), show.legend = FALSE ) ``` --- # Geometric Objects - To display multiple geometric objects in the same plot, add multiple `geom_*()` functions to `ggplot()`: ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy)) ``` --- # Geometric Objects - If you place mappings in a geom function, `ggplot2` will treat them as local mappings for the layer. ```r ggplot(data = mpg, * mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth() ``` --- # Geometric Objects - You can use the same idea to specify different data for each layer. - Here, our smooth line displays just a subset of the `mpg` dataset, the `subcompact` cars. - The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for that layer only. ```r ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE) ``` --- class: inverse, center, middle # Statistical Transformation <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Statistical Transformations - Bar charts seem simple, but they are interesting because they reveal something subtle about plots. - Consider a basic bar chart, as drawn with `geom_bar()`. - The following bar chart displays the total number of diamonds in the `ggplot2::diamonds` dataset, grouped by `cut`. ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ``` - The `diamonds` dataset comes in `ggplot2` and contains information about ~54,000 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. --- # Statistical Transformations - Many graphs, including bar charts, calculate new values to plot: - `geom_bar()`, `geom_histogram()`, and `geom_freqpoly()` bin your data and then plot bin counts, the number of observations that fall in each bin. - `geom_smooth()` fits a model to your data and then plot predictions from the model. - `geom_boxplot()` compute a summary of the distribution and then display a specially formatted box. --- # Statistical Transformations - The algorithm used to calculate new values for a graph is called a `stat`, short for statistical transformation. - The figure below describes how this process works with `geom_bar()`. <img src="../lec_figs/r4s_370_1.png" width="100%" style="display: block; margin: auto;" /> --- # Statistical Transformations ### <p style="color:#00449E"> Observed Value vs. Number of Observations - There are three reasons you might need to use a `stat` explicitly: - *1*. You might want to override the default stat. ```r demo <- tribble( # for a simple data.frame ~cut, ~freq, "Fair", 1610, "Good", 4906, "Very Good", 12082, "Premium", 13791, "Ideal", 21551 ) ggplot(data = demo) + geom_bar(mapping = aes(x = cut, y = freq), * stat = "identity") ``` --- # Statistical Transformations ### <p style="color:#00449E"> Count vs. Proportion - There are three reasons you might need to use a `stat` explicitly: - *2*. You might want to override the default mapping from transformed variables to aesthetics. ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop), * group = 1)) ``` --- # Statistical Transformations ### <p style="color:#00449E"> Stat summary - There are three reasons you might need to use a `stat` explicitly: - *3*. You might want to draw greater attention to the statistical transformation in your code. ```r ggplot(data = diamonds) + stat_summary( mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median ) ``` --- # Statistical Transformations ### <p style="color:#00449E"> Exercises - What is the default geom associated with `stat_summary()`? How could you rewrite the previous plot to use that geom function instead of the stat function? - What does `geom_col()` do? How is it different to `geom_bar()`? - Most `geoms` and `stats` come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common? - What variables does `stat_smooth()` compute? What parameters control its behavior? --- # Statistical Transformations ### <p style="color:#00449E"> Exercises - In our proportion bar chart, we need to set `group = 1`. Why? In other words what is the problem with these two graphs? ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop) ) ) ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop), * fill = color ) ) ``` --- class: inverse, center, middle # Position Adjustment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Position Adjustments - You can color a bar chart using either the `color` aesthetic, or, more usefully, `fill`: .panelset[ .panel[.panel-name[`color`] .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, * [?] = cut)) ``` ] .pull-right[ <!-- --> ] ] .panel[.panel-name[`fill`] .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, * [?] = cut)) ``` ] .pull-right[ <!-- --> ] ] ] --- # Position Adjustments - Note that the bars are automatically stacked if you map the `fill` aesthetic to another variable. .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, * fill = clarity) ) ``` ] .pull-right[ <!-- --> ] --- # Position Adjustments - The `stack`ing is performed automatically by the **position adjustment** specified by the `position` argument. .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), * position = "stack") ``` ] .pull-right[ <!-- --> ] --- # Position Adjustments - If you don't want a stacked bar chart with counts, you can use one of two other `position` options: `fill` or `dodge`. .panelset[ .panel[.panel-name[`position = "fill"`] - `position = "fill"` works like stacking, but makes each set of stacked bars the same height. - This makes it easier to compare proportions across groups. ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?]) ``` ] <!----> .panel[.panel-name[`position = "dodge"`] - `position = "dodge"` places overlapping objects directly beside one another. ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?]) ``` ] <!----> ] --- # Position Adjustments ### <p style="color:#00449E"> Overplotting - The values of `hwy` and `displ` are rounded so the points appear on a grid and many points overlap each other. - This problem is known as **overplotting**. - You can avoid the overlapping problem by setting the position adjustment to `jitter`. - `position = "jitter"` adds a small amount of random noise to each point. ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = [?]) ``` --- # Position Adjustments ### <p style="color:#00449E"> Exercises - What is the problem with this plot? How could you improve it? ```r ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() ``` - What parameters to `geom_jitter()` control the amount of jittering? - Compare and contrast `geom_jitter()` with `geom_count()`. - What’s the default position adjustment for `geom_boxplot()`? Create a visualization of the `mpg` dataset that demonstrates it. --- class: inverse, center, middle # Coordinate <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Coordinate Systems - The default coordinate system is the Cartesian coordinate system where the `x` and `y` positions act independently to determine the location of each point. - There are a number of other coordinate systems that are occasionally helpful. --- # Coordinate Systems - `coord_flip()` switches the `x` and `y` axes. - This is useful (for example), if you want horizontal boxplots. - It's also useful for long labels: it's hard to get them to fit without overlapping on the `x`-axis. ```r ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + * coord_flip() ``` --- # Coordinate Systems - `coord_quickmap()` sets the aspect ratio correctly for maps. ```r nz <- map_data("nz") # New Zealand map ggplot(nz, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") ggplot(nz, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") + coord_quickmap() ``` --- # Coordinate Systems - `coord_polar()` uses polar coordinates. - Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart. ```r bar <- ggplot(data = diamonds) + geom_bar( mapping = aes(x = cut, fill = cut), show.legend = FALSE, width = 1 ) + theme(aspect.ratio = 1) + labs(x = NULL, y = NULL) bar + coord_flip() bar + coord_polar() ``` --- # Coordinate Systems ### <p style="color:#00449E"> Exercises - Turn a stacked bar chart into a pie chart using `coord_polar()`. - What does `labs()` do? Read the documentation. - What does the plot below tell you about the relationship between city and highway mpg? Why is `coord_fixed()` important? What does `geom_abline()` do? ```r ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() + geom_abline() + coord_fixed() ``` --- class: inverse, center, middle # `ggplot` grammar <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # The Layered Grammar of Graphics - Let's add position adjustments, stats, coordinate systems, and faceting to our code template. ```r ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION>) + <COORDINATE_FUNCTION> + <FACET_FUNCTION> ``` - The seven parameters---(1) a dataset, (2) a geom, (3) a set of mappings, (4) a stat, (5) a position adjustment, (6) a coordinate system, and (7) a faceting scheme---in the template compose the grammar of graphics, a formal system for building plots.