class: title-slide, left, bottom # Lecture 9 ---- ## **DANL 200: Introduction to Data Analytics** ### Byeong-Hak Choe ### September 27, 2022 --- # Announcement ### <p style="color:#00449E"> Tutoring/TA-ing at Data Analytics Lab (South 321) </p> - Marcie Hogan (Tutor for Data Analytics 100): 1. Sunday, 2:00 PM--5:00 PM 2. Wednesday, 12:30 PM--1:30 PM - Andrew Mosbo (Tutor for Data Analytics): 1. Mondays, 4:00 PM--5:00 PM 2. Wednesdays, 11:00 A.M.--noon 3. Thursdays, 5:00 PM--6:00 PM - Jason Rappazzo (Tutor for Data Analytics; TA for Prof. Yazdani) 1. Tuesdays and Thursdays, 9:30 AM--10:45 AM 2. Friday, 9:00 AM--10:15 AM 3. Friday, 9:00 PM--10:15 PM --- # Announcement ### <p style="color:#00449E"> Tutoring/TA-ing at Data Analytics Lab (South 321) </p> - Emine Morris (TA for Byeong-Hak): 1. Mondays and Wednesdays, 5:00 PM--6:30 PM 2. Tuesdays and Thursdays, 3:00 PM--4:45 PM --- # Announcement <img src="../lec_figs/gold-pd-schedule-fall-2022.png" width="75%" style="display: block; margin: auto;" /> - Prerequisite required: `*` Intro to Interviewing; `**` Intro to Resume --- # Announcement ### <p style="color:#00449E"> Career Design Center </p> - You can access registration for the events at [https://www.geneseo.edu/gold](https://www.geneseo.edu/gold). - You can check the Career Design Event Calender from [https://www.geneseo.edu/career-design/events/calendar](https://www.geneseo.edu/career-design/events/calendar). - They encourage the first point of contact with their office to be through Quick Chats (Virtual Drop-Ins). - If you, 2nd - 5th year students, are interested in interning/working for a public accounting firm (open to all SoB majors, variety of positions,Tax, Audit, consulting) please have them stop in to see Mary Cannon today or tomorrow (no appt. needed) in South 112. --- # Announcement ### <p style="color:#00449E"> Homework Assignment 1 </p> - Here is one additional description for the data file, `bikeshare_cleaned.csv`. - The data set, `bikeshare_cleaned.csv`, includes 17376 observations of hourly counts from 2011 to 2012 for bike rides (rentals) in Washington D.C. - Variable `cnt` is a hourly count of total bikes rented out. --- # Style of Coding and Commenting for Homework Assignment ```r #### # Q2b. Provide both (1) `ggplot` codes and # (2) a couple of sentences # to describe the distribution of `cnt`. #### #### # Answer for Q2b #### ggplot(data = ... ) + geom_*(mapping = aes( ... )) # *** *** *** # The peak value ... # The distribution is .... skewed # Variable `cnt` is concentrated in an interval ... ``` --- # Style of Coding and Commenting for Homework Assignment ```r #### # Q2g. Provide both (1) `ggplot` codes and # (2) a couple of sentences # to describe the relationship between `temp` and `cnt`. #### #### # Answer for Q2g #### ggplot(data = ... ) + geom_*(mapping = aes( ... )) # *** *** *** # `temp` is .... associated with .... # .... ``` --- # Workflow ### <p style="color:#00449E"> Shortcuts for RStudio and RScript </p> .pull-left[ **Mac** - **command + shift + N** opens a new RScript. - **command + return** runs a current line or selected lines. - **command + shift + C** is the shortcut for # (commenting). - **option + - ** is the shortcut for `<-`. ] .pull-right[ **Windows** - **Ctrl + Shift + N** opens a new RS-cript. - **Ctrl + return** runs a current line or selected lines. - **Ctrl + Shift + C** is the shortcut for # (commenting). - **Alt + - ** is the shortcut for `<-`. ] --- # Workflow - **Home/End** moves the blinking cursor bar to the beginning/End of the line. - **Ctrl** (**command/fn** for Mac Users) **+**
/
works too. - **PgUp/PgDn** moves the blinking cursor bar to the top/bottom line of the script on the screen. - **Fn + **
/
works too. - **Ctrl** (**command** for Mac Users) **+ Z** undoes the previous action. - **Ctrl** (**command** for Mac Users) **+ Shift + Z** redoes when undo is executed. - **Ctrl** (**command** for Mac Users) **+ F** is useful when finding a phrase (and replace the phrase) in the RScript. - **Ctrl** (**command** for Mac Users) **+ D** deletes a current line. --- class: inverse, center, middle # Statistical Transformation <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Statistical Transformations - The algorithm used to calculate new values for a *ggplot* is called a `stat`, short for statistical transformation. - The figure below describes how this process works with `geom_bar()`. <img src="../lec_figs/r4s_370_1.png" width="96%" style="display: block; margin: auto;" /> --- # Statistical Transformations - Many graphs, including bar charts, calculate new values to plot: .panelset[ .panel[.panel-name[histogram] , `geom_histogram()` and `geom_freqpoly()` bin your data and then plot bin counts, the number of observations that fall in each bin. .pull-left[ ```r ggplot(data = mpg, mapping = aes(x = hwy)) + geom_histogram(binwidth = 1, fill = NA, color = "blue") + geom_freqpoly(binwidth = 1, color = "red") ``` ] .pull-right[ <!-- --> ] ] .panel[.panel-name[density function] - `geom_density()` plots a probability density function. - The area under the density plot is re-scaled to equal one. - We can think of a density plot as a continuous histogram of a variable. .pull-left[ ```r ggplot(data = mpg, mapping = aes(x = hwy)) + geom_density() ``` ] .pull-right[ <!-- --> ] ] .panel[.panel-name[bar chart] `geom_bar()` is a histogram for discrete data. .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ``` ] .pull-right[ <!-- --> ] ] .panel[.panel-name[boxplot] - `geom_boxplot()` compute a summary of the distribution and then display a specially formatted box. .pull-left[ ```r ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = cut, y = price)) ``` ] .pull-right[ <!-- --> ] ] .panel[.panel-name[fitted line] - `geom_smooth()` fits a model to your data and then plot predicted values of `y` from the model. .pull-left[ ```r ggplot(data = mpg, mapping = aes(x = hwy, y = cty)) + geom_point(alpha = .25) + geom_smooth(method = lm, color = 'red') + geom_smooth(linetype = 2, se = FALSE) ``` ] .pull-right[ <!-- --> ] ] ] --- # Statistical Transformations ### <p style="color:#00449E"> Observed Value vs. Number of Observations - There are three reasons we might need to use a `stat` explicitly: - *1*. We might want to override the default `stat`. ```r demo <- tribble( # for a simple data.frame ~cut, ~freq, "Fair", 1610, "Good", 4906, "Very Good", 12082, "Premium", 13791, "Ideal", 21551 ) ggplot(data = demo) + geom_bar(mapping = aes(x = cut, y = freq), * stat = "identity") ``` --- # Statistical Transformations ### <p style="color:#00449E"> Count vs. Proportion - There are three reasons we might need to use a `stat` explicitly: - *2*. We might want to override the default mapping from transformed variables to aesthetics. ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, * y = after_stat(prop), group = 1)) ``` --- # Statistical Transformations ### <p style="color:#00449E"> Stat summary - There are three reasons we might need to use a `stat` explicitly: - *3*. We might want to draw greater attention to the statistical transformation in our code. ```r ggplot(data = diamonds) + * stat_summary( mapping = aes(x = cut, y = depth), fun = median, fun.min = min, fun.max = max ) ``` --- # Statistical Transformations ### <p style="color:#00449E"> Exercises - What is the default geom associated with `stat_summary()`? How could you rewrite the previous plot to use that geom function instead of the stat function? - What does `geom_col()` do? How is it different to `geom_bar()`? - Most `geoms` and `stats` come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common? - What variables does `stat_smooth()` compute? What parameters control its behavior? --- # Statistical Transformations ### <p style="color:#00449E"> Exercises - In our proportion bar chart, we need to set `group = 1`. Why? In other words what is the problem with these two graphs? ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop) ) ) ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop), * fill = color ) ) ``` --- class: inverse, center, middle # Position Adjustment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Position Adjustments ### <p style="color:#00449E"> `color` and `fill` aesthetic - We can color a bar chart using either the `color` aesthetic, or, more usefully, `fill`: .panelset[ .panel[.panel-name[`color`] .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, * [?] = cut)) ``` ] .pull-right[ <!-- --> ] ] .panel[.panel-name[`fill`] .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, * [?] = cut)) ``` ] .pull-right[ <!-- --> ] ] ] --- # Position Adjustments ### <p style="color:#00449E"> Stacked bar charts with `fill` aesthetic - Note that the bars are automatically stacked if we map the `fill` aesthetic to another variable. .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, * fill = clarity) ) ``` ] .pull-right[ <!-- --> ] --- # Position Adjustments ### <p style="color:#00449E"> Stacked bar charts with `fill` aesthetic - The `stack`ing is performed automatically by the **position adjustment** specified by the `position` argument. .pull-left[ ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), * position = "stack") ``` ] .pull-right[ <!-- --> ] --- # Position Adjustments ### <p style="color:#00449E"> `position = "fill"` and `position = "dodge"` - If we don't want a stacked bar chart with counts, we can use one of two other `position` options: `fill` or `dodge`. .panelset[ .panel[.panel-name[`position = "fill"`] - `position = "fill"` works like stacking, but makes each set of stacked bars the same height. - This makes it easier to compare proportions across groups. ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?]) ``` ] <!----> .panel[.panel-name[`position = "dodge"`] - `position = "dodge"` places overlapping objects directly beside one another. ```r ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?]) ``` ] <!----> ] --- # Position Adjustments ### <p style="color:#00449E"> Overplotting and `position = "jitter"` - The values of `hwy` and `displ` are rounded so the points appear on a grid and many points overlap each other. - This problem is known as **overplotting**. - We can avoid the overlapping problem by setting the position adjustment to `jitter`. - `position = "jitter"` adds a small amount of random noise to each point. ```r ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = [?]) ``` --- # Position Adjustments ### <p style="color:#00449E"> Exercises - What is the problem with this plot? How could you improve it? ```r ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() ``` - What parameters to `geom_jitter()` control the amount of jittering? - Compare and contrast `geom_jitter()` with `geom_count()`. - What’s the default position adjustment for `geom_boxplot()`? Create a visualization of the `mpg` dataset that demonstrates it. --- class: inverse, center, middle # Coordinate <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Coordinate Systems - The default coordinate system is the Cartesian coordinate system where the `x` and `y` positions act independently to determine the location of each point. - There are a number of other coordinate systems that are occasionally helpful. --- # Coordinate Systems ### <p style="color:#00449E"> `coord_flip()` - `coord_flip()` switches the `x` and `y` axes. - This is useful (for example), if we want horizontal boxplots. - It's also useful for long labels: it's hard to get them to fit without overlapping on the `x`-axis. ```r ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + * coord_flip() ``` --- # Coordinate Systems ### <p style="color:#00449E"> `coord_quickmap()` - `coord_quickmap()` sets the aspect ratio correctly for maps. ```r county <- map_data("county") # Map data for US Counties ny <- filter(county, # We will discuss 'filter()' in the next chapter region == "new york") ggplot(ny, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") ggplot(ny, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") + coord_quickmap() ``` --- # Coordinate Systems ### <p style="color:#00449E"> Exercises - What does `labs()` do? Read the documentation. - What does the plot below tell you about the relationship between city and highway mpg? Why is `coord_fixed()` important? What does `geom_abline()` do? ```r ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() + geom_abline() + coord_fixed() ``` --- class: inverse, center, middle # `ggplot` Grammar <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # The Layered Grammar of Graphics - Let's add position adjustments, stats, coordinate systems, and faceting to our code template. ```r ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION>) + <COORDINATE_FUNCTION> + <FACET_FUNCTION> ``` - The seven parameters---(1) a dataset, (2) a geom, (3) a set of mappings, (4) a stat, (5) a position adjustment, (6) a coordinate system, and (7) a faceting scheme---in the template compose the **grammar of graphics**, a formal system for building plots.