class: title-slide, left, bottom # Lecture 27 ---- ## **DANL 100: Programming for Data Analytics** ### Byeong-Hak Choe ### December 8, 2022 --- # Announcement ### <p style="color:#00449E"> Student Course Experience (SCE) Survey - Effective Fall 2022, the Student Course Experience (SCE) survey replaces the Student Observation of Faculty Instruction (SOFI) survey. - In a web browser, students should visit their [myGeneseo](https://my.geneseo.edu/dashboard) portal, then select KnightWeb, Surveys, then SCE (formerly SOFI) Surveys. --- # Announcement ### <p style="color:#00449E"> Final Exam - The Final Exam is scheduled on Wednesday, December 14, noon - 2 P.M. - The Final Exam covers: - Python basics in programming (if-elif-else chain, for-loops, while-loops, functions with `def`) - R basics (variable and data types, vectors, and data.frame) - Loading CSV files in Python and R with `read_csv()` - Python pandas DataFrame and Series - Data visualization with Python `seaborn` and R `ggplot2` --- class: inverse, center, middle # Starting with R and RStudio <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Installing the Tools ### <p style="color:#00449E"> R programming </p> The R language is available as a free download from the R Project website at: - Windows: [https://cran.r-project.org/bin/windows/base/](https://cran.r-project.org/bin/windows/base/) - Mac: [https://cran.r-project.org/bin/macosx/](https://cran.r-project.org/bin/macosx/) - Download the file of R that corresponds to your Mac OS (Big Sur, Apple silicon arm64, High Sierra, El Capitan, Mavericks, etc.) --- # Installing the Tools ### <p style="color:#00449E"> RStudio </p> - **RStudio** offers a graphical interface to assist in creating R code: - The RStudio Desktop is available as a free download from the following webpage: - [https://www.rstudio.com/products/rstudio/download/#download](https://www.rstudio.com/products/rstudio/download/#download) --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Script Pane** is where you write R commands in a script file that you can save. - An R script is simply a text file containing R commands. - RStudio will color-code different elements of your code to make it easier to read. ] --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Console Pane** allows you to interact directly with the R interpreter and type commands where R will immediately execute them. ] --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Environment Pane** is where you can see the values of variables, data frames, and other objects that are currently stored in memory. ] --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Plots Pane** contains any graphics that you generate from your R code. ] --- # Installing the Tools ### <p style="color:#00449E"> R Packages </p> ```r pkgs <- c("ggplot2", "readr", "dplyr") install.packages(pkgs) ``` - While running the above codes, I recommend you to answer "no" to the following question: .pull-left[ **Mac**: *"Do you want to install from sources the packages which need compilation?"* from Console Pane. ] .pull-right[ **Windows**: *"Would you like to use a personal library instead?"* from Pop-up message. ] --- # Installing the Tools ### <p style="color:#00449E"> R Packages </p> - Check whether `ggplot2` is installed well: ```r library(ggplot2) # loading the package tidyverse mpg # data.frame provided by the package ggplot2 # ggplot2 is included in tidyverse ``` - Let me know if you have an error from the above code. --- class: inverse, center, middle # Workflow <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Workflow ### <p style="color:#00449E"> Shortcuts for RStudio and RScript </p> .pull-left[ **Mac** - **command + shift + N** opens a new RScript. - **command + return** runs a current line or selected lines. - **command + shift + C** is the shortcut for # (commenting). - **option + - ** is the shortcut for `<-`. ] .pull-right[ **Windows** - **Ctrl + Shift + N** opens a new RS-cript. - **Ctrl + return** runs a current line or selected lines. - **Ctrl + Shift + C** is the shortcut for # (commenting). - **Alt + - ** is the shortcut for `<-`. ] --- # Workflow - **Home/End** moves the blinking cursor bar to the beginning/End of the line. - **Ctrl** (**command** for Mac Users) **+**
/
works too. - **Ctrl** (**command** for Mac Users) **+ Z** undoes the previous action. - **Ctrl** (**command** for Mac Users) **+ Shift + Z** redoes when undo is executed. - **Ctrl** (**command** for Mac Users) **+ F** is useful when finding a phrase (and replace the phrase) in the RScript. - Auto-completion of command is useful. - Type `libr` in the RScript in RStudio and wait for a second. .pull-left[ ```r libr ``` ] .pull-right[ <img src="../lec_figs/auto-completionRStudio.png" width="100%" style="display: block; margin: auto;" /> ] --- # Workflow - To install R package `PACKAGE`, use `install.packages("PACKAGE")`. ```r install.packages("ggplot2") # installing package "ggplot2" ``` - When the code is running, RStudio shows the STOP icon (
) at the top right corner in the Console Pane. - Do not click it unless if you want to stop running the code. <img src="../lec_figs/console-running.png" width="90%" style="display: block; margin: auto;" /> --- # Workflow ### <p style="color:#00449E"> Quotation marks, parentheses, and `+` </p> - Quotation marks and parentheses must always come in a pair. - If not, Console Pane will show you the continuation character `+`: ```r > x <- "hello ``` - The `+` tells you that R is waiting for more input; it doesn’t think you’re done yet. --- class: inverse, center, middle # Starting with R <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Starting with R ### <p style="color:#00449E"> Assignment </p> - R has many assignment operators (e.g., `<-`, `=`, `->` ). - The preferred one is `<-`. ```r x <- 2 x < - 3 print(x) x <- 5 x = 5 5 -> x ``` --- # Starting with R ### <p style="color:#00449E"> R variables and data types </p> - **Variables** can be thought of as a labelled container used to store information. - Variables allow us to recall saved information to later use in calculations. - Variables can store many different things in RStudio, from single values, data frames, to graphs. --- # Starting with R ### <p style="color:#00449E"> R variables and data types </p> .panelset[ .panel[.panel-name[variable types] .pull-left[ <img src="../lec_figs/r_variable_types.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Logical**: TRUE or FALSE. - **Numeric**: Decimal numbers - **Integer**: Integers - **Character**: Text strings - **Factor**: Categorical values. Each possible value of a factor is known as a *level*. ] ] .panel[.panel-name[data types] .pull-left[ <img src="../lec_figs/r_data_types.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ - **vector**: 1D collection of variables of the same type - **matrix**: 2D collection of variables of the same type - **data.frame**: 2D collection of variables of multiple types ] ] ] --- # R variable and data types ### <p style="color:#00449E"> </p> .panelset[ .panel[.panel-name[Character] - Strings are known as “character” in R. - Use the double quotes " or single quotes ' to wrap around the string ```r myname <- "my_name" class(myname) ``` - `class()` function returns the data type of an object. ] .panel[.panel-name[Numbers] - Numbers have different classes. - The most common two are integer and numeric. Integers are whole numbers: ```r favourite.integer <- as.integer(2) print(favourite.integer) class(favourite.integer) favourite.numeric <- as.numeric(8.8) print(favourite.numeric) class(favourite.numeric) pvalue.threshold <- 0.05 ``` ] .panel[.panel-name[Logical (TRUE/FALSE)] - We use the `==` to test for equality in R ```r class(TRUE) favourite.numeric == 8.8 favourite.numeric == 9.9 ``` ] .panel[.panel-name[Vectors] - We can create 1D data structures called “vectors”. ```r 1:10 2*(1:10) seq(0, 10, 2) myvector <- 1:10 myvector b <- c(3,4,5) b^2 beers <- c("BUD LIGHT", "BUSCH LIGHT", "COORS LIGHT", "MILLER LITE", "NATURAL LIGHT") beers ``` ] .panel[.panel-name[Factors] - Factors store categorical data. - Under the hood, factors are actually integers that have a string label attached to each unique integer. - For example, if we have a long list of Male/Female labels for each of our patients, this will be stored a “row” of zeros and ones by R. ```r beers <- as.factor(beers) class(beers) levels(beers) nlevels(beers) ``` ] ] --- # Starting with R ### <p style="color:#00449E"> NULL and NA values </p> - `NULL` is just an alias for `c()`, the empty vector. - `NA` indicates missing or unavailable data. ```r c(c(), 1, NULL) c("a", NA, "c") ``` --- # Starting with R ### <p style="color:#00449E"> NULL and NA values </p> - `NULL` is just an alias for `c()`, the empty vector. - `NA` indicates missing or unavailable data. ```r c(c(), 1, NULL) c("a", NA, "c") ``` --- class: inverse, center, middle # Working with Data from Files <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Working with Data from Files - Step 0. Download the zip file, 'car_data.zip' from the Files section in our Canvas. - Step 1. Find the path name for the file, `car.data.csv`. - Step 2. In the code below, replace 'PATH_NAME_FOR_THE_FILE_car.data.csv' with the path name for the file, `car.data.csv`. - Step 3. Run the following R code: ```r # install.packages("readr") library(readr) uciCar <- read_csv( 'PATH_NAME_FOR_THE_FILE_car.data.csv') View(uciCar) ``` --- # Working with Data from Files ### <p style="color:#00449E"> Examining data frame </p> - `class()` tells you what kind of R object you have. - `dim()` shows how many rows and columns are in the data for `data.frame`. - `head()` shows the top few rows of the data. - `help()` provides the documentation for a class. - Try `help(class(uciCar))`. - `str()` gives us the structure for an object. --- # Working with Data from Files ### <p style="color:#00449E"> Examining data frame </p> - `summary()` provides a summary of almost any R object. - `skimr::skim()` provides a more detailed summary. - `skimr` is the package that provides the function `skim()`. - `print()` prints all the data. - Note: for large datasets, this can take a very long time and is something you want to avoid. - `View()` displays the data in a simple spreadsheet-like grid viewer. - `dplyr::glimpse()` displays brief information about the data. --- # Working with Data from Files ### <p style="color:#00449E"> Examining data frame </p> ```r print(uciCar) class(uciCar) dim(uciCar) head(uciCar) help(class(uciCar)) str(uciCar) summary(uciCar) library(skimr) skim(uciCar) library(tidyverse) glimpse(uciCar) ``` --- # Working with Data from Files ### <p style="color:#00449E"> Reading data from an URL </p> - We can import the data file from the web. ```r # install.packages("readr") # library(readr) tvshows <- read_csv( 'https://bcdanl.github.io/data/tvshows.csv') ``` --- # Working with Data from Files ### <p style="color:#00449E"> Accessing Subsets </p> .panelset[ .panel[.panel-name[head() & tail()] - `head()` returns the first N rows of our data frame. - `tail()` returns the last N rows of our data frame. ```r head(tvshows, n = 3) head(tvshows, 3) tail(tvshows, 3) ``` ] .panel[.panel-name[row] - As in Python, we can use the same slicing methods in R. - Starting index in R is 1, unlike Python. ```r tvshows[ 1:3, ] tvshows[ c(1, 2, 3), ] tvshows[ c(1, 2, 3), 1] ``` ] .panel[.panel-name[column] - Return the “Network” column in the data set: ```r tvshows$Network tvshows[, 2] tvshows[, "Network"] ``` - Return the columns named “Show” and “GRP” ```r tvshows[ , c("Show", "GRP")] ``` ] .panel[.panel-name[row & column] - Return only the first 3 rows and columns 2 and 5 of the data set ```r tvshows[1:3, c(2,5)] ``` ] .panel[.panel-name[filtering] - Return only the shows whose `Genre` is `Reality`. ```r tvshows[ tvshows$Genre == "Reality", ] ``` - Another way to subset the shows is with the `which()` function. - This returns the TRUE indices of a logical object. ```r reality <- which(tvshows$Genre == "Reality") reality tvshows[ reality, ] ``` ] .panel[.panel-name[filtering] - What if we want all shows whose PE is greater than 80? ```r tvshows[tvshows$PE > 80, ] ``` - Another way to subset the shows is with the `which()` function. - This returns the TRUE indices of a logical object. ```r reality <- which(tvshows$Genre == "Reality") reality tvshows[ reality, ] ``` ] ] --- # Working with Data from Files ### <p style="color:#00449E"> Class Exercises 2 </p> 1. Return those shows whose `Duration` values are `30`. 2. Return those shows whose `GRP` values are greater than the mean value of `GRP`. 2. Return the data.frame with only three variables---`Show`, `PE`, and `GRP`---for which `PE` values are greater than the mean value of `PE`. --- class: inverse, center, middle # Data Visualization with `ggplot2` <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # `ggplot2` <img src="../lec_figs/ggplot2-logo.png" width="20%" style="display: block; margin: auto;" /> - `ggplot2` is a R data visualization package based on [The Grammar of Graphics](http://vita.had.co.nz/papers/layered-grammar.pdf). - `ggplot2` is the most elegant and most versatile visualization tools in R. - We provide the data, tell `ggplot2` how to map variables to `aes`thetics, what graphical primitives to use, and it takes care of the details. ```r library(ggplot2) ``` --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Types of plots - We will consider the following types of visualization: - Bar chart - Histogram - Scatter plot - Scatter plot with Fitted line - Line chart --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Getting started with `ggplot2` - Let's use the `titanic` and `tips` data.frames: ```r df_titanic <- read_csv('https://bcdanl.github.io/data/titanic_cleaned.csv') df_tips <- read_csv('https://bcdanl.github.io/data/tips_seaborn.csv') ``` --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Bar Chart - A bar chart is used to plot the frequency of the different categories. - It is useful to visualize how values of a **categorical variable** are distributed. - A variable is **categorical** if it can only take one of a small set of values. - We use `geom_bar()` function to plot a bar chart: .pull-left[ ```r ggplot( data = df_titanic ) + geom_bar( aes(x = sex) ) ``` ] .pull-right[ - Mapping - `data`: data.frame - `x`: Name of a categorical variable (column) in data.frame ] --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Bar Chart - We can further break up the bars in the bar chart based on another categorical variable. - This is useful to visualize the relationship between the two categorical variables. .pull-left[ ```r ggplot( data = df_titanic ) + geom_bar( aes( x = sex, fill = survived ) ) ``` ] .pull-right[ - Mapping - `fill`: Name of a categorical variable ] --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Histogram - A histogram is a **continuous** version of a bar chart. - It is used to plot the frequency of the different values. - It is useful to visualize how values of a **continuous variable** are distributed. - A variable is **continuous** if it can take any of an infinite set of ordered values. - We use `geom_histogram()` function to plot a histogram: .pull-left[ ```r ggplot(data = df_titanic) + geom_histogram( aes( x = age ), bins = 5 ) ``` ] .pull-right[ - Mapping - `bins`: Number of bins ] --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Scatter plot - A scatter plot is used to display the relationship between two continuous variables. - We can see co-variation as a pattern in the scattered points. - We use `geom_point()` function to plot a scatter plot: .pull-left[ ```r ggplot( data = df_tips ) + geom_point( aes( x = total_bill, y = tip ) ) ``` ] .pull-right[ - Mapping - `x`: Name of a continuous variable on the horizontal axis - `y`: Name of a continuous variable on the vertical axis ] --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Scatter plot - To the scatter plot, we can add a `color`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`. - Suppose we are interested in the following question: - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior? ```r ggplot( data = df_tips ) + geom_point( aes( x = total_bill, y = tip, color = smoker ) ) ``` --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Fitted line - From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables. - `geom_smooth( method = lm )` adds a line that fits well into the scattered points. - On average, the fitted line describes the relationship between two continuous variables. ```r ggplot( data = df_tips ) + geom_point( aes( x = total_bill, y = tip, color = smoker ) ) + geom_smooth( aes( x = total_bill, y = tip, color = smoker ), method = lm ) ``` --- # Data Visualization with `ggplot2` ### <p style="color:#00449E"> Line cahrt - A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable. - It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur. - We use `geom_line()` function to plot a line plot: ```r path_csv <- 'THE_PATHNAME_FOR_THE_FILE__dji.csv' dow <- read_csv(path_csv) ggplot( data = dow ) + geom_line( aes( x = Date, y = Close ) ) ```