+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 27


DANL 100: Programming for Data Analytics

Byeong-Hak Choe

December 8, 2022

1 / 43

Announcement

Student Course Experience (SCE) Survey

  • Effective Fall 2022, the Student Course Experience (SCE) survey replaces the Student Observation of Faculty Instruction (SOFI) survey.

  • In a web browser, students should visit their myGeneseo portal, then select KnightWeb, Surveys, then SCE (formerly SOFI) Surveys.

2 / 43

Announcement

Final Exam

  • The Final Exam is scheduled on Wednesday, December 14, noon - 2 P.M.

  • The Final Exam covers:

    • Python basics in programming (if-elif-else chain, for-loops, while-loops, functions with def)
    • R basics (variable and data types, vectors, and data.frame)
    • Loading CSV files in Python and R with read_csv()
    • Python pandas DataFrame and Series
    • Data visualization with Python seaborn and R ggplot2
3 / 43

Starting with R and RStudio


4 / 43

Installing the Tools

R programming

The R language is available as a free download from the R Project website at:

5 / 43

Installing the Tools

RStudio

6 / 43

Installing the Tools

RStudio Environment

  • Script Pane is where you write R commands in a script file that you can save.
    • An R script is simply a text file containing R commands.
    • RStudio will color-code different elements of your code to make it easier to read.
7 / 43

Installing the Tools

RStudio Environment

  • Console Pane allows you to interact directly with the R interpreter and type commands where R will immediately execute them.
8 / 43

Installing the Tools

RStudio Environment

  • Environment Pane is where you can see the values of variables, data frames, and other objects that are currently stored in memory.
9 / 43

Installing the Tools

RStudio Environment

  • Plots Pane contains any graphics that you generate from your R code.
10 / 43

Installing the Tools

R Packages

pkgs <- c("ggplot2", "readr", "dplyr")
install.packages(pkgs)
  • While running the above codes, I recommend you to answer "no" to the following question:

Mac: "Do you want to install from sources the packages which need compilation?" from Console Pane.

Windows: "Would you like to use a personal library instead?" from Pop-up message.

11 / 43

Installing the Tools

R Packages

  • Check whether ggplot2 is installed well:
library(ggplot2) # loading the package tidyverse
mpg # data.frame provided by the package ggplot2
# ggplot2 is included in tidyverse
  • Let me know if you have an error from the above code.
12 / 43

Workflow


13 / 43

Workflow

Shortcuts for RStudio and RScript

Mac

  • command + shift + N opens a new RScript.
  • command + return runs a current line or selected lines.
  • command + shift + C is the shortcut for # (commenting).
  • option + - is the shortcut for <-.

Windows

  • Ctrl + Shift + N opens a new RS-cript.
  • Ctrl + return runs a current line or selected lines.
  • Ctrl + Shift + C is the shortcut for # (commenting).
  • Alt + - is the shortcut for <-.
14 / 43

Workflow

  • Home/End moves the blinking cursor bar to the beginning/End of the line.
    • Ctrl (command for Mac Users) + / works too.
  • Ctrl (command for Mac Users) + Z undoes the previous action.
  • Ctrl (command for Mac Users) + Shift + Z redoes when undo is executed.
  • Ctrl (command for Mac Users) + F is useful when finding a phrase (and replace the phrase) in the RScript.

  • Auto-completion of command is useful.

    • Type libr in the RScript in RStudio and wait for a second.
libr

15 / 43

Workflow

  • To install R package PACKAGE, use install.packages("PACKAGE").
install.packages("ggplot2") # installing package "ggplot2"
  • When the code is running, RStudio shows the STOP icon () at the top right corner in the Console Pane.
    • Do not click it unless if you want to stop running the code.

16 / 43

Workflow

Quotation marks, parentheses, and +

  • Quotation marks and parentheses must always come in a pair.
    • If not, Console Pane will show you the continuation character +:
> x <- "hello
  • The + tells you that R is waiting for more input; it doesn’t think you’re done yet.
17 / 43

Starting with R


18 / 43

Starting with R

Assignment

  • R has many assignment operators (e.g., <-, =, -> ).
  • The preferred one is <-.
x <- 2
x < - 3
print(x)
x <- 5
x = 5
5 -> x
19 / 43

Starting with R

R variables and data types

  • Variables can be thought of as a labelled container used to store information.

  • Variables allow us to recall saved information to later use in calculations.

  • Variables can store many different things in RStudio, from single values, data frames, to graphs.

20 / 43

Starting with R

R variables and data types

  • Logical: TRUE or FALSE.
    • Numeric: Decimal numbers
    • Integer: Integers
    • Character: Text strings
    • Factor: Categorical values. Each possible value of a factor is known as a level.

  • vector: 1D collection of variables of the same type
  • matrix: 2D collection of variables of the same type
  • data.frame: 2D collection of variables of multiple types
21 / 43

R variable and data types

  • Strings are known as “character” in R.
  • Use the double quotes " or single quotes ' to wrap around the string
myname <- "my_name"
class(myname)
  • class() function returns the data type of an object.
  • Numbers have different classes.
    • The most common two are integer and numeric. Integers are whole numbers:
favourite.integer <- as.integer(2)
print(favourite.integer)
class(favourite.integer)
favourite.numeric <- as.numeric(8.8)
print(favourite.numeric)
class(favourite.numeric)
pvalue.threshold <- 0.05
  • We use the == to test for equality in R
class(TRUE)
favourite.numeric == 8.8
favourite.numeric == 9.9
  • We can create 1D data structures called “vectors”.
1:10
2*(1:10)
seq(0, 10, 2)
myvector <- 1:10
myvector
b <- c(3,4,5)
b^2
beers <- c("BUD LIGHT", "BUSCH LIGHT", "COORS LIGHT", "MILLER LITE", "NATURAL LIGHT")
beers
  • Factors store categorical data.

  • Under the hood, factors are actually integers that have a string label attached to each unique integer.

    • For example, if we have a long list of Male/Female labels for each of our patients, this will be stored a “row” of zeros and ones by R.
beers <- as.factor(beers)
class(beers)
levels(beers)
nlevels(beers)
22 / 43

Starting with R

NULL and NA values

  • NULL is just an alias for c(), the empty vector.
  • NA indicates missing or unavailable data.
c(c(), 1, NULL)
c("a", NA, "c")
23 / 43

Starting with R

NULL and NA values

  • NULL is just an alias for c(), the empty vector.
  • NA indicates missing or unavailable data.
c(c(), 1, NULL)
c("a", NA, "c")
24 / 43

Working with Data from Files


25 / 43

Working with Data from Files

  • Step 0. Download the zip file, 'car_data.zip' from the Files section in our Canvas.

  • Step 1. Find the path name for the file, car.data.csv.

  • Step 2. In the code below, replace 'PATH_NAME_FOR_THE_FILE_car.data.csv' with the path name for the file, car.data.csv.

  • Step 3. Run the following R code:

# install.packages("readr")
library(readr)
uciCar <- read_csv(
'PATH_NAME_FOR_THE_FILE_car.data.csv')
View(uciCar)
26 / 43

Working with Data from Files

Examining data frame

  • class() tells you what kind of R object you have.
  • dim() shows how many rows and columns are in the data for data.frame.
  • head() shows the top few rows of the data.
  • help() provides the documentation for a class.
    • Try help(class(uciCar)).
  • str() gives us the structure for an object.
27 / 43

Working with Data from Files

Examining data frame

  • summary() provides a summary of almost any R object.
  • skimr::skim() provides a more detailed summary.
    • skimr is the package that provides the function skim().
  • print() prints all the data.
    • Note: for large datasets, this can take a very long time and is something you want to avoid.
  • View() displays the data in a simple spreadsheet-like grid viewer.
  • dplyr::glimpse() displays brief information about the data.
28 / 43

Working with Data from Files

Examining data frame

print(uciCar)
class(uciCar)
dim(uciCar)
head(uciCar)
help(class(uciCar))
str(uciCar)
summary(uciCar)
library(skimr)
skim(uciCar)
library(tidyverse)
glimpse(uciCar)
29 / 43

Working with Data from Files

Reading data from an URL

  • We can import the data file from the web.
# install.packages("readr")
# library(readr)
tvshows <- read_csv(
'https://bcdanl.github.io/data/tvshows.csv')
30 / 43

Working with Data from Files

Accessing Subsets

  • head() returns the first N rows of our data frame.
  • tail() returns the last N rows of our data frame.
head(tvshows, n = 3)
head(tvshows, 3)
tail(tvshows, 3)
  • As in Python, we can use the same slicing methods in R.
    • Starting index in R is 1, unlike Python.
tvshows[ 1:3, ]
tvshows[ c(1, 2, 3), ]
tvshows[ c(1, 2, 3), 1]
  • Return the “Network” column in the data set:
tvshows$Network
tvshows[, 2]
tvshows[, "Network"]
  • Return the columns named “Show” and “GRP”
tvshows[ , c("Show", "GRP")]
  • Return only the first 3 rows and columns 2 and 5 of the data set
tvshows[1:3, c(2,5)]
  • Return only the shows whose Genre is Reality.
tvshows[ tvshows$Genre == "Reality", ]
  • Another way to subset the shows is with the which() function. - This returns the TRUE indices of a logical object.
reality <- which(tvshows$Genre == "Reality")
reality
tvshows[ reality, ]
  • What if we want all shows whose PE is greater than 80?
tvshows[tvshows$PE > 80, ]
  • Another way to subset the shows is with the which() function. - This returns the TRUE indices of a logical object.
reality <- which(tvshows$Genre == "Reality")
reality
tvshows[ reality, ]
31 / 43

Working with Data from Files

Class Exercises 2

  1. Return those shows whose Duration values are 30.

  2. Return those shows whose GRP values are greater than the mean value of GRP.

  3. Return the data.frame with only three variables---Show, PE, and GRP---for which PE values are greater than the mean value of PE.

32 / 43

Data Visualization with ggplot2


33 / 43

ggplot2

  • ggplot2 is a R data visualization package based on The Grammar of Graphics.
    • ggplot2 is the most elegant and most versatile visualization tools in R.
    • We provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
library(ggplot2)
34 / 43

Data Visualization with ggplot2

Types of plots

  • We will consider the following types of visualization:

    • Bar chart

    • Histogram

    • Scatter plot

    • Scatter plot with Fitted line

    • Line chart

35 / 43

Data Visualization with ggplot2

Getting started with ggplot2

  • Let's use the titanic and tips data.frames:
df_titanic <- read_csv('https://bcdanl.github.io/data/titanic_cleaned.csv')
df_tips <- read_csv('https://bcdanl.github.io/data/tips_seaborn.csv')
36 / 43

Data Visualization with ggplot2

Bar Chart

  • A bar chart is used to plot the frequency of the different categories.
    • It is useful to visualize how values of a categorical variable are distributed.
    • A variable is categorical if it can only take one of a small set of values.
  • We use geom_bar() function to plot a bar chart:
ggplot( data = df_titanic ) +
geom_bar( aes(x = sex) )
  • Mapping
    • data: data.frame
    • x: Name of a categorical variable (column) in data.frame
37 / 43

Data Visualization with ggplot2

Bar Chart

  • We can further break up the bars in the bar chart based on another categorical variable.

    • This is useful to visualize the relationship between the two categorical variables.
ggplot( data = df_titanic ) +
geom_bar( aes( x = sex,
fill = survived ) )
  • Mapping
    • fill: Name of a categorical variable
38 / 43

Data Visualization with ggplot2

Histogram

  • A histogram is a continuous version of a bar chart.
    • It is used to plot the frequency of the different values.
    • It is useful to visualize how values of a continuous variable are distributed.
    • A variable is continuous if it can take any of an infinite set of ordered values.
  • We use geom_histogram() function to plot a histogram:
    ggplot(data = df_titanic) +
    geom_histogram( aes( x = age ),
    bins = 5 )
  • Mapping
    • bins: Number of bins
39 / 43

Data Visualization with ggplot2

Scatter plot

  • A scatter plot is used to display the relationship between two continuous variables.

    • We can see co-variation as a pattern in the scattered points.
  • We use geom_point() function to plot a scatter plot:

ggplot( data = df_tips ) +
geom_point( aes( x = total_bill,
y = tip ) )
  • Mapping
    • x: Name of a continuous variable on the horizontal axis
    • y: Name of a continuous variable on the vertical axis
40 / 43

Data Visualization with ggplot2

Scatter plot

  • To the scatter plot, we can add a color-VARIABLE mapping to display how the relationship between two continuous variables varies by VARIABLE.

  • Suppose we are interested in the following question:

    • Q. Does a smoker and a non-smoker have a difference in tipping behavior?
ggplot( data = df_tips ) +
geom_point( aes( x = total_bill, y = tip,
color = smoker ) )
41 / 43

Data Visualization with ggplot2

Fitted line

  • From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables.
    • geom_smooth( method = lm ) adds a line that fits well into the scattered points.
    • On average, the fitted line describes the relationship between two continuous variables.
ggplot( data = df_tips ) +
geom_point( aes( x = total_bill, y = tip,
color = smoker ) ) +
geom_smooth( aes( x = total_bill, y = tip,
color = smoker ),
method = lm )
42 / 43

Data Visualization with ggplot2

Line cahrt

  • A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable.

    • It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur.

    • We use geom_line() function to plot a line plot:

path_csv <- 'THE_PATHNAME_FOR_THE_FILE__dji.csv'
dow <- read_csv(path_csv)
ggplot( data = dow ) +
geom_line( aes( x = Date, y = Close ) )
43 / 43

Announcement

Student Course Experience (SCE) Survey

  • Effective Fall 2022, the Student Course Experience (SCE) survey replaces the Student Observation of Faculty Instruction (SOFI) survey.

  • In a web browser, students should visit their myGeneseo portal, then select KnightWeb, Surveys, then SCE (formerly SOFI) Surveys.

2 / 43
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow