+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 26


DANL 100: Programming for Data Analytics

Byeong-Hak Choe

December 6, 2022

1 / 40

Announcement

Student Course Experience (SCE) Survey

  • Effective Fall 2022, the Student Course Experience (SCE) survey replaces the Student Observation of Faculty Instruction (SOFI) survey.

  • In a web browser, students should visit their myGeneseo portal, then select KnightWeb, Surveys, then SCE (formerly SOFI) Surveys.

2 / 40

Data Visualization with seaborn

Transparency with alpha

  • In a scatter plot, adding transparency with alpha helps address many data points on the same location.
    • We can map alpha to number between 0 and 1.
import seaborn as sns
df_tips = sns.load_dataset('tips')
sns.scatterplot(x = 'total_bill',
y = 'tip',
hue = 'smoker',
alpha = .25,
data = df_tips)
sns.lmplot(x = 'total_bill',
y = 'tip',
scatter_kws = {'alpha' : 0.2},
data = df_tips)
3 / 40

Starting with R and RStudio


4 / 40

Installing the Tools

R programming

The R language is available as a free download from the R Project website at:

5 / 40

Installing the Tools

RStudio

6 / 40

Installing the Tools

RStudio Environment

  • Script Pane is where you write R commands in a script file that you can save.
    • An R script is simply a text file containing R commands.
    • RStudio will color-code different elements of your code to make it easier to read.
7 / 40

Installing the Tools

RStudio Environment

  • Console Pane allows you to interact directly with the R interpreter and type commands where R will immediately execute them.
8 / 40

Installing the Tools

RStudio Environment

  • Environment Pane is where you can see the values of variables, data frames, and other objects that are currently stored in memory.
9 / 40

Installing the Tools

RStudio Environment

  • Plots Pane contains any graphics that you generate from your R code.
10 / 40

Installing the Tools

R Packages

pkgs <- c("ggplot2", "readr", "dplyr")
install.packages(pkgs)
  • While running the above codes, I recommend you to answer "no" to the following question:

Mac: "Do you want to install from sources the packages which need compilation?" from Console Pane.

Windows: "Would you like to use a personal library instead?" from Pop-up message.

11 / 40

Installing the Tools

R Packages

  • Check whether ggplot2 is installed well:
library(ggplot2) # loading the package tidyverse
mpg # data.frame provided by the package ggplot2
# ggplot2 is included in tidyverse
  • Let me know if you have an error from the above code.
12 / 40

Workflow


13 / 40

Workflow

Shortcuts for RStudio and RScript

Mac

  • command + shift + N opens a new RScript.
  • command + return runs a current line or selected lines.
  • command + shift + C is the shortcut for # (commenting).
  • option + - is the shortcut for <-.

Windows

  • Ctrl + Shift + N opens a new RS-cript.
  • Ctrl + return runs a current line or selected lines.
  • Ctrl + Shift + C is the shortcut for # (commenting).
  • Alt + - is the shortcut for <-.
14 / 40

Workflow

  • Home/End moves the blinking cursor bar to the beginning/End of the line.
    • Ctrl (command for Mac Users) + / works too.
  • Ctrl (command for Mac Users) + Z undoes the previous action.
  • Ctrl (command for Mac Users) + Shift + Z redoes when undo is executed.
  • Ctrl (command for Mac Users) + F is useful when finding a phrase (and replace the phrase) in the RScript.

  • Auto-completion of command is useful.

    • Type libr in the RScript in RStudio and wait for a second.
libr

15 / 40

Workflow

  • To install R package PACKAGE, use install.packages("PACKAGE").
install.packages("ggplot2") # installing package "ggplot2"
  • When the code is running, RStudio shows the STOP icon () at the top right corner in the Console Pane.
    • Do not click it unless if you want to stop running the code.

16 / 40

Workflow

Quotation marks, parentheses, and +

  • Quotation marks and parentheses must always come in a pair.
    • If not, Console Pane will show you the continuation character +:
> x <- "hello
  • The + tells you that R is waiting for more input; it doesn’t think you’re done yet.
17 / 40

Workflow

RStudio Options Setting

  • This option menu is found by menus as follows:

    • Mac: RStudio > Preferences
    • Windows: Tools > Global Options
  • Check as in the picture.

  • Choose "Never" on "Save workplace to .RData on exit:".
18 / 40

Starting with R


19 / 40

Starting with R

  • Let's try a few commands to help you become familiar with R and its basic data types.

  • In R, vectors are arrays of same-typed values.

    • They can be built with the c() notation.
1
1/2
'Joe'
"Joe"
"Joe"=='Joe'
c()
is.null(c())
is.null(5)
c(1)
c(1, 2)
c("Apple", 'Orange')
length(c(1, 2))
vec <- c(1, 2)
vec
20 / 40

Starting with R

Assignment

  • R has many assignment operators (e.g., <-, =, -> ).
  • The preferred one is <-.
x <- 2
x < - 3
print(x)
x <- 5
x = 5
5 -> x
21 / 40

Starting with R

Class Exercise 1

  1. Create a new R script.

  2. Enter the following code into your script.

variable2 <- c(12, 1, 10, 2, 18, 3)
variable2
mean(variable2)
variable2 + 2
  1. Run your code (Ctrl + Enter for Windows users; cmd + Return for mac users).

  2. Save your code.

22 / 40

Starting with R

R variables and data types

  • Variables can be thought of as a labelled container used to store information.

  • Variables allow us to recall saved information to later use in calculations.

  • Variables can store many different things in RStudio, from single values, data frames, to graphs.

23 / 40

Starting with R

R variables and data types

  • Logical: TRUE or FALSE.
    • Numeric: Decimal numbers
    • Integer: Integers
    • Character: Text strings
    • Factor: Categorical values. Each possible value of a factor is known as a level.

  • vector: 1D collection of variables of the same type
  • matrix: 2D collection of variables of the same type
  • data.frame: 2D collection of variables of multiple types
24 / 40

R variable and data types

  • Strings are known as “character” in R.
  • Use the double quotes " or single quotes ' to wrap around the string
myname <- "my_name"
class(myname)
  • class() function returns the data type of an object.
  • Numbers have different classes.
    • The most common two are integer and numeric. Integers are whole numbers:
favourite.integer <- as.integer(2)
print(favourite.integer)
class(favourite.integer)
favourite.numeric <- as.numeric(8.8)
print(favourite.numeric)
class(favourite.numeric)
pvalue.threshold <- 0.05
  • We use the == to test for equality in R
class(TRUE)
favourite.numeric == 8.8
favourite.numeric == 9.9
  • We can create 1D data structures called “vectors”.
1:10
2*(1:10)
seq(0, 10, 2)
myvector <- 1:10
myvector
b <- c(3,4,5)
b^2
beers <- c("BUD LIGHT", "BUSCH LIGHT", "COORS LIGHT", "MILLER LITE", "NATURAL LIGHT")
beers
  • Factors store categorical data.

  • Under the hood, factors are actually integers that have a string label attached to each unique integer.

    • For example, if we have a long list of Male/Female labels for each of our patients, this will be stored a “row” of zeros and ones by R.
beers <- as.factor(beers)
class(beers)
levels(beers)
nlevels(beers)
25 / 40

Starting with R

NULL and NA values

  • NULL is just an alias for c(), the empty vector.
  • NA indicates missing or unavailable data.
c(c(), 1, NULL)
c("a", NA, "c")
26 / 40

Starting with R

NULL and NA values

  • NULL is just an alias for c(), the empty vector.
  • NA indicates missing or unavailable data.
c(c(), 1, NULL)
c("a", NA, "c")
27 / 40

Management of Files, Directories, and Scripts


28 / 40

Management of Files, Directories, and Scripts

Code and comment style

  • The two main principles for coding and managing data are:

    • Make things easier for your future self.
    • Don't trust your future self.
  • So we do make comments on codes.

29 / 40

Management of Files, Directories, and Scripts

Code and comment style

  • The # mark is R's comment character.

    • # indicates that the rest of the line is to be ignored.
    • Write comments before the line that you want the comment to apply to.
  • Consider using block commenting for separating code sections.

    • ##### defines a coding block.
  • Break down long lines and long algebraic expressions.

30 / 40

Management of Files and Directories

Finding the path name of the file

  • Step 1. Go to your folder using the File Explorer.
  • Step 2. Right-click the file.
  • Step 3. Click "Copy as path".
  • Step 4. Paste the path name of the file to the R script (Ctrl+V).
  • Step 5.
    • Option 1. Replace backslash(\) with double-backslash(\\) in the path name.
    • Option 2. Replace backslash(\) with slash(/) in the path name.
  • Step 1. Go to your folder using the File Explorer.
  • Step 2. Keep pressing the "Shift" key
  • Step 3. Right-click the file.
  • Step 4. Click "Copy as path".
  • Step 5. Paste the path name of the file to the R script (Ctrl+V).
  • Step 6.
    • Option 1. Replace backslash(\) with double-backslash(\\) in the path name.
    • Option 2. Replace backslash(\) with slash(/) in the path name.
  • Step 1. Go to your folder using the Finder.
  • Step 2. Right-click the file in the folder
  • Step 3. Keep pressing "option"
  • Step 4. Click "Copy 'PATH_FOR_YOUR_FILE' as Pathname" from the menu.
  • Step 5. Paste it to the R script (command+V).
31 / 40

Working with Data from Files


32 / 40

Working with Data from Files

  • Step 0. Download the zip file, 'car_data.zip' from the Files section in our Canvas.

  • Step 1. Find the path name for the file, car.data.csv.

  • Step 2. In the code below, replace 'PATH_NAME_FOR_THE_FILE_car.data.csv' with the path name for the file, car.data.csv.

  • Step 3. Run the following R code:

# install.packages("readr")
library(readr)
uciCar <- read_csv(
'PATH_NAME_FOR_THE_FILE_car.data.csv')
View(uciCar)
33 / 40

Working with Data from Files

Examining data frame

  • class() tells you what kind of R object you have.
  • dim() shows how many rows and columns are in the data for data.frame.
  • head() shows the top few rows of the data.
  • help() provides the documentation for a class.
    • Try help(class(uciCar)).
  • str() gives us the structure for an object.
34 / 40

Working with Data from Files

Examining data frame

  • summary() provides a summary of almost any R object.
  • skimr::skim() provides a more detailed summary.
    • skimr is the package that provides the function skim().
  • print() prints all the data.
    • Note: for large datasets, this can take a very long time and is something you want to avoid.
  • View() displays the data in a simple spreadsheet-like grid viewer.
  • dplyr::glimpse() displays brief information about the data.
35 / 40

Working with Data from Files

Examining data frame

print(uciCar)
class(uciCar)
dim(uciCar)
head(uciCar)
help(class(uciCar))
str(uciCar)
summary(uciCar)
library(skimr)
skim(uciCar)
library(tidyverse)
glimpse(uciCar)
36 / 40

Working with Data from Files

Reading data from an URL

  • We can import the data file from the web.
# install.packages("readr")
# library(readr)
tvshows <- read_csv(
'https://bcdanl.github.io/data/tvshows.csv')
37 / 40

Working with Data from Files

Accessing Subsets

  • head() returns the first N rows of our data frame.
  • tail() returns the last N rows of our data frame.
head(tvshows, n = 3)
head(tvshows, 3)
tail(tvshows, 3)
  • As in Python, we can use the same slicing methods in R.
    • Starting index in R is 1, unlike Python.
tvshows[ 1:3, ]
tvshows[ c(1, 2, 3), ]
tvshows[ c(1, 2, 3), 1]
  • Return the “Network” column in the data set:
tvshows$Network
tvshows[, 2]
tvshows[, "Network"]
  • Return the columns named “Show” and “GRP”
tvshows[ , c("Show", "GRP")]
  • Return only the first 3 rows and columns 2 and 5 of the data set
tvshows[1:3, c(2,5)]
  • Return only the shows whose Genre is Reality.
tvshows[ tvshows$Genre == "Reality", ]
  • Another way to subset the shows is with the which() function. - This returns the TRUE indices of a logical object.
reality <- which(tvshows$Genre == "Reality")
reality
tvshows[ reality, ]
  • What if we want all shows whose PE is greater than 80?
tvshows[tvshows$PE > 80, ]
  • Another way to subset the shows is with the which() function. - This returns the TRUE indices of a logical object.
reality <- which(tvshows$Genre == "Reality")
reality
tvshows[ reality, ]
38 / 40

Working with Data from Files

Class Exercises 2

  1. Return those shows whose Duration values are 30.

  2. Return those shows whose GRP values are greater than the mean value of GRP.

  3. Return the data.frame with only three variables---Show, PE, and GRP---for which PE values are greater than the mean value of PE.

39 / 40

Working with Data from Files

Data visualization

  • Let's try some data visualization using ggplot():
# install.packages("ggplot2")
library(ggplot2)
ggplot( data = tvshows ) +
geom_point( aes( x = GRP, y = PE,
color = Genre ) )
ggplot( data = tvshows ) +
geom_point( aes( x = GRP, y = PE,
color = Genre ) ) +
geom_smooth( aes( x = GRP, y = PE,
color = Genre ),
method = lm )
  • How is the the relationship between audience size (GRP) and audience engagement (PE)?
40 / 40

Announcement

Student Course Experience (SCE) Survey

  • Effective Fall 2022, the Student Course Experience (SCE) survey replaces the Student Observation of Faculty Instruction (SOFI) survey.

  • In a web browser, students should visit their myGeneseo portal, then select KnightWeb, Surveys, then SCE (formerly SOFI) Surveys.

2 / 40
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow