+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 2


DANL 200: Introduction to Data Analytics

Byeong-Hak Choe

Septermber 1, 2022

1 / 37

Announcement

Changes in Office Hours

  • Office: South Hall 117B.
  • Office Hours:
    • Mondays 3:30 PM-5:30 PM
    • Wednesdays 1:30 PM-3:30 PM.
2 / 37

Installing the Tools


3 / 37

Installing the Tools

R programming

The R language is available as a free download from the R Project website at:

4 / 37

Installing the Tools

RStudio

5 / 37

Installing the Tools

RStudio Environment

  • Script Pane is where you write R commands in a script file that you can save.
    • An R script is simply a text file containing R commands.
    • RStudio will color-code different elements of your code to make it easier to read.
6 / 37

Installing the Tools

RStudio Environment

  • Console Pane allows you to interact directly with the R interpreter and type commands where R will immediately execute them.
7 / 37

Installing the Tools

RStudio Environment

  • Environment Pane is where you can see the values of variables, data frames, and other objects that are currently stored in memory.
8 / 37

Installing the Tools

RStudio Environment

  • Plots Pane contains any graphics that you generate from your R code.
9 / 37

Installing the Tools

RStudio Environment

# Answer "no" to:
# Do you want to install from sources the packages which need compilation?
update.packages(ask = FALSE, checkBuilt = TRUE)
pkgs <- c("tidyverse", "nycflights13", "gapminder", "skimr")
install.packages(pkgs,
dependencies = c("Depends", "Imports", "LinkingTo"))
10 / 37

Management of Files, Directories, and Scripts


11 / 37

Management of Files, Directories, and Scripts

Materials for the book, Practical Data Science with R

  • Step 1. Go to your Download folder (or the folder where the downloaded file is saved) using the File Explorer.
  • Step 2. Click the ZIP file one time.
  • Step 3. Cut the file by using the shortcut (Ctrl+X).
  • Step 4. Go to your working folder for the course using the File Explorer.
  • Step 5. Paste the file to your working folder by using Ctrl+V.
  • Step 6. Right-click the ZIP file and click "Extract ..."
  • Step 1. Go to your Download folder (or the folder where the downloaded file is saved) using the Finder.
  • Step 2. Click the ZIP file (or the folder if the ZIP file is extracted) one time.
  • Step 3. Copy the file (or the folder) by using the shortcut (command+C).
  • Step 4. Go to your working folder for the course using the Finder.
  • Step 5. Paste the file to your working folder by using command+option+V.
  • Step 6. Right-click the ZIP file and click "Extract ..."
12 / 37

Management of Files and Directories

Finding the path name of the file

  • Step 1. Go to your folder using the File Explorer.
  • Step 2. Right-click the file.
  • Step 3. Click "Copy as path".
  • Step 4. Paste the path name of the file to the R script (Ctrl+V).
  • Step 5.
    • Option 1. Replace backslash(\) with double-backslash(\\) in the path name.
    • Option 2. Add r at the beginning of the path name:
    • 'r PATH_FOR_YOUR_FILE'.
  • Step 1. Go to your folder using the File Explorer.
  • Step 2. Keep pressing the "Shift" key
  • Step 3. Right-click the file.
  • Step 4. Click "Copy as path".
  • Step 5. Paste the path name of the file to the R script (Ctrl+V).
  • Step 6.
    • Option 1. Replace backslash(\) with double-backslash(\\) in the path name.
    • Option 2. Add r at the beginning of the path name:
    • 'r PATH_FOR_YOUR_FILE'.
  • Step 1. Go to your folder using the Finder.
  • Step 2. Right-click the file in the folder
  • Step 3. Keep pressing "option"
  • Step 4. Click "Copy 'PATH_FOR_YOUR_FILE' as Pathname" from the menu.
  • Step 5. Paste it to the R script (command+V).
13 / 37

Management of Files, Directories, and Scripts

Code and comment style

  • The two main principles for coding and managing data are:

    • Make things easier for your future self.
    • Don't trust your future self.
  • The # mark is R's comment character.

    • # indicates that the rest of the line is to be ignored.
    • Write comments before the line that you want the comment to apply to.
  • Consider using block commenting for separating code sections.

    • #### defines a coding block.
  • Break down long lines and long algebraic expressions.

14 / 37

Starting with R


15 / 37

Starting with R

RStudio Options Setting

  • This option menu is found by menus as follows:
    • Mac: RStudio > Preferences
    • Windows: Tools > Global Options
16 / 37

Starting with R

  • Let's try a few commands to help you become familiar with R and its basic data types.
1
1/2
'Joe'
"Joe"
"Joe"=='Joe'
c()
is.null(c())
is.null(5)
  • In R, vectors are arrays of same-typed values.
    • They can be built with the c() notation.
c(1)
c(1, 2)
c("Apple", 'Orange')
length(c(1, 2))
vec <- c(1, 2)
vec
17 / 37

Starting with R

Code and comment style

  • R has many assignment operators (e.g., <-, =, -> ).
  • The preferred one is <-.
x <- 2
x < - 3
print(x)
x <- 5
x = 5
5 -> x
18 / 37

Starting with R

Shortcuts

Mac

  • command + return runs a current line (where the blinking cursor is) or selected lines.
  • command + shift + C is the shortcut for #.
  • option + - is the shortcut for <-.

Windows

  • Ctrl + Enter runs a current line (where the blinking cursor is) or selected lines.
  • Ctrl + Shift + C is the shortcut for #.
  • Alt + - is the shortcut for <-.
19 / 37

Starting with R

R data types

  • Primary data types in R are as follows:
    • Logical: A simple binary variable that may have only two values---TRUE or FALSE.
    • Numeric: Decimal numbers
    • Integer: Integers
    • Character: Text strings
    • Factor: Categorical values. Each possible value of a factor is known as a level.
    • Ordered Factor: A special factor data type where the order of the levels is significant. E.g., Low, Medium, and High
20 / 37

Starting with R

R data types

  • Test the data types.
    x <- TRUE
    y <- 1
    z <- 'Data Analytics'
    productCategory <- c('fruit', 'vegetable', 'dry goods', 'fruit',
    'vegetable', 'dry goods')
    productCategoryFactor <- factor(productCategory)
  • The class() function returns the data type of an object.
    • What are classes for x, y, z, productCategory, and productCategoryFactor?
21 / 37

Starting with R

R data types

  • Most R data types are mutable, in that we're allowed to change them.
a <- c(1, 2)
b <- a
print(b)
# Alters a
a[[1]] <- 5
print(a)
print(b)
22 / 37

Starting with R

Lists

  • Lists, unlike vectors, can store more than one type of object.
    • The ways to access items in lists are the $ operator and the [[]] operator.
x <- list('a' = 6, b = 'fred')
names(x)
x$a
x$b
x[['a']]
x[c('a', 'a', 'b', 'b')]
23 / 37

Starting with R

R data types

  • Here are examples of a vector and a list.
example_vector <- c(10, 20, 30)
example_list <- list(a = 10, b = 20, c = 30)
example_vector[1]
example_list[1]
example_vector[[2]]
example_list[[2]]
example_vector[c(FALSE, TRUE, TRUE)]
example_list[c(FALSE, TRUE, TRUE)]
example_list$b
example_list[["b"]]
24 / 37

Starting with R

Errors

  • Errors are just R's way of saying it safely refused to complete an ill-formed operation

  • Fear of errors should not limit experiments.

x <- 1:5
print(x)
x <- meanMISSPELLED(x)
print(x)
x <- mean(x)
print(x)
25 / 37

Starting with R

Data Frames

  • R’s central data structure is the data frame.
  • A data frame is organized into rows and columns.
  • Data frames are essentially lists of columns.
  • Data frames can have columns of different types.
d <- data.frame(x=c(1,2),
y=c('a','b'))
d[['x']]
d$x
d[[1]]
d
d[1,]
d[,1]
d[1,1]
d[1, 'x']
26 / 37

Starting with R

Data Frames

  • The R data.frame class is designed to store data in a very good "ready for analysis" format.
d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1))
print(d)
d$col3 <- d$col1 + d$col2
print(d)
27 / 37

Starting with R

NULL and NA values

  • NULL is just an alias for c(), the empty vector.
  • NA indicates missing or unavailable data.
c(c(), 1, NULL)
c("a", NA, "c")
28 / 37

Starting with R

NULL and NA values

  • NULL is just an alias for c(), the empty vector.
  • NA indicates missing or unavailable data.
c(c(), 1, NULL)
c("a", NA, "c")
29 / 37

Starting with R

NULL and NA values

  • Most R data types are mutable, in that we're allowed to change them.
d <- data.frame(x = 1, y = 2)
d2 <- d
d$x <- 5
print(d)
print(d2)
30 / 37

Working with Data from Files


31 / 37

Working with Data from Files

  • Step 1. Find the path name for the file, car.data.csv, from the sub-folder, 'UCICar' in the folder, 'PDSwR2-main'.

  • Step 2. In the code below, replace 'PATH_NAME_FOR_THE_FILE_car.data.csv' with the path name for the file, car.data.csv.

  • Step 3. Run the following R code:

uciCar <- read.table(
'PATH_NAME_FOR_THE_FILE_car.data.csv',
sep = ',',
header = TRUE,
stringsAsFactor = TRUE )
View(uciCar)
32 / 37

Working with Data from Files

Examining data frame

  • class() tells you what kind of R object you have.
  • dim() shows how many rows and columns are in the data for data.frame.
  • head() shows the top few rows of the data.
  • help() provides the documentation for a class.
    • Try help(class(uciCar)).
  • str() gives us the structure for an object.
33 / 37

Working with Data from Files

Examining data frame

  • summary() provides a summary of almost any R object.
  • skimr::skim() provides a more detailed summary.
  • print() prints all the data.
    • Note: for large datasets, this can take a very long time and is something you want to avoid.
  • View() displays the data in a simple spreadsheet-like grid viewer.
  • dplyr::glimpse() displays brief information about the data.
34 / 37

Working with Data from Files

Examining data frame

print(uciCar)
class(uciCar)
dim(uciCar)
head(uciCar)
help(class(uciCar))
str(uciCar)
summary(uciCar)
library(skimr)
skim(uciCar)
library(tidyverse)
glimpse(uciCar)
35 / 37

Working with Data from Files

Reading from an URL

  • We can import the data file from the web.
tvshows <- read.table(
'https://bcdanl.github.io/data/tvshows.csv',
sep = ',',
header = TRUE,
stringsAsFactor = TRUE)
36 / 37

Working with Data from Files

ggplot

  • Let's try some data visualization using ggplot():
ggplot(tvshows) +
geom_point(aes(x=GRP, y=PE, color=Genre))
ggplot(tvshows) +
geom_point(aes(x=GRP, y=PE)) +
facet_wrap(~Genre)
  • What is the nature of the the relationship between audience size (GRP) and audience engagement (PE)?
37 / 37

Announcement

Changes in Office Hours

  • Office: South Hall 117B.
  • Office Hours:
    • Mondays 3:30 PM-5:30 PM
    • Wednesdays 1:30 PM-3:30 PM.
2 / 37
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow