Lecture 24

class: title-slide, left, bottom

# Lecture 24
----
## **DANL 100: Programming for Data Analytics**
### Byeong-Hak Choe
### November 29, 2022

---
# Announcement
### <p style="color:#00449E"> Geneseo Alumni's Talk on Data Analytics Career

- Name: Lauren Kopac (Class of 2015)

- When: December 1, 2022, Thursday 11:00 A.M.

- Where: Zoom (I will leave the Zoom link on Canvas soon.)

- We will watch her Zoom recording on next Tuesday.

- She will give us a talk on the data analytics career with her experiences.

- As a data analyst, she has worked at (1) Columbia University, New York, NY and (2) Neuberger Berman, New York, NY.

---
# Announcement
### <p style="color:#00449E"> Grading

- My apologies for the grading delay.

- I will finish grading by Thursday or Friday.

---
# Summary Statistics
### <p style="color:#00449E"> Percentage Grades in Choe's DANLs, Fall 2021 & Spring 2022

<table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;"> Statistics </th>
   <th style="text-align:left;"> Values </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Average </td>
   <td style="text-align:left;width: 10em; "> 83.00 - 87.50 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Standard Deviation </td>
   <td style="text-align:left;width: 10em; "> 7.37 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Minimum </td>
   <td style="text-align:left;width: 10em; "> 63.12 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Maximum </td>
   <td style="text-align:left;width: 10em; "> 99.07 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Number of Students </td>
   <td style="text-align:left;width: 10em; "> 97 </td>
  </tr>
</tbody>
</table>

$$ \begin{cases}
100  \;\geq\; A \;\geq\; 93 \;>\; A- \;\geq\; 90 \\\\
90  \;>\; B+ \;\geq\; 87 \;>\; B \;\geq\; 83 > B- \;\geq\; 80\\\\
80  \;>\; C+ \;\geq\; 77 \;>\; C \;\geq\; 73 > C- \;\geq\; 70\\\\
70 \;>\; D \;\geq\; 60 \;>\; E
\end{cases} $$

---
# Tips for using Presentation Slides

- To go to a previous/next page, use keyboard arrows, <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M447.1 256C447.1 273.7 433.7 288 416 288H109.3l105.4 105.4c12.5 12.5 12.5 32.75 0 45.25C208.4 444.9 200.2 448 192 448s-16.38-3.125-22.62-9.375l-160-160c-12.5-12.5-12.5-32.75 0-45.25l160-160c12.5-12.5 32.75-12.5 45.25 0s12.5 32.75 0 45.25L109.3 224H416C433.7 224 447.1 238.3 447.1 256z"/></svg> and <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M438.6 278.6l-160 160C272.4 444.9 264.2 448 256 448s-16.38-3.125-22.62-9.375c-12.5-12.5-12.5-32.75 0-45.25L338.8 288H32C14.33 288 .0016 273.7 .0016 256S14.33 224 32 224h306.8l-105.4-105.4c-12.5-12.5-12.5-32.75 0-45.25s32.75-12.5 45.25 0l160 160C451.1 245.9 451.1 266.1 438.6 278.6z"/></svg>.

- To see a tile view of the lecture slides, use the alphabet key, `o`.

- If you hover a mouse cursor on the code block in the lecture slide, you can see and click the *"Copy Code"* from the top-right corner of the code block.
  - If you click the *"Copy Code"*, the codes in the block are copied, so that you can paste them to RScript.

- If the presentation slides does not respond, refresh the web-page of the slides by the shortcut, **Ctrl** (or **command** for Mac users) ** + R**.

---
# Getting started with `pandas`
### <p style="color:#00449E"> Boolean Indexing

- Boolean indexing of DataFrames works like boolean indexing an `np.array`.
  - `DataFrame[ DataFrame['VARIABLE_NAME'] > VALUE  ]`
  - `DataFrame[ DataFrame['VARIABLE_NAME'] == VALUE  ]`
  - `DataFrame[ DataFrame['VARIABLE_NAME'] < VALUE  ]`

```python
data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}

companies = pd.DataFrame(data)
companies_daimler = companies[ companies['company'] ==  "Daimler" ]
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Append the two `DataFrame`s

- `DataFrame.append(DATAFRAME)` is used to append rows of other `DATAFRAME` to the end of the given `DataFrame`, returning a new DataFrame object.

```python
df1 = df = pd.DataFrame({"a":[1, 2, 3, 4],
                         "b":[5, 6, 7, 8]})
df2 = pd.DataFrame({"a":[1, 2, 3],
                    "b":[5, 6, 7]})
                    
# to append df2 at the end of df1 dataframe
df1.append(df2)
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Import/Export data
`pd.read_csv("PATH_NAME_OF_*.csv")` reads the csv file into `DataFrame`.
  - `header=None` does not read the top row of the csv file as column names.
  - We can set column names with `names`, for example, `names=["a", "b", "c", "d", "e"]`.
  
- `DataFrame.head()` and `DataFrame.tail()` prints the first and last five rows on the Console, respectively.

```python
nbc_show = pd.read_csv("https://bcdanl.github.io/data/nbc_show_na.csv")
# `GRP`: audience size; `PE`: audience engagement.
nbc_show.head()   # showing the first five rows
nbc_show.tail()   # showing the last five rows
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Export data
`DataFrame.to_csv("filename")` writes `DataFrame` to the csv file.
  - `index = False` and `header=False` do not write row index and column names in the csv file.
  - We can set column names with `header`, for example, `header=["a", "b", "c", "d", "e"]`.

```python
nbc_show.to_csv("PATH_NAME_OF_THE_csv_FILE")
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Summarizing `DataFrame`

- `DataFrame.count()` returns a Series containing the number of non-missing values for each column.
- `DataFrame.sum()` returns a Series containing the sum of values for each column.
- `DataFrame.mean()` returns a Series containing the mean of values for each column.
  - Passing `axis="columns"` or `axis=1` sums across the columns instead:

```python
nbc_count = nbc_show.sum()
nbc_sum = nbc_show.sum()
nbc_sum_c = nbc_show.sum( axis="columns" )
nbc_mean = nbc_show.mean()
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Grouping `DataFrame`

- `DataFrame.groupby(col1, col2)` groups `DataFrame` by columns (grouping by one or more than two columns is also possible!).

- Adding the functions `count()`, `sum()`, `mean()` to `groupby()` returns the sum or the mean of the grouped columns.

```python
nbc_genre_count = nbc_show.groupby(["Genre"]).count()
nbc_genre_sum = nbc_show.groupby(["Genre"]).sum()
nbc_network_genre_mean = nbc_show.groupby(["Network", "Genre"]).mean()
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Sorting `DataFrame`

- `DataFrame.sort_index()` sorts DataFrame by index on either axis.

- `DataFrame.sort_index(axis="columns")` sorts DataFrame by column index.
  
  - `DataFrame.sort_index(ascending=False)` sorts DataFrame by either index in descending order.

```python
nbc_show.sort_index()
nbc_show.sort_index(ascending = False)
nbc_show.sort_index(axis = "columns")
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Sorting `DataFrame`

- `DataFrame.sort_value("SOME_VARIABLE")` sorts DataFrame by values of SOME_VARIABLE.

- For `Series.sort_values()`, we do not need to provide `"SOME_VARIABLE"` in the `sort_values()` function.

- `DataFrame.sort_values("SOME_VARIABLE", ascdening = False)` sorts DataFrame by values of SOME_VARIABLE in descending order.

```python
nbc_show.sort_values("GRP")
nbc_show.sort_values("GRP", ascending = False)

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
```

---
# Getting started with `pandas`
### <p style="color:#00449E"> Class Exercise

Use the `nbc_show_na.csv` file to answer the following questions:

1. Find the top show in terms of the value of `PE` for each Genre.

2. Find the top show in terms of the value of `GRP` for each Network.

3. Which genre does have the largest `GRP` on average?

---
# Workflow
### <p style="color:#00449E"> Installing Python modules

- Anaconda Spyder also allow for installation on Spyder.

.panelset[

.panel[.panel-name[Windows]
- Step 1. Type the following command on Spyder Python script:
- Step 2. Run the command.

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

.panel[.panel-name[Mac]
- Step 1. Type the following command on Spyder Python script:
- Step 2. Run the command.

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

]

---
# Workflow
### <p style="color:#00449E"> Installing Python modules

- Let's install the Python visualization library `seaborn` on Anaconda Spyder.

.panelset[

.panel[.panel-name[Windows]
- Step 1. Open "Anaconda Prompt".
- Step 2. Type the following:

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

.panel[.panel-name[Mac]
- Step 1. Open "Terminal".
- Step 2. Type the following:

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

]

---
class: inverse, center, middle

# Data Visualization with `seaborn`

---
# Data Visualization

.pull-left[

]
.pull-right[
- Graphs and charts let us explore and learn about the structure of the information we have in DataFrame.

- Good data visualizations make it easier to communicate our ideas and findings to other people.

]

---
# Exploratory Data Analysis (EDA)

- We use visualization and summary statistics (e.g., mean, median, minimum, maximum) to explore our data in a systematic way.

- EDA is an iterative cycle. We:

- Generate questions about our data.

- Search for answers by visualizing, transforming, and modelling our data.

- Use what we learn to refine our questions and/or generate new questions.

---
#  `seaborn`

- `seaborn` is a Python data visualization library based on `matplotlib`. 
  - It allows us to easily create beautiful but complex graphics using a simple interface.
  - It also provides a general improvement in the default appearance of `matplotlib`-produced plots, and so I recommend using it by default.

```python
import seaborn as sns
```

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Types of plots

- We will consider the following types of visualization:

- Bar chart

- Histogram

- Scatter plot
  
  - Scatter plot with Fitted line

- Line chart

---
# Getting started with `pandas`
### <p style="color:#00449E"> What is *tidy* `DataFrame`? </p>

- There are three rules which make a dataset tidy:

1. Each **variable** has its own *column*.
  2. Each **observation** has its own *row*.
  3. Each **value** has its own *cell*.

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Getting started with `seaborn`

- Let's get the names of `DataFrame`s provided by the `seaborn` library:

```python
import seaborn as sns
print( sns.get_dataset_names() )
```

- Let's use the `titanic` and `tips` DataFrames:

```python
df_titanic = sns.load_dataset('titanic')
df_titanic.head()
df_tips = sns.load_dataset('tips')
df_tips.head()
```

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Bar Chart

- A bar chart is used to plot the frequency of the different categories.
  - It is useful to visualize how values of a **categorical variable** are distributed.
  - A variable is **categorical** if it can only take one of a small set of values.
  
  
- We use `sns.countplot()` function to plot a bar chart:

.pull-left[

```python
sns.countplot(x =  'sex', 
              data = df_titanic)
```
]

.pull-right[

- Mapping
  - `data`: DataFrame.
  - `x`:  Name of a categorical variable (column) in DataFrame

]

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Bar Chart

- We can further break up the bars in the bar chart based on another categorical variable.

- This is useful to visualize the relationship between the two categorical variables.

.pull-left[

```python
sns.countplot(x = 'sex', 
              hue = 'survived', 
              data = df_titanic)
```
]

.pull-right[

- Mapping
  - `hue`:  Name of a categorical variable

]

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Histogram

- A histogram is a **continuous** version of a bar chart.
  - It is used to plot the frequency of the different values.
  - It is useful to visualize how values of a **continuous variable** are distributed.
  - A variable is **continuous** if it can take any of an infinite set of ordered values.
  
  
- We use `sns.displot()` function to plot a histogram:
.pull-left[

```python
sns.displot(x =  'age', 
            bins = 5 ,
            data = df_titanic)
```
]

.pull-right[
- Mapping
  - `bins`:  Number of bins

]

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Scatter plot

- A scatter plot is used to display the relationship between two continuous variables.

-  We can see co-variation as a pattern in the scattered points.

- We use `sns.scatterplot()` function to plot a scatter plot:

.pull-left[

```python
sns.scatterplot(x = 'total_bill', 
                y = 'tip',
                data = df_tips)
```
]

.pull-right[
- Mapping
  - `x`:  Name of a continuous variable on the horizontal axis
  - `y`:  Name of a continuous variable on the vertical axis
]

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Scatter plot

- To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`.

- Suppose we are interested in the following question:
  - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior?

```python
sns.scatterplot(x = 'total_bill', 
                y = 'tip',
                hue = 'smoker',
                data = df)
```

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Fitted line

- From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables.

- `sns.lmplot()` adds a line that fits well into the scattered points.
  
  - On average, the fitted line describes the relationship between two continuous variables.

```python
sns.lmplot(x = 'total_bill', 
           y = 'tip',
           data = df_tips)
```

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Scatter plot

- To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`.

- Using the fitted lines, let's answer the following question:
  - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior?

```python
sns.scatterplot(x = 'total_bill', 
                y = 'tip',
                hue = 'smoker',
                data = df_tips)
```

---
# Data Visualization with `seaborn`
### <p style="color:#00449E"> Line cahrt

- A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable.
  - It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur.
- We use `sns.lineplot()` function to plot a line plot:
.pull-left[

```python
path_csv = '/Users/byeong-hakchoe/Google Drive/suny-geneseo/teaching-materials/lecture-data/dji.csv'
dow = pd.read_csv(path_csv, index_col=0, parse_dates=True)
sns.lineplot(x =  'Date', 
             y =  'Close', 
             data = dow)
```
]

.pull-right[
- Mapping
  - `x`:  Name of a continuous variable (often time variable) on the horizontal axis 
  - `y`:  Name of a continuous variable on the vertical axis
]

---
class: inverse, center, middle

# Starting with R and RStudio
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Installing the Tools
### <p style="color:#00449E"> R programming </p>

The R language is available as a free download from the R Project website at:

- Windows: [https://cran.r-project.org/bin/windows/base/](https://cran.r-project.org/bin/windows/base/)
- Mac: [https://cran.r-project.org/bin/macosx/](https://cran.r-project.org/bin/macosx/)
  -  Download the file of R that corresponds to your Mac OS (Big Sur, Apple silicon arm64, High Sierra, El Capitan, Mavericks, etc.)

---
# Installing the Tools
### <p style="color:#00449E"> RStudio </p>

- **RStudio** offers a graphical interface to assist in creating R code:

- The RStudio Desktop is available as a free download from the following webpage:
    - [https://www.rstudio.com/products/rstudio/download/#download](https://www.rstudio.com/products/rstudio/download/#download)

---
# Installing the Tools
### <p style="color:#00449E"> RStudio Environment </p>
.pull-left[
<img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- **Script Pane** is where you write R commands in a script file that you can save. 
  - An R script is simply a text file containing R commands. 
  - RStudio will color-code different elements of your code to make it easier to read.

]
---
# Installing the Tools
### <p style="color:#00449E"> RStudio Environment </p>
.pull-left[
<img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- **Console Pane** allows you to interact directly with the R interpreter and type commands where R will immediately execute them.

]
---
# Installing the Tools
### <p style="color:#00449E"> RStudio Environment </p>
.pull-left[
<img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- **Environment Pane** is where you can see the values of variables, data frames, and other objects that are currently stored in memory.

]
---
# Installing the Tools
### <p style="color:#00449E"> RStudio Environment </p>
.pull-left[
<img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- **Plots Pane** contains any graphics that you generate from your R code.

]

---
# Installing the Tools
### <p style="color:#00449E"> R Packages </p>

```r
pkgs <- c("ggplot2", "readr", "dplyr")
install.packages(pkgs)
```

- While running the above codes, I recommend you to answer "no" to the following question:

.pull-left[

**Mac**: *"Do you want to install from sources the packages which need compilation?"* from Console Pane.
]

.pull-right[

**Windows**: *"Would you like to use a personal library instead?"* from Pop-up message.
]

---
# Installing the Tools
### <p style="color:#00449E"> R Packages </p>

- Check whether `ggplot2` is installed well:

```r
library(ggplot2)   # loading the package tidyverse
mpg  # data.frame provided by the package ggplot2
     # ggplot2 is included in tidyverse
```

- Let me know if you have an error from the above code.

---
class: inverse, center, middle

# Workflow
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Workflow
### <p style="color:#00449E"> Shortcuts for RStudio and RScript </p>

.pull-left[
**Mac**

- **command + shift + N** opens a new RScript.
- **command + return** runs a current line or selected lines.
- **command + shift + C** is the shortcut for # (commenting).
- **option + - ** is the shortcut for `<-`.
]

.pull-right[
**Windows**

- **Ctrl + Shift + N** opens a new RS-cript.
- **Ctrl + return** runs a current line or selected lines.
- **Ctrl + Shift + C** is the shortcut for # (commenting).
- **Alt + - ** is the shortcut for `<-`.
]

---
# Workflow

- **Home/End** moves the blinking cursor bar to the beginning/End of the line.
  - **Ctrl** (**command** for Mac Users) **+** <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M447.1 256C447.1 273.7 433.7 288 416 288H109.3l105.4 105.4c12.5 12.5 12.5 32.75 0 45.25C208.4 444.9 200.2 448 192 448s-16.38-3.125-22.62-9.375l-160-160c-12.5-12.5-12.5-32.75 0-45.25l160-160c12.5-12.5 32.75-12.5 45.25 0s12.5 32.75 0 45.25L109.3 224H416C433.7 224 447.1 238.3 447.1 256z"/></svg> / <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M438.6 278.6l-160 160C272.4 444.9 264.2 448 256 448s-16.38-3.125-22.62-9.375c-12.5-12.5-12.5-32.75 0-45.25L338.8 288H32C14.33 288 .0016 273.7 .0016 256S14.33 224 32 224h306.8l-105.4-105.4c-12.5-12.5-12.5-32.75 0-45.25s32.75-12.5 45.25 0l160 160C451.1 245.9 451.1 266.1 438.6 278.6z"/></svg> works too.

- **Ctrl** (**command** for Mac Users) **+ Z** undoes the previous action.
- **Ctrl** (**command** for Mac Users) **+ Shift + Z** redoes when undo is executed.

- **Ctrl** (**command** for Mac Users) **+ F** is useful when finding a phrase (and replace the phrase) in the RScript.

- Auto-completion of command is useful.
  - Type `libr` in the RScript in RStudio and wait for a second.
  
.pull-left[

```r
libr
```
]
.pull-right[
<img src="../lec_figs/auto-completionRStudio.png" width="100%" style="display: block; margin: auto;" />

]

---
# Workflow

- To install R package `PACKAGE`, use `install.packages("PACKAGE")`.

```r
install.packages("ggplot2")  # installing package "ggplot2"
```

- When the code is running, RStudio shows the STOP icon (<svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:red;overflow:visible;position:relative;"><path d="M384 128v255.1c0 35.35-28.65 64-64 64H64c-35.35 0-64-28.65-64-64V128c0-35.35 28.65-64 64-64H320C355.3 64 384 92.65 384 128z"/></svg>) at the top right corner in the Console Pane.
  - Do not click it unless if you want to stop running the code.

---
# Workflow
### <p style="color:#00449E"> Quotation marks, parentheses, and `+` </p>

- Quotation marks and parentheses must always come in a pair.
  - If not, Console Pane will show you the continuation character `+`:

```r
> x <- "hello
```

- The `+` tells you that R is waiting for more input; it doesn’t think you’re done yet.

---
# Workflow
### <p style="color:#00449E"> RStudio Options Setting </p>
.pull-left[
<img src="../lec_figs/RStudio_options.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- This option menu is found by menus as follows:
  - *Mac*: RStudio `$>$` Preferences 
  - *Windows*: Tools `$>$` Global Options
  
- Check <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M438.6 105.4C451.1 117.9 451.1 138.1 438.6 150.6L182.6 406.6C170.1 419.1 149.9 419.1 137.4 406.6L9.372 278.6C-3.124 266.1-3.124 245.9 9.372 233.4C21.87 220.9 42.13 220.9 54.63 233.4L159.1 338.7L393.4 105.4C405.9 92.88 426.1 92.88 438.6 105.4H438.6z"/></svg> as in the picture.
- Choose "Never" on "Save workplace to .RData on exit:".
]

---
class: inverse, center, middle

# Starting with R
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Starting with R
- Let's try a few commands to help you become familiar with R and its basic data types.

- In R, **vectors** are arrays of same-typed values.
  - They can be built with the `c()` notation.
  
.pull-left[

```r
1
1/2
'Joe'
"Joe"
"Joe"=='Joe'
c()
is.null(c())
is.null(5)
```
]
.pull-right[

```r
c(1)
c(1, 2)
c("Apple", 'Orange')
length(c(1, 2))
vec <- c(1, 2)
vec
```
]

---
# Starting with R
### <p style="color:#00449E"> Assignment </p>
- R has many assignment operators (e.g., `<-`, `=`, `->` ).
- The preferred one is `<-`.

```r
x <- 2
x < - 3
print(x)

x <- 5
x = 5
5 -> x
```

---
# Starting with R
### <p style="color:#00449E"> R data types </p>

- Primary data types in R are as follows:
  - **Logical**: A simple binary variable that may have only two values---TRUE or FALSE.
  - **Numeric**: Decimal numbers
  - **Integer**: Integers
  - **Character**: Text strings
  - **Factor**: Categorical values. Each possible value of a factor is known as a *level*.
  - **Ordered Factor**: A special factor data type where the order of the levels is significant. E.g., Low, Medium, and High

---
# Starting with R
### <p style="color:#00449E"> R data types </p>

- Test the data types.

```r
x <- TRUE
y <- 1
z <- 'Data Analytics'
productCategory <- c('fruit', 'vegetable', 'dry goods', 'fruit',
                     'vegetable', 'dry goods')
productCategoryFactor <- factor(productCategory)
```
- The `class()` function returns the data type of an object.
  - What are classes for `x`, `y`, `z`, `productCategory`, and `productCategoryFactor`?

---
# Starting with R
### <p style="color:#00449E"> R data types </p>

- Most R data types are *mutable*, in that we're allowed to change them.

```r
a <- c(1, 2)
b <- a

print(b)

# Alters a
a[[1]] <- 5

print(a)
print(b)
```

---
# Starting with R
### <p style="color:#00449E"> Lists </p>

- **Lists**, unlike *vectors*, can store more than one type of object.
  - The ways to access items in lists are the `$` operator and the `[[]]` operator.

```r
x <- list('a' = 6, b = 'fred')
names(x)

x$a
x$b

x[['a']]

x[c('a', 'a', 'b', 'b')]
```

---
# Starting with R
### <p style="color:#00449E"> R data types </p>

- Here are examples of a vector and a list.

```r
example_vector <- c(10, 20, 30)
example_list <- list(a = 10, b = 20, c = 30)

example_vector[1]
example_list[1]

example_vector[[2]]
example_list[[2]]

example_vector[c(FALSE, TRUE, TRUE)]
example_list[c(FALSE, TRUE, TRUE)]

example_list$b
example_list[["b"]]
```

---
# Starting with R
### <p style="color:#00449E"> Errors </p>

- Errors are just R's way of saying it safely refused to complete an ill-formed operation

- Fear of errors should not limit experiments.

```r
x <- 1:5
print(x)

x <- meanMISSPELLED(x)  
print(x)

x <- mean(x)                      
print(x)
```

---
# Starting with R
### <p style="color:#00449E"> Data Frames </p>

- R’s central data structure is the data frame. 
- A data frame is organized into rows and columns. 
- Data frames are essentially lists of columns.
- Data frames can have columns of different types.

.pull-left[

```r
d <- data.frame(x=c(1,2),
              y=c('a','b'))

d[['x']]
d$x
d[[1]]
```
]

.pull-right[

```r
d
d[1,]
d[,1]

d[1,1]
d[1, 'x']
```

]

---
# Starting with R
### <p style="color:#00449E"> Data Frames </p>

- The R **data.frame** class is designed to store data in a very good "ready for analysis" format.

```r
d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1))
print(d)
d$col3 <- d$col1 + d$col2
print(d)
```

---
# Starting with R
### <p style="color:#00449E"> NULL and NA values </p>

- `NULL` is just an alias for `c()`, the empty vector.
- `NA` indicates missing or unavailable data.

```r
c(c(), 1, NULL)
c("a", NA, "c")
```

---
# Starting with R
### <p style="color:#00449E"> NULL and NA values </p>

- `NULL` is just an alias for `c()`, the empty vector.
- `NA` indicates missing or unavailable data.

```r
c(c(), 1, NULL)
c("a", NA, "c")
```

---
# Starting with R
### <p style="color:#00449E"> NULL and NA values </p>

- Most R data types are *mutable*, in that we're allowed to change them.

```r
d <- data.frame(x = 1, y = 2)
d2 <- d
d$x <- 5

print(d)
print(d2)
```

---
class: inverse, center, middle

# Management of Files, Directories, and Scripts
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Management of Files, Directories, and Scripts
### <p style="color:#00449E"> Code and comment style </p>
- The two main principles for coding and managing data are:
  - Make things easier for your future self.
  - Don't trust your future self.

- So we do make comments on codes.

---
# Management of Files, Directories, and Scripts
### <p style="color:#00449E"> Code and comment style </p>

- The `#` mark is R's comment character.
  - `#` indicates that the rest of the line is to be ignored.
  - Write comments before the line that you want the comment to apply to.

- Consider using block commenting for separating code sections.
  - `#####` defines a coding block.

- Break down long lines and long algebraic expressions.

---
# Management of Files, Directories, and Scripts
### <p style="color:#00449E"> Materials for the book, Practical Data Science with R </p>

- Click the green "Code" button and download the ZIP file from the following GitHub page: [https://github.com/WinVector/PDSwR2](https://github.com/WinVector/PDSwR2).

.panelset[

.panel[.panel-name[Windows]

- **Step 1.** Go to your Download folder (or the folder where the downloaded file is saved) using the File Explorer.
- **Step 2.** Click the ZIP file one time.
- **Step 3.** Cut the file by using the shortcut (**Ctrl+X**).  
- **Step 4.** Go to your working folder for the course using the File Explorer.
- **Step 5.** Paste the file to your working folder by using **Ctrl+V**.
- **Step 6.** Right-click the ZIP file and click "Extract ..."
]

.panel[.panel-name[Mac]

- **Step 1.** Go to your Download folder (or the folder where the downloaded file is saved) using the Finder.
- **Step 2.** Click the ZIP file (or the folder if the ZIP file is extracted) one time.
- **Step 3.** Copy the file (or the folder) by using the shortcut (**command+C**).  
- **Step 4.** Go to your working folder for the course using the Finder.
- **Step 5.** Paste the file to your working folder by using **command+option+V**.
- **Step 6.** Right-click the ZIP file and click "Extract ..."
]

]

---
# Management of Files and Directories
### <p style="color:#00449E"> Finding the path name of the file </p>

.panelset[

.panel[.panel-name[Windows 11]

- **Step 1.** Go to your folder using the File Explorer.
- **Step 2.** Right-click the file.
- **Step 3.** Click "Copy as path".
- **Step 4.** Paste the path name of the file to the R script (Ctrl+V).
- **Step 5.** 
  - *Option 1.* Replace backslash(`\`) with double-backslash(`\\`) in the path name.
  - *Option 2.* Replace backslash(`\`) with slash(`/`) in the path name.
]

.panel[.panel-name[Windows 10]

- **Step 1.** Go to your folder using the File Explorer.
- **Step 2.** Keep pressing the "Shift" key
- **Step 3.** Right-click the file.
- **Step 4.** Click "Copy as path".
- **Step 5.** Paste the path name of the file to the R script (Ctrl+V).
- **Step 6.** 
  - *Option 1.* Replace backslash(`\`) with double-backslash(`\\`) in the path name.
  - *Option 2.* Replace backslash(`\`) with slash(`/`) in the path name.
]

.panel[.panel-name[Mac]

- **Step 1.** Go to your folder using the Finder.
- **Step 2.** Right-click the file in the folder
- **Step 3.** Keep pressing "option"
- **Step 4.** Click "Copy 'PATH\_FOR\_YOUR\_FILE' as Pathname" from the menu.
- **Step 5.** Paste it to the R script (command+V).

]

]

---
class: inverse, center, middle

# Working with Data from Files
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Working with Data from Files

- Step 1. Find the path name for the file, `car.data.csv`, from the sub-folder, 'UCICar' in the folder, 'PDSwR2-main'.

- Step 2. In the code below, replace 'PATH_NAME_FOR_THE_FILE_car.data.csv' with the path name for the file, `car.data.csv`.

- Step 3. Run the following R code:

```r
# install.packages("readr")
library(readr)

uciCar <- read_csv(
		'PATH_NAME_FOR_THE_FILE_car.data.csv')
View(uciCar)
```

---
# Working with Data from Files
### <p style="color:#00449E"> Examining data frame </p>

- `class()` tells you what kind of R object you have. 
- `dim()` shows how many rows and columns are in the data for `data.frame`.
- `head()` shows the top few rows of the data.
- `help()` provides the documentation for a class. 
  - Try `help(class(uciCar))`.
- `str()` gives us the structure for an object.

---
# Working with Data from Files
### <p style="color:#00449E"> Examining data frame </p>

- `summary()` provides a summary of almost any R object. 
- `skimr::skim()` provides a more detailed summary.
  - `skimr` is the package that provides the function `skim()`.
- `print()` prints all the data. 
  - Note: for large datasets, this can take a very long time and is something you want to avoid.
- `View()` displays the data in a simple spreadsheet-like grid viewer.
- `dplyr::glimpse()` displays brief information about the data.

---
# Working with Data from Files
### <p style="color:#00449E"> Examining data frame </p>

```r
print(uciCar)
class(uciCar)
dim(uciCar)
head(uciCar)
help(class(uciCar))
str(uciCar)
summary(uciCar)

library(skimr)
skim(uciCar)
library(tidyverse)
glimpse(uciCar)
```

---
# Working with Data from Files
### <p style="color:#00449E"> Reading data from an URL </p>

- We can import the data file from the web.

```r
# install.packages("readr")
# library(readr)

tvshows <- read_csv(
		'https://bcdanl.github.io/data/tvshows.csv')
```

---
# Working with Data from Files
### <p style="color:#00449E"> Data visualization </p>

- Let's try some data visualization using `ggplot()`:

```r
# install.packages("ggplot2")
library(ggplot2)

ggplot(tvshows) + 
  geom_point(aes(x=GRP, y=PE, color=Genre))
  
ggplot(tvshows) + 
  geom_point(aes(x=GRP, y=PE)) + 
  facet_wrap(~Genre)
```
-  How is the the relationship between audience size (`GRP`) and audience engagement (`PE`)?