class: title-slide, left, bottom # Lecture 24 ---- ## **DANL 100: Programming for Data Analytics** ### Byeong-Hak Choe ### November 29, 2022 --- # Announcement ### <p style="color:#00449E"> Geneseo Alumni's Talk on Data Analytics Career - Name: Lauren Kopac (Class of 2015) - When: December 1, 2022, Thursday 11:00 A.M. - Where: Zoom (I will leave the Zoom link on Canvas soon.) - We will watch her Zoom recording on next Tuesday. - She will give us a talk on the data analytics career with her experiences. - As a data analyst, she has worked at (1) Columbia University, New York, NY and (2) Neuberger Berman, New York, NY. --- # Announcement ### <p style="color:#00449E"> Grading - My apologies for the grading delay. - I will finish grading by Thursday or Friday. --- # Summary Statistics ### <p style="color:#00449E"> Percentage Grades in Choe's DANLs, Fall 2021 & Spring 2022 <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> Statistics </th> <th style="text-align:left;"> Values </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Average </td> <td style="text-align:left;width: 10em; "> 83.00 - 87.50 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Standard Deviation </td> <td style="text-align:left;width: 10em; "> 7.37 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Minimum </td> <td style="text-align:left;width: 10em; "> 63.12 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Maximum </td> <td style="text-align:left;width: 10em; "> 99.07 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;border-right:1px solid;"> Number of Students </td> <td style="text-align:left;width: 10em; "> 97 </td> </tr> </tbody> </table> $$ \begin{cases} 100 \;\geq\; A \;\geq\; 93 \;>\; A- \;\geq\; 90 \\\\ 90 \;>\; B+ \;\geq\; 87 \;>\; B \;\geq\; 83 > B- \;\geq\; 80\\\\ 80 \;>\; C+ \;\geq\; 77 \;>\; C \;\geq\; 73 > C- \;\geq\; 70\\\\ 70 \;>\; D \;\geq\; 60 \;>\; E \end{cases} $$ --- # Tips for using Presentation Slides <!-- ### <p style="color:#00449E"></p> --> - To go to a previous/next page, use keyboard arrows,
and
. - To see a tile view of the lecture slides, use the alphabet key, `o`. - If you hover a mouse cursor on the code block in the lecture slide, you can see and click the *"Copy Code"* from the top-right corner of the code block. - If you click the *"Copy Code"*, the codes in the block are copied, so that you can paste them to RScript. - If the presentation slides does not respond, refresh the web-page of the slides by the shortcut, **Ctrl** (or **command** for Mac users) ** + R**. --- # Getting started with `pandas` ### <p style="color:#00449E"> Boolean Indexing - Boolean indexing of DataFrames works like boolean indexing an `np.array`. - `DataFrame[ DataFrame['VARIABLE_NAME'] > VALUE ]` - `DataFrame[ DataFrame['VARIABLE_NAME'] == VALUE ]` - `DataFrame[ DataFrame['VARIABLE_NAME'] < VALUE ]` ```python data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"], "price": [69.2, 8.11, 110.92, 87.28, 87.81], "volume": [4456290, 3667975, 3669487, 1778058, 1824582]} companies = pd.DataFrame(data) companies_daimler = companies[ companies['company'] == "Daimler" ] ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Append the two `DataFrame`s - `DataFrame.append(DATAFRAME)` is used to append rows of other `DATAFRAME` to the end of the given `DataFrame`, returning a new DataFrame object. ```python df1 = df = pd.DataFrame({"a":[1, 2, 3, 4], "b":[5, 6, 7, 8]}) df2 = pd.DataFrame({"a":[1, 2, 3], "b":[5, 6, 7]}) # to append df2 at the end of df1 dataframe df1.append(df2) ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Import/Export data `pd.read_csv("PATH_NAME_OF_*.csv")` reads the csv file into `DataFrame`. - `header=None` does not read the top row of the csv file as column names. - We can set column names with `names`, for example, `names=["a", "b", "c", "d", "e"]`. - `DataFrame.head()` and `DataFrame.tail()` prints the first and last five rows on the Console, respectively. ```python nbc_show = pd.read_csv("https://bcdanl.github.io/data/nbc_show_na.csv") # `GRP`: audience size; `PE`: audience engagement. nbc_show.head() # showing the first five rows nbc_show.tail() # showing the last five rows ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Export data `DataFrame.to_csv("filename")` writes `DataFrame` to the csv file. - `index = False` and `header=False` do not write row index and column names in the csv file. - We can set column names with `header`, for example, `header=["a", "b", "c", "d", "e"]`. ```python nbc_show.to_csv("PATH_NAME_OF_THE_csv_FILE") ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Summarizing `DataFrame` - `DataFrame.count()` returns a Series containing the number of non-missing values for each column. - `DataFrame.sum()` returns a Series containing the sum of values for each column. - `DataFrame.mean()` returns a Series containing the mean of values for each column. - Passing `axis="columns"` or `axis=1` sums across the columns instead: ```python nbc_count = nbc_show.sum() nbc_sum = nbc_show.sum() nbc_sum_c = nbc_show.sum( axis="columns" ) nbc_mean = nbc_show.mean() ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Grouping `DataFrame` - `DataFrame.groupby(col1, col2)` groups `DataFrame` by columns (grouping by one or more than two columns is also possible!). - Adding the functions `count()`, `sum()`, `mean()` to `groupby()` returns the sum or the mean of the grouped columns. ```python nbc_genre_count = nbc_show.groupby(["Genre"]).count() nbc_genre_sum = nbc_show.groupby(["Genre"]).sum() nbc_network_genre_mean = nbc_show.groupby(["Network", "Genre"]).mean() ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Sorting `DataFrame` - `DataFrame.sort_index()` sorts DataFrame by index on either axis. - `DataFrame.sort_index(axis="columns")` sorts DataFrame by column index. - `DataFrame.sort_index(ascending=False)` sorts DataFrame by either index in descending order. ```python nbc_show.sort_index() nbc_show.sort_index(ascending = False) nbc_show.sort_index(axis = "columns") ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Sorting `DataFrame` - `DataFrame.sort_value("SOME_VARIABLE")` sorts DataFrame by values of SOME_VARIABLE. - For `Series.sort_values()`, we do not need to provide `"SOME_VARIABLE"` in the `sort_values()` function. - `DataFrame.sort_values("SOME_VARIABLE", ascdening = False)` sorts DataFrame by values of SOME_VARIABLE in descending order. ```python nbc_show.sort_values("GRP") nbc_show.sort_values("GRP", ascending = False) obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) obj.sort_values() ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Class Exercise Use the `nbc_show_na.csv` file to answer the following questions: 1. Find the top show in terms of the value of `PE` for each Genre. 2. Find the top show in terms of the value of `GRP` for each Network. 3. Which genre does have the largest `GRP` on average? --- # Workflow ### <p style="color:#00449E"> Installing Python modules - Anaconda Spyder also allow for installation on Spyder. .panelset[ .panel[.panel-name[Windows] - Step 1. Type the following command on Spyder Python script: - Step 2. Run the command. ```python conda install seaborn ``` or ```python pip install seaborn ``` ] .panel[.panel-name[Mac] - Step 1. Type the following command on Spyder Python script: - Step 2. Run the command. ```python conda install seaborn ``` or ```python pip install seaborn ``` ] ] --- # Workflow ### <p style="color:#00449E"> Installing Python modules - Let's install the Python visualization library `seaborn` on Anaconda Spyder. .panelset[ .panel[.panel-name[Windows] - Step 1. Open "Anaconda Prompt". - Step 2. Type the following: ```python conda install seaborn ``` or ```python pip install seaborn ``` ] .panel[.panel-name[Mac] - Step 1. Open "Terminal". - Step 2. Type the following: ```python conda install seaborn ``` or ```python pip install seaborn ``` ] ] --- class: inverse, center, middle # Data Visualization with `seaborn` --- # Data Visualization .pull-left[ <img src="../lec_figs/lego.png" width="67%" style="display: block; margin: auto;" /> ] .pull-right[ - Graphs and charts let us explore and learn about the structure of the information we have in DataFrame. - Good data visualizations make it easier to communicate our ideas and findings to other people. ] --- # Exploratory Data Analysis (EDA) - We use visualization and summary statistics (e.g., mean, median, minimum, maximum) to explore our data in a systematic way. - EDA is an iterative cycle. We: - Generate questions about our data. - Search for answers by visualizing, transforming, and modelling our data. - Use what we learn to refine our questions and/or generate new questions. --- # `seaborn` <img src="../lec_figs/seaborn-logo.png" width="20%" style="display: block; margin: auto;" /> - `seaborn` is a Python data visualization library based on `matplotlib`. - It allows us to easily create beautiful but complex graphics using a simple interface. - It also provides a general improvement in the default appearance of `matplotlib`-produced plots, and so I recommend using it by default. ```python import seaborn as sns ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Types of plots - We will consider the following types of visualization: - Bar chart - Histogram - Scatter plot - Scatter plot with Fitted line - Line chart --- # Getting started with `pandas` ### <p style="color:#00449E"> What is *tidy* `DataFrame`? </p> - There are three rules which make a dataset tidy: 1. Each **variable** has its own *column*. 2. Each **observation** has its own *row*. 3. Each **value** has its own *cell*. <img src="../lec_figs/tidy-1.png" width="75%" style="display: block; margin: auto;" /> --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Getting started with `seaborn` - Let's get the names of `DataFrame`s provided by the `seaborn` library: ```python import seaborn as sns print( sns.get_dataset_names() ) ``` - Let's use the `titanic` and `tips` DataFrames: ```python df_titanic = sns.load_dataset('titanic') df_titanic.head() df_tips = sns.load_dataset('tips') df_tips.head() ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Bar Chart - A bar chart is used to plot the frequency of the different categories. - It is useful to visualize how values of a **categorical variable** are distributed. - A variable is **categorical** if it can only take one of a small set of values. - We use `sns.countplot()` function to plot a bar chart: .pull-left[ ```python sns.countplot(x = 'sex', data = df_titanic) ``` ] .pull-right[ - Mapping - `data`: DataFrame. - `x`: Name of a categorical variable (column) in DataFrame ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Bar Chart - We can further break up the bars in the bar chart based on another categorical variable. - This is useful to visualize the relationship between the two categorical variables. .pull-left[ ```python sns.countplot(x = 'sex', hue = 'survived', data = df_titanic) ``` ] .pull-right[ - Mapping - `hue`: Name of a categorical variable ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Histogram - A histogram is a **continuous** version of a bar chart. - It is used to plot the frequency of the different values. - It is useful to visualize how values of a **continuous variable** are distributed. - A variable is **continuous** if it can take any of an infinite set of ordered values. - We use `sns.displot()` function to plot a histogram: .pull-left[ ```python sns.displot(x = 'age', bins = 5 , data = df_titanic) ``` ] .pull-right[ - Mapping - `bins`: Number of bins ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Scatter plot - A scatter plot is used to display the relationship between two continuous variables. - We can see co-variation as a pattern in the scattered points. - We use `sns.scatterplot()` function to plot a scatter plot: .pull-left[ ```python sns.scatterplot(x = 'total_bill', y = 'tip', data = df_tips) ``` ] .pull-right[ - Mapping - `x`: Name of a continuous variable on the horizontal axis - `y`: Name of a continuous variable on the vertical axis ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Scatter plot - To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`. - Suppose we are interested in the following question: - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior? ```python sns.scatterplot(x = 'total_bill', y = 'tip', hue = 'smoker', data = df) ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Fitted line - From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables. - `sns.lmplot()` adds a line that fits well into the scattered points. - On average, the fitted line describes the relationship between two continuous variables. ```python sns.lmplot(x = 'total_bill', y = 'tip', data = df_tips) ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Scatter plot - To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`. - Using the fitted lines, let's answer the following question: - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior? ```python sns.scatterplot(x = 'total_bill', y = 'tip', hue = 'smoker', data = df_tips) ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Line cahrt - A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable. - It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur. - We use `sns.lineplot()` function to plot a line plot: .pull-left[ ```python path_csv = '/Users/byeong-hakchoe/Google Drive/suny-geneseo/teaching-materials/lecture-data/dji.csv' dow = pd.read_csv(path_csv, index_col=0, parse_dates=True) sns.lineplot(x = 'Date', y = 'Close', data = dow) ``` ] .pull-right[ - Mapping - `x`: Name of a continuous variable (often time variable) on the horizontal axis - `y`: Name of a continuous variable on the vertical axis ] --- class: inverse, center, middle # Starting with R and RStudio <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Installing the Tools ### <p style="color:#00449E"> R programming </p> The R language is available as a free download from the R Project website at: - Windows: [https://cran.r-project.org/bin/windows/base/](https://cran.r-project.org/bin/windows/base/) - Mac: [https://cran.r-project.org/bin/macosx/](https://cran.r-project.org/bin/macosx/) - Download the file of R that corresponds to your Mac OS (Big Sur, Apple silicon arm64, High Sierra, El Capitan, Mavericks, etc.) --- # Installing the Tools ### <p style="color:#00449E"> RStudio </p> - **RStudio** offers a graphical interface to assist in creating R code: - The RStudio Desktop is available as a free download from the following webpage: - [https://www.rstudio.com/products/rstudio/download/#download](https://www.rstudio.com/products/rstudio/download/#download) --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Script Pane** is where you write R commands in a script file that you can save. - An R script is simply a text file containing R commands. - RStudio will color-code different elements of your code to make it easier to read. ] --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Console Pane** allows you to interact directly with the R interpreter and type commands where R will immediately execute them. ] --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Environment Pane** is where you can see the values of variables, data frames, and other objects that are currently stored in memory. ] --- # Installing the Tools ### <p style="color:#00449E"> RStudio Environment </p> .pull-left[ <img src="../lec_figs/rstudio_env.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Plots Pane** contains any graphics that you generate from your R code. ] --- # Installing the Tools ### <p style="color:#00449E"> R Packages </p> ```r pkgs <- c("ggplot2", "readr", "dplyr") install.packages(pkgs) ``` - While running the above codes, I recommend you to answer "no" to the following question: .pull-left[ **Mac**: *"Do you want to install from sources the packages which need compilation?"* from Console Pane. ] .pull-right[ **Windows**: *"Would you like to use a personal library instead?"* from Pop-up message. ] --- # Installing the Tools ### <p style="color:#00449E"> R Packages </p> - Check whether `ggplot2` is installed well: ```r library(ggplot2) # loading the package tidyverse mpg # data.frame provided by the package ggplot2 # ggplot2 is included in tidyverse ``` - Let me know if you have an error from the above code. --- class: inverse, center, middle # Workflow <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Workflow ### <p style="color:#00449E"> Shortcuts for RStudio and RScript </p> .pull-left[ **Mac** - **command + shift + N** opens a new RScript. - **command + return** runs a current line or selected lines. - **command + shift + C** is the shortcut for # (commenting). - **option + - ** is the shortcut for `<-`. ] .pull-right[ **Windows** - **Ctrl + Shift + N** opens a new RS-cript. - **Ctrl + return** runs a current line or selected lines. - **Ctrl + Shift + C** is the shortcut for # (commenting). - **Alt + - ** is the shortcut for `<-`. ] --- # Workflow - **Home/End** moves the blinking cursor bar to the beginning/End of the line. - **Ctrl** (**command** for Mac Users) **+**
/
works too. - **Ctrl** (**command** for Mac Users) **+ Z** undoes the previous action. - **Ctrl** (**command** for Mac Users) **+ Shift + Z** redoes when undo is executed. - **Ctrl** (**command** for Mac Users) **+ F** is useful when finding a phrase (and replace the phrase) in the RScript. - Auto-completion of command is useful. - Type `libr` in the RScript in RStudio and wait for a second. .pull-left[ ```r libr ``` ] .pull-right[ <img src="../lec_figs/auto-completionRStudio.png" width="100%" style="display: block; margin: auto;" /> ] --- # Workflow - To install R package `PACKAGE`, use `install.packages("PACKAGE")`. ```r install.packages("ggplot2") # installing package "ggplot2" ``` - When the code is running, RStudio shows the STOP icon (
) at the top right corner in the Console Pane. - Do not click it unless if you want to stop running the code. <img src="../lec_figs/console-running.png" width="90%" style="display: block; margin: auto;" /> --- # Workflow ### <p style="color:#00449E"> Quotation marks, parentheses, and `+` </p> - Quotation marks and parentheses must always come in a pair. - If not, Console Pane will show you the continuation character `+`: ```r > x <- "hello ``` - The `+` tells you that R is waiting for more input; it doesn’t think you’re done yet. --- # Workflow ### <p style="color:#00449E"> RStudio Options Setting </p> .pull-left[ <img src="../lec_figs/RStudio_options.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - This option menu is found by menus as follows: - *Mac*: RStudio `\(>\)` Preferences - *Windows*: Tools `\(>\)` Global Options - Check
as in the picture. - Choose "Never" on "Save workplace to .RData on exit:". ] --- class: inverse, center, middle # Starting with R <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Starting with R - Let's try a few commands to help you become familiar with R and its basic data types. - In R, **vectors** are arrays of same-typed values. - They can be built with the `c()` notation. .pull-left[ ```r 1 1/2 'Joe' "Joe" "Joe"=='Joe' c() is.null(c()) is.null(5) ``` ] .pull-right[ ```r c(1) c(1, 2) c("Apple", 'Orange') length(c(1, 2)) vec <- c(1, 2) vec ``` ] --- # Starting with R ### <p style="color:#00449E"> Assignment </p> - R has many assignment operators (e.g., `<-`, `=`, `->` ). - The preferred one is `<-`. ```r x <- 2 x < - 3 print(x) x <- 5 x = 5 5 -> x ``` --- # Starting with R ### <p style="color:#00449E"> R data types </p> - Primary data types in R are as follows: - **Logical**: A simple binary variable that may have only two values---TRUE or FALSE. - **Numeric**: Decimal numbers - **Integer**: Integers - **Character**: Text strings - **Factor**: Categorical values. Each possible value of a factor is known as a *level*. - **Ordered Factor**: A special factor data type where the order of the levels is significant. E.g., Low, Medium, and High --- # Starting with R ### <p style="color:#00449E"> R data types </p> - Test the data types. ```r x <- TRUE y <- 1 z <- 'Data Analytics' productCategory <- c('fruit', 'vegetable', 'dry goods', 'fruit', 'vegetable', 'dry goods') productCategoryFactor <- factor(productCategory) ``` - The `class()` function returns the data type of an object. - What are classes for `x`, `y`, `z`, `productCategory`, and `productCategoryFactor`? --- # Starting with R ### <p style="color:#00449E"> R data types </p> - Most R data types are *mutable*, in that we're allowed to change them. ```r a <- c(1, 2) b <- a print(b) # Alters a a[[1]] <- 5 print(a) print(b) ``` --- # Starting with R ### <p style="color:#00449E"> Lists </p> - **Lists**, unlike *vectors*, can store more than one type of object. - The ways to access items in lists are the `$` operator and the `[[]]` operator. ```r x <- list('a' = 6, b = 'fred') names(x) x$a x$b x[['a']] x[c('a', 'a', 'b', 'b')] ``` --- # Starting with R ### <p style="color:#00449E"> R data types </p> - Here are examples of a vector and a list. ```r example_vector <- c(10, 20, 30) example_list <- list(a = 10, b = 20, c = 30) example_vector[1] example_list[1] example_vector[[2]] example_list[[2]] example_vector[c(FALSE, TRUE, TRUE)] example_list[c(FALSE, TRUE, TRUE)] example_list$b example_list[["b"]] ``` --- # Starting with R ### <p style="color:#00449E"> Errors </p> - Errors are just R's way of saying it safely refused to complete an ill-formed operation - Fear of errors should not limit experiments. ```r x <- 1:5 print(x) x <- meanMISSPELLED(x) print(x) x <- mean(x) print(x) ``` --- # Starting with R ### <p style="color:#00449E"> Data Frames </p> - R’s central data structure is the data frame. - A data frame is organized into rows and columns. - Data frames are essentially lists of columns. - Data frames can have columns of different types. .pull-left[ ```r d <- data.frame(x=c(1,2), y=c('a','b')) d[['x']] d$x d[[1]] ``` ] .pull-right[ ```r d d[1,] d[,1] d[1,1] d[1, 'x'] ``` ] --- # Starting with R ### <p style="color:#00449E"> Data Frames </p> - The R **data.frame** class is designed to store data in a very good "ready for analysis" format. ```r d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1)) print(d) d$col3 <- d$col1 + d$col2 print(d) ``` --- # Starting with R ### <p style="color:#00449E"> NULL and NA values </p> - `NULL` is just an alias for `c()`, the empty vector. - `NA` indicates missing or unavailable data. ```r c(c(), 1, NULL) c("a", NA, "c") ``` --- # Starting with R ### <p style="color:#00449E"> NULL and NA values </p> - `NULL` is just an alias for `c()`, the empty vector. - `NA` indicates missing or unavailable data. ```r c(c(), 1, NULL) c("a", NA, "c") ``` --- # Starting with R ### <p style="color:#00449E"> NULL and NA values </p> - Most R data types are *mutable*, in that we're allowed to change them. ```r d <- data.frame(x = 1, y = 2) d2 <- d d$x <- 5 print(d) print(d2) ``` --- class: inverse, center, middle # Management of Files, Directories, and Scripts <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Management of Files, Directories, and Scripts ### <p style="color:#00449E"> Code and comment style </p> - The two main principles for coding and managing data are: - Make things easier for your future self. - Don't trust your future self. - So we do make comments on codes. --- # Management of Files, Directories, and Scripts ### <p style="color:#00449E"> Code and comment style </p> - The `#` mark is R's comment character. - `#` indicates that the rest of the line is to be ignored. - Write comments before the line that you want the comment to apply to. - Consider using block commenting for separating code sections. - `#####` defines a coding block. - Break down long lines and long algebraic expressions. --- # Management of Files, Directories, and Scripts ### <p style="color:#00449E"> Materials for the book, Practical Data Science with R </p> - Click the green "Code" button and download the ZIP file from the following GitHub page: [https://github.com/WinVector/PDSwR2](https://github.com/WinVector/PDSwR2). .panelset[ .panel[.panel-name[Windows] - **Step 1.** Go to your Download folder (or the folder where the downloaded file is saved) using the File Explorer. - **Step 2.** Click the ZIP file one time. - **Step 3.** Cut the file by using the shortcut (**Ctrl+X**). - **Step 4.** Go to your working folder for the course using the File Explorer. - **Step 5.** Paste the file to your working folder by using **Ctrl+V**. - **Step 6.** Right-click the ZIP file and click "Extract ..." ] <!----> .panel[.panel-name[Mac] - **Step 1.** Go to your Download folder (or the folder where the downloaded file is saved) using the Finder. - **Step 2.** Click the ZIP file (or the folder if the ZIP file is extracted) one time. - **Step 3.** Copy the file (or the folder) by using the shortcut (**command+C**). - **Step 4.** Go to your working folder for the course using the Finder. - **Step 5.** Paste the file to your working folder by using **command+option+V**. - **Step 6.** Right-click the ZIP file and click "Extract ..." ] <!----> ] <!--end of panelset--> --- # Management of Files and Directories ### <p style="color:#00449E"> Finding the path name of the file </p> .panelset[ .panel[.panel-name[Windows 11] - **Step 1.** Go to your folder using the File Explorer. - **Step 2.** Right-click the file. - **Step 3.** Click "Copy as path". - **Step 4.** Paste the path name of the file to the R script (Ctrl+V). - **Step 5.** - *Option 1.* Replace backslash(`\`) with double-backslash(`\\`) in the path name. - *Option 2.* Replace backslash(`\`) with slash(`/`) in the path name. ] <!----> .panel[.panel-name[Windows 10] - **Step 1.** Go to your folder using the File Explorer. - **Step 2.** Keep pressing the "Shift" key - **Step 3.** Right-click the file. - **Step 4.** Click "Copy as path". - **Step 5.** Paste the path name of the file to the R script (Ctrl+V). - **Step 6.** - *Option 1.* Replace backslash(`\`) with double-backslash(`\\`) in the path name. - *Option 2.* Replace backslash(`\`) with slash(`/`) in the path name. ] <!----> .panel[.panel-name[Mac] - **Step 1.** Go to your folder using the Finder. - **Step 2.** Right-click the file in the folder - **Step 3.** Keep pressing "option" - **Step 4.** Click "Copy 'PATH\_FOR\_YOUR\_FILE' as Pathname" from the menu. - **Step 5.** Paste it to the R script (command+V). ] <!----> ] <!--end of panelset--> --- class: inverse, center, middle # Working with Data from Files <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Working with Data from Files - Step 1. Find the path name for the file, `car.data.csv`, from the sub-folder, 'UCICar' in the folder, 'PDSwR2-main'. - Step 2. In the code below, replace 'PATH_NAME_FOR_THE_FILE_car.data.csv' with the path name for the file, `car.data.csv`. - Step 3. Run the following R code: ```r # install.packages("readr") library(readr) uciCar <- read_csv( 'PATH_NAME_FOR_THE_FILE_car.data.csv') View(uciCar) ``` --- # Working with Data from Files ### <p style="color:#00449E"> Examining data frame </p> - `class()` tells you what kind of R object you have. - `dim()` shows how many rows and columns are in the data for `data.frame`. - `head()` shows the top few rows of the data. - `help()` provides the documentation for a class. - Try `help(class(uciCar))`. - `str()` gives us the structure for an object. --- # Working with Data from Files ### <p style="color:#00449E"> Examining data frame </p> - `summary()` provides a summary of almost any R object. - `skimr::skim()` provides a more detailed summary. - `skimr` is the package that provides the function `skim()`. - `print()` prints all the data. - Note: for large datasets, this can take a very long time and is something you want to avoid. - `View()` displays the data in a simple spreadsheet-like grid viewer. - `dplyr::glimpse()` displays brief information about the data. --- # Working with Data from Files ### <p style="color:#00449E"> Examining data frame </p> ```r print(uciCar) class(uciCar) dim(uciCar) head(uciCar) help(class(uciCar)) str(uciCar) summary(uciCar) library(skimr) skim(uciCar) library(tidyverse) glimpse(uciCar) ``` --- # Working with Data from Files ### <p style="color:#00449E"> Reading data from an URL </p> - We can import the data file from the web. ```r # install.packages("readr") # library(readr) tvshows <- read_csv( 'https://bcdanl.github.io/data/tvshows.csv') ``` --- # Working with Data from Files ### <p style="color:#00449E"> Data visualization </p> - Let's try some data visualization using `ggplot()`: ```r # install.packages("ggplot2") library(ggplot2) ggplot(tvshows) + geom_point(aes(x=GRP, y=PE, color=Genre)) ggplot(tvshows) + geom_point(aes(x=GRP, y=PE)) + facet_wrap(~Genre) ``` - How is the the relationship between audience size (`GRP`) and audience engagement (`PE`)?