class: title-slide, left, bottom # Lecture 25 ---- ## **DANL 200: Introduction to Data Analytics** ### Byeong-Hak Choe ### December 1, 2022 --- name: about-me layout: false class: about-me-slide, inverse, middle, center background-image: url(img/hex-xaringan.png), url(img/frame-art.png) background-position: 90% 75%, 75% 75% background-size: 8%, cover # Talk on Data Analytics Career <img style="border-radius: 50%;" src="https://avatars.githubusercontent.com/u/55219713?v=4" width="150px"/> ## Lauren Kopac ### Data Analyst, Human Capital Management (Compensation & Analytics) .fade[Neuberger Berman<br>New York, NY June 2022 – Present] [
@laurenkopac](https://github.com/laurenkopac) --- name: about-me layout: false class: about-me-slide, inverse, middle, center background-image: url(img/hex-xaringan.png), url(img/frame-art.png) background-position: 90% 75%, 75% 75% background-size: 8%, cover # Talk on Data Analytics Career <img style="border-radius: 50%;" src="https://avatars.githubusercontent.com/u/55219713?v=4" width="150px"/> ## Lauren Kopac ### Assistant Director of Data, Law School in the Office of Academic Affairs ### Institutional Research Analyst, School of Engineering in the Office of the Dean .fade[Columbia University<br>New York, NY October 2018 – June 2022] --- name: about-me layout: false class: about-me-slide, inverse, middle, center background-image: url(img/hex-xaringan.png), url(img/frame-art.png) background-position: 90% 75%, 75% 75% background-size: 8%, cover # Talk on Data Analytics Career <img style="border-radius: 50%;" src="https://avatars.githubusercontent.com/u/55219713?v=4" width="150px"/> ## Lauren Kopac .pull-left[ ### MS Computer Science, *Machine Learning* .fade[Columbia University<br>New York, NY Expected December 2023] ] .pull-right[ ### BA Mathematics .fade[SUNY Geneseo<br>Geneseo, NY August 2012 - December 2015] ] <!-- <html><hr color='#EB811B'></html> --> --- # Announcement ### <p style="color:#00449E"> Job Opportunities .panelset[ .panel[.panel-name[M&T] - **M&T guests**, Leah Froebel and Emily Scheck - Dec. 2nd (11:30 am - 12:30 pm) in South 340. - All are welcome! They will be sharing their career stories along with an introduction to their F/T Management Development program (5 states) and prestigious internships in multiple functional areas (H/R, Marketing, Operations, etc...) ] .panel[.panel-name[McKinsey] - **McKinsey Consulting** - (F/T grad. Dec. '22 and May '23) 2-year paid rotational fellowship, NYC. - If interested, send email to cannonm@geneseo.edu to meet with one of our McKinsey alumni for prep this week. - [Business Insights Fellow](https://www.mckinsey.com/careers/search-jobs/jobs/businessinsightsfellow-73311) ] .panel[.panel-name[Blackstone] - **Blackstone** - (Econ. or Finance) Sophomore & Juniors, - Apply to attend a Blackstone networking event in NYC, Jan. 13th 12:30 - 3:30 pm, application due 12/8. - [Blackstone Networking Event](https://app.joinhandshake.com/edu/jobs/7242483) ] ] --- class: inverse, center, middle # Linear Regression using **R** <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Linear Regression using **R** ### <p style="color:#00449E"> More Explanatory Variables in the Model - In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables: - `SCHL`: level of education --- # Linear Regression using **R** ### <p style="color:#00449E"> More Explanatory Variables in the Model - Suppose we also want to assess how personal income (`PINCP`) varies with (1) age (`AGEP`), and (2) gender (`SEX`), and (3) a bachelor's degree (`SCHL`). 1. Conduct the exploratory data analysis. 2. Based on the visualization, set a hypothesis regarding the relationship between having bachelor's degree and `PINCP`. 3. Train the linear regression model. 4. Interpret the beta coefficients from the linear regression result. 5. Calculate the predicted `PINCP` using the testing data. 6. Draw the actual vs. predicted outcome plot and the residual plot. --- # Linear Regression in R ### <p style="color:#00449E"> R commands to do EDA and linear regression analysis .panelset[ .panel[.panel-name[Data] ```r library(tidyverse) psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') ) set.seed(54321) gp <- runif( nrow(psub) ) # Set up factor variables if needed. dtrain <- filter(psub, gp >= .5) dtest <- filter(psub, gp < .5) ``` ] .panel[.panel-name[EDA] ```r library(skimr) sum_dtrain <- skim( select(dtrain, PINCP, AGEP, SEX, SCHL) ) library(GGally) ggpairs( select(dtrain, PINCP, AGEP, SEX, SCHL) ) # MORE VISUALIZATIONS ARE RECOMMENDED ``` ] .panel[.panel-name[Training] ```r model_1 <- lm( PINCP ~ AGEP + SEX, data = dtrain ) model_2 <- lm( PINCP ~ AGEP + SEX + SCHL, data = dtrain ) ``` ] .panel[.panel-name[Summary 1] - Summary with base-R: ```r summary(model_1) summary(model_2) coef(model_1) coef(model_2) # Using the model.matrix() function on our linear model object, # we can get the data matrix that underlies our regression. df_model_1 <- as_tibble( model.matrix(model_1) ) df_model_2 <- as_tibble( model.matrix(model_2) ) ``` ] .panel[.panel-name[Summary 2] - Summary with R packages: ```r # install.packages(c("stargazer", "broom")) library(stargazer) library(broom) stargazer(model_1, model_2, type = 'text') # from the stargazer package sum_model_2 <- tidy(model_2) # from the broom package # Consider filter() to keep statistically significant beta estimates ``` ] .panel[.panel-name[Betas in plot] ```r ggplot(sum_model_2) + geom_pointrange( aes(x = term, y = estimate, ymin = estimate - 2*std.error, ymax = estimate + 2*std.error ) ) + coord_flip() ``` ] .panel[.panel-name[Prediction] ```r dtest <- dtest %>% mutate( pred_1 = predict(model_1, newdata = dtest), pred_2 = predict(model_2, newdata = dtest) ) ``` ] .panel[.panel-name[Actual vs. Prediction Plot] ```r ggplot( data = dtest, aes(x = pred_2, y = PINCP) ) + geom_point( alpha = 0.2, color = "darkgray" ) + geom_smooth( color = "darkblue" ) + geom_abline( color = "red", linetype = 2 ) # y = x, perfect prediction line ``` ] .panel[.panel-name[Residual Plot] ```r ggplot(data = dtest, aes(x = pred_2, y = PINCP - pred_2)) + geom_point(alpha = 0.2, color = "darkgray") + geom_smooth( color = "darkblue" ) + geom_hline( aes( yintercept = 0 ), # perfect prediction color = "red", linetype = 2) ``` ] ] --- # Linear Regression in R ### <p style="color:#00449E"> The model equation `$$\texttt{PINCP[i]}\qquad\qquad\qquad\qquad\qquad\notag\\ \;=\;\; \texttt{b0} \,+\, \texttt{b1*AGEP[i]} \,+\, \texttt{b2*SEX.Male[i]}\\ \qquad\qquad\texttt{b3*SCHL.no high school diploma[i]}\,+\, \\ \qquad\qquad\qquad\quad\;\texttt{b4*SCHL.GED or alternative credential[i]}\,+\, \\ \qquad\qquad\qquad\quad\;\;\texttt{b5*SCHL.some college credit, no degree[i]}\,+\, \\ \qquad\texttt{b6*SCHL.Associate's degree[i]}\,+\, \\ \quad\;\;\texttt{b7*SCHL.Bachelor's degree[i]}\,+\, \\ \;\;\,\texttt{b8*SCHL.Master's degree[i]}\,+\, \\ \qquad\;\;\;\texttt{b9*SCHL.Professional degree[i]}\,+\, \\ \qquad\texttt{b10*SCHL.Doctorate degree[i]}\,+\, \\ \,\texttt{e[i]}.\qquad\qquad\qquad\qquad\qquad$$` --- class: inverse, center, middle # Linear Regression with Log-transformed Variables <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Linear Regression with Log-transformtion ### <p style="color:#00449E"> A Little Bit of Math for Logarithm .panelset[ .panel[.panel-name[log functions] - The logarithm function, `\(y = \log_{b}\,(\,x\,)\)`, looks like .... <img src="../lec_figs/logarithm_plots.png" width="42%" style="display: block; margin: auto;" /> ] .panel[.panel-name[log examples] - `\(\log_{e}\,(\,x\,)\)`: the base `\(e\)` logarithm is called the natural log, where `\(e = 2.718\cdots\)` is the mathematical constant, the Euler's number. - `\(\log\,(\,x\,)\)` or `\(\ln\,(\,x\,)\)`: the natural log of `\(x\)` . - `\(\log_{e}\,(\,7.389\cdots\,)\)`: the natural log of `\(7.389\cdots\)` is `\(2\)`, because `\(e^{2} = 7.389\cdots\)`. ] ] --- # Linear Regression with Log-transformtion - We should use a logarithmic scale when **percent change**, or change in orders of magnitude, is more important than changes in absolute units. - For small changes in variable `\(x\)` from `\(x_{0}\)` to `\(x_{1}\)`, the following equation holds: `$$\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) \approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} \,=\, \frac{\Delta\, x}{x_{0}}.$$` - A change in income of $5,000 means something very different across people with different income levels. - A percentage change in income, e.g., 5% of income, may mean somewhat more similar across people with different income levels. - We can also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed. --- # Linear Regression with Log-transformtion - The log transformation makes the skewed distribution of income more normal. ```r ggplot(dtrain, aes( x = PINCP ) ) + geom_density() ggplot(dtrain, aes( x = log(PINCP) ) ) + geom_density() ``` --- # Linear Regression with Log-transformtion ### <p style="color:#00449E"> A Few Algebras for Logarithm and Exponential Functions - Rule 1: $$\texttt{y} \,=\, \texttt{log(x)}\qquad\Leftrightarrow\qquad \texttt{exp(y)} \,=\, \texttt{x}. $$ - Rule 2: `$$\texttt{log(x)} \,-\, \texttt{log(z)} \,=\, \texttt{log}\,\left(\,\frac{\texttt{x}}{\texttt{z}}\,\right).$$` - By the rules above, $$ \texttt{log(x)} \,-\, \texttt{log(z)} \,=\, \texttt{b}\qquad\Leftrightarrow\qquad \frac{\texttt{x}}{\texttt{z}} \,=\,\texttt{exp(b)}. $$ --- # Linear Regression with Log-transformtion - Let's consider the following linear regression model: `$$\qquad\log(\texttt{PINCP[i]})\qquad\qquad\qquad\qquad\\ \;=\;\; \texttt{b0} \,+\, \texttt{b1*AGEP[i]} \,+\, \texttt{b2*SEX.Male[i]}\\ \qquad\qquad\texttt{b3*SCHL.no high school diploma[i]}\,+\, \\ \qquad\qquad\qquad\quad\;\texttt{b4*SCHL.GED or alternative credential[i]}\,+\, \\ \qquad\qquad\qquad\quad\;\;\texttt{b5*SCHL.some college credit, no degree[i]}\,+\, \\ \qquad\texttt{b6*SCHL.Associate's degree[i]}\,+\, \\ \quad\;\;\texttt{b7*SCHL.Bachelor's degree[i]}\,+\, \\ \;\;\,\texttt{b8*SCHL.Master's degree[i]}\,+\, \\ \qquad\;\;\;\texttt{b9*SCHL.Professional degree[i]}\,+\, \\ \qquad\texttt{b10*SCHL.Doctorate degree[i]}\,+\, \\ \,\texttt{e[i]}.\qquad\qquad\qquad\qquad\qquad$$` --- # Linear Regression with Log-transformtion ### <p style="color:#00449E"> Interpreting Beta Estimates - If we apply the rule above for `\(\texttt{Bob}\)` and `\(\texttt{Ben}\)`'s predicted incomes, `$$\widehat{\texttt{log(PINCP[Ben]})} \,-\, \widehat{\texttt{log(PINCP[Bob])}}\qquad \\ \;=\quad \hat{\texttt{b1}}\texttt{ * }(\texttt{AGEP[Ben]} - \texttt{AGEP[Bob]})\qquad\\ \;=\quad \hat{\texttt{b1}}\texttt{ * }\texttt{(51 - 50)}\qquad\qquad\qquad\qquad\;\\ \;=\quad \hat{\texttt{b1}}\qquad\qquad\qquad\qquad\qquad\qquad\;\;\;\,$$` So we can have the following: `$$\frac{\widehat{\texttt{PINCP[Ben]}}}{ \widehat{\texttt{PINCP[Bob]}}} \;=\; \texttt{exp(}\hat{\texttt{b1}}\texttt{)} \quad\Leftrightarrow\quad\widehat{\texttt{PINCP[Ben]}} \;=\; \widehat{\texttt{PINCP[Bob]}} * \texttt{exp(}\hat{\texttt{b1}}\texttt{)}$$` --- # Linear Regression with Log-transformtion ### <p style="color:#00449E"> Interpreting Beta Estimates - If we apply the rule above for `\(\texttt{Ben}\)` and `\(\texttt{Linda}\)`'s predicted incomes, `$$\frac{\widehat{\texttt{PINCP[Ben]}}}{ \widehat{\texttt{PINCP[Linda]}}} \;=\; \texttt{exp(}\hat{\texttt{b2}}\texttt{)} \quad\Leftrightarrow\quad\widehat{\texttt{PINCP[Ben]}} \;=\; \widehat{\texttt{PINCP[Linda]}} * \texttt{exp(}\hat{\texttt{b2}}\texttt{)}$$` - Suppose `\(\texttt{exp(}\hat{\texttt{b2}}\texttt{)} = 1.18\)`. - Then `\(\widehat{\texttt{PINCP[Ben]}}\)` is 1.18 times `\(\widehat{\texttt{PINCP[Linda]}}\)`. - It means that being a male is associated with an increase in income by 18% relative to being a female. --- # Linear Regression with Log-transformtion ### <p style="color:#00449E"> Interpreting Beta Estimates - All else being equal, an increase in `AGEP` by one unit is associated with an increase in `log(PINCP)` by `\(\hat{\texttt{b1}}\)`. - All else being equal, an increase in `AGEP` by one unit is associated with an increase in `PINCP` by `\((\texttt{exp(}\hat{\texttt{b1}}\texttt{)} - 1)\)`%. - All else being equal, being a male is associated with an increase in `log(PINCP)` by `\(\hat{\texttt{b2}}\)` relative to being a female. - All else being equal, being a male is associated with an increase in `PINCP` by `\((\texttt{exp(}\hat{\texttt{b2}}\texttt{)} - 1)\)`% relative to being a female. --- class: inverse, center, middle # Linear Regression with Interaction Terms <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Linear Regression with Interaction Terms ### <p style="color:#00449E"> Motivation - Does the relationship between education and income vary by gender? - Suppose we are interested in knowing whether women are being compensated unequally despite having the same levels of education and preparation as men do. - How can linear regression address the question above? --- # Linear Regression with Interaction Terms ### <p style="color:#00449E"> Model - The linear regression with an interaction between explanatory variables `\(X_{1}\)` and `\(X_{2}\)` are: `$$Y_{i} \,=\, b_{0} \,+\, b_{1}\,X_{1,i} \,+\, b_{2}\,X_{2,i} \,+\, b_{3}\,X_{1,i}\times X_{2,i} \,+\, e_{i},$$` - where - `\(i\;\)`: `\(\;\;i\)`-th observation in the training data.frame, `\(i = 1, 2, 3, \cdots\)`. - `\(Y_{i}\,\)`: `\(\;i\)`-th observation of outcome variable `\(Y\)`. - `\(X_{p, i}\,\)`: `\(i\)`-th observation of the `\(p\)`-th explanatory variable `\(X_{p}\)`. - `\(e_{i}\;\)`: `\(\;i\)`-th observation of statistical error variable. --- # Linear Regression with Interaction Terms ### <p style="color:#00449E"> Model - The linear regression with an interaction between explanatory variables `\(X_{1}\)` and `\(X_{2}\)` are: `$$Y_{i} \,=\, b_{0} \,+\, b_{1}\,X_{1,i} \,+\, b_{2}\,X_{2,i} \,+\, b_{3}\,X_{1,i}\times X_{2,i} \,+\, e_{i}$$`. - The relationship between `\(X_{1}\)` and `\(Y\)` is now described by not only `\(b_{1}\)` but also `\(b_{3}\, X_{2}\)`: `$$\frac{\Delta Y}{\Delta X_{1}} \,=\, b_{1} + b_{3}\, X_{2}$$`. --- # Linear Regression with Interaction Terms ### <p style="color:#00449E"> Motivation - Is education related with income? ```r model <- lm( log(PINCP) ~ AGEP + SCHL + SEX, data = dtrain ) ``` - Does the relationship between education and income vary by gender? ```r model_int <- lm( log(PINCP) ~ AGEP + SCHL + SEX + SCHL * SEX, data = dtrain ) # Equivalently, model_int <- lm( log(PINCP) ~ AGEP + SCHL * SEX, # Use this one data = dtrain ) ``` --- class: inverse, center, middle # Log-Log Linear Regression <html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html> --- # Log-Log Linear Regression ### <p style="color:#00449E"> Estimating Price Elasticity - To estimate the price elasticity of orange juice (OJ), we will use sales data for OJ from Dominick’s grocery stores in the 1990s. - Weekly `price` and `sales` (in number of cartons "sold") for three OJ brands---Tropicana, Minute Maid, Dominick's - An indicator, `feat`, showing whether each `brand` was advertised (in store or flyer) that week. Variable | Description ----------|------------------------------- `sales` | Quantity of OJ cartons sold `price` | Price of OJ `brand` | Brand of OJ `feat` | Advertisement status --- # Log-Log Linear Regression ### <p style="color:#00449E"> Estimating Price Elasticity - Let's prepare the OJ data: ```r oj <- read_csv('https://bcdanl.github.io/data/dominick_oj.csv') # Split 70-30 into training and testing data.frames set.seed(14454) gp <- runif( nrow(oj) ) dtrain <- filter(oj, [?]) dtest <- filter(oj, [?]) ``` --- # Log-Log Linear Regression ### <p style="color:#00449E"> Estimating Price Elasticity - The following model estimates the price elasticity of demand for a carton of OJ: `$$\log(\texttt{sales}_{\texttt{i}}) \,=\, \quad\;\; b_{0} \,+\, b_{\,\texttt{tr}}\,\texttt{brand}_{\,\texttt{tr}, \texttt{i}} \,+\, b_{\,\texttt{mm}}\,\texttt{brand}_{\,\texttt{mm}, \texttt{i}}\\ \,+\, b_{1}\,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}$$` - where $$ \texttt{brand}_{\,\texttt{tr}, \texttt{i}}\\ = \begin{cases} \texttt{1} & \text{ if an orange juice } \texttt{i} \text{ is } \texttt{Tropicana};\\\\ \texttt{0} & \text{otherwise}.\qquad\qquad\quad\, \end{cases} $$ $$ \texttt{brand}_{\,\texttt{mm}, \texttt{i}}\\ = \begin{cases} \texttt{1} & \text{ if an orange juice } \texttt{i} \text{ is } \texttt{Minute Maid};\\\\ \texttt{0} & \text{otherwise}.\qquad\qquad\quad\, \end{cases} $$ --- # Log-Log Linear Regression ### <p style="color:#00449E"> Estimating Price Elasticity - The following model estimates the price elasticity of demand for a carton of OJ: `$$\log(\texttt{sales}_{\texttt{i}}) \,=\, \quad\;\; b_{0} \,+\, b_{\,\texttt{tr}}\,\texttt{brand}_{\,\texttt{tr}, \texttt{i}} \,+\, b_{\,\texttt{mm}}\,\texttt{brand}_{\,\texttt{mm}, \texttt{i}}\\ \,+\, b_{1}\,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}$$` - When `\(\texttt{brand}_{\,\texttt{tr}, \texttt{i}}\,=\,0\)` and `\(\texttt{brand}_{\,\texttt{mm}, \texttt{i}}\,=\,0\)`, the beta coefficient for the intercept `\(b_{0}\)` gives the value of Dominick's log sales at a log price of zero. - The beta coefficient `\(b_{1}\)` is the price elasticity of demand. - It measures how sensitive the quantity demanded is to its price. --- # Log-Log Linear Regression ### <p style="color:#00449E"> Estimating Price Elasticity - For small changes in variable `\(x\)` from `\(x_{0}\)` to `\(x_{1}\)`, the following equation holds: `$$\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) \approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} \,=\, \frac{\Delta\, x}{x_{0}}.$$` - The coefficient on `\(\log(\texttt{price}_{\texttt{i}})\)`, `\(b_{1}\)`, is therefore `$$b_{1} \,=\, \frac{\Delta \log(\texttt{sales}_{\texttt{i}})}{\Delta \log(\texttt{price}_{\texttt{i}})}\,=\, \frac{\frac{\Delta \texttt{sales}_{\texttt{i}}}{\texttt{sales}_{\texttt{i}}}}{\frac{\Delta \texttt{price}_{\texttt{i}}}{\texttt{price}_{\texttt{i}}}},$$` the percentage change in `\(\texttt{sales}\)` when `\(\texttt{price}\)` increases by 1%. --- # Log-Log Linear Regression ### <p style="color:#00449E"> Estimating Price Elasticity - Let's train the model: `$$\log(\texttt{sales}_{\texttt{i}}) \,=\, \quad\;\; b_{0} \,+\, b_{\,\texttt{tr}}\,\texttt{brand}_{\,\texttt{tr}, \texttt{i}} \,+\, b_{\,\texttt{mm}}\,\texttt{brand}_{\,\texttt{mm}, \texttt{i}}\\ \,+\, b_{1}\,\log(\texttt{price}_{\texttt{i}}) \,+\, e_{\texttt{i}}$$` ```r reg_1 <- lm([?], data = dtrain) ```