Lecture 25DANL 200: Introduction to Data AnalyticsByeong-Hak ChoeDecember 1, 20221 / 31

Talk on Data Analytics Career

Lauren Kopac

Data Analyst, Human Capital Management (Compensation & Analytics)

Neuberger Berman
New York, NY June 2022 – Present

@laurenkopac

2 / 31

Talk on Data Analytics Career

Lauren Kopac

Assistant Director of Data, Law School in the Office of Academic Affairs

Institutional Research Analyst, School of Engineering in the Office of the Dean

Columbia University
New York, NY October 2018 – June 2022

3 / 31

Talk on Data Analytics Career

Lauren Kopac

MS Computer Science, Machine Learning

Columbia University
New York, NY Expected December 2023

BA Mathematics

SUNY Geneseo
Geneseo, NY August 2012 - December 2015

4 / 31

Announcement

Job Opportunities

M&T guests, Leah Froebel and Emily Scheck
- Dec. 2nd (11:30 am - 12:30 pm) in South 340.
- All are welcome! They will be sharing their career stories along with an introduction to their F/T Management Development program (5 states) and prestigious internships in multiple functional areas (H/R, Marketing, Operations, etc...)

McKinsey Consulting
- (F/T grad. Dec. '22 and May '23) 2-year paid rotational fellowship, NYC.
- If interested, send email to cannonm@geneseo.edu to meet with one of our McKinsey alumni for prep this week.
- Business Insights Fellow

Blackstone
- (Econ. or Finance) Sophomore & Juniors,
- Apply to attend a Blackstone networking event in NYC, Jan. 13th 12:30 - 3:30 pm, application due 12/8.
- Blackstone Networking Event

5 / 31

Linear Regression using R
6 / 31

Linear Regression using R

More Explanatory Variables in the Model

In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables:
- SCHL: level of education

7 / 31

Linear Regression using R

More Explanatory Variables in the Model

Suppose we also want to assess how personal income (PINCP) varies with (1) age (AGEP), and (2) gender (SEX), and (3) a bachelor's degree (SCHL).
1. Conduct the exploratory data analysis.
2. Based on the visualization, set a hypothesis regarding the relationship between having bachelor's degree and PINCP.
3. Train the linear regression model.
4. Interpret the beta coefficients from the linear regression result.
5. Calculate the predicted PINCP using the testing data.
6. Draw the actual vs. predicted outcome plot and the residual plot.

8 / 31

Linear Regression in R

R commands to do EDA and linear regression analysis

library(tidyverse)
psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )
set.seed(54321)
gp <- runif( nrow(psub) )
# Set up factor variables if needed.
dtrain <- filter(psub, gp >= .5)
dtest <- filter(psub, gp < .5)

library(skimr)
sum_dtrain <- skim( select(dtrain,
                           PINCP, AGEP, SEX, SCHL) )
library(GGally)
ggpairs( select(dtrain,
                PINCP, AGEP, SEX, SCHL) )
# MORE VISUALIZATIONS ARE RECOMMENDED

model_1 <- lm( PINCP ~ AGEP + SEX,
               data = dtrain )
model_2 <- lm( PINCP ~ AGEP + SEX + SCHL,
               data = dtrain )

Summary with base-R:

summary(model_1)
summary(model_2)
coef(model_1)
coef(model_2)
# Using the model.matrix() function on our linear model object, 
# we can get the data matrix that underlies our regression. 
df_model_1 <- as_tibble( model.matrix(model_1) )
df_model_2 <- as_tibble( model.matrix(model_2) )

Summary with R packages:

# install.packages(c("stargazer", "broom"))
library(stargazer)
library(broom)
stargazer(model_1, model_2, 
          type = 'text')  # from the stargazer package
sum_model_2 <- tidy(model_2)  # from the broom package
# Consider filter() to keep statistically significant beta estimates

ggplot(sum_model_2) +
  geom_pointrange( aes(x = term, 
                       y = estimate,
                       ymin = estimate - 2*std.error,
                       ymax = estimate + 2*std.error ) ) +
  coord_flip()

dtest <- dtest %>% 
  mutate( pred_1 = predict(model_1, newdata = dtest),
          pred_2 = predict(model_2, newdata = dtest) )

ggplot( data = dtest, 
        aes(x = pred_2, y = PINCP) ) +
  geom_point( alpha = 0.2, color = "darkgray" ) +
  geom_smooth( color = "darkblue" ) +  
  geom_abline( color = "red", linetype = 2 )  # y = x, perfect prediction line

ggplot(data = dtest, 
       aes(x = pred_2, y = PINCP - pred_2)) +
  geom_point(alpha = 0.2, color = "darkgray") +
  geom_smooth( color = "darkblue" ) +   
  geom_hline( aes( yintercept = 0 ),  # perfect prediction 
              color = "red", linetype = 2)

9 / 31

Linear Regression in R

The model equation

10 / 31

Linear Regression with Log-transformed Variables
11 / 31

Linear Regression with Log-transformtion

A Little Bit of Math for Logarithm

The logarithm function, , looks like ....

: the base logarithm is called the natural log, where is the mathematical constant, the Euler's number.
or : the natural log of .
: the natural log of is , because .

12 / 31

Linear Regression with Log-transformtion

We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
- For small changes in variable from to , the following equation holds:

A change in income of $5,000 means something very different across people with different income levels.
- A percentage change in income, e.g., 5% of income, may mean somewhat more similar across people with different income levels.

We can also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed.

13 / 31

Linear Regression with Log-transformtion

The log transformation makes the skewed distribution of income more normal.

ggplot(dtrain, aes( x = PINCP ) ) +
  geom_density() 
ggplot(dtrain, aes( x = log(PINCP) ) ) +
  geom_density()

14 / 31

Linear Regression with Log-transformtion

A Few Algebras for Logarithm and Exponential Functions

Rule 1:
Rule 2:

By the rules above,

15 / 31

Linear Regression with Log-transformtionLet's consider the following linear regression model:
log(PINCP[i])=b0+b1*AGEP[i]+b2*SEX.Male[i]b3*SCHL.no high school diploma[i]+b4*SCHL.GED or alternative credential[i]+b5*SCHL.some college credit, no degree[i]+b6*SCHL.Associate's degree[i]+b7*SCHL.Bachelor's degree[i]+b8*SCHL.Master's degree[i]+b9*SCHL.Professional degree[i]+b10*SCHL.Doctorate degree[i]+e[i].
16 / 31

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

So we can have the following:

17 / 31

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

Suppose .
- Then is 1.18 times .
- It means that being a male is associated with an increase in income by 18% relative to being a female.

18 / 31

Linear Regression with Log-transformtion

Interpreting Beta Estimates

All else being equal, an increase in AGEP by one unit is associated with an increase in log(PINCP) by .
- All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by %.
All else being equal, being a male is associated with an increase in log(PINCP) by relative to being a female.
- All else being equal, being a male is associated with an increase in PINCP by % relative to being a female.

19 / 31

Linear Regression with Interaction Terms
20 / 31

Linear Regression with Interaction Terms

Motivation

Does the relationship between education and income vary by gender?
- Suppose we are interested in knowing whether women are being compensated unequally despite having the same levels of education and preparation as men do.
- How can linear regression address the question above?

21 / 31

Linear Regression with Interaction Terms

Model

The linear regression with an interaction between explanatory variables and are:

where
- : -th observation in the training data.frame, .
- : -th observation of outcome variable .
- : -th observation of the -th explanatory variable .
- : -th observation of statistical error variable.

22 / 31

Linear Regression with Interaction Terms

Model

The linear regression with an interaction between explanatory variables and are:

The relationship between and is now described by not only but also :

23 / 31

Linear Regression with Interaction Terms

Motivation

Is education related with income?

model <- lm( log(PINCP) ~ AGEP + SCHL + SEX,
             data = dtrain )

Does the relationship between education and income vary by gender?

model_int <- lm( log(PINCP) ~ AGEP + SCHL + SEX + SCHL * SEX,
                 data = dtrain )
# Equivalently,
model_int <- lm( log(PINCP) ~ AGEP + SCHL * SEX,  # Use this one
                 data = dtrain )

24 / 31

Log-Log Linear Regression
25 / 31

Log-Log Linear Regression

Estimating Price Elasticity

To estimate the price elasticity of orange juice (OJ), we will use sales data for OJ from Dominick’s grocery stores in the 1990s.
- Weekly price and sales (in number of cartons "sold") for three OJ brands---Tropicana, Minute Maid, Dominick's
- An indicator, feat, showing whether each brand was advertised (in store or flyer) that week.

Variable	Description
`sales`	Quantity of OJ cartons sold
`price`	Price of OJ
`brand`	Brand of OJ
`feat`	Advertisement status

26 / 31

Log-Log Linear Regression

Estimating Price Elasticity

Let's prepare the OJ data:

oj <- read_csv('https://bcdanl.github.io/data/dominick_oj.csv')
# Split 70-30 into training and testing data.frames
set.seed(14454)
gp <- runif( nrow(oj) )
dtrain <- filter(oj, [?])
dtest <- filter(oj, [?])

27 / 31

Log-Log Linear Regression

Estimating Price Elasticity

The following model estimates the price elasticity of demand for a carton of OJ:

where

28 / 31

Log-Log Linear Regression

Estimating Price Elasticity

The following model estimates the price elasticity of demand for a carton of OJ:

When and , the beta coefficient for the intercept gives the value of Dominick's log sales at a log price of zero.
The beta coefficient is the price elasticity of demand.
- It measures how sensitive the quantity demanded is to its price.

29 / 31

Log-Log Linear Regression

Estimating Price Elasticity

For small changes in variable from to , the following equation holds:

The coefficient on , , is therefore

the percentage change in when increases by 1%.

30 / 31

Log-Log Linear Regression

Estimating Price Elasticity

Let's train the model:

reg_1 <- lm([?], data = dtrain)

31 / 31

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Tile View: Overview of Slides

Lecture 25DANL 200: Introduction to Data AnalyticsByeong-Hak ChoeDecember 1, 20221 / 31

Talk on Data Analytics Career

Lauren Kopac

Data Analyst, Human Capital Management (Compensation & Analytics)

Neuberger Berman
New York, NY June 2022 – Present

@laurenkopac

2 / 31

Talk on Data Analytics Career

Lauren Kopac

Assistant Director of Data, Law School in the Office of Academic Affairs

Institutional Research Analyst, School of Engineering in the Office of the Dean

Columbia University
New York, NY October 2018 – June 2022

3 / 31

Talk on Data Analytics Career

Lauren Kopac

MS Computer Science, Machine Learning

Columbia University
New York, NY Expected December 2023

BA Mathematics

SUNY Geneseo
Geneseo, NY August 2012 - December 2015

4 / 31

Announcement

Job Opportunities

M&T guests, Leah Froebel and Emily Scheck
- Dec. 2nd (11:30 am - 12:30 pm) in South 340.
- All are welcome! They will be sharing their career stories along with an introduction to their F/T Management Development program (5 states) and prestigious internships in multiple functional areas (H/R, Marketing, Operations, etc...)

McKinsey Consulting
- (F/T grad. Dec. '22 and May '23) 2-year paid rotational fellowship, NYC.
- If interested, send email to cannonm@geneseo.edu to meet with one of our McKinsey alumni for prep this week.
- Business Insights Fellow

Blackstone
- (Econ. or Finance) Sophomore & Juniors,
- Apply to attend a Blackstone networking event in NYC, Jan. 13th 12:30 - 3:30 pm, application due 12/8.
- Blackstone Networking Event

5 / 31

Linear Regression using R
6 / 31

Linear Regression using R

More Explanatory Variables in the Model

In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables:
- SCHL: level of education

7 / 31

Linear Regression using R

More Explanatory Variables in the Model

Suppose we also want to assess how personal income (PINCP) varies with (1) age (AGEP), and (2) gender (SEX), and (3) a bachelor's degree (SCHL).
1. Conduct the exploratory data analysis.
2. Based on the visualization, set a hypothesis regarding the relationship between having bachelor's degree and PINCP.
3. Train the linear regression model.
4. Interpret the beta coefficients from the linear regression result.
5. Calculate the predicted PINCP using the testing data.
6. Draw the actual vs. predicted outcome plot and the residual plot.

8 / 31

Linear Regression in R

R commands to do EDA and linear regression analysis

library(tidyverse)
psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )
set.seed(54321)
gp <- runif( nrow(psub) )
# Set up factor variables if needed.
dtrain <- filter(psub, gp >= .5)
dtest <- filter(psub, gp < .5)

library(skimr)
sum_dtrain <- skim( select(dtrain,
                           PINCP, AGEP, SEX, SCHL) )
library(GGally)
ggpairs( select(dtrain,
                PINCP, AGEP, SEX, SCHL) )
# MORE VISUALIZATIONS ARE RECOMMENDED

model_1 <- lm( PINCP ~ AGEP + SEX,
               data = dtrain )
model_2 <- lm( PINCP ~ AGEP + SEX + SCHL,
               data = dtrain )

Summary with base-R:

summary(model_1)
summary(model_2)
coef(model_1)
coef(model_2)
# Using the model.matrix() function on our linear model object, 
# we can get the data matrix that underlies our regression. 
df_model_1 <- as_tibble( model.matrix(model_1) )
df_model_2 <- as_tibble( model.matrix(model_2) )

Summary with R packages:

# install.packages(c("stargazer", "broom"))
library(stargazer)
library(broom)
stargazer(model_1, model_2, 
          type = 'text')  # from the stargazer package
sum_model_2 <- tidy(model_2)  # from the broom package
# Consider filter() to keep statistically significant beta estimates

ggplot(sum_model_2) +
  geom_pointrange( aes(x = term, 
                       y = estimate,
                       ymin = estimate - 2*std.error,
                       ymax = estimate + 2*std.error ) ) +
  coord_flip()

dtest <- dtest %>% 
  mutate( pred_1 = predict(model_1, newdata = dtest),
          pred_2 = predict(model_2, newdata = dtest) )

ggplot( data = dtest, 
        aes(x = pred_2, y = PINCP) ) +
  geom_point( alpha = 0.2, color = "darkgray" ) +
  geom_smooth( color = "darkblue" ) +  
  geom_abline( color = "red", linetype = 2 )  # y = x, perfect prediction line

ggplot(data = dtest, 
       aes(x = pred_2, y = PINCP - pred_2)) +
  geom_point(alpha = 0.2, color = "darkgray") +
  geom_smooth( color = "darkblue" ) +   
  geom_hline( aes( yintercept = 0 ),  # perfect prediction 
              color = "red", linetype = 2)

9 / 31

Linear Regression in R

The model equation

10 / 31

Linear Regression with Log-transformed Variables
11 / 31

Linear Regression with Log-transformtion

A Little Bit of Math for Logarithm

The logarithm function, , looks like ....

: the base logarithm is called the natural log, where is the mathematical constant, the Euler's number.
or : the natural log of .
: the natural log of is , because .

12 / 31

Linear Regression with Log-transformtion

We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
- For small changes in variable from to , the following equation holds:

A change in income of $5,000 means something very different across people with different income levels.
- A percentage change in income, e.g., 5% of income, may mean somewhat more similar across people with different income levels.

We can also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed.

13 / 31

Linear Regression with Log-transformtion

The log transformation makes the skewed distribution of income more normal.

ggplot(dtrain, aes( x = PINCP ) ) +
  geom_density() 
ggplot(dtrain, aes( x = log(PINCP) ) ) +
  geom_density()

14 / 31

Linear Regression with Log-transformtion

A Few Algebras for Logarithm and Exponential Functions

Rule 1:
Rule 2:

By the rules above,

15 / 31

Linear Regression with Log-transformtionLet's consider the following linear regression model:
log(PINCP[i])=b0+b1*AGEP[i]+b2*SEX.Male[i]b3*SCHL.no high school diploma[i]+b4*SCHL.GED or alternative credential[i]+b5*SCHL.some college credit, no degree[i]+b6*SCHL.Associate's degree[i]+b7*SCHL.Bachelor's degree[i]+b8*SCHL.Master's degree[i]+b9*SCHL.Professional degree[i]+b10*SCHL.Doctorate degree[i]+e[i].
16 / 31

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

So we can have the following:

17 / 31

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

Suppose .
- Then is 1.18 times .
- It means that being a male is associated with an increase in income by 18% relative to being a female.

18 / 31

Linear Regression with Log-transformtion

Interpreting Beta Estimates

All else being equal, an increase in AGEP by one unit is associated with an increase in log(PINCP) by .
- All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by %.
All else being equal, being a male is associated with an increase in log(PINCP) by relative to being a female.
- All else being equal, being a male is associated with an increase in PINCP by % relative to being a female.

19 / 31

Linear Regression with Interaction Terms
20 / 31

Linear Regression with Interaction Terms

Motivation

Does the relationship between education and income vary by gender?
- Suppose we are interested in knowing whether women are being compensated unequally despite having the same levels of education and preparation as men do.
- How can linear regression address the question above?

21 / 31

Linear Regression with Interaction Terms

Model

The linear regression with an interaction between explanatory variables and are:

where
- : -th observation in the training data.frame, .
- : -th observation of outcome variable .
- : -th observation of the -th explanatory variable .
- : -th observation of statistical error variable.

22 / 31

Linear Regression with Interaction Terms

Model

The linear regression with an interaction between explanatory variables and are:

The relationship between and is now described by not only but also :

23 / 31

Linear Regression with Interaction Terms

Motivation

Is education related with income?

model <- lm( log(PINCP) ~ AGEP + SCHL + SEX,
             data = dtrain )

Does the relationship between education and income vary by gender?

model_int <- lm( log(PINCP) ~ AGEP + SCHL + SEX + SCHL * SEX,
                 data = dtrain )
# Equivalently,
model_int <- lm( log(PINCP) ~ AGEP + SCHL * SEX,  # Use this one
                 data = dtrain )

24 / 31

Log-Log Linear Regression
25 / 31

Log-Log Linear Regression

Estimating Price Elasticity

To estimate the price elasticity of orange juice (OJ), we will use sales data for OJ from Dominick’s grocery stores in the 1990s.
- Weekly price and sales (in number of cartons "sold") for three OJ brands---Tropicana, Minute Maid, Dominick's
- An indicator, feat, showing whether each brand was advertised (in store or flyer) that week.

Variable	Description
`sales`	Quantity of OJ cartons sold
`price`	Price of OJ
`brand`	Brand of OJ
`feat`	Advertisement status

26 / 31

Log-Log Linear Regression

Estimating Price Elasticity

Let's prepare the OJ data:

oj <- read_csv('https://bcdanl.github.io/data/dominick_oj.csv')
# Split 70-30 into training and testing data.frames
set.seed(14454)
gp <- runif( nrow(oj) )
dtrain <- filter(oj, [?])
dtest <- filter(oj, [?])

27 / 31

Log-Log Linear Regression

Estimating Price Elasticity

The following model estimates the price elasticity of demand for a carton of OJ:

where

28 / 31

Log-Log Linear Regression

Estimating Price Elasticity

The following model estimates the price elasticity of demand for a carton of OJ:

When and , the beta coefficient for the intercept gives the value of Dominick's log sales at a log price of zero.
The beta coefficient is the price elasticity of demand.
- It measures how sensitive the quantity demanded is to its price.

29 / 31

Log-Log Linear Regression

Estimating Price Elasticity

For small changes in variable from to , the following equation holds:

The coefficient on , , is therefore

the percentage change in when increases by 1%.

30 / 31

Log-Log Linear Regression

Estimating Price Elasticity

Let's train the model:

reg_1 <- lm([?], data = dtrain)

31 / 31