Neuberger Berman
New York, NY June 2022 – Present
Columbia University
New York, NY October 2018 – June 2022
Columbia University
New York, NY Expected December 2023
SUNY Geneseo
Geneseo, NY August 2012 - December 2015
Job Opportunities
More Explanatory Variables in the Model
In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables:
SCHL: level of educationMore Explanatory Variables in the Model
Suppose we also want to assess how personal income (PINCP) varies with (1) age (AGEP), and (2) gender (SEX), and (3) a bachelor's degree (SCHL).
PINCP.PINCP using the testing data.R commands to do EDA and linear regression analysis
library(tidyverse)psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )set.seed(54321)gp <- runif( nrow(psub) )# Set up factor variables if needed.dtrain <- filter(psub, gp >= .5)dtest <- filter(psub, gp < .5)library(skimr)sum_dtrain <- skim( select(dtrain, PINCP, AGEP, SEX, SCHL) )library(GGally)ggpairs( select(dtrain, PINCP, AGEP, SEX, SCHL) )# MORE VISUALIZATIONS ARE RECOMMENDEDmodel_1 <- lm( PINCP ~ AGEP + SEX, data = dtrain )model_2 <- lm( PINCP ~ AGEP + SEX + SCHL, data = dtrain )summary(model_1)summary(model_2)coef(model_1)coef(model_2)# Using the model.matrix() function on our linear model object, # we can get the data matrix that underlies our regression. df_model_1 <- as_tibble( model.matrix(model_1) )df_model_2 <- as_tibble( model.matrix(model_2) )# install.packages(c("stargazer", "broom"))library(stargazer)library(broom)stargazer(model_1, model_2, type = 'text') # from the stargazer packagesum_model_2 <- tidy(model_2) # from the broom package# Consider filter() to keep statistically significant beta estimatesggplot(sum_model_2) + geom_pointrange( aes(x = term, y = estimate, ymin = estimate - 2*std.error, ymax = estimate + 2*std.error ) ) + coord_flip()dtest <- dtest %>% mutate( pred_1 = predict(model_1, newdata = dtest), pred_2 = predict(model_2, newdata = dtest) )ggplot( data = dtest, aes(x = pred_2, y = PINCP) ) + geom_point( alpha = 0.2, color = "darkgray" ) + geom_smooth( color = "darkblue" ) + geom_abline( color = "red", linetype = 2 ) # y = x, perfect prediction lineggplot(data = dtest, aes(x = pred_2, y = PINCP - pred_2)) + geom_point(alpha = 0.2, color = "darkgray") + geom_smooth( color = "darkblue" ) + geom_hline( aes( yintercept = 0 ), # perfect prediction color = "red", linetype = 2)The model equation
PINCP[i]=b0+b1*AGEP[i]+b2*SEX.Male[i]b3*SCHL.no high school diploma[i]+b4*SCHL.GED or alternative credential[i]+b5*SCHL.some college credit, no degree[i]+b6*SCHL.Associate's degree[i]+b7*SCHL.Bachelor's degree[i]+b8*SCHL.Master's degree[i]+b9*SCHL.Professional degree[i]+b10*SCHL.Doctorate degree[i]+e[i].
A Little Bit of Math for Logarithm

loge(x): the base e logarithm is called the natural log, where e=2.718⋯ is the mathematical constant, the Euler's number.
log(x) or ln(x): the natural log of x .
loge(7.389⋯): the natural log of 7.389⋯ is 2, because e2=7.389⋯.
We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
Δlog(x)=log(x1)−log(x0)≈x1−x0x0=Δxx0.
ggplot(dtrain, aes( x = PINCP ) ) + geom_density() ggplot(dtrain, aes( x = log(PINCP) ) ) + geom_density()A Few Algebras for Logarithm and Exponential Functions
log(x)−log(z)=log(xz).
Interpreting Beta Estimates
^log(PINCP[Ben])−^log(PINCP[Bob])=^b1 * (AGEP[Ben]−AGEP[Bob])=^b1 * (51 - 50)=^b1
So we can have the following: ^PINCP[Ben]^PINCP[Bob]=exp(^b1)⇔^PINCP[Ben]=^PINCP[Bob]∗exp(^b1)
Interpreting Beta Estimates
^PINCP[Ben]^PINCP[Linda]=exp(^b2)⇔^PINCP[Ben]=^PINCP[Linda]∗exp(^b2)
Suppose exp(^b2)=1.18.
Then ^PINCP[Ben] is 1.18 times ^PINCP[Linda].
It means that being a male is associated with an increase in income by 18% relative to being a female.
Interpreting Beta Estimates
All else being equal, an increase in AGEP by one unit is associated with an increase in log(PINCP) by ^b1.
AGEP by one unit is associated with an increase in PINCP by (exp(^b1)−1)%.All else being equal, being a male is associated with an increase in log(PINCP) by ^b2 relative to being a female.
PINCP by (exp(^b2)−1)% relative to being a female.Motivation
Does the relationship between education and income vary by gender?
Suppose we are interested in knowing whether women are being compensated unequally despite having the same levels of education and preparation as men do.
How can linear regression address the question above?
Model
Yi=b0+b1X1,i+b2X2,i+b3X1,i×X2,i+ei,
Model
Yi=b0+b1X1,i+b2X2,i+b3X1,i×X2,i+ei
ΔYΔX1=b1+b3X2
Motivation
model <- lm( log(PINCP) ~ AGEP + SCHL + SEX, data = dtrain )
model_int <- lm( log(PINCP) ~ AGEP + SCHL + SEX + SCHL * SEX, data = dtrain )# Equivalently,model_int <- lm( log(PINCP) ~ AGEP + SCHL * SEX, # Use this one data = dtrain )Estimating Price Elasticity
price and sales (in number of cartons "sold") for three OJ brands---Tropicana, Minute Maid, Dominick'sfeat, showing whether each brand was advertised (in store or flyer) that week.| Variable | Description |
|---|---|
sales |
Quantity of OJ cartons sold |
price |
Price of OJ |
brand |
Brand of OJ |
feat |
Advertisement status |
Estimating Price Elasticity
oj <- read_csv('https://bcdanl.github.io/data/dominick_oj.csv')# Split 70-30 into training and testing data.framesset.seed(14454)gp <- runif( nrow(oj) )dtrain <- filter(oj, [?])dtest <- filter(oj, [?])Estimating Price Elasticity
log(salesi)=b0+btrbrandtr,i+bmmbrandmm,i+b1log(pricei)+ei
brandtr,i ={1 if an orange juice i is Tropicana;0otherwise.
brandmm,i ={1 if an orange juice i is Minute Maid;0otherwise.
Estimating Price Elasticity
log(salesi)=b0+btrbrandtr,i+bmmbrandmm,i+b1log(pricei)+ei
When brandtr,i=0 and brandmm,i=0, the beta coefficient for the intercept b0 gives the value of Dominick's log sales at a log price of zero.
The beta coefficient b1 is the price elasticity of demand.
Estimating Price Elasticity
Δlog(x)=log(x1)−log(x0)≈x1−x0x0=Δxx0.
b1=Δlog(salesi)Δlog(pricei)=ΔsalesisalesiΔpriceipricei,
the percentage change in sales when price increases by 1%.
Estimating Price Elasticity
log(salesi)=b0+btrbrandtr,i+bmmbrandmm,i+b1log(pricei)+ei
reg_1 <- lm([?], data = dtrain)Neuberger Berman
New York, NY June 2022 – Present
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| o | Tile View: Overview of Slides |
| Esc | Back to slideshow |
Neuberger Berman
New York, NY June 2022 – Present
Columbia University
New York, NY October 2018 – June 2022
Columbia University
New York, NY Expected December 2023
SUNY Geneseo
Geneseo, NY August 2012 - December 2015
Job Opportunities
More Explanatory Variables in the Model
In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables:
SCHL: level of educationMore Explanatory Variables in the Model
Suppose we also want to assess how personal income (PINCP) varies with (1) age (AGEP), and (2) gender (SEX), and (3) a bachelor's degree (SCHL).
PINCP.PINCP using the testing data.R commands to do EDA and linear regression analysis
library(tidyverse)psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )set.seed(54321)gp <- runif( nrow(psub) )# Set up factor variables if needed.dtrain <- filter(psub, gp >= .5)dtest <- filter(psub, gp < .5)library(skimr)sum_dtrain <- skim( select(dtrain, PINCP, AGEP, SEX, SCHL) )library(GGally)ggpairs( select(dtrain, PINCP, AGEP, SEX, SCHL) )# MORE VISUALIZATIONS ARE RECOMMENDEDmodel_1 <- lm( PINCP ~ AGEP + SEX, data = dtrain )model_2 <- lm( PINCP ~ AGEP + SEX + SCHL, data = dtrain )summary(model_1)summary(model_2)coef(model_1)coef(model_2)# Using the model.matrix() function on our linear model object, # we can get the data matrix that underlies our regression. df_model_1 <- as_tibble( model.matrix(model_1) )df_model_2 <- as_tibble( model.matrix(model_2) )# install.packages(c("stargazer", "broom"))library(stargazer)library(broom)stargazer(model_1, model_2, type = 'text') # from the stargazer packagesum_model_2 <- tidy(model_2) # from the broom package# Consider filter() to keep statistically significant beta estimatesggplot(sum_model_2) + geom_pointrange( aes(x = term, y = estimate, ymin = estimate - 2*std.error, ymax = estimate + 2*std.error ) ) + coord_flip()dtest <- dtest %>% mutate( pred_1 = predict(model_1, newdata = dtest), pred_2 = predict(model_2, newdata = dtest) )ggplot( data = dtest, aes(x = pred_2, y = PINCP) ) + geom_point( alpha = 0.2, color = "darkgray" ) + geom_smooth( color = "darkblue" ) + geom_abline( color = "red", linetype = 2 ) # y = x, perfect prediction lineggplot(data = dtest, aes(x = pred_2, y = PINCP - pred_2)) + geom_point(alpha = 0.2, color = "darkgray") + geom_smooth( color = "darkblue" ) + geom_hline( aes( yintercept = 0 ), # perfect prediction color = "red", linetype = 2)The model equation
PINCP[i]=b0+b1*AGEP[i]+b2*SEX.Male[i]b3*SCHL.no high school diploma[i]+b4*SCHL.GED or alternative credential[i]+b5*SCHL.some college credit, no degree[i]+b6*SCHL.Associate's degree[i]+b7*SCHL.Bachelor's degree[i]+b8*SCHL.Master's degree[i]+b9*SCHL.Professional degree[i]+b10*SCHL.Doctorate degree[i]+e[i].
A Little Bit of Math for Logarithm

loge(x): the base e logarithm is called the natural log, where e=2.718⋯ is the mathematical constant, the Euler's number.
log(x) or ln(x): the natural log of x .
loge(7.389⋯): the natural log of 7.389⋯ is 2, because e2=7.389⋯.
We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
Δlog(x)=log(x1)−log(x0)≈x1−x0x0=Δxx0.
ggplot(dtrain, aes( x = PINCP ) ) + geom_density() ggplot(dtrain, aes( x = log(PINCP) ) ) + geom_density()A Few Algebras for Logarithm and Exponential Functions
log(x)−log(z)=log(xz).
Interpreting Beta Estimates
^log(PINCP[Ben])−^log(PINCP[Bob])=^b1 * (AGEP[Ben]−AGEP[Bob])=^b1 * (51 - 50)=^b1
So we can have the following: ^PINCP[Ben]^PINCP[Bob]=exp(^b1)⇔^PINCP[Ben]=^PINCP[Bob]∗exp(^b1)
Interpreting Beta Estimates
^PINCP[Ben]^PINCP[Linda]=exp(^b2)⇔^PINCP[Ben]=^PINCP[Linda]∗exp(^b2)
Suppose exp(^b2)=1.18.
Then ^PINCP[Ben] is 1.18 times ^PINCP[Linda].
It means that being a male is associated with an increase in income by 18% relative to being a female.
Interpreting Beta Estimates
All else being equal, an increase in AGEP by one unit is associated with an increase in log(PINCP) by ^b1.
AGEP by one unit is associated with an increase in PINCP by (exp(^b1)−1)%.All else being equal, being a male is associated with an increase in log(PINCP) by ^b2 relative to being a female.
PINCP by (exp(^b2)−1)% relative to being a female.Motivation
Does the relationship between education and income vary by gender?
Suppose we are interested in knowing whether women are being compensated unequally despite having the same levels of education and preparation as men do.
How can linear regression address the question above?
Model
Yi=b0+b1X1,i+b2X2,i+b3X1,i×X2,i+ei,
Model
Yi=b0+b1X1,i+b2X2,i+b3X1,i×X2,i+ei.
ΔYΔX1=b1+b3X2.
Motivation
model <- lm( log(PINCP) ~ AGEP + SCHL + SEX, data = dtrain )
model_int <- lm( log(PINCP) ~ AGEP + SCHL + SEX + SCHL * SEX, data = dtrain )# Equivalently,model_int <- lm( log(PINCP) ~ AGEP + SCHL * SEX, # Use this one data = dtrain )Estimating Price Elasticity
price and sales (in number of cartons "sold") for three OJ brands---Tropicana, Minute Maid, Dominick'sfeat, showing whether each brand was advertised (in store or flyer) that week.| Variable | Description |
|---|---|
sales |
Quantity of OJ cartons sold |
price |
Price of OJ |
brand |
Brand of OJ |
feat |
Advertisement status |
Estimating Price Elasticity
oj <- read_csv('https://bcdanl.github.io/data/dominick_oj.csv')# Split 70-30 into training and testing data.framesset.seed(14454)gp <- runif( nrow(oj) )dtrain <- filter(oj, [?])dtest <- filter(oj, [?])Estimating Price Elasticity
log(salesi)=b0+btrbrandtr,i+bmmbrandmm,i+b1log(pricei)+ei
brandtr,i ={1 if an orange juice i is Tropicana;0otherwise.
brandmm,i ={1 if an orange juice i is Minute Maid;0otherwise.
Estimating Price Elasticity
log(salesi)=b0+btrbrandtr,i+bmmbrandmm,i+b1log(pricei)+ei
When brandtr,i=0 and brandmm,i=0, the beta coefficient for the intercept b0 gives the value of Dominick's log sales at a log price of zero.
The beta coefficient b1 is the price elasticity of demand.
Estimating Price Elasticity
Δlog(x)=log(x1)−log(x0)≈x1−x0x0=Δxx0.
b1=Δlog(salesi)Δlog(pricei)=ΔsalesisalesiΔpriceipricei,
the percentage change in sales when price increases by 1%.
Estimating Price Elasticity
log(salesi)=b0+btrbrandtr,i+bmmbrandmm,i+b1log(pricei)+ei
reg_1 <- lm([?], data = dtrain)