Lecture 23DANL 200: Introduction to Data AnalyticsByeong-Hak ChoeNovember 22, 20221 / 37

Modeling Methods - Linear Regression
2 / 37

Linear Regression

Example

Suppose we also want to estimate how gender will affect personal income.
Linear regression assumes that ...
- The outcome PINCP[i] is linearly related to each of the inputs AGEP[i] and SEX[i]:

A variable on the left-hand side is called an outcome variable or a dependent variable.
Variables on the right-hand side are called explanatory variables, independent variables, or input variables.
Coefficients on the right-hand side are called beta coefficients.

3 / 37

Linear Regression

Goals of Linear Regression

The goals of linear regression are ...
1. Find the estimated values of b1 and b2: and .
2. Make a prediction on PINCP[i] for each person i: .

We will use the hat notation to distinguish estimated beta coefficients and predicted outcomes from true values of beta coefficients and true values of outcome variables, respectively.

4 / 37

Linear Regression

Assumptions on Linear Regression

Assumptions on the linear regression model are that ...
- The outcome variable is a linear combination of the explanatory variables.
- Errors have a mean value of 0.
- Errors are uncorrelated with explanatory variables.

5 / 37

Linear Regression

Beta estimates

Linear regression finds the beta coefficients such that ...

– The linear function is as near as possible to for all pairs in the data.

In other words, the estimator for the beta coefficients is chosen to minimize the sum of squares of the residual errors (SSR):
- .
- .

6 / 37

Linear Regression

Evaluating Models

Training data: When we're building a model to make predictions or to identify the relationships, we need data to build the model.
Testing data: We also need data to test whether the model works well on new data.

So, we split data into training and test sets when building a linear regression model.

7 / 37

Linear Regression using R
8 / 37

Linear Regression

Example of Linear Regression using R

Let's use the 2016 US Census PUMS dataset.
- Full-time employees between 20 and 50 years of age with income between $1,000 and $250,000;
Personal data recorded includes personal income and demographic variables:
- PINCP: personal income
- AGEP: age
- SEX: sex

9 / 37

Linear Regression

Spliting Data into Training and Testing Data

# Importing the cleaned small sample of data
psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )
# Making the random sampling reproducible by setting the random seed.
set.seed(3454351) # 3454351 is just any number.
# The set.seed() function sets the starting number 
# used to generate a sequence of random numbers.
# With set.seed(), we can replicate the random number generation:
# If we start with that same seed number in the set.seed() each time, 
# we run the same random process, 
# so that we can replicate the same random numbers.

# How many random numbers do we need?
gp <- runif( nrow(psub) ) 
# a number generation from a random variable that follows Unif(0,1)
# Splits 50-50 into training and test sets 
# using filter() and gp
dtrain <- filter(psub, gp >= .5) 
dtest <- filter(psub,  gp < .5)
# A vector can be used for CONDITION in the filter(data.frame, CONDITION) 
# if the length of the vector is the same as that of the data.frame.

10 / 37

Linear Regression

Exploratory Data Analysis (EDA)

Use summary statistics and visualization to explore the data, particularly for the following variables:
- PINCP: personal income
- AGEP: age
- SEX: sex

It's often a better idea to get some sense of how the data behaves through EDA before doing any statistical analysis.

# install.packages("GGally")  # to use GGally::ggpairs()
ggpairs( select(dtrain, PINCP, AGEP, SEX) )  # for correlogram or correlation matrix

11 / 37

Linear Regression

Building a linear regression model using `lm()`

model <- lm(formula = PINCP ~ AGEP + SEX, 
            data = dtrain)

In the above line of R commands, ...

model: R object to save the estimation result of linear regression
lm(): Linear regression modeling function
PINCP ~ AGEP + SEX: Formula for linear regression
PINCP: Outcome/Dependent variable
AGEP, SEX: Input/Independent/Explanatory variables
dtrain: Data frame to use for training

12 / 37

Linear Regression using R

Making predictions with a linear regression model using `predict()`

dtest$pred <-  predict(model, 
                       newdata = dtest)

In the above line of R commands, ...
- dtest$pred: Adding a new column pred to the dtest data frame. mutate() also works.
- predict(): Function to get the predicted outcome using model and dtest
- model: R object to save the estimation result of linear regression
- dtest: Data frame to use in prediction

We can make prediction using dtrain data frame too.

13 / 37

Linear Regression using R

Summary of the regression result

summary(model)   # This produces the output of the linear regression.

14 / 37

Linear Regression using R

Getting Estimates of Beta Coefficients

coef() returns the beta estimates:

coef(model)   
coef(model)['AGEP']

15 / 37

Linear Regression using R

Indicator variables

Linear regression handles a factor variable with m possible levels by converting it to m-1 indicator variables, and the rest 1 category, the first level of the factor variable, becomes a reference level.

The value of any indicator variable is either 0 or 1.
E.g., the indicator variable, SEXFemale, is follows:

The level male becomes a reference level when interpreting the beta estimate for SEXFemale.

16 / 37

Linear Regression using R

Setting a reference level

If the independent variable includes factor variables, we can set a reference level for each factor variable using relevel(VARIABLE, ref = "LEVEL").

dtrain$SEX <- relevel(dtrain$SEX, ref = "Female") 
model <- lm(PINCP ~ AGEP + SEX, 
            data = dtrain)
summary(model)

E.g., the indicator variable, SEXMale, is follows:

The level Female now becomes a reference level.
Note: Changing the reference level does not change the regression result.

17 / 37

Linear Regression using R

Interpreting Estimated Coefficients

The model is ...

All else being equal, ...

All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by b1.

All else being equal, an increase in SEX.Male by one unit is associated with an increase in PINCP by b2.
All else being equal, being a male relative to being a female is associated with an increase in PINCP by b2.

18 / 37

Linear Regression using R

Interpreting Estimated Coefficients

Consider the predicted incomes of the two male persons, Ben and Bob, whose ages are 51 and 50 respectively.

19 / 37

Linear Regression using R

Interpreting Estimated Coefficients

Consider the predicted incomes of the two persons, Ben and Linda, whose ages are the same as 50. Ben is male and Linda is female.

20 / 37

Linear Regression using R

Interpreting Estimated Coefficients

What does it mean for a beta estimate to be statistically significant at 5% level?
- It means that the null hypothesis is rejected for a given significance level 5%.
- "2 standard error rule" of thumb: The true value of is 95% likely to be in the confidence interval .
- The standard error tells us how uncertain our estimate of the coefficient b is.
- We should look for the stars!

21 / 37

Linear Regression using R

Interpreting Estimated Coefficients

Using the "2 standard error rule" of thumb, we could refine our earlier interpretation of beta estimates as follows:
- All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by b1 2*Std.Err.b1.
- All else being equal, being a male relative to being a female is associated with an increase in PINCP by b2 2*Std.Err.b2 .

22 / 37

Linear Regression using R

R-squared

R-squared is a measure of how well the model “fits” the data, or its “goodness of fit.”
- R-squared can be thought of as what fraction of the y's variation is explained by the independent variables.

R-squared will be higher for models with more explanatory variables, regardless of whether the additional explanatory variables actually improve the model or not.
We want R-squared to be fairly large and R-squareds that are similar on testing and training.

The adjusted R-squared is the multiple R-squared penalized for the number of input variables.

23 / 37

Linear Regression using R

Visualizations to diagnose the quality of modeling results

The following two visualizations from the linear regression are useful to determine the quality of linear regression:
1. Actual vs. predicted outcome plot;
2. Residual plot.

The following is the actual vs predicted outcome plot.

ggplot( data = dtest, 
        aes(x = pred, y = PINCP) ) +
  geom_point( alpha = 0.2, color = "darkgray" ) +
  geom_smooth( color = "darkblue" ) +  
  geom_abline( color = "red", linetype = 2 )  # y = x, perfect prediction line

The following is the residual plot.
- Residual[i] = y[i] - Predicted_y[i].

ggplot(data = dtest, 
       aes(x = pred, y = PINCP - pred)) +
  geom_point(alpha = 0.2, color = "darkgray") +
  geom_smooth( color = "darkblue" ) +   
  geom_hline( aes( yintercept = 0 ),  # perfect prediction 
              color = "red", linetype = 2) + 
  labs(x = 'Predicted PINCP', y = "Residual error")

From the plot of actual vs. predicted outcomes and the plot of residuals, we should ask the following two questions ourselves:
- On average, are the predictions correct?
- Are there systematic errors?
A well-behaved plot will bounce randomly and form a cloud roughly around the perfect prediction line.

An example of systematic errors in model predictions

24 / 37

Linear Regression using R

Practical considerations in linear regression

Correlation does not imply causation:
- Just because a coefficient is significant, doesn’t mean our explanatory variable causes the response of our outcome variable.
- In order to test cause-and-effect relationships through regression, we would often need data from (quasi-)experiments to remove selection bias.

To achieve causality, researchers conduct experiments such as randomized controlled trials (RCT) and A/B testing:
- The treatment group receives the treatment whose effect the researcher is interested in.
- The control group receives either no treatment or a placebo.
- The treatment variable indicates the status of treatment and control.

In linear regression, if all explanatory variables apart from the treatment variable are made equal across the two groups, selection bias is mostly eliminated, so that we may infer causality from beta estimates.

There is a difference between practical significance and statistical significance:
- Whether an association between x and y is practically significant depends heavily on the unit of measurement.
- E.g., We regressed income (measured in $) on height, and got a statistically significant beta estimate of 100, with a standard error of 20.
- Q. Is 100 a large effect?

25 / 37

Linear Regression using R

More Explanatory Variables in the Model

In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables:
- COW: class of worker
- SCHL: level of education

26 / 37

Linear Regression using R

More Explanatory Variables in the Model

Suppose we also want to assess how personal income (PINCP) varies with (1) age (AGEP), and (2) gender (SEX), and (3) a bachelor's degree (SCHL).
1. Conduct the exploratory data analysis.
2. Based on the visualization, set a hypothesis regarding the relationship between having bachelor's degree and PINCP.
3. Train the linear regression model.
4. Interpret the beta coefficients from the linear regression result.
5. Calculate the predicted PINCP using the testing data.
6. Draw the actual vs. predicted outcome plot and the residual plot.

27 / 37

Linear Regression in R

The model equation

28 / 37

Linear Regression with Log-transformed Variables
29 / 37

Linear Regression with Log-transformtion

A Little Bit of Math for Logarithm

The logarithm function, , looks like ....

: the base logarithm is called the natural log, where is the mathematical constant, the Euler's number.
or : the natural log of .
: the natural log of is , because .

30 / 37

Linear Regression with Log-transformtion

We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
- For small changes in variable from to , the following can be shown:

A difference in income of $5,000 means something very different across people with different income levels.
- We should also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed.

31 / 37

Linear Regression with Log-transformtion

The log transformation makes the skewed distribution of income more normal.

ggplot(dtrain, aes( x = PINCP ) ) +
  geom_density() 
ggplot(dtrain, aes( x = log(PINCP) ) ) +
  geom_density()

32 / 37

Linear Regression with Log-transformtion

33 / 37

Linear Regression with Log-transformtion

A Few Algebras for Logarithm and Exponential Functions

Rule 1:
Rule 2:

By the rules above,

34 / 37

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

So we can have the following:

35 / 37

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

Suppose .
- Then is 1.18 times .
- It means that being a male is associated with an increase in income by 18% relative to being a female.

36 / 37

Linear Regression with Log-transformtion

Interpreting Beta Estimates

All else being equal, an increase in AGEP by one unit is associated with an increase in log(PINCP) by b1.
All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by (exp(b1) - 1)%.

All else being equal, an increase in SEXMale by one unit is associated with an increase in log(PINCP) by b2.
All else being equal, being a male is associated with an increase in PINCP by (exp(b1) - 1)% relative to being a female.

37 / 37

Lecture 23DANL 200: Introduction to Data AnalyticsByeong-Hak ChoeNovember 22, 20221 / 37

Modeling Methods - Linear Regression
2 / 37

Linear Regression

Example

Suppose we also want to estimate how gender will affect personal income.
Linear regression assumes that ...
- The outcome PINCP[i] is linearly related to each of the inputs AGEP[i] and SEX[i]:

A variable on the left-hand side is called an outcome variable or a dependent variable.
Variables on the right-hand side are called explanatory variables, independent variables, or input variables.
Coefficients on the right-hand side are called beta coefficients.

3 / 37

Linear Regression

Goals of Linear Regression

The goals of linear regression are ...
1. Find the estimated values of b1 and b2: and .
2. Make a prediction on PINCP[i] for each person i: .

We will use the hat notation to distinguish estimated beta coefficients and predicted outcomes from true values of beta coefficients and true values of outcome variables, respectively.

4 / 37

Linear Regression

Assumptions on Linear Regression

Assumptions on the linear regression model are that ...
- The outcome variable is a linear combination of the explanatory variables.
- Errors have a mean value of 0.
- Errors are uncorrelated with explanatory variables.

5 / 37

Linear Regression

Beta estimates

Linear regression finds the beta coefficients such that ...

– The linear function is as near as possible to for all pairs in the data.

In other words, the estimator for the beta coefficients is chosen to minimize the sum of squares of the residual errors (SSR):
- .
- .

6 / 37

Linear Regression

Evaluating Models

Training data: When we're building a model to make predictions or to identify the relationships, we need data to build the model.
Testing data: We also need data to test whether the model works well on new data.

So, we split data into training and test sets when building a linear regression model.

7 / 37

Linear Regression using R
8 / 37

Linear Regression

Example of Linear Regression using R

Let's use the 2016 US Census PUMS dataset.
- Full-time employees between 20 and 50 years of age with income between $1,000 and $250,000;
Personal data recorded includes personal income and demographic variables:
- PINCP: personal income
- AGEP: age
- SEX: sex

9 / 37

Linear Regression

Spliting Data into Training and Testing Data

# Importing the cleaned small sample of data
psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )
# Making the random sampling reproducible by setting the random seed.
set.seed(3454351) # 3454351 is just any number.
# The set.seed() function sets the starting number 
# used to generate a sequence of random numbers.
# With set.seed(), we can replicate the random number generation:
# If we start with that same seed number in the set.seed() each time, 
# we run the same random process, 
# so that we can replicate the same random numbers.

# How many random numbers do we need?
gp <- runif( nrow(psub) ) 
# a number generation from a random variable that follows Unif(0,1)
# Splits 50-50 into training and test sets 
# using filter() and gp
dtrain <- filter(psub, gp >= .5) 
dtest <- filter(psub,  gp < .5)
# A vector can be used for CONDITION in the filter(data.frame, CONDITION) 
# if the length of the vector is the same as that of the data.frame.

10 / 37

Linear Regression

Exploratory Data Analysis (EDA)

Use summary statistics and visualization to explore the data, particularly for the following variables:
- PINCP: personal income
- AGEP: age
- SEX: sex

It's often a better idea to get some sense of how the data behaves through EDA before doing any statistical analysis.

# install.packages("GGally")  # to use GGally::ggpairs()
ggpairs( select(dtrain, PINCP, AGEP, SEX) )  # for correlogram or correlation matrix

11 / 37

Linear Regression

Building a linear regression model using `lm()`

model <- lm(formula = PINCP ~ AGEP + SEX, 
            data = dtrain)

In the above line of R commands, ...

model: R object to save the estimation result of linear regression
lm(): Linear regression modeling function
PINCP ~ AGEP + SEX: Formula for linear regression
PINCP: Outcome/Dependent variable
AGEP, SEX: Input/Independent/Explanatory variables
dtrain: Data frame to use for training

12 / 37

Linear Regression using R

Making predictions with a linear regression model using `predict()`

dtest$pred <-  predict(model, 
                       newdata = dtest)

In the above line of R commands, ...
- dtest$pred: Adding a new column pred to the dtest data frame. mutate() also works.
- predict(): Function to get the predicted outcome using model and dtest
- model: R object to save the estimation result of linear regression
- dtest: Data frame to use in prediction

We can make prediction using dtrain data frame too.

13 / 37

Linear Regression using R

Summary of the regression result

summary(model)   # This produces the output of the linear regression.

14 / 37

Linear Regression using R

Getting Estimates of Beta Coefficients

coef() returns the beta estimates:

coef(model)   
coef(model)['AGEP']

15 / 37

Linear Regression using R

Indicator variables

Linear regression handles a factor variable with m possible levels by converting it to m-1 indicator variables, and the rest 1 category, the first level of the factor variable, becomes a reference level.

The value of any indicator variable is either 0 or 1.
E.g., the indicator variable, SEXFemale, is follows:

The level male becomes a reference level when interpreting the beta estimate for SEXFemale.

16 / 37

Linear Regression using R

Setting a reference level

If the independent variable includes factor variables, we can set a reference level for each factor variable using relevel(VARIABLE, ref = "LEVEL").

dtrain$SEX <- relevel(dtrain$SEX, ref = "Female") 
model <- lm(PINCP ~ AGEP + SEX, 
            data = dtrain)
summary(model)

E.g., the indicator variable, SEXMale, is follows:

The level Female now becomes a reference level.
Note: Changing the reference level does not change the regression result.

17 / 37

Linear Regression using R

Interpreting Estimated Coefficients

The model is ...

All else being equal, ...

All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by b1.

All else being equal, an increase in SEX.Male by one unit is associated with an increase in PINCP by b2.
All else being equal, being a male relative to being a female is associated with an increase in PINCP by b2.

18 / 37

Linear Regression using R

Interpreting Estimated Coefficients

Consider the predicted incomes of the two male persons, Ben and Bob, whose ages are 51 and 50 respectively.

19 / 37

Linear Regression using R

Interpreting Estimated Coefficients

Consider the predicted incomes of the two persons, Ben and Linda, whose ages are the same as 50. Ben is male and Linda is female.

20 / 37

Linear Regression using R

Interpreting Estimated Coefficients

What does it mean for a beta estimate to be statistically significant at 5% level?
- It means that the null hypothesis is rejected for a given significance level 5%.
- "2 standard error rule" of thumb: The true value of is 95% likely to be in the confidence interval .
- The standard error tells us how uncertain our estimate of the coefficient b is.
- We should look for the stars!

21 / 37

Linear Regression using R

Interpreting Estimated Coefficients

Using the "2 standard error rule" of thumb, we could refine our earlier interpretation of beta estimates as follows:
- All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by b1 2*Std.Err.b1.
- All else being equal, being a male relative to being a female is associated with an increase in PINCP by b2 2*Std.Err.b2 .

22 / 37

Linear Regression using R

R-squared

R-squared is a measure of how well the model “fits” the data, or its “goodness of fit.”
- R-squared can be thought of as what fraction of the y's variation is explained by the independent variables.

R-squared will be higher for models with more explanatory variables, regardless of whether the additional explanatory variables actually improve the model or not.
We want R-squared to be fairly large and R-squareds that are similar on testing and training.

The adjusted R-squared is the multiple R-squared penalized for the number of input variables.

23 / 37

Linear Regression using R

Visualizations to diagnose the quality of modeling results

The following two visualizations from the linear regression are useful to determine the quality of linear regression:
1. Actual vs. predicted outcome plot;
2. Residual plot.

The following is the actual vs predicted outcome plot.

ggplot( data = dtest, 
        aes(x = pred, y = PINCP) ) +
  geom_point( alpha = 0.2, color = "darkgray" ) +
  geom_smooth( color = "darkblue" ) +  
  geom_abline( color = "red", linetype = 2 )  # y = x, perfect prediction line

The following is the residual plot.
- Residual[i] = y[i] - Predicted_y[i].

ggplot(data = dtest, 
       aes(x = pred, y = PINCP - pred)) +
  geom_point(alpha = 0.2, color = "darkgray") +
  geom_smooth( color = "darkblue" ) +   
  geom_hline( aes( yintercept = 0 ),  # perfect prediction 
              color = "red", linetype = 2) + 
  labs(x = 'Predicted PINCP', y = "Residual error")

From the plot of actual vs. predicted outcomes and the plot of residuals, we should ask the following two questions ourselves:
- On average, are the predictions correct?
- Are there systematic errors?
A well-behaved plot will bounce randomly and form a cloud roughly around the perfect prediction line.

An example of systematic errors in model predictions

24 / 37

Linear Regression using R

Practical considerations in linear regression

Correlation does not imply causation:
- Just because a coefficient is significant, doesn’t mean our explanatory variable causes the response of our outcome variable.
- In order to test cause-and-effect relationships through regression, we would often need data from (quasi-)experiments to remove selection bias.

To achieve causality, researchers conduct experiments such as randomized controlled trials (RCT) and A/B testing:
- The treatment group receives the treatment whose effect the researcher is interested in.
- The control group receives either no treatment or a placebo.
- The treatment variable indicates the status of treatment and control.

In linear regression, if all explanatory variables apart from the treatment variable are made equal across the two groups, selection bias is mostly eliminated, so that we may infer causality from beta estimates.

There is a difference between practical significance and statistical significance:
- Whether an association between x and y is practically significant depends heavily on the unit of measurement.
- E.g., We regressed income (measured in $) on height, and got a statistically significant beta estimate of 100, with a standard error of 20.
- Q. Is 100 a large effect?

25 / 37

Linear Regression using R

More Explanatory Variables in the Model

In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables:
- COW: class of worker
- SCHL: level of education

26 / 37

Linear Regression using R

More Explanatory Variables in the Model

Suppose we also want to assess how personal income (PINCP) varies with (1) age (AGEP), and (2) gender (SEX), and (3) a bachelor's degree (SCHL).
1. Conduct the exploratory data analysis.
2. Based on the visualization, set a hypothesis regarding the relationship between having bachelor's degree and PINCP.
3. Train the linear regression model.
4. Interpret the beta coefficients from the linear regression result.
5. Calculate the predicted PINCP using the testing data.
6. Draw the actual vs. predicted outcome plot and the residual plot.

27 / 37

Linear Regression in R

The model equation

28 / 37

Linear Regression with Log-transformed Variables
29 / 37

Linear Regression with Log-transformtion

A Little Bit of Math for Logarithm

The logarithm function, , looks like ....

: the base logarithm is called the natural log, where is the mathematical constant, the Euler's number.
or : the natural log of .
: the natural log of is , because .

30 / 37

Linear Regression with Log-transformtion

We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
- For small changes in variable from to , the following can be shown:

A difference in income of $5,000 means something very different across people with different income levels.
- We should also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed.

31 / 37

Linear Regression with Log-transformtion

The log transformation makes the skewed distribution of income more normal.

ggplot(dtrain, aes( x = PINCP ) ) +
  geom_density() 
ggplot(dtrain, aes( x = log(PINCP) ) ) +
  geom_density()

32 / 37

Linear Regression with Log-transformtion

33 / 37

Linear Regression with Log-transformtion

A Few Algebras for Logarithm and Exponential Functions

Rule 1:
Rule 2:

By the rules above,

34 / 37

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

So we can have the following:

35 / 37

Linear Regression with Log-transformtion

Interpreting Beta Estimates

If we apply the rule above for and 's predicted incomes,

Suppose .
- Then is 1.18 times .
- It means that being a male is associated with an increase in income by 18% relative to being a female.

36 / 37

Linear Regression with Log-transformtion

Interpreting Beta Estimates

All else being equal, an increase in AGEP by one unit is associated with an increase in log(PINCP) by b1.
All else being equal, an increase in AGEP by one unit is associated with an increase in PINCP by (exp(b1) - 1)%.

All else being equal, an increase in SEXMale by one unit is associated with an increase in log(PINCP) by b2.
All else being equal, being a male is associated with an increase in PINCP by (exp(b1) - 1)% relative to being a female.

37 / 37

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Lecture 23

DANL 200: Introduction to Data Analytics

Byeong-Hak Choe

November 22, 2022

Modeling Methods - Linear Regression

Linear Regression

Example

Linear Regression

Goals of Linear Regression

Linear Regression

Assumptions on Linear Regression

Linear Regression

Beta estimates

Linear Regression

Evaluating Models

Linear Regression using R

Linear Regression

Example of Linear Regression using R

Linear Regression

Spliting Data into Training and Testing Data

Linear Regression

Exploratory Data Analysis (EDA)

Linear Regression

Building a linear regression model using lm()

Linear Regression using R

Making predictions with a linear regression model using predict()

Linear Regression using R

Summary of the regression result

Linear Regression using R

Getting Estimates of Beta Coefficients

Linear Regression using R

Indicator variables

Linear Regression using R

Setting a reference level

Linear Regression using R

Interpreting Estimated Coefficients

Linear Regression using R

Interpreting Estimated Coefficients

Linear Regression using R

Interpreting Estimated Coefficients

Linear Regression using R

Interpreting Estimated Coefficients

Linear Regression using R

Interpreting Estimated Coefficients

Linear Regression using R

R-squared

Linear Regression using R

Visualizations to diagnose the quality of modeling results

Linear Regression using R

Practical considerations in linear regression

Linear Regression using R

More Explanatory Variables in the Model

Linear Regression using R

More Explanatory Variables in the Model

Linear Regression in R

The model equation

Linear Regression with Log-transformed Variables

Linear Regression with Log-transformtion

A Little Bit of Math for Logarithm

Linear Regression with Log-transformtion

Linear Regression with Log-transformtion

Linear Regression with Log-transformtion

Linear Regression with Log-transformtion

A Few Algebras for Logarithm and Exponential Functions

Linear Regression with Log-transformtion

Interpreting Beta Estimates

Linear Regression with Log-transformtion

Interpreting Beta Estimates

Linear Regression with Log-transformtion

Interpreting Beta Estimates

Modeling Methods - Linear Regression

Help

Lecture 23

Lecture 23

DANL 200: Introduction to Data Analytics

Byeong-Hak Choe

November 22, 2022

Modeling Methods - Linear Regression

Linear Regression

Example

Building a linear regression model using `lm()`

Making predictions with a linear regression model using `predict()`

Building a linear regression model using `lm()`

Making predictions with a linear regression model using `predict()`