Lecture 24

class: title-slide, left, bottom

# Lecture 24
----
## **DANL 200: Introduction to Data Analytics**
### Byeong-Hak Choe
### November 29, 2022

---
# Announcement
### <p style="color:#00449E"> Geneseo Alumni's Talk on Data Analytics Career

- Name: Lauren Kopac (Class of 2015)

- When: December 1, 2022, Thursday 11:00 A.M.

- Where: Zoom (I will leave the Zoom link on Canvas soon.)

- We will watch her Zoom recording on next Tuesday.

- She will give us a talk on the data analytics career with her experiences.

- As a data analyst, she has worked at (1) Columbia University, New York, NY and (2) Neuberger Berman, New York, NY.

---
# Tips for using Presentation Slides

- To go to a previous/next page, use keyboard arrows, <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M447.1 256C447.1 273.7 433.7 288 416 288H109.3l105.4 105.4c12.5 12.5 12.5 32.75 0 45.25C208.4 444.9 200.2 448 192 448s-16.38-3.125-22.62-9.375l-160-160c-12.5-12.5-12.5-32.75 0-45.25l160-160c12.5-12.5 32.75-12.5 45.25 0s12.5 32.75 0 45.25L109.3 224H416C433.7 224 447.1 238.3 447.1 256z"/></svg> and <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M438.6 278.6l-160 160C272.4 444.9 264.2 448 256 448s-16.38-3.125-22.62-9.375c-12.5-12.5-12.5-32.75 0-45.25L338.8 288H32C14.33 288 .0016 273.7 .0016 256S14.33 224 32 224h306.8l-105.4-105.4c-12.5-12.5-12.5-32.75 0-45.25s32.75-12.5 45.25 0l160 160C451.1 245.9 451.1 266.1 438.6 278.6z"/></svg>.

- To see a tile view of the lecture slides, use the alphabet key, `o`.

- If you hover a mouse cursor on the code block in the lecture slide, you can see and click the *"Copy Code"* from the top-right corner of the code block.
  - If you click the *"Copy Code"*, the codes in the block are copied, so that you can paste them to RScript.

- If the presentation slides does not respond, refresh the web-page of the slides by the shortcut, **Ctrl** (or **command** for Mac users) ** + R**.

---
class: inverse, center, middle

# Modeling Methods - Linear Regression
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Linear Regression 
### <p style="color:#00449E">  Example

- Suppose we also want to estimate how gender will affect personal income. 
- Linear regression assumes that ...
  - The outcome `PINCP[i]` is linearly related to each of the inputs `AGEP[i]` and `SEX[i]`:

`$$\texttt{PINCP[i]} \;=\quad \texttt{f(AGEP[i], SEX[i])} \,+\, \texttt{e[i]} \qquad\qquad\qquad\qquad\\
\;=\quad \texttt{b0} \,+\, \texttt{b1*AGEP[i]} \,+\, \texttt{b2*SEX[i]}\,+\, \texttt{e[i]}$$`
- A variable on the left-hand side is called an outcome variable or a dependent variable.
- Variables on the right-hand side are called explanatory variables, independent variables, or input variables.
- Coefficients `$\texttt{b[1]}, ... , \texttt{b[P]}\;$`  on the right-hand side are called beta coefficients.

---
# Linear Regression 
### <p style="color:#00449E">  Goals of Linear Regression

-  The goals of linear regression are  ... 
  1. Find the estimated values of `b1` and `b2`: `$\quad \hat{\texttt{b1}}$` and `$\hat{\texttt{b2}}$`.
  
  2. Make a prediction on `PINCP[i]` for each person `i`: `$\quad \widehat{\texttt{PINCP}}\texttt{[i]}$`.

`$$\widehat{\texttt{PINCP}}\texttt{[i]} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{*AGEP[i]} \,+\, \hat{\texttt{b2}}\texttt{*SEX[i]}$$`

- We will use the hat notation `$(\,\hat{\texttt{ }}\,)$`  to distinguish *estimated* beta coefficients and *predicted* outcomes from *true* values of beta coefficients and *true* values of outcome variables, respectively.

---
# Linear Regression 
### <p style="color:#00449E">  Assumptions on Linear Regression

- Assumptions on the linear regression model are that ...

- The outcome variable is a linear combination of the explanatory variables.
  
  - Errors have a mean value of 0.
  
  - Errors are *uncorrelated* with explanatory variables.

---
# Linear Regression 
### <p style="color:#00449E"> Beta estimates

- Linear regression finds the beta coefficients `$( \texttt{b[0]}, ... , \texttt{b[P]} )$` such that ...

– The linear function `$\texttt{f(x[i, ])}$` is as near as possible to
`$\texttt{y[i]}$` for all `$\texttt{(x[i, ], y[i])}$` pairs in the data.

- In other words, the estimator for the beta coefficients is chosen to minimize the sum of squares of the *residual errors* (SSR):

-  `$\texttt{Residual_Error[i] = y[i] - } \hat{\texttt{y}}\,\texttt{[i]}$`.
  
  -  `$\texttt{SSR} = \texttt{Residual_Error[1]}^{2} + \cdots + \texttt{Residual_Error[N]}^{2}$`.

---
# Linear Regression 
### <p style="color:#00449E"> Evaluating Models

- **Training data**: When we're building a model to make predictions or to identify the relationships, we need *data* to build the model.

- **Testing data**: We also need data to test whether the model works well on *new data*.

.pull-left[

- So, we split data into training and test sets when building a linear regression model.
]

.pull-right[
<img src="../lec_figs/pds_fig4_12.png" width="100%" style="display: block; margin: auto;" />
]

---
class: inverse, center, middle

# Linear Regression using **R**
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Linear Regression 
### <p style="color:#00449E"> Example of Linear Regression using **R**

- Let's use the 2016 US Census PUMS dataset.
  - Full-time employees between 20 and 50 years of age with income between $1,000 and $250,000;

- Personal data recorded includes personal income and demographic variables:
  - `PINCP`: personal income
  - `AGEP`: age 
  - `SEX`: sex

---
# Linear Regression 
### <p style="color:#00449E"> Spliting Data into Training and Testing Data

.panelset[

.panel[.panel-name[Step 1. set.seed()]

```r
# Importing the cleaned small sample of data
psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )

# Making the random sampling reproducible by setting the random seed.
set.seed(3454351) # 3454351 is just any number.
# The set.seed() function sets the starting number 
# used to generate a sequence of random numbers.

# With set.seed(), we can replicate the random number generation:
# If we start with that same seed number in the set.seed() each time, 
# we run the same random process, 
# so that we can replicate the same random numbers.
```

]

.panel[.panel-name[Step 2. runif()]

```r
# How many random numbers do we need?
gp <- runif( nrow(psub) ) 
# a number generation from a random variable that follows Unif(0,1)

# Splits 50-50 into training and test sets 
# using filter() and gp

dtrain <- filter(psub, gp >= .5) 
dtest <- filter(psub,  gp < .5)
# A vector can be used for CONDITION in the filter(data.frame, CONDITION) 
# if the length of the vector is the same as that of the data.frame.
```
]

]

---
# Linear Regression 
### <p style="color:#00449E"> Exploratory Data Analysis (EDA)

- Use summary statistics and visualization to explore the data, particularly for the following variables:

- `PINCP`: personal income
  - `AGEP`: age 
  - `SEX`: sex

- It's often a better idea to get some sense of how the data behaves through EDA before doing any statistical analysis.

```r
# install.packages("GGally")  # to use GGally::ggpairs()
ggpairs( select(dtrain, PINCP, AGEP, SEX) )  # for correlogram or correlation matrix
```

---
# Linear Regression 
### <p style="color:#00449E"> Building a linear regression model using `lm()`

```r
model <- lm(formula = PINCP ~ AGEP + SEX, 
            data = dtrain)
```
In the above line of R commands, ...
- `model`: R object to save the estimation result of linear regression
-  `lm()`: Linear regression modeling function
- `PINCP ~ AGEP + SEX`:  Formula for linear regression
- `PINCP`: Outcome/Dependent variable
- `AGEP, SEX`: Input/Independent/Explanatory variables
- `dtrain`: Data frame to use for training

---
# Linear Regression using **R**
### <p style="color:#00449E"> Making predictions with a linear regression model using `predict()`

```r
dtest$pred <-  predict(model, 
                       newdata = dtest)
```

- In the above line of R commands, ...
  - `dtest$pred`: Adding a new column `pred` to the `dtest` data frame. `mutate()` also works.
  - `predict()`: Function to get the predicted outcome using `model` and `dtest`
  - `model`: R object to save the estimation result of linear regression
  - `dtest`: Data frame to use in prediction

- We can make prediction using `dtrain` data frame too.

---
# Linear Regression using **R**
### <p style="color:#00449E"> Summary of the regression result

```r
summary(model)   # This produces the output of the linear regression.
```

---
# Linear Regression using **R**
### <p style="color:#00449E"> Getting Estimates of Beta Coefficients
- `coef()` returns the beta estimates:

```r
coef(model)   
coef(model)['AGEP']
```

---
# Linear Regression using **R**
### <p style="color:#00449E"> Indicator variables

- Linear regression handles a factor variable with `m` possible levels by converting it to `m-1` indicator variables, and the rest `1` category, the first level of the factor variable, becomes a reference level.

- The value of any indicator variable is either 0 or 1.

- E.g., the indicator variable, `SEXFemale`, is follows:

$$
\texttt{SEXFemale[i] }\\
= \begin{cases}
\texttt{1} & \text{if a person } \texttt{i} \text{ is } \texttt{female};\\\\
\texttt{0} & \text{otherwise}.\qquad\qquad\quad\,
\end{cases}
$$

- The level `male` becomes a reference level when interpreting the beta estimate for `SEXFemale`.

---
# Linear Regression using **R**
### <p style="color:#00449E"> Setting a reference level

- If the independent variable includes factor variables, we can set a reference level for each factor variable using `relevel(VARIABLE, ref = "LEVEL")`.

.panelset[

.panel[.panel-name[code]

```r
dtrain$SEX <- relevel(dtrain$SEX, ref = "Female")

model <- lm(PINCP ~ AGEP + SEX, 
            data = dtrain)
            
summary(model)
```
]

.panel[.panel-name[variable]

- E.g., the indicator variable, `SEXMale`, is follows:

$$
\texttt{SEXMale[i] }\\
= \begin{cases}
\texttt{1} & \text{if a person } \texttt{i} \text{ is } \texttt{male};\\\\ 
\texttt{0} & \text{otherwise}.\qquad\qquad\quad
\end{cases}
$$

- The level `Female` now becomes a reference level.

- Note: Changing the reference level does not change the regression result.
]

]

---
# Linear Regression using **R**
### <p style="color:#00449E"> Interpreting Estimated Coefficients
The model is ...

`$$\texttt{PINCP[i]} \;=\quad \texttt{b0} \,+\, \texttt{b1*AGEP[i]} \,+\,\texttt{b2*SEX.Male[i]}\,+\, \texttt{e[i]}$$`

All else being equal, ...

.panelset[

.panel[.panel-name[`AGEP`]

- All else being equal, an increase in `AGEP` by one unit is associated with an increase in `PINCP` by `b1`.

]

.panel[.panel-name[`SEX.Male`]

- All else being equal, an increase in `SEX.Male` by one unit is associated with an increase in `PINCP` by `b2`.

- All else being equal, being a male relative to being a female is associated with an increase in `PINCP` by `b2`.

]

---
# Linear Regression using **R**
### <p style="color:#00449E"> Interpreting Estimated Coefficients
Consider the predicted incomes of the two male persons, `Ben` and `Bob`, whose ages are 51 and 50 respectively.

`$$\widehat{\texttt{PINCP[Ben]}} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{ * AGEP[Ben]} \,+\, \hat{\texttt{b2}}\texttt{ * SEX.Male[Ben]}\\
\widehat{\texttt{PINCP[Bob]}} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{ * AGEP[Bob]} \,+\, \hat{\texttt{b2}}\texttt{ * SEX.Male[Bob]}$$`

`$$\Leftrightarrow\qquad\widehat{\texttt{PINCP[Ben]}} \,-\, \widehat{\texttt{PINCP[Bob]}}\qquad  \\
\;=\quad \hat{\texttt{b1}}\texttt{ * }(\texttt{AGEP[Ben]} - \texttt{AGEP[Bob]})\\
\;=\quad \hat{\texttt{b1}}\texttt{ * }\texttt{(51 - 50)}\qquad\qquad\quad\;\;\\
\;=\quad \hat{\texttt{b1}}\qquad\qquad\qquad\qquad\quad\;\;\;\,$$`

---
# Linear Regression using **R**
### <p style="color:#00449E"> Interpreting Estimated Coefficients
Consider the predicted incomes of the two persons, `Ben` and `Linda`, whose ages are the same as 50. `Ben` is `male` and `Linda` is `female`.

`$$\widehat{\texttt{PINCP[Ben]}} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{ * AGEP[Ben]} \,+\, \hat{\texttt{b2}}\texttt{ * SEX.Male[Ben]}\;\;\,\\
\widehat{\texttt{PINCP[Linda]}} \;=\quad \hat{\texttt{b0}} \,+\, \hat{\texttt{b1}}\texttt{ * AGEP[Linda]} \,+\, \hat{\texttt{b2}}\texttt{ * SEX.Male[Linda]}$$`

`$$\Leftrightarrow\qquad\widehat{\texttt{PINCP[Ben]}} \,-\, \widehat{\texttt{PINCP[Linda]}}\qquad\qquad\qquad  \\
\;=\quad \hat{\texttt{b2}}\texttt{ * }(\texttt{SEX.Male[Ben]} - \texttt{SEX.Male[Linda]})\\
\;=\quad \hat{\texttt{b2}}\texttt{ * }\texttt{(1 - 0)}\qquad\qquad\quad\qquad\qquad\quad\;\;\\
\;=\quad \hat{\texttt{b2}}\qquad\qquad\qquad\qquad\quad\qquad\qquad\quad\;$$`

---
# Linear Regression using **R**
### <p style="color:#00449E"> Interpreting Estimated Coefficients

- What does it mean for a beta estimate `$\hat{\texttt{b}}$` to be statistically significant at 5% level?

- It means that the null hypothesis `$H_{0}: \texttt{b} = 0$` is rejected for a given significance level 5%.
  
  - "2 standard error rule" of thumb: The true value of `$\texttt{b}$` is 95% likely to be in the confidence interval `$(\, \hat{\texttt{b}} - 2 * \texttt{Std. Error}\;,\; \hat{\texttt{b}} + 2 * \texttt{Std. Error} \,)$`.
  
  - The standard error tells us how uncertain our estimate of the coefficient `b` is.
 
  - We should look for the stars!

---
# Linear Regression using **R**
### <p style="color:#00449E"> Interpreting Estimated Coefficients

- Using the "2 standard error rule" of thumb, we could refine our earlier interpretation of beta estimates as follows:

- All else being equal, an increase in `AGEP` by one unit is associated with an increase in `PINCP` by `b1` `$\pm$` `2*Std.Err.b1`.

- All else being equal, being a male relative to being a female is associated with an increase in `PINCP` by `b2` `$\pm$` `2*Std.Err.b2` .

---
# Linear Regression using **R**
### <p style="color:#00449E"> **R-squared**

- **R-squared** is a measure of how well the model “fits” the data, or its “goodness of fit.”
  - **R-squared** can be thought of as *what fraction of the `y`'s variation is explained by the independent variables*.
  
  
- **R-squared** will be higher for models with more explanatory variables, regardless of whether the additional explanatory variables actually improve the model or not.
- We want **R-squared** to be *fairly* large and **R-squareds** that are similar on testing and training.

- The adjusted **R-squared** is the multiple **R-squared** penalized for the number of input variables.

---
# Linear Regression using **R**
### <p style="color:#00449E"> Visualizations to diagnose the quality of modeling results

.panelset[
.panel[.panel-name[Quality?]
- The following two visualizations from the linear regression are useful to determine the quality of linear regression:

1. Actual vs. predicted outcome plot;
  2. Residual plot.

]

.panel[.panel-name[Actual vs. Predicted]

- The following is **the actual vs predicted outcome plot**.

```r
ggplot( data = dtest, 
        aes(x = pred, y = PINCP) ) +
  geom_point( alpha = 0.2, color = "darkgray" ) +
  geom_smooth( color = "darkblue" ) +  
  geom_abline( color = "red", linetype = 2 )  # y = x, perfect prediction line
```
]

.panel[.panel-name[Residuals]

- The following is **the residual plot**.

- `Residual[i] = y[i] -  Predicted_y[i]`.

```r
ggplot(data = dtest, 
       aes(x = pred, y = PINCP - pred)) +
  geom_point(alpha = 0.2, color = "darkgray") +
  geom_smooth( color = "darkblue" ) +   
  geom_hline( aes( yintercept = 0 ),  # perfect prediction 
              color = "red", linetype = 2) + 
  labs(x = 'Predicted PINCP', y = "Residual error")
```

]

.panel[.panel-name[Q & A]

- From the plot of actual vs. predicted outcomes and the plot of residuals, we should ask the following two questions ourselves:

- On average, are the predictions correct?
  - Are there systematic errors?

- A well-behaved plot will bounce *randomly* and form a cloud roughly around the perfect prediction line.

]

.panel[.panel-name[Sys. Errors]
.pull-left[
<img src="../lec_figs/pds_fig7_8.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<div class="figure" style="text-align: center">
<img src="../lec_figs/residual-hetero.png" alt="An example of systematic errors in model predictions" width="90%" />
<p class="caption">An example of systematic errors in model predictions</p>
</div>
]

]

---
# Linear Regression using **R**
### <p style="color:#00449E"> More Explanatory Variables in the Model

- In the 2016 US Census PUMS dataset, personal data recorded includes occupation, level of education, personal income, and many other demographic variables:

- `SCHL`: level of education

---
# Linear Regression using **R**
### <p style="color:#00449E"> More Explanatory Variables in the Model

- Suppose we also want to assess how personal income (`PINCP`) varies with (1) age (`AGEP`), and (2) gender (`SEX`), and (3) a bachelor's degree (`SCHL`).

1. Conduct the exploratory data analysis.
  2. Based on the visualization, set a hypothesis regarding the relationship between having bachelor's degree and `PINCP`.
  3. Train the linear regression model.
  4. Interpret the beta coefficients from the linear regression result.
  5. Calculate the predicted `PINCP` using the testing data.
  6. Draw the actual vs. predicted outcome plot and the residual plot.

---
# Linear Regression in R
### <p style="color:#00449E"> R commands to do EDA and linear regression analysis

.panelset[
.panel[.panel-name[Data]

```r
library(tidyverse)
psub <- readRDS( url('https://bcdanl.github.io/data/psub.RDS') )
set.seed(54321)
gp <- runif( nrow(psub) )
# Set up factor variables if needed.
dtrain <- filter(psub, gp >= .5)
dtest <- filter(psub, gp < .5)
```
]

.panel[.panel-name[EDA]

```r
library(skimr)
sum_dtrain <- skim( select(dtrain,
                           PINCP, AGEP, SEX, SCHL) )
library(GGally)
ggpairs( select(dtrain,
                PINCP, AGEP, SEX, SCHL) )
# MORE VISUALIZATIONS ARE RECOMMENDED
```
]

.panel[.panel-name[Training]

```r
model_1 <- lm( PINCP ~ AGEP + SEX,
               data = dtrain )
model_2 <- lm( PINCP ~ AGEP + SEX + SCHL,
               data = dtrain )
```
]

.panel[.panel-name[Summary 1]
- Summary with base-R:

```r
summary(model_1)
summary(model_2)
coef(model_1)
coef(model_2)
# Using the model.matrix() function on our linear model object, 
# we can get the data matrix that underlies our regression. 
df_model_1 <- as_tibble( model.matrix(model_1) )
df_model_2 <- as_tibble( model.matrix(model_2) )
```

]

.panel[.panel-name[Summary 2]

- Summary with R packages:

```r
# install.packages(c("stargazer", "broom"))
library(stargazer)
library(broom)
stargazer(model_1, model_2, 
          type = 'text')  # from the stargazer package
sum_model <- tidy(model)  # from the broom package
# Consider filter() to keep statistically significant beta estimates
```

]

.panel[.panel-name[Betas in plot]

```r
ggplot(sum_model) +
  geom_pointrange( aes(x = term, 
                       y = estimate,
                       ymin = estimate - 2*std.error,
                       ymax = estimate + 2*std.error ) ) +
  coord_flip()
```

]

.panel[.panel-name[Prediction]

```r
dtest <- dtest %>% 
  mutate( pred_1 = predict(model_1, newdata = dtest),
          pred_2 = predict(model_2, newdata = dtest) )
```
]

.panel[.panel-name[Actual vs. Prediction Plot]

]

.panel[.panel-name[Residual Plot]

]

---
# Linear Regression in R
### <p style="color:#00449E"> The model equation

`$$\texttt{PINCP[i]}\qquad\qquad\qquad\qquad\qquad\notag\\ 
\;=\quad \texttt{b0} \,+\, \texttt{b1*AGEP[i]} \,+\, \texttt{b2*SEX.Male[i]}\\
\qquad\qquad\texttt{b3*SCHL.no high school diploma[i]}\,+\, \\
\qquad\qquad\qquad\quad\;\texttt{b4*SCHL.GED or alternative credential[i]}\,+\, \\
\qquad\qquad\qquad\quad\;\;\;\texttt{b5*SCHL.some college credit, no degree[i]}\,+\, \\ 
\qquad\texttt{b6*SCHL.Associate's degree[i]}\,+\, \\
\quad\;\;\texttt{b7*SCHL.Bachelor's degree[i]}\,+\, \\
\;\;\;\texttt{b8*SCHL.Master's degree[i]}\,+\, \\
\qquad\;\;\;\texttt{b9*SCHL.Professional degree[i]}\,+\, \\
\qquad\texttt{b10*SCHL.Doctorate degree[i]}\,+\, \\
\texttt{e[i]}.\qquad\qquad\qquad\qquad\qquad$$`

---
class: inverse, center, middle

# Linear Regression with Log-transformed Variables
<html><div style='float:left'></div><hr color='#EB811B' size=1px width=796px></html>

---
# Linear Regression with Log-transformtion
### <p style="color:#00449E"> A Little Bit of Math for Logarithm

.panelset[

.panel[.panel-name[log functions]

- The logarithm function, `$y = \log_{b}\,(\,x\,)$`, looks like ....

<img src="../lec_figs/logarithm_plots.png" width="42%" style="display: block; margin: auto;" />
]

.panel[.panel-name[log examples]

- `$\log_{e}\,(\,x\,)$`: the base `$e$` logarithm is called the natural log, where `$e = 2.718\cdots$` is the mathematical constant,  the Euler's number.

- `$\log\,(\,x\,)$` or `$\ln\,(\,x\,)$`: the natural log of `$x$` .

- `$\log_{e}\,(\,7.389\cdots\,)$`: the natural log of `$7.389\cdots$` is `$2$`, because `$e^{2} = 7.389\cdots$`.
]

]

---
# Linear Regression with Log-transformtion

- We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.

- For small changes in variable `$x$` from `$x_{0}$` to `$x_{1}$`, the following equation holds: 
  
`$$\Delta \log(x) \,= \, \log(x_{1}) \,-\, \log(x_{0}) 
\approx\, \frac{x_{1} \,-\, x_{0}}{x_{0}} 
\,=\, \frac{\Delta\, x}{x_{0}}.$$`

- A difference in income of $5,000 means something very different across people with different income levels.

- We should also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed.

---
# Linear Regression with Log-transformtion

- The log transformation makes the skewed distribution of income more normal.

```r
ggplot(dtrain, aes( x = PINCP ) ) +
  geom_density()

ggplot(dtrain, aes( x = log(PINCP) ) ) +
  geom_density() 
```

---
# Linear Regression with Log-transformtion

`$$\log(\texttt{PINCP[i]})\qquad\qquad\qquad\qquad\qquad\notag\\ 
\;=\quad \texttt{b0} \,+\, \texttt{b1*AGEP[i]} \,+\, \texttt{b2*SEX.Male[i]}\\
\qquad\qquad\texttt{b3*SCHL.no high school diploma[i]}\,+\, \\
\qquad\qquad\qquad\quad\;\texttt{b4*SCHL.GED or alternative credential[i]}\,+\, \\
\qquad\qquad\qquad\quad\;\;\;\texttt{b5*SCHL.some college credit, no degree[i]}\,+\, \\ 
\qquad\texttt{b6*SCHL.Associate's degree[i]}\,+\, \\
\quad\;\;\texttt{b7*SCHL.Bachelor's degree[i]}\,+\, \\
\;\;\;\texttt{b8*SCHL.Master's degree[i]}\,+\, \\
\qquad\;\;\;\texttt{b9*SCHL.Professional degree[i]}\,+\, \\
\qquad\texttt{b10*SCHL.Doctorate degree[i]}\,+\, \\
\texttt{e[i]}.\qquad\qquad\qquad\qquad\qquad\quad$$`

---
# Linear Regression with Log-transformtion
### <p style="color:#00449E"> A Few Algebras for Logarithm and Exponential Functions

- Rule 1: 
$$\texttt{y} \,=\, \texttt{log(x)}\qquad\Leftrightarrow\qquad \texttt{exp(y)} \,=\, \texttt{x}. $$
- Rule 2:

`$$\texttt{log(x)} \,-\, \texttt{log(z)} \,=\, \texttt{log}\,\left(\,\frac{\texttt{x}}{\texttt{z}}\,\right).$$`

- By the rules above,
$$
\texttt{log(x)} \,-\, \texttt{log(z)} \,=\, \texttt{b}\qquad\Leftrightarrow\qquad \frac{\texttt{x}}{\texttt{z}} \,=\,\texttt{exp(b)}.
$$

---
# Linear Regression with Log-transformtion
### <p style="color:#00449E"> Interpreting Beta Estimates

- If we apply the rule above for `$\texttt{Bob}$` and `$\texttt{Ben}$`'s predicted incomes,

`$$\Leftrightarrow\qquad\widehat{\texttt{log(PINCP[Ben]})} \,-\, \widehat{\texttt{log(PINCP[Bob])}}\qquad  \\
\;=\quad \hat{\texttt{b1}}\texttt{ * }(\texttt{AGEP[Ben]} - \texttt{AGEP[Bob]})\qquad\\
\;=\quad \hat{\texttt{b1}}\texttt{ * }\texttt{(51 - 50)}\qquad\qquad\qquad\qquad\;\\
\;=\quad \hat{\texttt{b1}}\qquad\qquad\qquad\qquad\qquad\qquad\;\;\;\,$$`

So we can have the following:
`$$\Leftrightarrow\qquad\frac{\widehat{\texttt{PINCP[Ben]}}}{ \widehat{\texttt{PINCP[Bob]}}} \;=\; \texttt{exp(}\hat{\texttt{b1}}\texttt{)} \quad\Leftrightarrow\quad\widehat{\texttt{PINCP[Ben]}} \;=\; \widehat{\texttt{PINCP[Bob]}} * \texttt{exp(}\hat{\texttt{b1}}\texttt{)}$$`

---
# Linear Regression with Log-transformtion
### <p style="color:#00449E"> Interpreting Beta Estimates

- If we apply the rule above for `$\texttt{Ben}$` and `$\texttt{Linda}$`'s predicted incomes,

`$$\frac{\widehat{\texttt{PINCP[Ben]}}}{ \widehat{\texttt{PINCP[Linda]}}} \;=\; \texttt{exp(}\hat{\texttt{b2}}\texttt{)} \quad\Leftrightarrow\quad\widehat{\texttt{PINCP[Ben]}} \;=\; \widehat{\texttt{PINCP[Linda]}} * \texttt{exp(}\hat{\texttt{b2}}\texttt{)}$$`

- Suppose `$\texttt{exp(}\hat{\texttt{b2}}\texttt{)} = 1.18$`.

- Then `$\widehat{\texttt{PINCP[Ben]}}$` is 1.18 times `$\widehat{\texttt{PINCP[Linda]}}$`.
  
  - It means that being a male is associated with an increase in income by 18% relative to being a female.
  
---
# Linear Regression with Log-transformtion
### <p style="color:#00449E"> Interpreting Beta Estimates

- All else being equal, an increase in `AGEP` by one unit is associated with an increase in `log(PINCP)` by `b1`.

- All else being equal, an increase in `AGEP` by one unit is associated with an increase in `PINCP` by `(exp(b1) - 1)`%.

- All else being equal, an increase in `SEXMale` by one unit is associated with an increase in `log(PINCP)` by `b2`.

- All else being equal, being a male is associated with an increase in `PINCP` by `(exp(b1) - 1)`% relative to being a female.