Lecture 7DANL 200: Introduction to Data AnalyticsByeong-Hak ChoeSeptember 20, 20221 / 69

Announcement

Tutoring and TA-ing Schedules

Marcie Hogan (Tutor for DANL 100):
1. Sunday, 2:00 PM--5:00 PM
2. Wednesday, 12:30 PM--1:30 PM

Andrew Mosbo (Tutor):
1. Mondays, 4:00 PM--5:00 PM
2. Wednesdays, 11:00 A.M.--noon
3. Thursdays, 5:00 PM--6:00 PM

Emine Morris (TA):
1. Mondays and Wednesdays, 5:00 PM--6:30 PM
2. Tuesdays and Thursdays, 3:00 PM--4:45 PM

2 / 69

Workflow

Shortcuts for RStudio and RScript

Mac

command + shift + N opens a new RScript.
command + return runs a current line or selected lines.
command + shift + C is the shortcut for # (commenting).
option + - is the shortcut for <-.

Windows

Ctrl + Shift + N opens a new RS-cript.
Ctrl + return runs a current line or selected lines.
Ctrl + Shift + C is the shortcut for # (commenting).
Alt + - is the shortcut for <-.

3 / 69

WorkflowHome/End moves the blinking cursor bar to the beginning/End of the line.Ctrl (command/fn for Mac Users) +  /  works too.

PgUp/PgDn moves the blinking cursor bar to the top/bottom line of the script on the screen. Fn +   /  works too.

Ctrl (command for Mac Users) + Z undoes the previous action.
Ctrl (command for Mac Users) + Shift + Z redoes when undo is executed.
Ctrl (command for Mac Users) + F is useful when finding a phrase (and replace the phrase) in the RScript.
Ctrl (command for Mac Users) + D deletes a current line.
4 / 69

About the dataset for Question 3 in Homework Assignment 1

The geographic and time units of observation (row) in the dataset, NY_school_enrollment_socioecon.csv, are New York county and year.

FIPS	year	county_name	pincp	c01_001	c02_002
36001	2015	Albany	55793	84463	4.7

For example, the observation above means that in Albany county in year 2015 ...

Personal income of people is $55,793.
Population 3 years and over enrolled in school is 84,463.
Percent of population 3 years and over enrolled in nursery school and preschool is 4.7%.

5 / 69

Data Visualization - First Steps

Graphing Template

To make a ggplot plot, replace the bracketed sections in the code below with a data.frame, a geom function, or a collection of mappings such as x = VAR_1 and y = VAR_2.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

6 / 69

Class Exercises

Use the following data.frame.

tvshows_web <- read_csv(
        'https://bcdanl.github.io/data/tvshows.csv')

Describe the relationship between audience size (GRP) and audience engagement (PE) using ggplot. Explain the relationship in words.
- What aesthetic property would you consider?
- Would you do faceting?

7 / 69

Geometric Objects
8 / 69

Geometric Objects

How are these two plots similar?

9 / 69

Geometric Objects

A geom_*() is the geometrical object that a plot uses to represent data.
- Bar charts use geom_bar();
- Line charts use geom_line();
- Boxplots use the geom_boxplot();
- Scatterplots use the geom_point();
- Fitted lines use the geom_smooth();
- and many more!
We can use different geom_*() to plot the same data.

10 / 69

Geometric Objects

To change the geom in your plot, change the geom function that you add to ggplot().

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy))

ggplot(data = mpg) + 
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

11 / 69

Geometric Objects

`geom_*()` Functions and Aesthetic mappings

Every geom_*() function takes specific mapping arguments.
- Not every aesthetic property works with every geom_*() function.
- For example, you can set the shape of a geom_point(), but you cannot set the shape of a geom_smooth();
- You could set the linetype of a geom_smooth().

ggplot( data = mpg ) + 
  geom_smooth( mapping = aes( x = displ, y = hwy),
               linetype = 3)

12 / 69

Geometric Objects

`geom_*()` functions and `group` aesthetic

You can set the group aesthetic to a categorical variable to draw multiple objects.
ggplot2 will draw a separate object for each unique value of the grouping variable.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, 
                            group = drv))

13 / 69

Geometric Objects

`geom_*()` functions and `group` aesthetic

In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example).

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, 
                  color = drv),
    show.legend = FALSE
  )

14 / 69

Geometric Objects

Multiple `geom_*()` functions

To display multiple geometric objects in the same plot, add multiple geom_*() functions to ggplot():

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

15 / 69

Geometric Objects

Multiple `geom_*()` functions

If you place mappings in a geom_*() function, ggplot2 will treat them as local mappings for the layer.

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

16 / 69

Geometric Objects

Multiple `geom_*()` functions

You can use the same idea to specify different data for each layer.
Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars.
The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), 
              se = FALSE)

17 / 69

Statistical Transformation
18 / 69

Statistical Transformations

Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
Consider a basic bar chart, as drawn with geom_bar().
The following bar chart displays the total number of diamonds in the ggplot2::diamonds dataset, grouped by cut.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.

19 / 69

Statistical Transformations

Many graphs, including bar charts, calculate new values to plot:
- geom_bar(), geom_histogram(), and geom_freqpoly() bin your data and then plot bin counts, the number of observations that fall in each bin.
- geom_smooth() fits a model to your data and then plot predictions from the model.
- geom_boxplot() compute a summary of the distribution and then display a specially formatted box.

20 / 69

Statistical Transformations

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
The figure below describes how this process works with geom_bar().

21 / 69

Statistical Transformations

Observed Value vs. Number of Observations

There are three reasons you might need to use a stat explicitly:
- 1. You might want to override the default stat.

demo <- tribble(         # for a simple data.frame
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551 )
ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), 
           stat = "identity")

22 / 69

Statistical Transformations

Count vs. Proportion

There are three reasons you might need to use a stat explicitly:
- 2. You might want to override the default mapping from transformed variables to aesthetics.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, 
                         y = stat(prop), 
                         group = 1))

23 / 69

Statistical Transformations

Stat summary

There are three reasons you might need to use a stat explicitly:
- 3. You might want to draw greater attention to the statistical transformation in your code.

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

24 / 69

Statistical Transformations

Exercises

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
What does geom_col() do? How is it different to geom_bar()?
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
What variables does stat_smooth() compute? What parameters control its behavior?

25 / 69

Statistical Transformations

Exercises

In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop) ) )
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), 
                         fill = color ) )

26 / 69

Position Adjustment
27 / 69

Position Adjustments

`color` and `fill` aesthetic

You can color a bar chart using either the color aesthetic, or, more usefully, fill:

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 [?] = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 [?] = cut))

28 / 69

Position Adjustments

Stacked bar charts with `fill` aesthetic

Note that the bars are automatically stacked if you map the fill aesthetic to another variable.

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity) )

29 / 69

Position Adjustments

Stacked bar charts with `fill` aesthetic

The stacking is performed automatically by the position adjustment specified by the position argument.

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity),
           position = "stack")

30 / 69

Position Adjustments

`position = "fill"` and `position = "dodge"`

If you don't want a stacked bar chart with counts, you can use one of two other position options: fill or dodge.

position = "fill" works like stacking, but makes each set of stacked bars the same height.
- This makes it easier to compare proportions across groups.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])

position = "dodge" places overlapping objects directly beside one another.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])

31 / 69

Position Adjustments

Overplotting and `position = "jitter"`

The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other.
- This problem is known as overplotting.
You can avoid the overlapping problem by setting the position adjustment to jitter.
- position = "jitter" adds a small amount of random noise to each point.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = [?])

32 / 69

Position Adjustments

Exercises

What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

What parameters to geom_jitter() control the amount of jittering?
Compare and contrast geom_jitter() with geom_count().
What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

33 / 69

Coordinate
34 / 69

Coordinate Systems

The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point.
There are a number of other coordinate systems that are occasionally helpful.

35 / 69

Coordinate Systems

`coord_flip()`

coord_flip() switches the x and y axes.
This is useful (for example), if you want horizontal boxplots.
It's also useful for long labels: it's hard to get them to fit without overlapping on the x-axis.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

36 / 69

Coordinate Systems

`coord_quickmap()`

coord_quickmap() sets the aspect ratio correctly for maps.

county <- map_data("county")   # Map data for US Counties
ny <- filter(county,      # We will discuss 'filter()' in the next chapter
             region == "new york")
ggplot(ny, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black")
ggplot(ny, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()

37 / 69

Coordinate Systems

Exercises

What does labs() do? Read the documentation.
What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

38 / 69

ggplot Grammar
39 / 69

The Layered Grammar of Graphics

Let's add position adjustments, stats, coordinate systems, and faceting to our code template.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

The seven parameters---(1) a dataset, (2) a geom, (3) a set of mappings, (4) a stat, (5) a position adjustment, (6) a coordinate system, and (7) a faceting scheme---in the template compose the grammar of graphics, a formal system for building plots.

40 / 69

Exploraty Data Analysis I
41 / 69

Exploraty Data Analysis

Get to know data before modeling

We need to explore the data before building the model.
- No dataset is perfect.
- You'll have a more specific idea of what information most accurately predicts the probability of insurance coverage.

Data exploration uses a combination of ...
- Summary statistics
- Visualization
- Data transformation

42 / 69

Exploraty Data Analysis

Example

Suppose your goal is to build a model to predict which of our customers don't have health insurance.
We've collected a dataset of customers whose health insurance status you know.
We've also identified some customer properties that you believe help predict the probability of insurance coverage:
- age
- employment status
- income
- information about residence and vehicles, and so on

43 / 69

Summary Statistics
44 / 69

Summary Statistics

Use the summary() or skimr::skim() command to take your first look at the data.
- They report a variety of summary statistics on the numerical variables of the data frame, and count statistics on any categorical variables.

library(tidyverse)
library(skimr)
path <- "PATH_NAME_FOR_THE_FILE_custdata.RDS"
customer_data <- readRDS(path)
# The following is the same data file in my website.
path_web <- "https://bcdanl.github.io/data/custdata.csv"
customer_data <- read.table(path_web, 
                            sep = ',', 
                            header = TRUE)
skim(customer_data)

45 / 69

Summary Statistics

Typical problems revealed by data summaries

At this stage, we're looking for several common issues:
- Missing values
- Invalid values and outliers
- Data ranges that are too wide or too narrow
- The units of the data
Generally, the goal of modeling is to make good predictions on typical cases, or to identify causal relationships.
A model that is highly skewed to predict a rare case correctly may not always be the best model overall.

46 / 69

Summary Statistics

Missing values

The variable is_employed is missing for more than a third of the data.
- Why?

## is_employed
## FALSE: 2321
## TRUE :44887
## NA's :24333

47 / 69

Summary Statistics

Data range and variation

We should pay attention to how much the values in the data vary.
- Outliers are data points that fall well out of the range of where you expect the data to be.

skim(customer_data$income)
skim(customer_data$age)

48 / 69

Summary Statistics

Units

We may not know that variable IncomeK is defined as

$I n c o m e K = customer_data $ i n c o m e / 1000.$

Looking only at the summary, the values could plausibly be interpreted to mean either "hourly wage" or "yearly income in units of $1,000."

IncomeK <- customer_data$income/1000
skim(IncomeK)

This is actually something that we’ll catch by checking data definitions in data dictionaries or documentation, rather than in the summary statistics.

49 / 69

Visualization
50 / 69

Key Points in Visualization

A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
Strive for clarity. Make the data stand out. Specific tips for increasing clarity include these:
- Avoid too many superimposed elements, such as too many curves in the same graphing space.
- Find the right aspect ratio and scaling to properly bring out the details of the data.
- Avoid having the data all skewed to one side or the other of your graph.
Visualization is an iterative process. Its purpose is to answer questions about the data.

51 / 69

Visualization

Visually checking distributions for a single variable

We will look at histograms, density plots, bar charts, and dot plots.
The above visualizations help us answer questions like these:
- What is the peak value of the distribution?
- How many peaks are there in the distribution (unimodality versus bimodality)?
- How normal is the data?
- How much does the data vary? Is it concentrated in a certain interval or in a certain category?

52 / 69

Visualization

Visually checking distributions for a single variable

One of the things that’s easy to grasp visually is the shape of the distribution of variable.

ggplot(data = customer_data) + 
  geom_density( mapping = aes(x = age) )

The graph here is somewhat flattish between the ages of about 25 and about 60, falling off slowly after 60.
There seems to be a peak at around the late-20s to early 30s range, and another in the early 50s.
This data has multiple peaks: it is not unimodal.
- Distribution peaks around mid/late 20s. Peaks again in early 50s.

53 / 69

Visualization

Visually checking distributions for a single variable

54 / 69

Visualization

Histograms

A basic histogram bins a variable into fixed-width buckets and returns the number of data points that fall into each bucket as a height.
A histogram tells you where your data is concentrated. It also visually highlights outliers and anomalies.

ggplot( data = customer_data, aes(x=gas_usage) ) +
  geom_histogram( binwidth=10, fill="gray" )
skim(customer_data$gas_usage)

55 / 69

Visualization

Data dictionary entry for `gas_usage`

Treat values 001, 002, and 003 as numerical values could potentially lead to incorrect conclusions in our analysis.

56 / 69

Visualization

Density plots

We can think of a density plot as a continuous histogram of a variable.
- The area under the density plot is re-scaled to equal one.
- We can think of a density plot as a continuous histogram of a variable.

library(scales)   # to denote the dollar sign in axes
ggplot(customer_data, aes(x=income)) + 
    geom_density() +
    scale_x_continuous(labels=dollar)

57 / 69

Visualization

A Little Bit of Math for Logarithm

The logarithm function, $y = \log_{b} (x)$ , looks like ....

$\log_{10} (100)$ : the base 10 logarithm of 100 is 2, because $10^{2} = 100$
$\log_{e} (x)$ : the base $e$ logarithm is called the natural log, where $e = 2.718\cdots$'' is the mathematical constant, the Euler's number.
$\log (x)$ or $\ln (x)$ : the natural log of $x$ .
$\log_{e} (7.389 \dots)$ : the natural log of $7.389 \dots$ is 2, because $e^{2} = 7.389 \dots$ .

58 / 69

Visualization

Log Transformation

We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
A difference in income of $5,000 means something very different across people with different income levels.
- We should also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed.

59 / 69

Visualization

Log Transformation

The log transformation makes the skewed distribution of income more normal.

ggplot(customer_data, aes(x=income)) +
  geom_density() +
  scale_x_log10(breaks = c(10, 100, 1000, 10000, 100000, 1000000),
                labels=dollar)

60 / 69

Visualization

Bar Charts and Dotplots

A bar chart is a histogram for discrete data.
- It records the frequency of every value of a categorical variable.

ggplot( data = customer_data, 
            mapping =  aes( x = marital_status )  ) + 
  geom_bar( fill="gray" )

61 / 69

Visualization

Bar Charts and Dotplots

Bar charts are most useful when the number of possible values is fairly large, like state of residents.

ggplot(customer_data, aes(x=state_of_res)) +
  geom_bar(fill="gray") +
  coord_flip()

A horizontal bar chart can be easier to read when there are several categories with long names.

62 / 69

Visualization

Bar Charts and Dotplots

Sometimes it is better to sort the data when plotting a bar chart or dot plot.

library(WVPlots)    # install.package("WVPlots") if you have not
ClevelandDotPlot(customer_data, "state_of_res",
                 sort = 1, title="Customers by state") +
  coord_flip()

Sorted bar chart or dot plot can allow use to extract insight more efficiently from the data.

63 / 69

Visualization

Visually checking relationships between two variables

We'll often want to look at the relationship between two variables.
- Is there a relationship between the two inputs---age and income---in my data?
- If so, what kind of relationship, and how strong?
- Is there a relationship between the input, marital status, and the output, health insurance? How strong?

64 / 69

Visualization

A relationship between age and income

Reasonable age and income values can be selected.
- We'll discuss the filter() function soon.

customer_data2 <- filter(customer_data,
                         0 < age & age < 100 &
                         0 < income & income < 200000)
cor(customer_data$age, customer_data$income)

65 / 69

Visualization

A relationship between age and income

ggplot( data = customer_data2 ) +
  geom_smooth( mapping = aes(x = age, y = income) )
ggplot(customer_data2, aes(x=age, y=income)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Income as a function of age")
library(hexbin)    # install.packages("hexbin) if you have not.
ggplot(customer_data2, aes(x=age, y=income)) +
  geom_hex() +
  geom_smooth(color = "red", se = F) +
  ggtitle("Income as a function of age")

66 / 69

Visualization

A relationship between marital status and health insurance

Bar charts can be used to describe a relationship between two categorical variables.

ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
  geom_bar()
# side-by-side bar chart
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
  geom_bar([?])
# stacked bar chart
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
  geom_bar([?])

67 / 69

Visualization

The Distribution of Marriage Status across Housing Types

cdata <- filter(customer_data, !is.na(housing_type))
ggplot(cdata, aes(x=housing_type, fill=marital_status)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Dark2") +
  coord_flip()
ggplot(cdata, aes(x=marital_status)) +
  geom_bar(fill="darkgray") +
  facet_wrap(~housing_type, scale="free_x") +
  coord_flip()

68 / 69

Visualization

Visually checking relationships between two variables

Overlaying, faceting, and several aesthetics should always be considered with the following geometric objects:
- Scatter plot
- Smoothing curve
- Bar chart
- Stacked bar chart
- Side-by-side bar chart
- Density plot
- Histogram
- Frequency ploygon
- Hexbin plot

69 / 69

Lecture 7DANL 200: Introduction to Data AnalyticsByeong-Hak ChoeSeptember 20, 20221 / 69

Announcement

Tutoring and TA-ing Schedules

Marcie Hogan (Tutor for DANL 100):
1. Sunday, 2:00 PM--5:00 PM
2. Wednesday, 12:30 PM--1:30 PM

Andrew Mosbo (Tutor):
1. Mondays, 4:00 PM--5:00 PM
2. Wednesdays, 11:00 A.M.--noon
3. Thursdays, 5:00 PM--6:00 PM

Emine Morris (TA):
1. Mondays and Wednesdays, 5:00 PM--6:30 PM
2. Tuesdays and Thursdays, 3:00 PM--4:45 PM

2 / 69

Workflow

Shortcuts for RStudio and RScript

Mac

command + shift + N opens a new RScript.
command + return runs a current line or selected lines.
command + shift + C is the shortcut for # (commenting).
option + - is the shortcut for <-.

Windows

Ctrl + Shift + N opens a new RS-cript.
Ctrl + return runs a current line or selected lines.
Ctrl + Shift + C is the shortcut for # (commenting).
Alt + - is the shortcut for <-.

3 / 69

WorkflowHome/End moves the blinking cursor bar to the beginning/End of the line.Ctrl (command/fn for Mac Users) +  /  works too.

PgUp/PgDn moves the blinking cursor bar to the top/bottom line of the script on the screen. Fn +   /  works too.

Ctrl (command for Mac Users) + Z undoes the previous action.
Ctrl (command for Mac Users) + Shift + Z redoes when undo is executed.
Ctrl (command for Mac Users) + F is useful when finding a phrase (and replace the phrase) in the RScript.
Ctrl (command for Mac Users) + D deletes a current line.
4 / 69

About the dataset for Question 3 in Homework Assignment 1

The geographic and time units of observation (row) in the dataset, NY_school_enrollment_socioecon.csv, are New York county and year.

FIPS	year	county_name	pincp	c01_001	c02_002
36001	2015	Albany	55793	84463	4.7

For example, the observation above means that in Albany county in year 2015 ...

Personal income of people is $55,793.
Population 3 years and over enrolled in school is 84,463.
Percent of population 3 years and over enrolled in nursery school and preschool is 4.7%.

5 / 69

Data Visualization - First Steps

Graphing Template

To make a ggplot plot, replace the bracketed sections in the code below with a data.frame, a geom function, or a collection of mappings such as x = VAR_1 and y = VAR_2.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

6 / 69

Class Exercises

Use the following data.frame.

tvshows_web <- read_csv(
        'https://bcdanl.github.io/data/tvshows.csv')

Describe the relationship between audience size (GRP) and audience engagement (PE) using ggplot. Explain the relationship in words.
- What aesthetic property would you consider?
- Would you do faceting?

7 / 69

Geometric Objects
8 / 69

Geometric Objects

How are these two plots similar?

9 / 69

Geometric Objects

A geom_*() is the geometrical object that a plot uses to represent data.
- Bar charts use geom_bar();
- Line charts use geom_line();
- Boxplots use the geom_boxplot();
- Scatterplots use the geom_point();
- Fitted lines use the geom_smooth();
- and many more!
We can use different geom_*() to plot the same data.

10 / 69

Geometric Objects

To change the geom in your plot, change the geom function that you add to ggplot().

ggplot(data = mpg) + 
  geom_point(mapping = 
               aes(x = displ, 
                   y = hwy))

ggplot(data = mpg) + 
  geom_smooth(mapping = 
                aes(x = displ, 
                    y = hwy))

11 / 69

Geometric Objects

`geom_*()` Functions and Aesthetic mappings

Every geom_*() function takes specific mapping arguments.
- Not every aesthetic property works with every geom_*() function.
- For example, you can set the shape of a geom_point(), but you cannot set the shape of a geom_smooth();
- You could set the linetype of a geom_smooth().

ggplot( data = mpg ) + 
  geom_smooth( mapping = aes( x = displ, y = hwy),
               linetype = 3)

12 / 69

Geometric Objects

`geom_*()` functions and `group` aesthetic

You can set the group aesthetic to a categorical variable to draw multiple objects.
ggplot2 will draw a separate object for each unique value of the grouping variable.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, 
                            group = drv))

13 / 69

Geometric Objects

`geom_*()` functions and `group` aesthetic

In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example).

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, 
                  color = drv),
    show.legend = FALSE
  )

14 / 69

Geometric Objects

Multiple `geom_*()` functions

To display multiple geometric objects in the same plot, add multiple geom_*() functions to ggplot():

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

15 / 69

Geometric Objects

Multiple `geom_*()` functions

If you place mappings in a geom_*() function, ggplot2 will treat them as local mappings for the layer.

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

16 / 69

Geometric Objects

Multiple `geom_*()` functions

You can use the same idea to specify different data for each layer.
Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars.
The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

ggplot(data = mpg, 
       mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), 
              se = FALSE)

17 / 69

Statistical Transformation
18 / 69

Statistical Transformations

Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
Consider a basic bar chart, as drawn with geom_bar().
The following bar chart displays the total number of diamonds in the ggplot2::diamonds dataset, grouped by cut.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.

19 / 69

Statistical Transformations

Many graphs, including bar charts, calculate new values to plot:
- geom_bar(), geom_histogram(), and geom_freqpoly() bin your data and then plot bin counts, the number of observations that fall in each bin.
- geom_smooth() fits a model to your data and then plot predictions from the model.
- geom_boxplot() compute a summary of the distribution and then display a specially formatted box.

20 / 69

Statistical Transformations

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation.
The figure below describes how this process works with geom_bar().

21 / 69

Statistical Transformations

Observed Value vs. Number of Observations

There are three reasons you might need to use a stat explicitly:
- 1. You might want to override the default stat.

demo <- tribble(         # for a simple data.frame
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551 )
ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), 
           stat = "identity")

22 / 69

Statistical Transformations

Count vs. Proportion

There are three reasons you might need to use a stat explicitly:
- 2. You might want to override the default mapping from transformed variables to aesthetics.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, 
                         y = stat(prop), 
                         group = 1))

23 / 69

Statistical Transformations

Stat summary

There are three reasons you might need to use a stat explicitly:
- 3. You might want to draw greater attention to the statistical transformation in your code.

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

24 / 69

Statistical Transformations

Exercises

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
What does geom_col() do? How is it different to geom_bar()?
Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
What variables does stat_smooth() compute? What parameters control its behavior?

25 / 69

Statistical Transformations

Exercises

In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop) ) )
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), 
                         fill = color ) )

26 / 69

Position Adjustment
27 / 69

Position Adjustments

`color` and `fill` aesthetic

You can color a bar chart using either the color aesthetic, or, more usefully, fill:

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 [?] = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 [?] = cut))

28 / 69

Position Adjustments

Stacked bar charts with `fill` aesthetic

Note that the bars are automatically stacked if you map the fill aesthetic to another variable.

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity) )

29 / 69

Position Adjustments

Stacked bar charts with `fill` aesthetic

The stacking is performed automatically by the position adjustment specified by the position argument.

ggplot(data = diamonds) + 
  geom_bar(mapping = 
             aes(x = cut, 
                 fill = clarity),
           position = "stack")

30 / 69

Position Adjustments

`position = "fill"` and `position = "dodge"`

If you don't want a stacked bar chart with counts, you can use one of two other position options: fill or dodge.

position = "fill" works like stacking, but makes each set of stacked bars the same height.
- This makes it easier to compare proportions across groups.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])

position = "dodge" places overlapping objects directly beside one another.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])

31 / 69

Position Adjustments

Overplotting and `position = "jitter"`

The values of hwy and displ are rounded so the points appear on a grid and many points overlap each other.
- This problem is known as overplotting.
You can avoid the overlapping problem by setting the position adjustment to jitter.
- position = "jitter" adds a small amount of random noise to each point.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = [?])

32 / 69

Position Adjustments

Exercises

What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

What parameters to geom_jitter() control the amount of jittering?
Compare and contrast geom_jitter() with geom_count().
What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

33 / 69

Coordinate
34 / 69

Coordinate Systems

The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point.
There are a number of other coordinate systems that are occasionally helpful.

35 / 69

Coordinate Systems

`coord_flip()`

coord_flip() switches the x and y axes.
This is useful (for example), if you want horizontal boxplots.
It's also useful for long labels: it's hard to get them to fit without overlapping on the x-axis.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

36 / 69

Coordinate Systems

`coord_quickmap()`

coord_quickmap() sets the aspect ratio correctly for maps.

county <- map_data("county")   # Map data for US Counties
ny <- filter(county,      # We will discuss 'filter()' in the next chapter
             region == "new york")
ggplot(ny, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black")
ggplot(ny, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()

37 / 69

Coordinate Systems

Exercises

What does labs() do? Read the documentation.
What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

38 / 69

ggplot Grammar
39 / 69

The Layered Grammar of Graphics

Let's add position adjustments, stats, coordinate systems, and faceting to our code template.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

The seven parameters---(1) a dataset, (2) a geom, (3) a set of mappings, (4) a stat, (5) a position adjustment, (6) a coordinate system, and (7) a faceting scheme---in the template compose the grammar of graphics, a formal system for building plots.

40 / 69

Exploraty Data Analysis I
41 / 69

Exploraty Data Analysis

Get to know data before modeling

We need to explore the data before building the model.
- No dataset is perfect.
- You'll have a more specific idea of what information most accurately predicts the probability of insurance coverage.

Data exploration uses a combination of ...
- Summary statistics
- Visualization
- Data transformation

42 / 69

Exploraty Data Analysis

Example

Suppose your goal is to build a model to predict which of our customers don't have health insurance.
We've collected a dataset of customers whose health insurance status you know.
We've also identified some customer properties that you believe help predict the probability of insurance coverage:
- age
- employment status
- income
- information about residence and vehicles, and so on

43 / 69

Summary Statistics
44 / 69

Summary Statistics

Use the summary() or skimr::skim() command to take your first look at the data.
- They report a variety of summary statistics on the numerical variables of the data frame, and count statistics on any categorical variables.

library(tidyverse)
library(skimr)
path <- "PATH_NAME_FOR_THE_FILE_custdata.RDS"
customer_data <- readRDS(path)
# The following is the same data file in my website.
path_web <- "https://bcdanl.github.io/data/custdata.csv"
customer_data <- read.table(path_web, 
                            sep = ',', 
                            header = TRUE)
skim(customer_data)

45 / 69

Summary Statistics

Typical problems revealed by data summaries

At this stage, we're looking for several common issues:
- Missing values
- Invalid values and outliers
- Data ranges that are too wide or too narrow
- The units of the data
Generally, the goal of modeling is to make good predictions on typical cases, or to identify causal relationships.
A model that is highly skewed to predict a rare case correctly may not always be the best model overall.

46 / 69

Summary Statistics

Missing values

The variable is_employed is missing for more than a third of the data.
- Why?

## is_employed
## FALSE: 2321
## TRUE :44887
## NA's :24333

47 / 69

Summary Statistics

Data range and variation

We should pay attention to how much the values in the data vary.
- Outliers are data points that fall well out of the range of where you expect the data to be.

skim(customer_data$income)
skim(customer_data$age)

48 / 69

Summary Statistics

Units

We may not know that variable IncomeK is defined as

$I n c o m e K = customer_data $ i n c o m e / 1000.$

Looking only at the summary, the values could plausibly be interpreted to mean either "hourly wage" or "yearly income in units of $1,000."

IncomeK <- customer_data$income/1000
skim(IncomeK)

This is actually something that we’ll catch by checking data definitions in data dictionaries or documentation, rather than in the summary statistics.

49 / 69

Visualization
50 / 69

Key Points in Visualization

A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
Strive for clarity. Make the data stand out. Specific tips for increasing clarity include these:
- Avoid too many superimposed elements, such as too many curves in the same graphing space.
- Find the right aspect ratio and scaling to properly bring out the details of the data.
- Avoid having the data all skewed to one side or the other of your graph.
Visualization is an iterative process. Its purpose is to answer questions about the data.

51 / 69

Visualization

Visually checking distributions for a single variable

We will look at histograms, density plots, bar charts, and dot plots.
The above visualizations help us answer questions like these:
- What is the peak value of the distribution?
- How many peaks are there in the distribution (unimodality versus bimodality)?
- How normal is the data?
- How much does the data vary? Is it concentrated in a certain interval or in a certain category?

52 / 69

Visualization

Visually checking distributions for a single variable

One of the things that’s easy to grasp visually is the shape of the distribution of variable.

ggplot(data = customer_data) + 
  geom_density( mapping = aes(x = age) )

The graph here is somewhat flattish between the ages of about 25 and about 60, falling off slowly after 60.
There seems to be a peak at around the late-20s to early 30s range, and another in the early 50s.
This data has multiple peaks: it is not unimodal.
- Distribution peaks around mid/late 20s. Peaks again in early 50s.

53 / 69

Visualization

Visually checking distributions for a single variable

54 / 69

Visualization

Histograms

A basic histogram bins a variable into fixed-width buckets and returns the number of data points that fall into each bucket as a height.
A histogram tells you where your data is concentrated. It also visually highlights outliers and anomalies.

ggplot( data = customer_data, aes(x=gas_usage) ) +
  geom_histogram( binwidth=10, fill="gray" )
skim(customer_data$gas_usage)

55 / 69

Visualization

Data dictionary entry for `gas_usage`

Treat values 001, 002, and 003 as numerical values could potentially lead to incorrect conclusions in our analysis.

56 / 69

Visualization

Density plots

We can think of a density plot as a continuous histogram of a variable.
- The area under the density plot is re-scaled to equal one.
- We can think of a density plot as a continuous histogram of a variable.

library(scales)   # to denote the dollar sign in axes
ggplot(customer_data, aes(x=income)) + 
    geom_density() +
    scale_x_continuous(labels=dollar)

57 / 69

Visualization

A Little Bit of Math for Logarithm

The logarithm function, $y = \log_{b} (x)$ , looks like ....

$\log_{10} (100)$ : the base 10 logarithm of 100 is 2, because $10^{2} = 100$
$\log_{e} (x)$ : the base $e$ logarithm is called the natural log, where $e = 2.718\cdots$'' is the mathematical constant, the Euler's number.
$\log (x)$ or $\ln (x)$ : the natural log of $x$ .
$\log_{e} (7.389 \dots)$ : the natural log of $7.389 \dots$ is 2, because $e^{2} = 7.389 \dots$ .

58 / 69

Visualization

Log Transformation

We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
A difference in income of $5,000 means something very different across people with different income levels.
- We should also consider using a log scale to reduce a variance of residuals when a variable is heavily skewed.

59 / 69

Visualization

Log Transformation

The log transformation makes the skewed distribution of income more normal.

ggplot(customer_data, aes(x=income)) +
  geom_density() +
  scale_x_log10(breaks = c(10, 100, 1000, 10000, 100000, 1000000),
                labels=dollar)

60 / 69

Visualization

Bar Charts and Dotplots

A bar chart is a histogram for discrete data.
- It records the frequency of every value of a categorical variable.

ggplot( data = customer_data, 
            mapping =  aes( x = marital_status )  ) + 
  geom_bar( fill="gray" )

61 / 69

Visualization

Bar Charts and Dotplots

Bar charts are most useful when the number of possible values is fairly large, like state of residents.

ggplot(customer_data, aes(x=state_of_res)) +
  geom_bar(fill="gray") +
  coord_flip()

A horizontal bar chart can be easier to read when there are several categories with long names.

62 / 69

Visualization

Bar Charts and Dotplots

Sometimes it is better to sort the data when plotting a bar chart or dot plot.

library(WVPlots)    # install.package("WVPlots") if you have not
ClevelandDotPlot(customer_data, "state_of_res",
                 sort = 1, title="Customers by state") +
  coord_flip()

Sorted bar chart or dot plot can allow use to extract insight more efficiently from the data.

63 / 69

Visualization

Visually checking relationships between two variables

We'll often want to look at the relationship between two variables.
- Is there a relationship between the two inputs---age and income---in my data?
- If so, what kind of relationship, and how strong?
- Is there a relationship between the input, marital status, and the output, health insurance? How strong?

64 / 69

Visualization

A relationship between age and income

Reasonable age and income values can be selected.
- We'll discuss the filter() function soon.

customer_data2 <- filter(customer_data,
                         0 < age & age < 100 &
                         0 < income & income < 200000)
cor(customer_data$age, customer_data$income)

65 / 69

Visualization

A relationship between age and income

ggplot( data = customer_data2 ) +
  geom_smooth( mapping = aes(x = age, y = income) )
ggplot(customer_data2, aes(x=age, y=income)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Income as a function of age")
library(hexbin)    # install.packages("hexbin) if you have not.
ggplot(customer_data2, aes(x=age, y=income)) +
  geom_hex() +
  geom_smooth(color = "red", se = F) +
  ggtitle("Income as a function of age")

66 / 69

Visualization

A relationship between marital status and health insurance

Bar charts can be used to describe a relationship between two categorical variables.

ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
  geom_bar()
# side-by-side bar chart
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
  geom_bar([?])
# stacked bar chart
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) +
  geom_bar([?])

67 / 69

Visualization

The Distribution of Marriage Status across Housing Types

cdata <- filter(customer_data, !is.na(housing_type))
ggplot(cdata, aes(x=housing_type, fill=marital_status)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Dark2") +
  coord_flip()
ggplot(cdata, aes(x=marital_status)) +
  geom_bar(fill="darkgray") +
  facet_wrap(~housing_type, scale="free_x") +
  coord_flip()

68 / 69

Visualization

Visually checking relationships between two variables

Overlaying, faceting, and several aesthetics should always be considered with the following geometric objects:
- Scatter plot
- Smoothing curve
- Bar chart
- Stacked bar chart
- Side-by-side bar chart
- Density plot
- Histogram
- Frequency ploygon
- Hexbin plot

69 / 69

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Lecture 7

DANL 200: Introduction to Data Analytics

Byeong-Hak Choe

September 20, 2022

Announcement

Tutoring and TA-ing Schedules

Workflow

Shortcuts for RStudio and RScript

Workflow

About the dataset for Question 3 in Homework Assignment 1

Data Visualization - First Steps

Graphing Template

Aesthetic Mappings and Facets

Class Exercises

Geometric Objects

Geometric Objects

Geometric Objects

Geometric Objects

Geometric Objects

geom_*() Functions and Aesthetic mappings

Geometric Objects

geom_*() functions and group aesthetic

Geometric Objects

geom_*() functions and group aesthetic

Geometric Objects

Multiple geom_*() functions

Geometric Objects

Multiple geom_*() functions

Geometric Objects

Multiple geom_*() functions

Statistical Transformation

Statistical Transformations

Statistical Transformations

Statistical Transformations

Statistical Transformations

Observed Value vs. Number of Observations

Statistical Transformations

Count vs. Proportion

Statistical Transformations

Stat summary

Statistical Transformations

Exercises

Statistical Transformations

Exercises

Position Adjustment

Position Adjustments

color and fill aesthetic

Position Adjustments

Stacked bar charts with fill aesthetic

Position Adjustments

Stacked bar charts with fill aesthetic

Position Adjustments

position = "fill" and position = "dodge"

Position Adjustments

Overplotting and position = "jitter"

Position Adjustments

Exercises

Coordinate

Coordinate Systems

Coordinate Systems

coord_flip()

Coordinate Systems

coord_quickmap()

Coordinate Systems

Exercises

ggplot Grammar

The Layered Grammar of Graphics

Exploraty Data Analysis I

Exploraty Data Analysis

Get to know data before modeling

Exploraty Data Analysis

Example

Summary Statistics

Summary Statistics

Summary Statistics

Typical problems revealed by data summaries

Summary Statistics

Missing values

Summary Statistics

Data range and variation

`geom_*()` Functions and Aesthetic mappings

`geom_*()` functions and `group` aesthetic

`geom_*()` functions and `group` aesthetic

Multiple `geom_*()` functions

Multiple `geom_*()` functions

Multiple `geom_*()` functions

`color` and `fill` aesthetic

Stacked bar charts with `fill` aesthetic

Stacked bar charts with `fill` aesthetic

`position = "fill"` and `position = "dodge"`

Overplotting and `position = "jitter"`

`coord_flip()`

`coord_quickmap()`

`ggplot` Grammar

Data dictionary entry for `gas_usage`

`geom_*()` Functions and Aesthetic mappings

`geom_*()` functions and `group` aesthetic

`geom_*()` functions and `group` aesthetic

Multiple `geom_*()` functions