Tutoring and TA-ing Schedules
Shortcuts for RStudio and RScript
Mac
<-
.Windows
<-
.NY_school_enrollment_socioecon.csv
, are New York county and year.FIPS | year | county_name | pincp | c01_001 | c02_002 |
---|---|---|---|---|---|
36001 | 2015 | Albany | 55793 | 84463 | 4.7 |
For example, the observation above means that in Albany county in year 2015 ...
Graphing Template
data.frame
, a geom
function, or a collection of mappings such as x = VAR_1
and y = VAR_2
.ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Class Exercises
tvshows_web <- read_csv( 'https://bcdanl.github.io/data/tvshows.csv')
Describe the relationship between audience size (GRP
) and audience engagement (PE
) using ggplot
. Explain the relationship in words.
What aesthetic property would you consider?
Would you do faceting?
How are these two plots similar?
A geom_*()
is the geometrical object that a plot uses to represent data.
geom_bar()
;geom_line()
; geom_boxplot()
; geom_point()
; geom_smooth()
;We can use different geom_*()
to plot the same data.
ggplot()
. ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
geom_*()
Functions and Aesthetic mappings
Every geom_*()
function takes specific mapping arguments.
geom_*()
function. shape
of a geom_point()
, but you cannot set the shape
of a geom_smooth()
;linetype
of a geom_smooth()
.ggplot( data = mpg ) + geom_smooth( mapping = aes( x = displ, y = hwy), linetype = 3)
geom_*()
functions and group
aesthetic
You can set the group
aesthetic to a categorical variable to draw multiple objects.
ggplot2
will draw a separate object for each unique value of the grouping variable.
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
geom_*()
functions and group
aesthetic
ggplot2
will automatically group the data for these geoms
whenever you map an aesthetic to a discrete variable (as in the linetype
example). ggplot(data = mpg) + geom_smooth( mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE )
Multiple geom_*()
functions
geom_*()
functions to ggplot()
:ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy))
Multiple geom_*()
functions
geom_*()
function, ggplot2
will treat them as local mappings for the layer. ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth()
Multiple geom_*()
functions
Here, our smooth line displays just a subset of the mpg
dataset, the subcompact
cars.
The local data argument in geom_smooth()
overrides the global data argument in ggplot()
for that layer only.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
Consider a basic bar chart, as drawn with geom_bar()
.
The following bar chart displays the total number of diamonds in the ggplot2::diamonds
dataset, grouped by cut
.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
diamonds
dataset comes in ggplot2
and contains information about ~54,000 diamonds, including the price
, carat
, color
, clarity
, and cut
of each diamond. Many graphs, including bar charts, calculate new values to plot:
geom_bar()
, geom_histogram()
, and geom_freqpoly()
bin your data and then plot bin counts, the number of observations that fall in each bin.
geom_smooth()
fits a model to your data and then plot predictions from the model.
geom_boxplot()
compute a summary of the distribution and then display a specially formatted box.
The algorithm used to calculate new values for a graph is called a stat
, short for statistical transformation.
The figure below describes how this process works with geom_bar()
.
Observed Value vs. Number of Observations
There are three reasons you might need to use a stat
explicitly:
stat
. demo <- tribble( # for a simple data.frame ~cut, ~freq, "Fair", 1610, "Good", 4906, "Very Good", 12082, "Premium", 13791, "Ideal", 21551 )ggplot(data = demo) + geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
Count vs. Proportion
There are three reasons you might need to use a stat
explicitly:
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
Stat summary
There are three reasons you might need to use a stat
explicitly:
ggplot(data = diamonds) + stat_summary( mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median )
Exercises
What is the default geom associated with stat_summary()
? How could you rewrite the previous plot to use that geom function instead of the stat function?
What does geom_col()
do? How is it different to geom_bar()
?
Most geoms
and stats
come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
What variables does stat_smooth()
compute? What parameters control its behavior?
Exercises
group = 1
. Why? In other words what is the problem with these two graphs?ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop) ) )ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop), fill = color ) )
Stacked bar charts with fill
aesthetic
fill
aesthetic to another variable.ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity) )
Stacked bar charts with fill
aesthetic
stack
ing is performed automatically by the position adjustment specified by the position
argument. ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")
position = "fill"
and position = "dodge"
position
options: fill
or dodge
.position = "fill"
works like stacking, but makes each set of stacked bars the same height.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])
position = "dodge"
places overlapping objects directly beside one another. ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])
Overplotting and position = "jitter"
The values of hwy
and displ
are rounded so the points appear on a grid and many points overlap each other.
You can avoid the overlapping problem by setting the position adjustment to jitter
.
position = "jitter"
adds a small amount of random noise to each point. ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = [?])
Exercises
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point()
What parameters to geom_jitter()
control the amount of jittering?
Compare and contrast geom_jitter()
with geom_count()
.
What’s the default position adjustment for geom_boxplot()
? Create a visualization of the mpg
dataset that demonstrates it.
The default coordinate system is the Cartesian coordinate system where the x
and y
positions act independently to determine the location of each point.
There are a number of other coordinate systems that are occasionally helpful.
coord_flip()
coord_flip()
switches the x
and y
axes.
This is useful (for example), if you want horizontal boxplots.
It's also useful for long labels: it's hard to get them to fit without overlapping on the x
-axis.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot()ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + coord_flip()
coord_quickmap()
coord_quickmap()
sets the aspect ratio correctly for maps. county <- map_data("county") # Map data for US Countiesny <- filter(county, # We will discuss 'filter()' in the next chapter region == "new york")ggplot(ny, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black")ggplot(ny, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") + coord_quickmap()
Exercises
What does labs()
do? Read the documentation.
What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed()
important? What does geom_abline()
do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() + geom_abline() + coord_fixed()
ggplot
Grammarggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION>) + <COORDINATE_FUNCTION> + <FACET_FUNCTION>
Get to know data before modeling
Example
Suppose your goal is to build a model to predict which of our customers don't have health insurance.
We've collected a dataset of customers whose health insurance status you know.
We've also identified some customer properties that you believe help predict the probability of insurance coverage:
summary()
or skimr::skim()
command to take your first look at the data.library(tidyverse)library(skimr)path <- "PATH_NAME_FOR_THE_FILE_custdata.RDS"customer_data <- readRDS(path)# The following is the same data file in my website.path_web <- "https://bcdanl.github.io/data/custdata.csv"customer_data <- read.table(path_web, sep = ',', header = TRUE)skim(customer_data)
Typical problems revealed by data summaries
At this stage, we're looking for several common issues:
Generally, the goal of modeling is to make good predictions on typical cases, or to identify causal relationships.
A model that is highly skewed to predict a rare case correctly may not always be the best model overall.
Missing values
The variable is_employed
is missing for more than a third of the data.
## is_employed## FALSE: 2321## TRUE :44887## NA's :24333
Data range and variation
We should pay attention to how much the values in the data vary.
skim(customer_data$income)skim(customer_data$age)
Units
IncomeK
is defined as IncomeK=customer_data$income/1000.
IncomeK <- customer_data$income/1000skim(IncomeK)
A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
Strive for clarity. Make the data stand out. Specific tips for increasing clarity include these:
Visualization is an iterative process. Its purpose is to answer questions about the data.
Visually checking distributions for a single variable
The above visualizations help us answer questions like these:
What is the peak value of the distribution?
How many peaks are there in the distribution (unimodality versus bimodality)?
How normal is the data?
How much does the data vary? Is it concentrated in a certain interval or in a certain category?
Visually checking distributions for a single variable
ggplot(data = customer_data) + geom_density( mapping = aes(x = age) )
Visually checking distributions for a single variable
Histograms
A basic histogram bins a variable into fixed-width buckets and returns the number of data points that fall into each bucket as a height.
A histogram tells you where your data is concentrated. It also visually highlights outliers and anomalies.
ggplot( data = customer_data, aes(x=gas_usage) ) + geom_histogram( binwidth=10, fill="gray" )skim(customer_data$gas_usage)
Data dictionary entry for gas_usage
001
, 002
, and 003
as numerical values could potentially lead to incorrect conclusions in our analysis.Density plots
We can think of a density plot as a continuous histogram of a variable.
library(scales) # to denote the dollar sign in axesggplot(customer_data, aes(x=income)) + geom_density() + scale_x_continuous(labels=dollar)
A Little Bit of Math for Logarithm
log10(100): the base 10 logarithm of 100 is 2, because 102=100
loge(x): the base e logarithm is called the natural log, where $e = 2.718\cdots$'' is the mathematical constant, the Euler's number.
log(x) or ln(x): the natural log of x .
loge(7.389⋯): the natural log of 7.389⋯ is 2, because e2=7.389⋯.
Log Transformation
We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
A difference in income of $5,000 means something very different across people with different income levels.
Log Transformation
ggplot(customer_data, aes(x=income)) + geom_density() + scale_x_log10(breaks = c(10, 100, 1000, 10000, 100000, 1000000), labels=dollar)
Bar Charts and Dotplots
ggplot( data = customer_data, mapping = aes( x = marital_status ) ) + geom_bar( fill="gray" )
Bar Charts and Dotplots
ggplot(customer_data, aes(x=state_of_res)) + geom_bar(fill="gray") + coord_flip()
Bar Charts and Dotplots
library(WVPlots) # install.package("WVPlots") if you have notClevelandDotPlot(customer_data, "state_of_res", sort = 1, title="Customers by state") + coord_flip()
Visually checking relationships between two variables
We'll often want to look at the relationship between two variables.
Is there a relationship between the two inputs---age and income---in my data?
If so, what kind of relationship, and how strong?
Is there a relationship between the input, marital status, and the output, health insurance? How strong?
A relationship between age and income
filter()
function soon.customer_data2 <- filter(customer_data, 0 < age & age < 100 & 0 < income & income < 200000)cor(customer_data$age, customer_data$income)
A relationship between age and income
ggplot( data = customer_data2 ) + geom_smooth( mapping = aes(x = age, y = income) )ggplot(customer_data2, aes(x=age, y=income)) + geom_point() + geom_smooth() + ggtitle("Income as a function of age")library(hexbin) # install.packages("hexbin) if you have not.ggplot(customer_data2, aes(x=age, y=income)) + geom_hex() + geom_smooth(color = "red", se = F) + ggtitle("Income as a function of age")
A relationship between marital status and health insurance
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) + geom_bar()# side-by-side bar chartggplot(customer_data, aes(x=marital_status, fill=health_ins)) + geom_bar([?])# stacked bar chartggplot(customer_data, aes(x=marital_status, fill=health_ins)) + geom_bar([?])
The Distribution of Marriage Status across Housing Types
cdata <- filter(customer_data, !is.na(housing_type))ggplot(cdata, aes(x=housing_type, fill=marital_status)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Dark2") + coord_flip()ggplot(cdata, aes(x=marital_status)) + geom_bar(fill="darkgray") + facet_wrap(~housing_type, scale="free_x") + coord_flip()
Visually checking relationships between two variables
Overlaying, faceting, and several aesthetics should always be considered with the following geometric objects:
Tutoring and TA-ing Schedules
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Tutoring and TA-ing Schedules
Shortcuts for RStudio and RScript
Mac
<-
.Windows
<-
.NY_school_enrollment_socioecon.csv
, are New York county and year.FIPS | year | county_name | pincp | c01_001 | c02_002 |
---|---|---|---|---|---|
36001 | 2015 | Albany | 55793 | 84463 | 4.7 |
For example, the observation above means that in Albany county in year 2015 ...
Graphing Template
data.frame
, a geom
function, or a collection of mappings such as x = VAR_1
and y = VAR_2
.ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Class Exercises
tvshows_web <- read_csv( 'https://bcdanl.github.io/data/tvshows.csv')
Describe the relationship between audience size (GRP
) and audience engagement (PE
) using ggplot
. Explain the relationship in words.
What aesthetic property would you consider?
Would you do faceting?
How are these two plots similar?
A geom_*()
is the geometrical object that a plot uses to represent data.
geom_bar()
;geom_line()
; geom_boxplot()
; geom_point()
; geom_smooth()
;We can use different geom_*()
to plot the same data.
ggplot()
. ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
geom_*()
Functions and Aesthetic mappings
Every geom_*()
function takes specific mapping arguments.
geom_*()
function. shape
of a geom_point()
, but you cannot set the shape
of a geom_smooth()
;linetype
of a geom_smooth()
.ggplot( data = mpg ) + geom_smooth( mapping = aes( x = displ, y = hwy), linetype = 3)
geom_*()
functions and group
aesthetic
You can set the group
aesthetic to a categorical variable to draw multiple objects.
ggplot2
will draw a separate object for each unique value of the grouping variable.
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
geom_*()
functions and group
aesthetic
ggplot2
will automatically group the data for these geoms
whenever you map an aesthetic to a discrete variable (as in the linetype
example). ggplot(data = mpg) + geom_smooth( mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE )
Multiple geom_*()
functions
geom_*()
functions to ggplot()
:ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy))
Multiple geom_*()
functions
geom_*()
function, ggplot2
will treat them as local mappings for the layer. ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth()
Multiple geom_*()
functions
Here, our smooth line displays just a subset of the mpg
dataset, the subcompact
cars.
The local data argument in geom_smooth()
overrides the global data argument in ggplot()
for that layer only.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
Consider a basic bar chart, as drawn with geom_bar()
.
The following bar chart displays the total number of diamonds in the ggplot2::diamonds
dataset, grouped by cut
.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
diamonds
dataset comes in ggplot2
and contains information about ~54,000 diamonds, including the price
, carat
, color
, clarity
, and cut
of each diamond. Many graphs, including bar charts, calculate new values to plot:
geom_bar()
, geom_histogram()
, and geom_freqpoly()
bin your data and then plot bin counts, the number of observations that fall in each bin.
geom_smooth()
fits a model to your data and then plot predictions from the model.
geom_boxplot()
compute a summary of the distribution and then display a specially formatted box.
The algorithm used to calculate new values for a graph is called a stat
, short for statistical transformation.
The figure below describes how this process works with geom_bar()
.
Observed Value vs. Number of Observations
There are three reasons you might need to use a stat
explicitly:
stat
. demo <- tribble( # for a simple data.frame ~cut, ~freq, "Fair", 1610, "Good", 4906, "Very Good", 12082, "Premium", 13791, "Ideal", 21551 )ggplot(data = demo) + geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
Count vs. Proportion
There are three reasons you might need to use a stat
explicitly:
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
Stat summary
There are three reasons you might need to use a stat
explicitly:
ggplot(data = diamonds) + stat_summary( mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median )
Exercises
What is the default geom associated with stat_summary()
? How could you rewrite the previous plot to use that geom function instead of the stat function?
What does geom_col()
do? How is it different to geom_bar()
?
Most geoms
and stats
come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
What variables does stat_smooth()
compute? What parameters control its behavior?
Exercises
group = 1
. Why? In other words what is the problem with these two graphs?ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop) ) )ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = stat(prop), fill = color ) )
Stacked bar charts with fill
aesthetic
fill
aesthetic to another variable.ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity) )
Stacked bar charts with fill
aesthetic
stack
ing is performed automatically by the position adjustment specified by the position
argument. ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")
position = "fill"
and position = "dodge"
position
options: fill
or dodge
.position = "fill"
works like stacking, but makes each set of stacked bars the same height.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])
position = "dodge"
places overlapping objects directly beside one another. ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = [?])
Overplotting and position = "jitter"
The values of hwy
and displ
are rounded so the points appear on a grid and many points overlap each other.
You can avoid the overlapping problem by setting the position adjustment to jitter
.
position = "jitter"
adds a small amount of random noise to each point. ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = [?])
Exercises
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point()
What parameters to geom_jitter()
control the amount of jittering?
Compare and contrast geom_jitter()
with geom_count()
.
What’s the default position adjustment for geom_boxplot()
? Create a visualization of the mpg
dataset that demonstrates it.
The default coordinate system is the Cartesian coordinate system where the x
and y
positions act independently to determine the location of each point.
There are a number of other coordinate systems that are occasionally helpful.
coord_flip()
coord_flip()
switches the x
and y
axes.
This is useful (for example), if you want horizontal boxplots.
It's also useful for long labels: it's hard to get them to fit without overlapping on the x
-axis.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot()ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() + coord_flip()
coord_quickmap()
coord_quickmap()
sets the aspect ratio correctly for maps. county <- map_data("county") # Map data for US Countiesny <- filter(county, # We will discuss 'filter()' in the next chapter region == "new york")ggplot(ny, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black")ggplot(ny, aes(long, lat, group = group)) + geom_polygon(fill = "white", color = "black") + coord_quickmap()
Exercises
What does labs()
do? Read the documentation.
What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed()
important? What does geom_abline()
do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() + geom_abline() + coord_fixed()
ggplot
Grammarggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION>) + <COORDINATE_FUNCTION> + <FACET_FUNCTION>
Get to know data before modeling
Example
Suppose your goal is to build a model to predict which of our customers don't have health insurance.
We've collected a dataset of customers whose health insurance status you know.
We've also identified some customer properties that you believe help predict the probability of insurance coverage:
summary()
or skimr::skim()
command to take your first look at the data.library(tidyverse)library(skimr)path <- "PATH_NAME_FOR_THE_FILE_custdata.RDS"customer_data <- readRDS(path)# The following is the same data file in my website.path_web <- "https://bcdanl.github.io/data/custdata.csv"customer_data <- read.table(path_web, sep = ',', header = TRUE)skim(customer_data)
Typical problems revealed by data summaries
At this stage, we're looking for several common issues:
Generally, the goal of modeling is to make good predictions on typical cases, or to identify causal relationships.
A model that is highly skewed to predict a rare case correctly may not always be the best model overall.
Missing values
The variable is_employed
is missing for more than a third of the data.
## is_employed## FALSE: 2321## TRUE :44887## NA's :24333
Data range and variation
We should pay attention to how much the values in the data vary.
skim(customer_data$income)skim(customer_data$age)
Units
IncomeK
is defined as IncomeK=customer_data$income/1000.
IncomeK <- customer_data$income/1000skim(IncomeK)
A graphic should display as much information as it can, with the lowest possible cognitive strain to the viewer.
Strive for clarity. Make the data stand out. Specific tips for increasing clarity include these:
Visualization is an iterative process. Its purpose is to answer questions about the data.
Visually checking distributions for a single variable
The above visualizations help us answer questions like these:
What is the peak value of the distribution?
How many peaks are there in the distribution (unimodality versus bimodality)?
How normal is the data?
How much does the data vary? Is it concentrated in a certain interval or in a certain category?
Visually checking distributions for a single variable
ggplot(data = customer_data) + geom_density( mapping = aes(x = age) )
Visually checking distributions for a single variable
Histograms
A basic histogram bins a variable into fixed-width buckets and returns the number of data points that fall into each bucket as a height.
A histogram tells you where your data is concentrated. It also visually highlights outliers and anomalies.
ggplot( data = customer_data, aes(x=gas_usage) ) + geom_histogram( binwidth=10, fill="gray" )skim(customer_data$gas_usage)
Data dictionary entry for gas_usage
001
, 002
, and 003
as numerical values could potentially lead to incorrect conclusions in our analysis.Density plots
We can think of a density plot as a continuous histogram of a variable.
library(scales) # to denote the dollar sign in axesggplot(customer_data, aes(x=income)) + geom_density() + scale_x_continuous(labels=dollar)
A Little Bit of Math for Logarithm
log10(100): the base 10 logarithm of 100 is 2, because 102=100
loge(x): the base e logarithm is called the natural log, where $e = 2.718\cdots$'' is the mathematical constant, the Euler's number.
log(x) or ln(x): the natural log of x .
loge(7.389⋯): the natural log of 7.389⋯ is 2, because e2=7.389⋯.
Log Transformation
We should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units.
A difference in income of $5,000 means something very different across people with different income levels.
Log Transformation
ggplot(customer_data, aes(x=income)) + geom_density() + scale_x_log10(breaks = c(10, 100, 1000, 10000, 100000, 1000000), labels=dollar)
Bar Charts and Dotplots
ggplot( data = customer_data, mapping = aes( x = marital_status ) ) + geom_bar( fill="gray" )
Bar Charts and Dotplots
ggplot(customer_data, aes(x=state_of_res)) + geom_bar(fill="gray") + coord_flip()
Bar Charts and Dotplots
library(WVPlots) # install.package("WVPlots") if you have notClevelandDotPlot(customer_data, "state_of_res", sort = 1, title="Customers by state") + coord_flip()
Visually checking relationships between two variables
We'll often want to look at the relationship between two variables.
Is there a relationship between the two inputs---age and income---in my data?
If so, what kind of relationship, and how strong?
Is there a relationship between the input, marital status, and the output, health insurance? How strong?
A relationship between age and income
filter()
function soon.customer_data2 <- filter(customer_data, 0 < age & age < 100 & 0 < income & income < 200000)cor(customer_data$age, customer_data$income)
A relationship between age and income
ggplot( data = customer_data2 ) + geom_smooth( mapping = aes(x = age, y = income) )ggplot(customer_data2, aes(x=age, y=income)) + geom_point() + geom_smooth() + ggtitle("Income as a function of age")library(hexbin) # install.packages("hexbin) if you have not.ggplot(customer_data2, aes(x=age, y=income)) + geom_hex() + geom_smooth(color = "red", se = F) + ggtitle("Income as a function of age")
A relationship between marital status and health insurance
ggplot(customer_data, aes(x=marital_status, fill=health_ins)) + geom_bar()# side-by-side bar chartggplot(customer_data, aes(x=marital_status, fill=health_ins)) + geom_bar([?])# stacked bar chartggplot(customer_data, aes(x=marital_status, fill=health_ins)) + geom_bar([?])
The Distribution of Marriage Status across Housing Types
cdata <- filter(customer_data, !is.na(housing_type))ggplot(cdata, aes(x=housing_type, fill=marital_status)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Dark2") + coord_flip()ggplot(cdata, aes(x=marital_status)) + geom_bar(fill="darkgray") + facet_wrap(~housing_type, scale="free_x") + coord_flip()
Visually checking relationships between two variables
Overlaying, faceting, and several aesthetics should always be considered with the following geometric objects: