Byeong-Hak Choe
February 15, 2022
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.8 ## ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
library(skimr) # a better summary of data frame library(scales) # scales for ggplot
## ## Attaching package: 'scales'
## The following object is masked from 'package:purrr': ## ## discard
## The following object is masked from 'package:readr': ## ## col_factor
library(gapminder) # gapminder data library(socviz) # data for visualization practice
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) p + geom_line(color="gray70", aes(group = country)) + geom_smooth(size = 1.1, method = "loess", se = FALSE) + scale_y_log10(labels=scales::dollar) + facet_wrap(~ continent, ncol = 5) + labs(x = "Year", y = "GDP per capita", title = "GDP per capita on Five Continents") + theme(axis.text.x = element_text(angle = 45), axis.title.x = element_text(margin = margin(t = 20)))
facet_wrap()
function is best used when you want a series of small multiples based on a single categorical variable.socviz
package includes the gss_sm
data frame.
gss_sm
is a dataset containing an extract from the 2016 General Social Survey.?gss_sm glimpse(gss_sm) skim(gss_sm) view(gss_sm)
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs)) p + geom_point(alpha = 0.2) + geom_smooth() + facet_grid(sex ~ race)
The facet_grid()
function is best used when you cross-classify some data by two categorical variables.
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar()
count
statistic is the one geom_bar()
uses by default.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar(mapping = aes(y = ..prop..))
If we want a chart of relative frequencies rather than counts, we will need to get the prop
statistic instead.
Our call to statistic from the aes()
function generically looks like this: <mapping> = <..statistic..>
.
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar(mapping = aes(y = ..prop.., group = 1))
We need to tell ggplot to ignore the x-categories when calculating denominator of the proportion, and use the total number observations instead.
group = 1
inside the aes()
call.table(gss_sm$religion)
## ## Protestant Catholic Jewish None Other ## 1371 649 51 619 159
p <- ggplot(data = gss_sm, mapping = aes(x = religion, color = religion)) p + geom_bar()
color
, only the border lines of the bars will be assigned colors, and the insides will remain gray.p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion)) p + geom_bar() + guides( fill = "none" )
If the gray bars look boring and we want to fill them with color instead, we can map the religion variable to fill
in addition to mapping it to x
.
The default legend is about the color variable, which is redundant.
guides(fill = "none")
, the legend is removed.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar()
A more appropriate use of the fill
aesthetic with geom_bar()
is to cross-classify two categorical variables.
geom_bar()
is a stacked bar chart, with counts on the y-axis.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "fill")
"fill"
.
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop..))
position="dodge"
to make the bars within each region of the country appear side by side.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = religion))
religion
, so we might try mapping that to the group
aesthetic.p <- ggplot(data = gss_sm, mapping = aes(x = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = bigregion)) + facet_wrap(~ bigregion, ncol = 1)
?midwest glimpse(midwest) skim(midwest) view(midwest)
ggplot
package comes with a dataset, midwest
, containing information on counties in several midwestern states of the USA.p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_histogram()
geom_histogram()
function will choose a bin size for us based on a rule of thumb.p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_histogram(bins = 10)
bins
and also optionally the origin
of the x-axis.oh_wi <- c("OH", "WI") p <- ggplot(data = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = percollege, fill = state)) p + geom_histogram(alpha = 0.4, bins = 20)
fill
mapping.
%in%
operator is a convenient way to filter on more than one term in a variable when using subset()
p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_density()
geom_density()
function.
p <- ggplot(data = midwest, mapping = aes(x = area, fill = state, color = state)) p + geom_density(alpha = 0.3)
color
(for the lines) and fill
(for the body of the density curve).
oh_wi <- c("OH", "WI") p <- ggplot(data = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = area, fill = state, color = state)) p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))
For geom_density()
, the stat_density()
function can return its default ..density..
statistic, or ..scaled..
, which will give a proportional density estimate.
It can also return a statistic called ..count..
, which is the density times the number of points. This can be used in stacked density plots.
socviz
package includes the titanic
data frame.
?titanic titanic
## fate sex n percent ## 1 perished male 1364 62.0 ## 2 perished female 126 5.7 ## 3 survived male 367 16.7 ## 4 survived female 344 15.6
p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex)) p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top")
geom_bar()
not to do any work on the variable before plotting it.
stat = "identity"
means “don’t do any summary calculations”.socviz
package also includes the oecd_sum
data frame.
oecd_sum
table information on average life expectancy at birth within the United States, and across other OECD countries.?oecd_sum oecd_sum
## # A tibble: 57 × 5 ## # Groups: year [57] ## year other usa diff hi_lo ## <int> <dbl> <dbl> <dbl> <chr> ## 1 1960 68.6 69.9 1.30 Below ## 2 1961 69.2 70.4 1.20 Below ## 3 1962 68.9 70.2 1.30 Below ## 4 1963 69.1 70 0.900 Below ## 5 1964 69.5 70.3 0.800 Below ## 6 1965 69.6 70.3 0.700 Below ## 7 1966 69.9 70.3 0.400 Below ## 8 1967 70.1 70.7 0.600 Below ## 9 1968 70.1 70.4 0.300 Below ## 10 1969 70.1 70.6 0.5 Below ## # … with 47 more rows
p <- ggplot(data = oecd_sum, mapping = aes(x = year, y = diff, fill = hi_lo)) p + geom_col() + guides( fill = "none" ) + labs(x = NULL, y = "Difference in Years", title = "The US Life Expectancy Gap", subtitle = "Difference between US and OECD average life expectancies, 1960-2015", caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017.")
For convenience ggplot
also provides a related geom, geom_col()
, which has exactly the same effect as geom_bar()
but its default stat
is stat = "identity"
.
The position
argument in geom_bar()
and geom_col()
can also take the value of "identity"
.
position = "identity"
means “just plot the values as given”.