Byeong-Hak Choe
February 15, 2022
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.8 ## ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
library(skimr) # a better summary of data frame library(scales) # scales for ggplot
## ## Attaching package: 'scales'
## The following object is masked from 'package:purrr': ## ## discard
## The following object is masked from 'package:readr': ## ## col_factor
library(gapminder) # gapminder data library(socviz) # data for visualization practice
p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) p + geom_line(color="gray70", aes(group = country)) + geom_smooth(size = 1.1, method = "loess", se = FALSE) + scale_y_log10(labels=scales::dollar) + facet_wrap(~ continent, ncol = 5) + labs(x = "Year", y = "GDP per capita", title = "GDP per capita on Five Continents") + theme(axis.text.x = element_text(angle = 45), axis.title.x = element_text(margin = margin(t = 20)))
facet_wrap() function is best used when you want a series of small multiples based on a single categorical variable.socviz package includes the gss_sm data frame.
gss_sm is a dataset containing an extract from the 2016 General Social Survey.?gss_sm glimpse(gss_sm) skim(gss_sm) view(gss_sm)
p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs)) p + geom_point(alpha = 0.2) + geom_smooth() + facet_grid(sex ~ race)
The facet_grid() function is best used when you cross-classify some data by two categorical variables.
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar()
count statistic is the one geom_bar() uses by default.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar(mapping = aes(y = ..prop..))
If we want a chart of relative frequencies rather than counts, we will need to get the prop statistic instead.
Our call to statistic from the aes() function generically looks like this: <mapping> = <..statistic..>.
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion)) p + geom_bar(mapping = aes(y = ..prop.., group = 1))
We need to tell ggplot to ignore the x-categories when calculating denominator of the proportion, and use the total number observations instead.
group = 1 inside the aes() call.table(gss_sm$religion)
## ## Protestant Catholic Jewish None Other ## 1371 649 51 619 159
p <- ggplot(data = gss_sm, mapping = aes(x = religion, color = religion)) p + geom_bar()
color, only the border lines of the bars will be assigned colors, and the insides will remain gray.p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion)) p + geom_bar() + guides( fill = "none" )
If the gray bars look boring and we want to fill them with color instead, we can map the religion variable to fill in addition to mapping it to x.
The default legend is about the color variable, which is redundant.
guides(fill = "none"), the legend is removed.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar()
A more appropriate use of the fill aesthetic with geom_bar() is to cross-classify two categorical variables.
geom_bar() is a stacked bar chart, with counts on the y-axis.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "fill")
"fill".
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop..))
position="dodge" to make the bars within each region of the country appear side by side.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = religion))
religion, so we might try mapping that to the group aesthetic.p <- ggplot(data = gss_sm, mapping = aes(x = religion)) p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = bigregion)) + facet_wrap(~ bigregion, ncol = 1)
?midwest glimpse(midwest) skim(midwest) view(midwest)
ggplot package comes with a dataset, midwest, containing information on counties in several midwestern states of the USA.p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_histogram()
geom_histogram() function will choose a bin size for us based on a rule of thumb.p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_histogram(bins = 10)
bins and also optionally the origin of the x-axis.oh_wi <- c("OH", "WI") p <- ggplot(data = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = percollege, fill = state)) p + geom_histogram(alpha = 0.4, bins = 20)
fill mapping.
%in% operator is a convenient way to filter on more than one term in a variable when using subset()p <- ggplot(data = midwest, mapping = aes(x = area)) p + geom_density()
geom_density() function.
p <- ggplot(data = midwest, mapping = aes(x = area, fill = state, color = state)) p + geom_density(alpha = 0.3)
color (for the lines) and fill (for the body of the density curve).
oh_wi <- c("OH", "WI") p <- ggplot(data = subset(midwest, subset = state %in% oh_wi), mapping = aes(x = area, fill = state, color = state)) p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))
For geom_density(), the stat_density() function can return its default ..density.. statistic, or ..scaled.., which will give a proportional density estimate.
It can also return a statistic called ..count.., which is the density times the number of points. This can be used in stacked density plots.
socviz package includes the titanic data frame.
?titanic titanic
## fate sex n percent ## 1 perished male 1364 62.0 ## 2 perished female 126 5.7 ## 3 survived male 367 16.7 ## 4 survived female 344 15.6
p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent, fill = sex)) p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top")
geom_bar() not to do any work on the variable before plotting it.
stat = "identity" means “don’t do any summary calculations”.socviz package also includes the oecd_sum data frame.
oecd_sum table information on average life expectancy at birth within the United States, and across other OECD countries.?oecd_sum oecd_sum
## # A tibble: 57 × 5 ## # Groups: year [57] ## year other usa diff hi_lo ## <int> <dbl> <dbl> <dbl> <chr> ## 1 1960 68.6 69.9 1.30 Below ## 2 1961 69.2 70.4 1.20 Below ## 3 1962 68.9 70.2 1.30 Below ## 4 1963 69.1 70 0.900 Below ## 5 1964 69.5 70.3 0.800 Below ## 6 1965 69.6 70.3 0.700 Below ## 7 1966 69.9 70.3 0.400 Below ## 8 1967 70.1 70.7 0.600 Below ## 9 1968 70.1 70.4 0.300 Below ## 10 1969 70.1 70.6 0.5 Below ## # … with 47 more rows
p <- ggplot(data = oecd_sum, mapping = aes(x = year, y = diff, fill = hi_lo)) p + geom_col() + guides( fill = "none" ) + labs(x = NULL, y = "Difference in Years", title = "The US Life Expectancy Gap", subtitle = "Difference between US and OECD average life expectancies, 1960-2015", caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017.")
For convenience ggplot also provides a related geom, geom_col(), which has exactly the same effect as geom_bar() but its default stat is stat = "identity".
The position argument in geom_bar() and geom_col() can also take the value of "identity".
position = "identity" means “just plot the values as given”.