DANL 310 Lecture 06

Byeong-Hak Choe

February 15, 2022

Show me the right number

Loading the R packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(skimr)   # a better summary of data frame
library(scales)  # scales for ggplot
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(gapminder) # gapminder data
library(socviz)  # data for visualization practice

Facet to make small multiples

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                          y = gdpPercap))

p + geom_line(color="gray70", aes(group = country)) +
    geom_smooth(size = 1.1, method = "loess", se = FALSE) +
    scale_y_log10(labels=scales::dollar) +
    facet_wrap(~ continent, ncol = 5) +
    labs(x = "Year", y = "GDP per capita",
         title = "GDP per capita on Five Continents") + 
    theme(axis.text.x = element_text(angle = 45),
          axis.title.x = element_text(margin = margin(t = 20)))
  • The facet_wrap() function is best used when you want a series of small multiples based on a single categorical variable.

Facet to make small multiples

The 2016 General Social Survey data

  • The socviz package includes the gss_sm data frame.
    • gss_sm is a dataset containing an extract from the 2016 General Social Survey.
?gss_sm
glimpse(gss_sm)
skim(gss_sm)
view(gss_sm)

Facet to make small multiples

p <- ggplot(data = gss_sm,
            mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) +
    geom_smooth() +
    facet_grid(sex ~ race)
  • The facet_grid() function is best used when you cross-classify some data by two categorical variables.

    • e.g., the relationship between the age and the number of children by sex and race

Facet to make small multiples

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion))
p + geom_bar()
  • The count statistic is the one geom_bar() uses by default.

Geoms can transform data

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop..))
  • If we want a chart of relative frequencies rather than counts, we will need to get the prop statistic instead.

  • Our call to statistic from the aes() function generically looks like this: <mapping> = <..statistic..>.

Geoms can transform data

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop.., group = 1)) 
  • We need to tell ggplot to ignore the x-categories when calculating denominator of the proportion, and use the total number observations instead.

    • To do so we specify group = 1 inside the aes() call.

Geoms can transform data

Geoms can transform data

table(gss_sm$religion)
## 
## Protestant   Catholic     Jewish       None      Other 
##       1371        649         51        619        159

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion, color = religion))
p + geom_bar()
  • If we map religion to color, only the border lines of the bars will be assigned colors, and the insides will remain gray.

Geoms can transform data

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion, fill = religion))
p + geom_bar() + guides( fill = "none" )
  • If the gray bars look boring and we want to fill them with color instead, we can map the religion variable to fill in addition to mapping it to x.

  • The default legend is about the color variable, which is redundant.

    • If we set guides(fill = "none"), the legend is removed.

Geoms can transform data

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar()
  • A more appropriate use of the fill aesthetic with geom_bar() is to cross-classify two categorical variables.

    • The default output of such geom_bar() is a stacked bar chart, with counts on the y-axis.

Frequency plots the slightly awkward way

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "fill")
  • An alternative choice is to set the position argument to "fill".
    • It is to compare proportions across groups.

Frequency plots the slightly awkward way

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop..))
  • We can use position="dodge" to make the bars within each region of the country appear side by side.

Frequency plots the slightly awkward way

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop.., group = religion))
  • In this case our grouping variable is religion, so we might try mapping that to the group aesthetic.

Frequency plots the slightly awkward way

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop.., group = bigregion)) +
    facet_wrap(~ bigregion, ncol = 1)
  • Instead, we can ask ggplot to give us a proportional bar chart of religious affiliation, and then facet that by region.
    • The proportions are calculated within each panel, which is the breakdown we wanted.

Frequency plots the slightly awkward way

Midwest demographics

?midwest
glimpse(midwest)
skim(midwest)
view(midwest)
  • The ggplot package comes with a dataset, midwest, containing information on counties in several midwestern states of the USA.

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_histogram()
  • By default, the geom_histogram() function will choose a bin size for us based on a rule of thumb.

Histograms and density plots

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_histogram(bins = 10)
  • When drawing histograms it is worth experimenting with bins and also optionally the origin of the x-axis.

Histograms and density plots

Histograms and density plots

oh_wi <- c("OH", "WI")

p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
            mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)
  • We can facet histograms by some variable of interest, or as here we can compare them in the same plot using the fill mapping.
    • We subset the data here to pick out just two states.
    • The %in% operator is a convenient way to filter on more than one term in a variable when using subset()

Histograms and density plots

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_density()
  • When working with a continuous variable, an alternative to binning the data and making a histogram is using the geom_density() function.
    • It calculates a kernel density estimate of the underlying distribution.
    • The area under the whole curve is usually 1.

Histograms and density plots

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3)
  • We can use color (for the lines) and fill (for the body of the density curve).
    • These figures often look quite nice.
    • When there are several filled areas on the plot, as in this case, the overlap can become hard to read.

Histograms and density plots

Histograms and density plots

oh_wi <- c("OH", "WI")
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))
  • For geom_density(), the stat_density() function can return its default ..density.. statistic, or ..scaled.., which will give a proportional density estimate.

  • It can also return a statistic called ..count.., which is the density times the number of points. This can be used in stacked density plots.

Histograms and density plots

Titanic data

  • The socviz package includes the titanic data frame.
    • the aggregated data on who survived the Titanic disaster by sex.
?titanic
titanic
##       fate    sex    n percent
## 1 perished   male 1364    62.0
## 2 perished female  126     5.7
## 3 survived   male  367    16.7
## 4 survived female  344    15.6

Avoid transformations when necessary

p <- ggplot(data = titanic,
            mapping = aes(x = fate, y = percent, fill = sex))

p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top")
  • We can tell geom_bar() not to do any work on the variable before plotting it.
    • stat = "identity" means “don’t do any summary calculations”.

Avoid transformations when necessary

OECD data

  • The socviz package also includes the oecd_sum data frame.
    • The oecd_sum table information on average life expectancy at birth within the United States, and across other OECD countries.
?oecd_sum
oecd_sum
## # A tibble: 57 × 5
## # Groups:   year [57]
##     year other   usa  diff hi_lo
##    <int> <dbl> <dbl> <dbl> <chr>
##  1  1960  68.6  69.9 1.30  Below
##  2  1961  69.2  70.4 1.20  Below
##  3  1962  68.9  70.2 1.30  Below
##  4  1963  69.1  70   0.900 Below
##  5  1964  69.5  70.3 0.800 Below
##  6  1965  69.6  70.3 0.700 Below
##  7  1966  69.9  70.3 0.400 Below
##  8  1967  70.1  70.7 0.600 Below
##  9  1968  70.1  70.4 0.300 Below
## 10  1969  70.1  70.6 0.5   Below
## # … with 47 more rows

Avoid transformations when necessary

p <- ggplot(data = oecd_sum,
            mapping = aes(x = year, y = diff, fill = hi_lo))
p + geom_col() + guides( fill = "none" ) + 
  labs(x = NULL, y = "Difference in Years",
       title = "The US Life Expectancy Gap",
       subtitle = "Difference between US and OECD
                   average life expectancies, 1960-2015",
       caption = "Data: OECD. After a chart by Christopher Ingraham,
                  Washington Post, December 27th 2017.")
  • For convenience ggplot also provides a related geom, geom_col(), which has exactly the same effect as geom_bar() but its default stat is stat = "identity".

  • The position argument in geom_bar() and geom_col() can also take the value of "identity".

    • position = "identity" means “just plot the values as given”.

Avoid transformations when necessary