DANL 310 Lecture 06

Byeong-Hak Choe

February 15, 2022

Show me the right number

Loading the R packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(skimr)   # a better summary of data frame
library(scales)  # scales for ggplot

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(gapminder) # gapminder data
library(socviz)  # data for visualization practice

Facet to make small multiples

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                          y = gdpPercap))

p + geom_line(color="gray70", aes(group = country)) +
    geom_smooth(size = 1.1, method = "loess", se = FALSE) +
    scale_y_log10(labels=scales::dollar) +
    facet_wrap(~ continent, ncol = 5) +
    labs(x = "Year", y = "GDP per capita",
         title = "GDP per capita on Five Continents") + 
    theme(axis.text.x = element_text(angle = 45),
          axis.title.x = element_text(margin = margin(t = 20)))

The facet_wrap() function is best used when you want a series of small multiples based on a single categorical variable.

Facet to make small multiples

The 2016 General Social Survey data

The socviz package includes the gss_sm data frame.
- gss_sm is a dataset containing an extract from the 2016 General Social Survey.

?gss_sm
glimpse(gss_sm)
skim(gss_sm)
view(gss_sm)

Facet to make small multiples

p <- ggplot(data = gss_sm,
            mapping = aes(x = age, y = childs))

p + geom_point(alpha = 0.2) +
    geom_smooth() +
    facet_grid(sex ~ race)

The facet_grid() function is best used when you cross-classify some data by two categorical variables.
- e.g., the relationship between the age and the number of children by sex and race

Facet to make small multiples

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion))
p + geom_bar()

The count statistic is the one geom_bar() uses by default.

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop..))

If we want a chart of relative frequencies rather than counts, we will need to get the prop statistic instead.
Our call to statistic from the aes() function generically looks like this: <mapping> = <..statistic..>.

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop.., group = 1))

We need to tell ggplot to ignore the x-categories when calculating denominator of the proportion, and use the total number observations instead.
- To do so we specify group = 1 inside the aes() call.

Geoms can transform data

table(gss_sm$religion)

## 
## Protestant   Catholic     Jewish       None      Other 
##       1371        649         51        619        159

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion, color = religion))
p + geom_bar()

If we map religion to color, only the border lines of the bars will be assigned colors, and the insides will remain gray.

Geoms can transform data

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion, fill = religion))
p + geom_bar() + guides( fill = "none" )

If the gray bars look boring and we want to fill them with color instead, we can map the religion variable to fill in addition to mapping it to x.
The default legend is about the color variable, which is redundant.
- If we set guides(fill = "none"), the legend is removed.

Geoms can transform data

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar()

A more appropriate use of the fill aesthetic with geom_bar() is to cross-classify two categorical variables.
- The default output of such geom_bar() is a stacked bar chart, with counts on the y-axis.

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "fill")

An alternative choice is to set the position argument to "fill".
- It is to compare proportions across groups.

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop..))

We can use position="dodge" to make the bars within each region of the country appear side by side.

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop.., group = religion))

In this case our grouping variable is religion, so we might try mapping that to the group aesthetic.

Frequency plots the slightly awkward way

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop.., group = bigregion)) +
    facet_wrap(~ bigregion, ncol = 1)

Instead, we can ask ggplot to give us a proportional bar chart of religious affiliation, and then facet that by region.
- The proportions are calculated within each panel, which is the breakdown we wanted.

Frequency plots the slightly awkward way

Midwest demographics

?midwest
glimpse(midwest)
skim(midwest)
view(midwest)

The ggplot package comes with a dataset, midwest, containing information on counties in several midwestern states of the USA.

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_histogram()

By default, the geom_histogram() function will choose a bin size for us based on a rule of thumb.

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_histogram(bins = 10)

When drawing histograms it is worth experimenting with bins and also optionally the origin of the x-axis.

Histograms and density plots

oh_wi <- c("OH", "WI")

p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
            mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)

We can facet histograms by some variable of interest, or as here we can compare them in the same plot using the fill mapping.
- We subset the data here to pick out just two states.
- The %in% operator is a convenient way to filter on more than one term in a variable when using subset()

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_density()

When working with a continuous variable, an alternative to binning the data and making a histogram is using the geom_density() function.
- It calculates a kernel density estimate of the underlying distribution.
- The area under the whole curve is usually 1.

Histograms and density plots

p <- ggplot(data = midwest,
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3)

We can use color (for the lines) and fill (for the body of the density curve).
- These figures often look quite nice.
- When there are several filled areas on the plot, as in this case, the overlap can become hard to read.

Histograms and density plots

oh_wi <- c("OH", "WI")
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))

For geom_density(), the stat_density() function can return its default ..density.. statistic, or ..scaled.., which will give a proportional density estimate.
It can also return a statistic called ..count.., which is the density times the number of points. This can be used in stacked density plots.

Histograms and density plots

Titanic data

The socviz package includes the titanic data frame.
- the aggregated data on who survived the Titanic disaster by sex.

?titanic
titanic

##       fate    sex    n percent
## 1 perished   male 1364    62.0
## 2 perished female  126     5.7
## 3 survived   male  367    16.7
## 4 survived female  344    15.6

Avoid transformations when necessary

p <- ggplot(data = titanic,
            mapping = aes(x = fate, y = percent, fill = sex))

p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top")

We can tell geom_bar() not to do any work on the variable before plotting it.
- stat = "identity" means “don’t do any summary calculations”.

Avoid transformations when necessary

OECD data

The socviz package also includes the oecd_sum data frame.
- The oecd_sum table information on average life expectancy at birth within the United States, and across other OECD countries.

?oecd_sum
oecd_sum

## # A tibble: 57 × 5
## # Groups:   year [57]
##     year other   usa  diff hi_lo
##    <int> <dbl> <dbl> <dbl> <chr>
##  1  1960  68.6  69.9 1.30  Below
##  2  1961  69.2  70.4 1.20  Below
##  3  1962  68.9  70.2 1.30  Below
##  4  1963  69.1  70   0.900 Below
##  5  1964  69.5  70.3 0.800 Below
##  6  1965  69.6  70.3 0.700 Below
##  7  1966  69.9  70.3 0.400 Below
##  8  1967  70.1  70.7 0.600 Below
##  9  1968  70.1  70.4 0.300 Below
## 10  1969  70.1  70.6 0.5   Below
## # … with 47 more rows

Avoid transformations when necessary

p <- ggplot(data = oecd_sum,
            mapping = aes(x = year, y = diff, fill = hi_lo))
p + geom_col() + guides( fill = "none" ) + 
  labs(x = NULL, y = "Difference in Years",
       title = "The US Life Expectancy Gap",
       subtitle = "Difference between US and OECD
                   average life expectancies, 1960-2015",
       caption = "Data: OECD. After a chart by Christopher Ingraham,
                  Washington Post, December 27th 2017.")

For convenience ggplot also provides a related geom, geom_col(), which has exactly the same effect as geom_bar() but its default stat is stat = "identity".
The position argument in geom_bar() and geom_col() can also take the value of "identity".
- position = "identity" means “just plot the values as given”.