DANL 310 Lecture 07

Byeong-Hak Choe

February 17, 2022

Graph tables, add labels, make notes

We will learn about how to transform data before we send it to ggplot to be turned into a figure.
We will expand the number of geoms we know about, and learn more about how to choose between them.
- How to reorder the variables displayed in our figures;
- How to subset the data we use before we display it.
We will learn a little more about the scale, guide, and theme functions.

Loading the R packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(skimr)   # a better summary of data frame
library(scales)  # scales for ggplot

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(gapminder) # gapminder data
library(socviz)  # data for visualization practice
library(ggrepel)  # for text on plot, geom_text_repel(), geom_label_repel()

The 2016 General Social Survey data

The socviz package includes the gss_sm data frame.
- gss_sm is a dataset containing an extract from the 2016 General Social Survey.

?gss_sm
glimpse(gss_sm)
skim(gss_sm)
view(gss_sm)

Graph tables

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "fill")

Setting the position argument to "fill" in the geom_bar() is to compare proportions across groups.

Graph tables

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion ))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop.., group = religion)) +
    facet_wrap(~ religion, ncol = 1)

Instead, we can ask ggplot to give us a proportional bar chart of region, and then facet that by religion.
- The proportions are calculated within each panel, which is the breakdown we wanted.

Graph tables

Use pipes to summarize data

We will use the %>% (pipe) operator when tidying data.
- The point of the pipe is to help you write code in a way that is easier to read and understand.
- A good way to pronounce %>% when reading code is “then”.

Use pipes to summarize data

A pipeline is typically a series of operations that do one or more of four things:
- group_by(): Group the data into the nested structure we want for our summary, such as “Religion by Region” or “Authors by Publications by Year”.
- filter() rows or select() columns: Filter or select pieces of the data by row, column, or both.
- mutate(): Mutate the data by creating new variables at the current level of grouping.
- summarize(): Summarize or aggregate the grouped data.
  - For example we might calculate means with mean() or counts with n().

Use pipes to summarize data

rel_by_region <- gss_sm %>%
    group_by( bigregion, religion ) %>%
    summarize( N = n() ) %>%
    mutate( freq = N / sum(N),
            pct = round((freq*100), 0) )
rel_by_region

## # A tibble: 24 × 5
## # Groups:   bigregion [4]
##    bigregion religion       N    freq   pct
##    <fct>     <fct>      <int>   <dbl> <dbl>
##  1 Northeast Protestant   158 0.324      32
##  2 Northeast Catholic     162 0.332      33
##  3 Northeast Jewish        27 0.0553      6
##  4 Northeast None         112 0.230      23
##  5 Northeast Other         28 0.0574      6
##  6 Northeast <NA>           1 0.00205     0
##  7 Midwest   Protestant   325 0.468      47
##  8 Midwest   Catholic     172 0.247      25
##  9 Midwest   Jewish         3 0.00432     0
## 10 Midwest   None         157 0.226      23
## # … with 14 more rows

Use pipes to summarize data

Reading from the left, the code says this:
- rel_by_region <- gss_sm %>%: Create a new object, rel_by_region. Start with the gss_sm data, and then …
- group_by(bigregion, religion) %>%: Group the rows by bigregion and, within that, by religion.

Use pipes to summarize data

summarize(N = n()) %>%: Summarize this table to create a new, much smaller table, with three columns:
- bigregion
- religion
- N: a count of the number of observations within each religious group for each region.

Use pipes to summarize data

mutate(freq = N / sum(N), pct = round((freq*100), 0)): With this new table, use the N variable to calculate two new columns:
- the relative proportion (freq)
- the percentage (pct) for each religious category, still grouped by region.
  - Round the results to the nearest percentage point.

Use pipes to summarize data

rel_by_region %>% group_by(bigregion) %>%
    summarize(total = sum(pct))

## # A tibble: 4 × 2
##   bigregion total
##   <fct>     <dbl>
## 1 Northeast   100
## 2 Midwest     101
## 3 South       100
## 4 West        101

Use pipes to summarize data

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))

p + geom_col(position = "dodge") +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

We use a different position argument here, dodge2 instead of dodge.

Use pipes to summarize data

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))

p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "Religion") +
    guides(fill = "none") + 
    coord_flip() + 
    facet_grid(~ bigregion)

The coord_flip() function switches the x and y axes after the plot is made.

Use pipes to summarize data

The `organdata` data

The socviz package includes the organdata data frame.
- organdata contains a little more than a decade’s worth of information on the donation of organs for transplants in seventeen OECD countries.

?organdata
glimpse(organdata)
skim(organdata)
view(organdata)

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_point()

Would it be informative?

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) + facet_wrap(~ country) +
  theme(axis.text.x = element_text(angle = 45))

We could use geom_line() to plot each country’s time series.
We can also facet the figure by country.

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = country, y = donors))
p + geom_boxplot() + coord_flip()

We can use geom_boxplot() to get a picture of variation by year across countries.

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors))
p + geom_boxplot() +
    labs(x=NULL) +
    coord_flip()

We generally want our plots to present data in some meaningful order.
The reorder() function will do this for us.
- reorder(country, donors): Reorder country by donors.

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, fill = world))
p + geom_boxplot() + labs(x=NULL) +
    coord_flip() + theme(legend.position = "top")

Boxplots can also take color and fill aesthetic mappings like other geoms.
The plots can be quite compact and fit a relatively large number of cases in by row.

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, color = world))
p + geom_point(alpha = .5) + labs(x=NULL) +
    coord_flip() + theme(legend.position = "top")

When we use geom_point() like this, there is some overplotting of observations.

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, color = world))
p + geom_jitter(position = position_jitter(width=0.15)) +
    labs(x=NULL) + coord_flip() + theme(legend.position = "top")

geom_jitter() can be useful to perturb the data just a little bit in order to get a better sense of how many observations there are at different values.

Continuous variables by group or category

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, color = world))
p + geom_jitter(position = position_jitter(width=0.15)) +
    labs(x=NULL) + coord_flip() + theme(legend.position = "top")

We can control it using height and width arguments to a position_jitter() function within the geom.

Continuous variables by group or category

When we want to summarize a categorical variable that just has one point per category, we should use this approach as well.

Continuous variables by group or category

by_country <- organdata %>% group_by(consent_law, country) %>%
    summarize(donors_mean= mean(donors, na.rm = TRUE),
              donors_sd = sd(donors, na.rm = TRUE),
              gdp_mean = mean(gdp, na.rm = TRUE),
              health_mean = mean(health, na.rm = TRUE),
              roads_mean = mean(roads, na.rm = TRUE),
              cerebvas_mean = mean(cerebvas, na.rm = TRUE))

by_country

## # A tibble: 17 × 8
## # Groups:   consent_law [2]
##    consent_law country     donors_mean donors_sd gdp_mean health_mean roads_mean
##    <chr>       <chr>             <dbl>     <dbl>    <dbl>       <dbl>      <dbl>
##  1 Informed    Australia          10.6     1.14    22179.       1958.      105. 
##  2 Informed    Canada             14.0     0.751   23711.       2272.      109. 
##  3 Informed    Denmark            13.1     1.47    23722.       2054.      102. 
##  4 Informed    Germany            13.0     0.611   22163.       2349.      113. 
##  5 Informed    Ireland            19.8     2.48    20824.       1480.      118. 
##  6 Informed    Netherlands        13.7     1.55    23013.       1993.       76.1
##  7 Informed    United Kin…        13.5     0.775   21359.       1561.       67.9
##  8 Informed    United Sta…        20.0     1.33    29212.       3988.      155. 
##  9 Presumed    Austria            23.5     2.42    23876.       1875.      150. 
## 10 Presumed    Belgium            21.9     1.94    22500.       1958.      155. 
## 11 Presumed    Finland            18.4     1.53    21019.       1615.       93.6
## 12 Presumed    France             16.8     1.60    22603.       2160.      156. 
## 13 Presumed    Italy              11.1     4.28    21554.       1757       122. 
## 14 Presumed    Norway             15.4     1.11    26448.       2217.       70.0
## 15 Presumed    Spain              28.1     4.96    16933        1289.      161. 
## 16 Presumed    Sweden             13.1     1.75    22415.       1951.       72.3
## 17 Presumed    Switzerland        14.2     1.71    27233        2776.       96.4
## # … with 1 more variable: cerebvas_mean <dbl>

Continuous variables by group or category

What we would like to do is apply the mean() and sd() functions to every numerical variable in organdata, but only the numerical ones.
summarize_if() examines each column in our data and applies a test to it.
- summarize_if() only summarizes if the test is passed, that is, if it returns a value of TRUE.

Continuous variables by group or category

by_country <- organdata %>% group_by(consent_law, country) %>%
    summarize_if(is.numeric, funs(mean, sd), na.rm = TRUE) %>%
    ungroup()

by_country

## # A tibble: 17 × 28
##    consent_law country  donors_mean pop_mean pop_dens_mean gdp_mean gdp_lag_mean
##    <chr>       <chr>          <dbl>    <dbl>         <dbl>    <dbl>        <dbl>
##  1 Informed    Austral…        10.6   18318.         0.237   22179.       21779.
##  2 Informed    Canada          14.0   29608.         0.297   23711.       23353.
##  3 Informed    Denmark         13.1    5257.        12.2     23722.       23275 
##  4 Informed    Germany         13.0   80255.        22.5     22163.       21938.
##  5 Informed    Ireland         19.8    3674.         5.23    20824.       20154.
##  6 Informed    Netherl…        13.7   15548.        37.4     23013.       22554.
##  7 Informed    United …        13.5   58187.        24.0     21359.       20962.
##  8 Informed    United …        20.0  269330.         2.80    29212.       28699.
##  9 Presumed    Austria         23.5    7927.         9.45    23876.       23415.
## 10 Presumed    Belgium         21.9   10153.        30.7     22500.       22096.
## 11 Presumed    Finland         18.4    5112.         1.51    21019.       20763 
## 12 Presumed    France          16.8   58056.        10.5     22603.       22211.
## 13 Presumed    Italy           11.1   57360.        19.0     21554.       21195.
## 14 Presumed    Norway          15.4    4386.         1.35    26448.       25769.
## 15 Presumed    Spain           28.1   39666.         7.84    16933        16584.
## 16 Presumed    Sweden          13.1    8789.         1.95    22415.       22094 
## 17 Presumed    Switzer…        14.2    7037.        17.0     27233        26931.
## # … with 21 more variables: health_mean <dbl>, health_lag_mean <dbl>,
## #   pubhealth_mean <dbl>, roads_mean <dbl>, cerebvas_mean <dbl>,
## #   assault_mean <dbl>, external_mean <dbl>, txp_pop_mean <dbl>,
## #   donors_sd <dbl>, pop_sd <dbl>, pop_dens_sd <dbl>, gdp_sd <dbl>,
## #   gdp_lag_sd <dbl>, health_sd <dbl>, health_lag_sd <dbl>, pubhealth_sd <dbl>,
## #   roads_sd <dbl>, cerebvas_sd <dbl>, assault_sd <dbl>, external_sd <dbl>,
## #   txp_pop_sd <dbl>

Cleveland dotplot

p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean, y = reorder(country, donors_mean),
                          color = consent_law))
p + geom_point(size=3) +
    labs(x = "Donor Procurement Rate",
         y = "", color = "Consent Law") +
    theme(legend.position="top")

Cleveland dotplot is a simple and extremely effective method of presenting data that is usually better than either a bar chart or a table.

Cleveland dotplot

p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean,
                          y = reorder(country, donors_mean)))

p + geom_point(size=3) +
    facet_wrap(~ consent_law, scales = "free_y", ncol = 1) +
    labs(x= "Donor Procurement Rate",
         y= "")

We could use a facet instead of coloring the points.
In the facet_wrap() here, …
- scales = "free_y" allows the y-axes scale to be free, which removes countries that do not dots in the panel.
- ncol=1 makes it easy to compare.

Cleveland dotplot

p <- ggplot(data = by_country, mapping = aes(x = reorder(country,
              donors_mean), y = donors_mean))

p + geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd,
       ymax = donors_mean + donors_sd)) +
     labs(x= "", y= "Donor Procurement Rate") + coord_flip()

The Cleveland-style dotplot can be extended to cases where we want to include some information about variance or error in the plot.
Using geom_pointrange(), we can tell ggplot to show us a point estimate and a range around it.

Cleveland dotplot

Plot text directly

p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country))

It can sometimes be useful to plot the labels along with the points in a scatterplot, or just plot informative labels directly.
- We can do this with geom_text().

Plot text directly

p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean, y = donors_mean))

p + geom_point() + geom_text(mapping = aes(label = country), hjust = 0)

We can left- or right-justify the labels using the hjust argument to geom_text().
- Setting hjust=0 will left justify the label, and hjust=1 will right justify it.
- Trying different values to hjust is not a robust approach.

Plot text directly

Historical U.S. presidential election data

The socviz package includes the elections_historic data frame.
- elections_historic provides historical U.S. presidential election data.

?elections_historic
glimpse(elections_historic)
skim(elections_historic)
view(elections_historic)

Plot text directly

p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"
p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label))
p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") + 
    geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
    geom_point() + geom_text_repel() +  
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle,
         caption = p_caption)

geom_hline() and geom_vline() to make the lines.
geom_text_repel() makes sure the labels do not overlap with each other, or obscure other points.

DANL 310 Lecture 07

Graph tables, add labels, make notes

Graph tables, add labels, make notes

Loading the R packages

The 2016 General Social Survey data

Graph tables

Graph tables

Graph tables

Graph tables

Graph tables

Graph tables

Graph tables

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

Use pipes to summarize data

The organdata data

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Continuous variables by group or category

Cleveland dotplot

Cleveland dotplot

Cleveland dotplot

Cleveland dotplot

Cleveland dotplot

Cleveland dotplot

Plot text directly

Plot text directly

Plot text directly

Plot text directly

Historical U.S. presidential election data

Plot text directly

Plot text directly

The `organdata` data