Byeong-Hak Choe
February 17, 2022
We will learn about how to transform data before we send it to ggplot to be turned into a figure.
We will expand the number of geoms we know about, and learn more about how to choose between them.
We will learn a little more about the scale, guide, and theme functions.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.8 ## ✓ tidyr 1.1.4 ✓ stringr 1.4.0 ## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
library(skimr) # a better summary of data frame library(scales) # scales for ggplot
## ## Attaching package: 'scales'
## The following object is masked from 'package:purrr': ## ## discard
## The following object is masked from 'package:readr': ## ## col_factor
library(gapminder) # gapminder data library(socviz) # data for visualization practice library(ggrepel) # for text on plot, geom_text_repel(), geom_label_repel()
socviz
package includes the gss_sm
data frame.
gss_sm
is a dataset containing an extract from the 2016 General Social Survey.?gss_sm glimpse(gss_sm) skim(gss_sm) view(gss_sm)
p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion)) p + geom_bar(position = "fill")
"fill"
in the geom_bar()
is to compare proportions across groups.p <- ggplot(data = gss_sm, mapping = aes(x = bigregion )) p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = religion)) + facet_wrap(~ religion, ncol = 1)
%>%
(pipe) operator when tidying data.
%>%
when reading code is “then”.group_by()
: Group the data into the nested structure we want for our summary, such as “Religion by Region” or “Authors by Publications by Year”.filter()
rows or select()
columns: Filter or select pieces of the data by row, column, or both.mutate()
: Mutate the data by creating new variables at the current level of grouping.summarize()
: Summarize or aggregate the grouped data.
mean()
or counts with n()
.rel_by_region <- gss_sm %>% group_by( bigregion, religion ) %>% summarize( N = n() ) %>% mutate( freq = N / sum(N), pct = round((freq*100), 0) ) rel_by_region
## # A tibble: 24 × 5 ## # Groups: bigregion [4] ## bigregion religion N freq pct ## <fct> <fct> <int> <dbl> <dbl> ## 1 Northeast Protestant 158 0.324 32 ## 2 Northeast Catholic 162 0.332 33 ## 3 Northeast Jewish 27 0.0553 6 ## 4 Northeast None 112 0.230 23 ## 5 Northeast Other 28 0.0574 6 ## 6 Northeast <NA> 1 0.00205 0 ## 7 Midwest Protestant 325 0.468 47 ## 8 Midwest Catholic 172 0.247 25 ## 9 Midwest Jewish 3 0.00432 0 ## 10 Midwest None 157 0.226 23 ## # … with 14 more rows
Reading from the left, the code says this:
rel_by_region <- gss_sm %>%
: Create a new object, rel_by_region
. Start with the gss_sm data, and then …
group_by(bigregion, religion) %>%
: Group the rows by bigregion and, within that, by religion.
summarize(N = n()) %>%
: Summarize this table to create a new, much smaller table, with three columns:
bigregion
religion
N
: a count of the number of observations within each religious group for each region.mutate(freq = N / sum(N), pct = round((freq*100), 0))
: With this new table, use the N
variable to calculate two new columns:
freq
)pct
) for each religious category, still grouped by region.
rel_by_region %>% group_by(bigregion) %>% summarize(total = sum(pct))
## # A tibble: 4 × 2 ## bigregion total ## <fct> <dbl> ## 1 Northeast 100 ## 2 Midwest 101 ## 3 South 100 ## 4 West 101
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion)) p + geom_col(position = "dodge") + labs(x = "Region",y = "Percent", fill = "Religion") + theme(legend.position = "top")
dodge2
instead of dodge
.p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion)) p + geom_col(position = "dodge2") + labs(x = NULL, y = "Percent", fill = "Religion") + guides(fill = "none") + coord_flip() + facet_grid(~ bigregion)
coord_flip()
function switches the x and y axes after the plot is made.organdata
datasocviz
package includes the organdata
data frame.
organdata
contains a little more than a decade’s worth of information on the donation of organs for transplants in seventeen OECD countries.?organdata glimpse(organdata) skim(organdata) view(organdata)
p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_point()
p <- ggplot(data = organdata, mapping = aes(x = year, y = donors)) p + geom_line(aes(group = country)) + facet_wrap(~ country) + theme(axis.text.x = element_text(angle = 45))
geom_line()
to plot each country’s time series.p <- ggplot(data = organdata, mapping = aes(x = country, y = donors)) p + geom_boxplot() + coord_flip()
geom_boxplot()
to get a picture of variation by year across countries.p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors)) p + geom_boxplot() + labs(x=NULL) + coord_flip()
reorder()
function will do this for us.
reorder(country, donors)
: Reorder country
by donors
.p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, fill = world)) p + geom_boxplot() + labs(x=NULL) + coord_flip() + theme(legend.position = "top")
Boxplots can also take color
and fill
aesthetic mappings like other geoms.
The plots can be quite compact and fit a relatively large number of cases in by row.
p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world)) p + geom_point(alpha = .5) + labs(x=NULL) + coord_flip() + theme(legend.position = "top")
geom_point()
like this, there is some overplotting of observations.p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world)) p + geom_jitter(position = position_jitter(width=0.15)) + labs(x=NULL) + coord_flip() + theme(legend.position = "top")
geom_jitter()
can be useful to perturb the data just a little bit in order to get a better sense of how many observations there are at different values.p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world)) p + geom_jitter(position = position_jitter(width=0.15)) + labs(x=NULL) + coord_flip() + theme(legend.position = "top")
height
and width
arguments to a position_jitter()
function within the geom.by_country <- organdata %>% group_by(consent_law, country) %>% summarize(donors_mean= mean(donors, na.rm = TRUE), donors_sd = sd(donors, na.rm = TRUE), gdp_mean = mean(gdp, na.rm = TRUE), health_mean = mean(health, na.rm = TRUE), roads_mean = mean(roads, na.rm = TRUE), cerebvas_mean = mean(cerebvas, na.rm = TRUE)) by_country
## # A tibble: 17 × 8 ## # Groups: consent_law [2] ## consent_law country donors_mean donors_sd gdp_mean health_mean roads_mean ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Informed Australia 10.6 1.14 22179. 1958. 105. ## 2 Informed Canada 14.0 0.751 23711. 2272. 109. ## 3 Informed Denmark 13.1 1.47 23722. 2054. 102. ## 4 Informed Germany 13.0 0.611 22163. 2349. 113. ## 5 Informed Ireland 19.8 2.48 20824. 1480. 118. ## 6 Informed Netherlands 13.7 1.55 23013. 1993. 76.1 ## 7 Informed United Kin… 13.5 0.775 21359. 1561. 67.9 ## 8 Informed United Sta… 20.0 1.33 29212. 3988. 155. ## 9 Presumed Austria 23.5 2.42 23876. 1875. 150. ## 10 Presumed Belgium 21.9 1.94 22500. 1958. 155. ## 11 Presumed Finland 18.4 1.53 21019. 1615. 93.6 ## 12 Presumed France 16.8 1.60 22603. 2160. 156. ## 13 Presumed Italy 11.1 4.28 21554. 1757 122. ## 14 Presumed Norway 15.4 1.11 26448. 2217. 70.0 ## 15 Presumed Spain 28.1 4.96 16933 1289. 161. ## 16 Presumed Sweden 13.1 1.75 22415. 1951. 72.3 ## 17 Presumed Switzerland 14.2 1.71 27233 2776. 96.4 ## # … with 1 more variable: cerebvas_mean <dbl>
What we would like to do is apply the mean()
and sd()
functions to every numerical variable in organdata
, but only the numerical ones.
summarize_if()
examines each column in our data and applies a test to it.
summarize_if()
only summarizes if the test is passed, that is, if it returns a value of TRUE
.by_country <- organdata %>% group_by(consent_law, country) %>% summarize_if(is.numeric, funs(mean, sd), na.rm = TRUE) %>% ungroup() by_country
## # A tibble: 17 × 28 ## consent_law country donors_mean pop_mean pop_dens_mean gdp_mean gdp_lag_mean ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Informed Austral… 10.6 18318. 0.237 22179. 21779. ## 2 Informed Canada 14.0 29608. 0.297 23711. 23353. ## 3 Informed Denmark 13.1 5257. 12.2 23722. 23275 ## 4 Informed Germany 13.0 80255. 22.5 22163. 21938. ## 5 Informed Ireland 19.8 3674. 5.23 20824. 20154. ## 6 Informed Netherl… 13.7 15548. 37.4 23013. 22554. ## 7 Informed United … 13.5 58187. 24.0 21359. 20962. ## 8 Informed United … 20.0 269330. 2.80 29212. 28699. ## 9 Presumed Austria 23.5 7927. 9.45 23876. 23415. ## 10 Presumed Belgium 21.9 10153. 30.7 22500. 22096. ## 11 Presumed Finland 18.4 5112. 1.51 21019. 20763 ## 12 Presumed France 16.8 58056. 10.5 22603. 22211. ## 13 Presumed Italy 11.1 57360. 19.0 21554. 21195. ## 14 Presumed Norway 15.4 4386. 1.35 26448. 25769. ## 15 Presumed Spain 28.1 39666. 7.84 16933 16584. ## 16 Presumed Sweden 13.1 8789. 1.95 22415. 22094 ## 17 Presumed Switzer… 14.2 7037. 17.0 27233 26931. ## # … with 21 more variables: health_mean <dbl>, health_lag_mean <dbl>, ## # pubhealth_mean <dbl>, roads_mean <dbl>, cerebvas_mean <dbl>, ## # assault_mean <dbl>, external_mean <dbl>, txp_pop_mean <dbl>, ## # donors_sd <dbl>, pop_sd <dbl>, pop_dens_sd <dbl>, gdp_sd <dbl>, ## # gdp_lag_sd <dbl>, health_sd <dbl>, health_lag_sd <dbl>, pubhealth_sd <dbl>, ## # roads_sd <dbl>, cerebvas_sd <dbl>, assault_sd <dbl>, external_sd <dbl>, ## # txp_pop_sd <dbl>
p <- ggplot(data = by_country, mapping = aes(x = donors_mean, y = reorder(country, donors_mean), color = consent_law)) p + geom_point(size=3) + labs(x = "Donor Procurement Rate", y = "", color = "Consent Law") + theme(legend.position="top")
p <- ggplot(data = by_country, mapping = aes(x = donors_mean, y = reorder(country, donors_mean))) p + geom_point(size=3) + facet_wrap(~ consent_law, scales = "free_y", ncol = 1) + labs(x= "Donor Procurement Rate", y= "")
We could use a facet instead of coloring the points.
In the facet_wrap()
here, …
scales = "free_y"
allows the y-axes scale to be free, which removes countries that do not dots in the panel.ncol=1
makes it easy to compare.p <- ggplot(data = by_country, mapping = aes(x = reorder(country, donors_mean), y = donors_mean)) p + geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd, ymax = donors_mean + donors_sd)) + labs(x= "", y= "Donor Procurement Rate") + coord_flip()
The Cleveland-style dotplot can be extended to cases where we want to include some information about variance or error in the plot.
Using geom_pointrange()
, we can tell ggplot to show us a point estimate and a range around it.
p <- ggplot(data = by_country, mapping = aes(x = roads_mean, y = donors_mean)) p + geom_point() + geom_text(mapping = aes(label = country))
geom_text()
.p <- ggplot(data = by_country, mapping = aes(x = roads_mean, y = donors_mean)) p + geom_point() + geom_text(mapping = aes(label = country), hjust = 0)
hjust
argument to geom_text()
.
hjust=0
will left justify the label, and hjust=1
will right justify it.hjust
is not a robust approach.socviz
package includes the elections_historic
data frame.
elections_historic
provides historical U.S. presidential election data.?elections_historic glimpse(elections_historic) skim(elections_historic) view(elections_historic)
p_title <- "Presidential Elections: Popular & Electoral College Margins" p_subtitle <- "1824-2016" p_caption <- "Data for 2016 are provisional." x_label <- "Winner's share of Popular Vote" y_label <- "Winner's share of Electoral College Votes" p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label)) p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") + geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") + geom_point() + geom_text_repel() + scale_x_continuous(labels = scales::percent) + scale_y_continuous(labels = scales::percent) + labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption)
geom_hline()
and geom_vline()
to make the lines.
geom_text_repel()
makes sure the labels do not overlap with each other, or obscure other points.