library(tidyverse)
library(skimr)
# install.packages("hexbin") # if you do not have the "hexbin" package
library(hexbin)Download the compressed file, bikeshare-2011-01-01.zip,
from the Files section in our Canvas web-site. Extract the file,
bikeshare-2011-01-01.zip, so that you can use the file,
bikeshare-2011-01-01.csv. Read the data file,
bikeshare-2011-01-01.csv, as the data.frame object with the
name, bikeshare2011_01_01, using (1) the
read_csv() function and (2) the absolute path name
of the file bikeshare_2011_01_01.csv from your local hard
disk drive in your laptop.
path <- '/Users/byeong-hakchoe/Google Drive/suny-geneseo/fall2022/bikeshare_2011_01_01.csv'
bikeshare2011_01_01 <- read_csv(path)
# View `bikeshare2011_01_01`
bikeshare2011_01_01bikeshare2011_01_01:
Report the mean, median, minimum, maximum, and standard deviation for
each numeric variable in the data.frame
bikeshare2011_01_01.
skim() function to get summary
statistics:library(skimr)
summary(bikeshare2011_01_01)
skim(bikeshare2011_01_01)| N | Mean | SD | Min | Q1 | Median | Q3 | Max | |||
|---|---|---|---|---|---|---|---|---|---|---|
| hr | 24 | 11.50 | 7.07 | 0.00 | 5.50 | 11.50 | 17.50 | 23.00 | ||
| holiday | 24 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ||
| temp | 24 | -0.79 | 0.50 | -1.54 | -1.33 | -0.56 | -0.40 | -0.19 | ||
| hum | 24 | 0.93 | 0.31 | 0.48 | 0.66 | 0.90 | 1.23 | 1.62 | ||
| windspeed | 24 | -0.24 | 1.10 | -1.55 | -1.55 | 0.40 | 0.76 | 0.89 | ||
| year | 24 | 2011.00 | 0.00 | 2011.00 | 2011.00 | 2011.00 | 2011.00 | 2011.00 | ||
| month | 24 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| date | 24 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ||
| cnt | 24 | 41.04 | 34.29 | 1.00 | 13.50 | 35.50 | 61.50 | 110.00 |
Read the data file, bikeshare_cleaned.csv, as the
data.frame object with the name, bikeshare, using (1) the
read_csv() function and (2) its URL,
https://bcdanl.github.io/data/bikeshare_cleaned.csv.
url <- 'https://bcdanl.github.io/data/bikeshare_cleaned.csv'
bikeshare <- read_csv(url)
View(bikeshare)
summary(bikeshare)
skim(bikeshare)
table(bikeshare$wkday)
table(bikeshare$month)
table(bikeshare$seasons)
table(bikeshare$weather_cond)
prop.table( table( bikeshare$wkday ) )
prop.table( table( bikeshare$month ) )
prop.table( table( bikeshare$seasons ) )
prop.table( table( bikeshare$weather_cond ) )bikeshare: bikeshare:| N | Mean | SD | Min | Q1 | Median | Q3 | Max | |||
|---|---|---|---|---|---|---|---|---|---|---|
| cnt | 17376 | 189.48 | 181.40 | 1.00 | 40.00 | 142.00 | 281.00 | 977.00 | ||
| year | 17376 | 2011.50 | 0.50 | 2011.00 | 2011.00 | 2012.00 | 2012.00 | 2012.00 | ||
| hr | 17376 | 11.55 | 6.91 | 0.00 | 6.00 | 12.00 | 18.00 | 23.00 | ||
| holiday | 17376 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ||
| temp | 17376 | 0.00 | 1.00 | -2.48 | -0.82 | 0.02 | 0.85 | 2.61 | ||
| hum | 17376 | 0.00 | 1.00 | -3.25 | -0.76 | 0.01 | 0.79 | 1.93 | ||
| windspeed | 17376 | 0.00 | 1.00 | -1.55 | -0.70 | 0.03 | 0.52 | 5.40 |
| Level | N | % | ||
|---|---|---|---|---|
| wkday | monday | 2478 | 14.3 | |
| tuesday | 2453 | 14.1 | ||
| wednesday | 2474 | 14.2 | ||
| thursday | 2471 | 14.2 | ||
| friday | 2487 | 14.3 | ||
| saturday | 2511 | 14.5 | ||
| sunday | 2502 | 14.4 |
| Level | N | % | ||
|---|---|---|---|---|
| month | 01 | 1426 | 8.2 | |
| 02 | 1341 | 7.7 | ||
| 03 | 1473 | 8.5 | ||
| 04 | 1437 | 8.3 | ||
| 05 | 1488 | 8.6 | ||
| 06 | 1440 | 8.3 | ||
| 07 | 1488 | 8.6 | ||
| 08 | 1475 | 8.5 | ||
| 09 | 1437 | 8.3 | ||
| 10 | 1451 | 8.4 | ||
| 11 | 1437 | 8.3 | ||
| 12 | 1483 | 8.5 |
| Level | N | % | ||
|---|---|---|---|---|
| seasons | spring | 4239 | 24.4 | |
| summer | 4409 | 25.4 | ||
| fall | 4496 | 25.9 | ||
| winter | 4232 | 24.4 |
| Level | N | % | ||
|---|---|---|---|---|
| weather_cond | Clear or Few Cloudy | 11413 | 65.7 | |
| Light Snow or Light Rain | 1419 | 8.2 | ||
| Mist or Cloudy | 4544 | 26.2 |
Use the data.frame bikeshare for the rest of questions
in Question 2.
Provide both (1) ggplot codes and (2) a couple of
sentences to describe the distribution of cnt.
ggplot(bikeshare) +
geom_histogram(aes(x = cnt),
binwidth = 5) 
cnt is right-skewed.cnt range from 0 to 50.Provide both (1) ggplot codes and (2) a couple of
sentences to describe the distribution of cnt by
year and month.
# density plot
ggplot(bikeshare) +
geom_density( aes(x = cnt, fill = month),
color = NA,
show.legend = F) +
facet_grid(month~year)
# histogram
ggplot(bikeshare) +
geom_histogram( aes(x = cnt, fill = month),
binwidth = 10,
color = NA,
show.legend = F) +
facet_grid(month~year)
# boxplot
ggplot(bikeshare) +
geom_boxplot( aes(x = cnt, y = month,
fill = month),
show.legend = F) 


Provide both (1) ggplot codes and (2) a couple of
sentences to describe the distribution of temp by
year and month.
ggplot(bikeshare) +
geom_histogram(aes(x = temp, fill = month),
binwidth = .1,
show.legend = F) +
geom_vline(xintercept = 0, color = 'red') +
facet_grid(month ~ year) 
Provide both (1) ggplot codes and (2) a couple of
sentences to describe the distribution of hum by
year and month.
ggplot(bikeshare) +
geom_histogram(aes(x = hum, fill = month),
binwidth = .1,
show.legend = F) +
geom_vline(xintercept = 0, color = 'red') +
facet_grid(month ~ year) 
Provide both (1) ggplot codes and (2) a couple of
sentences to describe the distribution of windspeed by
year and month.
ggplot(bikeshare) +
geom_density(aes(x = windspeed)) +
geom_vline(xintercept = 0, color = 'red') +
facet_grid(month ~ year) 
Provide both (1) ggplot codes and (2) a couple of
sentences to describe the relationship between temp and
cnt.
ggplot(bikeshare,
aes(x = temp, y = cnt)) +
geom_hex() +
geom_smooth(color = 'red') +
geom_smooth(method = lm) 
temp and cnt are positively associated
with each other.temp (above 1.5) may lead to lower
cnt.Provide both (1) ggplot codes and (2) a couple of
sentences to describe the relationship between temp and
cnt by year and month.
ggplot(bikeshare, aes(x = temp, y = cnt )) +
geom_point(alpha = .1) +
geom_smooth(color = "red3",
fill = "orchid",
method = lm) +
geom_smooth(color = "royalblue",
fill = "orchid") +
facet_grid(month~year)
temp and cnt are positively
associated with each other.temp
and cnt switches from positive to negative at which
temp is around 1.5.Provide both (1) ggplot codes and (2) a couple of
sentences to describe the relationship between weather_cond
and cnt.
ggplot(bikeshare) +
geom_histogram(aes(x = cnt),
binwidth = 5) +
facet_grid(. ~ weather_cond)
ggplot(bikeshare) +
geom_density(aes(x = cnt)) +
facet_grid(. ~ weather_cond) 

weather_cond is Clear or Few Cloudy.cnt are around 50 across all
values of weather_cond.weather_cond is
Light Snow or Light Rain, people are more likely to rent
less number of bikes.Provide both (1) ggplot codes and (2) a couple of
sentences to describe the relationship between weather_cond
and cnt by hr.
ggplot(bikeshare) +
geom_density(aes(x = cnt, fill = as.factor(hr)),
color = NA,
show.legend = F) +
facet_grid(hr ~ weather_cond, scale = "free_y") 
cnt during commuting hours
(hr 7:00 A.M.-8:59 A.M. and 5:00 P.M.-7:59 P.M.) are often
larger than other commuting hours.
Provide both (1) ggplot codes and (2) a couple of
sentences to describe the relationship between wkday and
cnt.
ggplot(bikeshare) +
geom_density(aes(x = cnt,
fill = as.factor(wkday)),
color = NA,
show.legend = F) +
facet_grid(. ~ wkday)
ggplot(bikeshare) +
geom_histogram(aes(x = cnt,
fill = as.factor(wkday)),
binwidth = 10,
color = NA,
show.legend = F) +
facet_grid(. ~ wkday)
ggplot(bikeshare,
aes(x = wkday, y = cnt)) +
geom_boxplot( aes(fill = wkday),
show.legend = F ) +
stat_summary(fun = mean)


cnt is right-skewed, and looks
similar across all values of wkday.Provide both (1) ggplot codes and (2) a couple of
sentences to describe the relationship between wkday and
cnt by hr.
ggplot(bikeshare) +
geom_density(aes(x = cnt,
fill = as.factor(hr)),
color = NA,
show.legend = F) +
facet_grid(hr~wkday, scale = "free_y") ggplot(bikeshare) +
geom_histogram(aes(x = cnt,
fill = as.factor(hr)),
binwidth = 10,
color = NA,
show.legend = F) +
facet_grid(hr~wkday, scale = "free_y") ggplot(bikeshare) +
geom_boxplot(aes(x = cnt,
y = as.factor(hr),
fill = as.factor(hr)),
show.legend = F) +
facet_grid(.~wkday, scale = "free_y") 


Read the data file, NY_school_enrollment_socioecon.csv,
as the data.frame object with the name,
NY_school_enrollment_socioecon, using (1) the
read_csv() function and (2) its URL,
https://bcdanl.github.io/data/NY_school_enrollment_socioecon.csv.
url2 <- 'https://bcdanl.github.io/data/NY_school_enrollment_socioecon.csv'
NY_school_enrollment_socioecon <- read_csv(url2)
View(ny_school_enrollment_socio)NY_school_enrollment_socioecon: For description of variables in
NY_school_enrollment_socioecon, refer to the file,
ny_school_enrollment_socioecon_description.zip, which is in
the Files section in our Canvas web-page. (I recommend you to extract
the zip file, and then read the file,
ny_school_enrollment_socioecon_description.csv, using
Excel or Numbers.)
Here are some details about the data.frame,
NY_school_enrollment_socioecon:
The geographic and time units of observation (row) in the
data.frame, NY_school_enrollment_socioecon, are New
York county and year.
| FIPS | year | county_name | pincp | c01_001 | c02_002 |
|---|---|---|---|---|---|
| 36001 | 2015 | Albany | 55793 | 84463 | 4.7 |

The following describes the variables:
c01_010: Total!!Population enrolled in college
or graduate school
So, c01_010 is total population enrolled in college
or graduate school;
c02_010: Percent!!Population enrolled in college
or graduate school
So, c02_010 is a percent of total
population enrolled in college or graduate school;
In which county is more likely for a person to be enrolled in a college or graduate school?
A county’s college enrollment level can be represented by an overall tendency of that county’s residents to be enrolled in college (as long as we are interested in analyzing how human behaves overall).
The size of a county’s population enrolled in college or graduate
school (c01_010) may not be appropriate to
represent a county’s college enrollment level.
Consider the following example:
| County | Total.Population | Bachelor.s.Degree | High.School | Percent.of.Bachelor.s.Degree | Percent.of.High.School |
|---|---|---|---|---|---|
| A | 100,000 | 1,000 | 99,000 | 1.0% | 99.0% |
| B | 1,000 | 999 | 1 | 99.9% | 0.1% |
Although County A has the larger number of people that have bachelor’s degrees than County B, it is more appropriate to say that people in County B have a higher college enrollment than people in County A.
This is because the overall tendency of County B’s people to attend college is stronger than that of County A’s people.
Similarly, to represent a standard of living of people in a country, we do not use a country’s gross domestic product (GDP) but its GDP per capita (GDP per capita is GDP devided by population).
Provide both (1) ggplot codes and (2) a couple of
sentences to describe the relationship between college enrollment and
educational attainment of population 45 to 64 years, and how such
relationship varies by the type (public or private) of colleges.
To represent a level of educational attainment of population 45
to 64 years in a county, I choose variable, 100 - d01_024,
a percent of population 45 to 64 years
without bachelor’s degree.
To represent a level of college enrollment of population in a
county, I choose variable, c02_010, a
percent of population enrolled in college or graduate
school.
To represent a level of public college’s
enrollment of population in a county, I choose variable,
c04_010, a percent of population enrolled
in public college or graduate school.
To represent a level of private college’s
enrollment of population in a county, I choose variable,
c06_010, a percent of population enrolled
in private college or graduate school.
The following ggplot describes the relationship between college enrollment and educational attainment of population 45 to 64 years:
ggplot(NY_school_enrollment_socioecon,
aes(x = 100 - d01_024, y = c02_010)) +
geom_hex() +
geom_smooth(method = lm) +
geom_smooth(color = 'red') +
coord_fixed()
The percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in college or graduate school.
The following ggplot describes the relationship between college enrollment in public schools and educational attainment of population 45 to 64 years:
ggplot(NY_school_enrollment_socioecon,
aes(x = 100 - d01_024, y = c04_010)) +
geom_hex() +
geom_smooth(method = lm) +
geom_smooth(color = 'red') +
coord_fixed()
ggplot(NY_school_enrollment_socioecon,
aes(x = 100 - d01_024, y = c06_010)) +
geom_hex() +
geom_smooth(method = lm) +
geom_smooth(color = 'red') +
coord_fixed()
The percentage of population 45 to 64 years without Bachelor’s degree is positively associated with the percentage of population enrolled in public college or graduate school.
The percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in private college or graduate school.
Provide both (1) ggplot codes and (2) a couple of
sentences to describe how the relationships described in Q3b vary by
gender of population 45 to 64 years.
To represent a level of educational attainment of
male population 45 to 64 years in a county, I choose
variable, 100 - d03_024, a percent of male
population 45 to 64 years without bachelor’s
degree.
To represent a level of educational attainment of
female population 45 to 64 years in a county, I choose
variable, 100 - d05_024, a percent of
female population 45 to 64 years
without bachelor’s degree.
The following ggplot describes the relationship between college enrollment and educational attainment of male/female populations 45 to 64 years:
ggplot(NY_school_enrollment_socioecon) +
geom_point(aes(x = 100 - d03_024, y = c02_010),
color = 'blue') +
geom_point(aes(x = 100 - d05_024, y = c02_010),
color = 'red') +
geom_smooth(aes(x = 100 - d03_024, y = c02_010),
color = 'blue',
method = lm) +
geom_smooth(aes(x = 100 - d05_024, y = c02_010),
color = 'red',
method = lm) 
Regardless of a gender of population 45 to 64 years, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in college or graduate school.
Given the same level of the percentage of population enrolled in college or graduate school, male population 45 to 64 years without bachelor’s degree tends to be higher than female population 45 to 64 years without bachelor’s degree.
ggplot(NY_school_enrollment_socioecon) +
geom_point(aes(x = 100 - d03_024, y = c04_010),
color = 'blue') +
geom_point(aes(x = 100 - d05_024, y = c04_010),
color = 'red') +
geom_smooth(aes(x = 100 - d03_024, y = c04_010),
color = 'blue',
method = lm) +
geom_smooth(aes(x = 100 - d05_024, y = c04_010),
color = 'red',
method = lm) 
Regardless of a gender of population 45 to 64 years, the percentage of population 45 to 64 years without Bachelor’s degree is positively associated with the percentage of population enrolled in public college or graduate school.
For the same level of the percentage of population 45 to 64 years without Bachelor’s degree, a level of educational attainment of female population 45 to 64 years may be associated with a higher level of public college enrollment than that of male population 45 to 64 years.
Given the same level of the percentage of population enrolled in public college or graduate school, male population 45 to 64 years without bachelor’s degree tends to be higher than female population 45 to 64 years without bachelor’s degree.
ggplot(NY_school_enrollment_socioecon) +
geom_point(aes(x = 100 - d03_024, y = c06_010),
color = 'blue') +
geom_point(aes(x = 100 - d05_024, y = c06_010),
color = 'red') +
geom_smooth(aes(x = 100 - d03_024, y = c06_010),
color = 'blue',
method = lm) +
geom_smooth(aes(x = 100 - d05_024, y = c06_010),
color = 'red',
method = lm) 
Regardless of a gender of population 45 to 64 years, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in private college or graduate school.
Given the same level of the percentage of population enrolled in private college or graduate school, male population 45 to 64 years without bachelor’s degree tends to be higher than female population 45 to 64 years without bachelor’s degree.
Provide both (1) ggplot codes and (2) a couple of
sentences to describe how the relationships described in Q3b vary by
gender of college enrollment.
To represent a level of male college enrollment
of population in a county, I choose variable, c02_011, a
percent of male population enrolled in college or
graduate school.
To represent a level of female college
enrollment of population in a county, I choose variable,
c02_012, a percent of female population
enrolled in college or graduate school.
The following ggplot describes the relationship between male/female college enrollment and educational attainment of populations 45 to 64 years:
ggplot(NY_school_enrollment_socioecon) +
geom_point(aes(x = 100-d01_024, y = c02_011),
color = 'blue') +
geom_point(aes(x = 100-d01_024, y = c02_012),
color = 'red') +
geom_smooth(aes(x = 100-d01_024, y = c02_011),
color = 'blue',
method = lm) +
geom_smooth(aes(x = 100-d01_024, y = c02_012),
color = 'red',
method = lm) 
Regardless of a gender of college enrollment, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in college or graduate school.
For the same level of the percentage of population 45 to 64 years without Bachelor’s degree, a level of educational attainment of population 45 to 64 years may be associated with a higher level of female college enrollment than that of male college enrollment.
Given the same level of the percentage of population 45 to 64 years without bachelor’s degree, male population enrolled in college or graduate schools tends to be lower than female population enrolled in college or graduate schools.
To represent a level of male
public college enrollment of population in a county, I
choose variable, c04_011, a percent of
male population enrolled in public
college or graduate school.
To represent a level of female
public college enrollment of population in a county, I
choose variable, c04_012, a percent of
female population enrolled in public
college or graduate school.
The following ggplot describes the relationship between male/female public college enrollment and educational attainment of populations 45 to 64 years:
ggplot(NY_school_enrollment_socioecon) +
geom_point(aes(x = 100-d01_024, y = c04_011),
color = 'blue') +
geom_point(aes(x = 100-d01_024, y = c04_012),
color = 'red') +
geom_smooth(aes(x = 100-d01_024, y = c04_011),
color = 'blue',
method = lm) +
geom_smooth(aes(x = 100-d01_024, y = c04_012),
color = 'red',
method = lm) 
To represent a level of male
private college enrollment of population in a county, I
choose variable, c06_011, a percent of
male population enrolled in private
college or graduate school.
To represent a level of female
private college enrollment of population in a county, I
choose variable, c06_012, a percent of
female population enrolled in private
college or graduate school.
The following ggplot describes the relationship between male/female public college enrollment and educational attainment of populations 45 to 64 years:
ggplot(NY_school_enrollment_socioecon) +
geom_point(aes(x = 100-d01_024, y = c06_011),
color = 'blue') +
geom_point(aes(x = 100-d01_024, y = c06_012),
color = 'red') +
geom_smooth(aes(x = 100-d01_024, y = c06_011),
color = 'blue',
method = lm) +
geom_smooth(aes(x = 100-d01_024, y = c06_012),
color = 'red',
method = lm) 
Regardless of a gender of college enrollment, the percentage of population 45 to 64 years without Bachelor’s degree is positively associated with the percentage of population enrolled in public college or graduate school.
Regardless of a gender of college enrollment, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in private college or graduate school.
## Removing package from '/Library/Frameworks/R.framework/Versions/4.2/Resources/library'
## (as 'lib' is unspecified)