Loading R packages for Homework Assignment 1

library(tidyverse)
library(skimr)
# install.packages("hexbin")   # if you do not have the "hexbin" package
library(hexbin)



Question 1

Q1a

Download the compressed file, bikeshare-2011-01-01.zip, from the Files section in our Canvas web-site. Extract the file, bikeshare-2011-01-01.zip, so that you can use the file, bikeshare-2011-01-01.csv. Read the data file, bikeshare-2011-01-01.csv, as the data.frame object with the name, bikeshare2011_01_01, using (1) the read_csv() function and (2) the absolute path name of the file bikeshare_2011_01_01.csv from your local hard disk drive in your laptop.

path <- '/Users/byeong-hakchoe/Google Drive/suny-geneseo/fall2022/bikeshare_2011_01_01.csv'
bikeshare2011_01_01 <- read_csv(path)

# View `bikeshare2011_01_01`
bikeshare2011_01_01
  • We can view the data.frame bikeshare2011_01_01:



Q1b

Report the mean, median, minimum, maximum, and standard deviation for each numeric variable in the data.frame bikeshare2011_01_01.

  • We can use the skim() function to get summary statistics:
library(skimr)
summary(bikeshare2011_01_01)
skim(bikeshare2011_01_01)
N Mean SD Min Q1 Median Q3 Max
hr 24 11.50 7.07 0.00 5.50 11.50 17.50 23.00
holiday 24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
temp 24 -0.79 0.50 -1.54 -1.33 -0.56 -0.40 -0.19
hum 24 0.93 0.31 0.48 0.66 0.90 1.23 1.62
windspeed 24 -0.24 1.10 -1.55 -1.55 0.40 0.76 0.89
year 24 2011.00 0.00 2011.00 2011.00 2011.00 2011.00 2011.00
month 24 1.00 0.00 1.00 1.00 1.00 1.00 1.00
date 24 1.00 0.00 1.00 1.00 1.00 1.00 1.00
cnt 24 41.04 34.29 1.00 13.50 35.50 61.50 110.00




Question 2

Q2a

Read the data file, bikeshare_cleaned.csv, as the data.frame object with the name, bikeshare, using (1) the read_csv() function and (2) its URL, https://bcdanl.github.io/data/bikeshare_cleaned.csv.

url <- 'https://bcdanl.github.io/data/bikeshare_cleaned.csv'
bikeshare <- read_csv(url)

View(bikeshare)
summary(bikeshare)
skim(bikeshare)

table(bikeshare$wkday)
table(bikeshare$month)
table(bikeshare$seasons)
table(bikeshare$weather_cond)

prop.table( table( bikeshare$wkday ) )
prop.table( table( bikeshare$month ) )
prop.table( table( bikeshare$seasons ) )
prop.table( table( bikeshare$weather_cond ) )


  • We can view the data.frame bikeshare:


  • The following summarizes the data.frame bikeshare:


N Mean SD Min Q1 Median Q3 Max
cnt 17376 189.48 181.40 1.00 40.00 142.00 281.00 977.00
year 17376 2011.50 0.50 2011.00 2011.00 2012.00 2012.00 2012.00
hr 17376 11.55 6.91 0.00 6.00 12.00 18.00 23.00
holiday 17376 0.03 0.17 0.00 0.00 0.00 0.00 1.00
temp 17376 0.00 1.00 -2.48 -0.82 0.02 0.85 2.61
hum 17376 0.00 1.00 -3.25 -0.76 0.01 0.79 1.93
windspeed 17376 0.00 1.00 -1.55 -0.70 0.03 0.52 5.40
Level N %
wkday monday 2478 14.3
tuesday 2453 14.1
wednesday 2474 14.2
thursday 2471 14.2
friday 2487 14.3
saturday 2511 14.5
sunday 2502 14.4
Level N %
month 01 1426 8.2
02 1341 7.7
03 1473 8.5
04 1437 8.3
05 1488 8.6
06 1440 8.3
07 1488 8.6
08 1475 8.5
09 1437 8.3
10 1451 8.4
11 1437 8.3
12 1483 8.5
Level N %
seasons spring 4239 24.4
summer 4409 25.4
fall 4496 25.9
winter 4232 24.4
Level N %
weather_cond Clear or Few Cloudy 11413 65.7
Light Snow or Light Rain 1419 8.2
Mist or Cloudy 4544 26.2



Use the data.frame bikeshare for the rest of questions in Question 2.

Description of variables in the data file, bikeshare_cleaned.csv

The data set, bikeshare_cleaned.csv, includes 17376 observations of hourly counts from 2011 to 2012 for bike rides (rentals) in Washington D.C.


  • cnt: count of total bikes rented out
  • year: year
  • month: month
  • date: date
  • hr: hours
  • wkday: week day
  • holiday: holiday if holiday == 1; non-holiday otherwise
  • seasons: season
  • weather_cond: weather condition
  • temp: temperature, measured in standard deviations from average.
  • hum: humidity, measured in standard deviations from average.
  • windspeed: wind speed, measured in standard deviations from average.



Q2b

Provide both (1) ggplot codes and (2) a couple of sentences to describe the distribution of cnt.

ggplot(bikeshare) +
  geom_histogram(aes(x = cnt),
                 binwidth = 5) 



  • The distribution of cnt is right-skewed.
  • The most common values for cnt range from 0 to 50.

Q2c

Provide both (1) ggplot codes and (2) a couple of sentences to describe the distribution of cnt by year and month.

# density plot
ggplot(bikeshare) +
  geom_density( aes(x = cnt, fill = month),
                color = NA,
                show.legend = F) +
  facet_grid(month~year) 

# histogram
ggplot(bikeshare) +
  geom_histogram( aes(x = cnt, fill = month),
                  binwidth = 10,
                  color = NA,
                  show.legend = F) +
  facet_grid(month~year) 

# boxplot
ggplot(bikeshare) +
  geom_boxplot( aes(x = cnt, y = month, 
                    fill = month),
                  show.legend = F) 

  • Overall, the demand for bike rentals tends to be higher in 2012 than in 2011.
  • Overall, the demand for bike rentals tends to be lower in winter.



Q2d

Provide both (1) ggplot codes and (2) a couple of sentences to describe the distribution of temp by year and month.

ggplot(bikeshare) +
  geom_histogram(aes(x = temp, fill = month),
                  binwidth = .1,
                 show.legend = F) +
  geom_vline(xintercept = 0, color = 'red') + 
  facet_grid(month ~ year) 

  • We observe the four seasons when it comes to temperature.
  • January tends to be the coldest and July is the hottest.
  • The distribution of temperature across months looks similar across years 2011-2012.



Q2e

Provide both (1) ggplot codes and (2) a couple of sentences to describe the distribution of hum by year and month.

ggplot(bikeshare) +
  geom_histogram(aes(x = hum, fill = month),
                binwidth = .1,
                show.legend = F) +
  geom_vline(xintercept = 0, color = 'red') + 
  facet_grid(month ~ year) 

  • In years 2011-2012 in Washington D.C., May, August, and September tend to be more humid than other months.
  • In years 2011-2012 in Washington D.C., January and February tend to be less humid than other months.
  • Overall, the distribution of humidity across months looks similar across years 2011-2012 in Washington D.C.



Q2f

Provide both (1) ggplot codes and (2) a couple of sentences to describe the distribution of windspeed by year and month.

ggplot(bikeshare) +
  geom_density(aes(x = windspeed)) +
  geom_vline(xintercept = 0, color = 'red') + 
  facet_grid(month ~ year) 

  • Overall, the monthly distribution of wind speed looks similar across years 2011-2012 in Washington D.C.
  • In years 2011-2012 in Washington D.C., noticeably slow wind speed, -1.5, which is a deviation from the standardized mean of wind speed 0, is observed throughout all months.



Q2g

Provide both (1) ggplot codes and (2) a couple of sentences to describe the relationship between temp and cnt.

ggplot(bikeshare,
       aes(x = temp, y = cnt)) +
  geom_hex() +
  geom_smooth(color = 'red') +
  geom_smooth(method = lm) 

  • temp and cnt are positively associated with each other.
  • Too high temp (above 1.5) may lead to lower cnt.



Q2h

Provide both (1) ggplot codes and (2) a couple of sentences to describe the relationship between temp and cnt by year and month.

ggplot(bikeshare, aes(x = temp, y = cnt )) +
    geom_point(alpha = .1)  +
    geom_smooth(color = "red3",
                fill = "orchid",
                method = lm)  +
    geom_smooth(color = "royalblue",
                fill = "orchid")  +
  facet_grid(month~year)

  • Overall, temp and cnt are positively associated with each other.
  • In June, July, and August, the association between temp and cnt switches from positive to negative at which temp is around 1.5.



Q2i

Provide both (1) ggplot codes and (2) a couple of sentences to describe the relationship between weather_cond and cnt.

ggplot(bikeshare) +
  geom_histogram(aes(x = cnt),
                 binwidth = 5) + 
  facet_grid(. ~ weather_cond) 

ggplot(bikeshare) +
  geom_density(aes(x = cnt)) + 
  facet_grid(. ~ weather_cond) 

  • In 2012-2013 in Washington D.C., people rented out bikes more often when weather_cond is Clear or Few Cloudy.
  • The most common values for cnt are around 50 across all values of weather_cond.
  • When weather_cond is Light Snow or Light Rain, people are more likely to rent less number of bikes.



Q2j

Provide both (1) ggplot codes and (2) a couple of sentences to describe the relationship between weather_cond and cnt by hr.

ggplot(bikeshare) +
  geom_density(aes(x = cnt, fill = as.factor(hr)),
               color = NA,
               show.legend = F) + 
  facet_grid(hr ~ weather_cond, scale = "free_y") 

  • The values of cnt during commuting hours (hr 7:00 A.M.-8:59 A.M. and 5:00 P.M.-7:59 P.M.) are often larger than other commuting hours.
    • It implies that a shortage of rental bikes is more likely to happen during these hours.



Q2k

Provide both (1) ggplot codes and (2) a couple of sentences to describe the relationship between wkday and cnt.

ggplot(bikeshare) +
  geom_density(aes(x = cnt,
                   fill = as.factor(wkday)),
               color = NA,
               show.legend = F) + 
  facet_grid(. ~ wkday)

ggplot(bikeshare) +
  geom_histogram(aes(x = cnt,
                   fill = as.factor(wkday)),
                 binwidth = 10,
               color = NA,
               show.legend = F) + 
  facet_grid(. ~ wkday)


ggplot(bikeshare,
       aes(x = wkday, y = cnt)) +
  geom_boxplot( aes(fill = wkday),
                show.legend = F ) +
  stat_summary(fun = mean)

  • The distribution of cnt is right-skewed, and looks similar across all values of wkday.



Q2l

Provide both (1) ggplot codes and (2) a couple of sentences to describe the relationship between wkday and cnt by hr.

ggplot(bikeshare) +
  geom_density(aes(x = cnt,
                     fill = as.factor(hr)),
               color = NA, 
               show.legend = F) + 
  facet_grid(hr~wkday, scale = "free_y") 
ggplot(bikeshare) +
  geom_histogram(aes(x = cnt,
                     fill = as.factor(hr)),
                 binwidth = 10,
               color = NA, 
               show.legend = F) + 
  facet_grid(hr~wkday, scale = "free_y") 
ggplot(bikeshare) +
  geom_boxplot(aes(x = cnt,
                   y = as.factor(hr),
                   fill = as.factor(hr)),
               show.legend = F) + 
  facet_grid(.~wkday, scale = "free_y") 

  • During hours from 10 to 15, people tend to rent out bikes more on Saturday and Sunday than on other week days.
  • During the morning commuting hours, people tend to rent out bikes less on Saturday and Sunday than on other week days.




Question 3

Q3a

Read the data file, NY_school_enrollment_socioecon.csv, as the data.frame object with the name, NY_school_enrollment_socioecon, using (1) the read_csv() function and (2) its URL, https://bcdanl.github.io/data/NY_school_enrollment_socioecon.csv.

url2 <- 'https://bcdanl.github.io/data/NY_school_enrollment_socioecon.csv'
NY_school_enrollment_socioecon <- read_csv(url2)
View(ny_school_enrollment_socio)
  • We can view the data.frame NY_school_enrollment_socioecon:



For description of variables in NY_school_enrollment_socioecon, refer to the file, ny_school_enrollment_socioecon_description.zip, which is in the Files section in our Canvas web-page. (I recommend you to extract the zip file, and then read the file, ny_school_enrollment_socioecon_description.csv, using Excel or Numbers.)

  • Here are some details about the data.frame, NY_school_enrollment_socioecon:

  • The geographic and time units of observation (row) in the data.frame, NY_school_enrollment_socioecon, are New York county and year.

FIPS year county_name pincp c01_001 c02_002
36001 2015 Albany 55793 84463 4.7
  • For example, the observation above means that in Albany county in year 2015 …
    • Average personal income of people is $55,793.
    • Population 3 years and over enrolled in school is 84,463.
    • Percent of population 3 years and over enrolled in nursery school and preschool is 4.7%.



  • The following is sample observations from Bronx and Livingston counties:

  • The following describes the variables:

    • c01_010: Total!!Population enrolled in college or graduate school

    • So, c01_010 is total population enrolled in college or graduate school;

    • c02_010: Percent!!Population enrolled in college or graduate school

    • So, c02_010 is a percent of total population enrolled in college or graduate school;

    • In which county is more likely for a person to be enrolled in a college or graduate school?



  • A county’s college enrollment level can be represented by an overall tendency of that county’s residents to be enrolled in college (as long as we are interested in analyzing how human behaves overall).

  • The size of a county’s population enrolled in college or graduate school (c01_010) may not be appropriate to represent a county’s college enrollment level.

    • A county’s larger size of population enrolled in college does not necessarily mean people in people in that county are likely to be enrolled in college.
  • Consider the following example:

County Total.Population Bachelor.s.Degree High.School Percent.of.Bachelor.s.Degree Percent.of.High.School
A 100,000 1,000 99,000 1.0% 99.0%
B 1,000 999 1 99.9% 0.1%
  • Although County A has the larger number of people that have bachelor’s degrees than County B, it is more appropriate to say that people in County B have a higher college enrollment than people in County A.

  • This is because the overall tendency of County B’s people to attend college is stronger than that of County A’s people.

  • Similarly, to represent a standard of living of people in a country, we do not use a country’s gross domestic product (GDP) but its GDP per capita (GDP per capita is GDP devided by population).

    • For example, China records the second largest GDP in the world as of now. However, World Bank still considers China a middle-income country, because of its relatively low level of GDP per capita.



Q3b

Provide both (1) ggplot codes and (2) a couple of sentences to describe the relationship between college enrollment and educational attainment of population 45 to 64 years, and how such relationship varies by the type (public or private) of colleges.

  • To represent a level of educational attainment of population 45 to 64 years in a county, I choose variable, 100 - d01_024, a percent of population 45 to 64 years without bachelor’s degree.

  • To represent a level of college enrollment of population in a county, I choose variable, c02_010, a percent of population enrolled in college or graduate school.

  • To represent a level of public college’s enrollment of population in a county, I choose variable, c04_010, a percent of population enrolled in public college or graduate school.

  • To represent a level of private college’s enrollment of population in a county, I choose variable, c06_010, a percent of population enrolled in private college or graduate school.

  • The following ggplot describes the relationship between college enrollment and educational attainment of population 45 to 64 years:

ggplot(NY_school_enrollment_socioecon,
       aes(x = 100 - d01_024, y = c02_010)) +
  geom_hex()  +
  geom_smooth(method = lm)  +
  geom_smooth(color = 'red') +
  coord_fixed()

  • The percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in college or graduate school.

  • The following ggplot describes the relationship between college enrollment in public schools and educational attainment of population 45 to 64 years:

ggplot(NY_school_enrollment_socioecon,
       aes(x = 100 - d01_024, y = c04_010)) +
  geom_hex()  +
  geom_smooth(method = lm)  +
  geom_smooth(color = 'red') +
  coord_fixed()

  • The following ggplot describes the relationship between college enrollment in private schools and educational attainment of population 45 to 64 years:
ggplot(NY_school_enrollment_socioecon,
       aes(x = 100 - d01_024, y = c06_010)) +
  geom_hex()  +
  geom_smooth(method = lm)  +
  geom_smooth(color = 'red') +
  coord_fixed()

  • The percentage of population 45 to 64 years without Bachelor’s degree is positively associated with the percentage of population enrolled in public college or graduate school.

  • The percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in private college or graduate school.



Q3c

Provide both (1) ggplot codes and (2) a couple of sentences to describe how the relationships described in Q3b vary by gender of population 45 to 64 years.

  • To represent a level of educational attainment of male population 45 to 64 years in a county, I choose variable, 100 - d03_024, a percent of male population 45 to 64 years without bachelor’s degree.

  • To represent a level of educational attainment of female population 45 to 64 years in a county, I choose variable, 100 - d05_024, a percent of female population 45 to 64 years without bachelor’s degree.

  • The following ggplot describes the relationship between college enrollment and educational attainment of male/female populations 45 to 64 years:

ggplot(NY_school_enrollment_socioecon) +
  geom_point(aes(x = 100 - d03_024, y = c02_010),
             color = 'blue')  +
  geom_point(aes(x = 100 - d05_024, y = c02_010),
             color = 'red')  +
  geom_smooth(aes(x = 100 - d03_024, y = c02_010),
              color = 'blue',  
              method = lm)  +
  geom_smooth(aes(x = 100 - d05_024, y = c02_010),
              color = 'red',
              method = lm) 

  • Regardless of a gender of population 45 to 64 years, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in college or graduate school.

  • Given the same level of the percentage of population enrolled in college or graduate school, male population 45 to 64 years without bachelor’s degree tends to be higher than female population 45 to 64 years without bachelor’s degree.


  • The following ggplot describes the relationship between public college enrollment and educational attainment of male/female populations 45 to 64 years:
ggplot(NY_school_enrollment_socioecon) +
  geom_point(aes(x = 100 - d03_024, y = c04_010),
             color = 'blue')  +
  geom_point(aes(x = 100 - d05_024, y = c04_010),
             color = 'red')  +
  geom_smooth(aes(x = 100 - d03_024, y = c04_010),
              color = 'blue',  
              method = lm)  +
  geom_smooth(aes(x = 100 - d05_024, y = c04_010),
              color = 'red',
              method = lm) 

  • Regardless of a gender of population 45 to 64 years, the percentage of population 45 to 64 years without Bachelor’s degree is positively associated with the percentage of population enrolled in public college or graduate school.

  • For the same level of the percentage of population 45 to 64 years without Bachelor’s degree, a level of educational attainment of female population 45 to 64 years may be associated with a higher level of public college enrollment than that of male population 45 to 64 years.

  • Given the same level of the percentage of population enrolled in public college or graduate school, male population 45 to 64 years without bachelor’s degree tends to be higher than female population 45 to 64 years without bachelor’s degree.


  • The following ggplot describes the relationship between private college enrollment and educational attainment of male/female populations 45 to 64 years:
ggplot(NY_school_enrollment_socioecon) +
  geom_point(aes(x = 100 - d03_024, y = c06_010),
             color = 'blue')  +
  geom_point(aes(x = 100 - d05_024, y = c06_010),
             color = 'red')  +
  geom_smooth(aes(x = 100 - d03_024, y = c06_010),
              color = 'blue',  
              method = lm)  +
  geom_smooth(aes(x = 100 - d05_024, y = c06_010),
              color = 'red',
              method = lm) 

  • Regardless of a gender of population 45 to 64 years, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in private college or graduate school.

  • Given the same level of the percentage of population enrolled in private college or graduate school, male population 45 to 64 years without bachelor’s degree tends to be higher than female population 45 to 64 years without bachelor’s degree.



Q3d

Provide both (1) ggplot codes and (2) a couple of sentences to describe how the relationships described in Q3b vary by gender of college enrollment.

  • To represent a level of male college enrollment of population in a county, I choose variable, c02_011, a percent of male population enrolled in college or graduate school.

  • To represent a level of female college enrollment of population in a county, I choose variable, c02_012, a percent of female population enrolled in college or graduate school.

  • The following ggplot describes the relationship between male/female college enrollment and educational attainment of populations 45 to 64 years:

ggplot(NY_school_enrollment_socioecon) +
  geom_point(aes(x = 100-d01_024, y = c02_011), 
             color = 'blue')  +
  geom_point(aes(x = 100-d01_024, y = c02_012), 
             color = 'red')  +
  geom_smooth(aes(x = 100-d01_024, y = c02_011), 
              color = 'blue',
              method = lm) +
  geom_smooth(aes(x = 100-d01_024, y = c02_012), 
              color = 'red',
              method = lm) 

  • Regardless of a gender of college enrollment, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in college or graduate school.

  • For the same level of the percentage of population 45 to 64 years without Bachelor’s degree, a level of educational attainment of population 45 to 64 years may be associated with a higher level of female college enrollment than that of male college enrollment.

  • Given the same level of the percentage of population 45 to 64 years without bachelor’s degree, male population enrolled in college or graduate schools tends to be lower than female population enrolled in college or graduate schools.


  • To represent a level of male public college enrollment of population in a county, I choose variable, c04_011, a percent of male population enrolled in public college or graduate school.

  • To represent a level of female public college enrollment of population in a county, I choose variable, c04_012, a percent of female population enrolled in public college or graduate school.

  • The following ggplot describes the relationship between male/female public college enrollment and educational attainment of populations 45 to 64 years:

ggplot(NY_school_enrollment_socioecon) +
  geom_point(aes(x = 100-d01_024, y = c04_011), 
             color = 'blue')  +
  geom_point(aes(x = 100-d01_024, y = c04_012), 
             color = 'red')  +
  geom_smooth(aes(x = 100-d01_024, y = c04_011), 
              color = 'blue',
              method = lm) +
  geom_smooth(aes(x = 100-d01_024, y = c04_012), 
              color = 'red',
              method = lm) 

  • To represent a level of male private college enrollment of population in a county, I choose variable, c06_011, a percent of male population enrolled in private college or graduate school.

  • To represent a level of female private college enrollment of population in a county, I choose variable, c06_012, a percent of female population enrolled in private college or graduate school.

  • The following ggplot describes the relationship between male/female public college enrollment and educational attainment of populations 45 to 64 years:

ggplot(NY_school_enrollment_socioecon) +
  geom_point(aes(x = 100-d01_024, y = c06_011), 
             color = 'blue')  +
  geom_point(aes(x = 100-d01_024, y = c06_012), 
             color = 'red')  +
  geom_smooth(aes(x = 100-d01_024, y = c06_011), 
              color = 'blue',
              method = lm) +
  geom_smooth(aes(x = 100-d01_024, y = c06_012), 
              color = 'red',
              method = lm) 

  • Regardless of a gender of college enrollment, the percentage of population 45 to 64 years without Bachelor’s degree is positively associated with the percentage of population enrolled in public college or graduate school.

  • Regardless of a gender of college enrollment, the percentage of population 45 to 64 years without Bachelor’s degree is negatively associated with the percentage of population enrolled in private college or graduate school.

## Removing package from '/Library/Frameworks/R.framework/Versions/4.2/Resources/library'
## (as 'lib' is unspecified)