Loading R packages for the Midterm Exam

library(tidyverse)

Question 1

Q1a.

Download dominick_oj_q1a.csv from the Midterm Exam in the Assignments or the Files sections in our Canvas.

Then import the dominick_oj_q1a.csv using the following lines:

oj_q1a <- read_csv('ABSOLUTE_PATH_NAME_FOR_THE_FILE_dominick_oj_q1a.csv')
table(oj_q1a$brand)

You need to provide the absolute path name for the file, dominick_oj_q1a.csv to the above read_csv() function to read the file.

Variable Description

sales: the number of orange juice (OJ) cartons sold in a week
price: price of OJ carton
brand: OJ brand
feat: Advertisement status— 1 if advertised; 0 if not advertised.
Report (1) minimum, (2) median, (3) maximum, (4) mean, and (5) standard deviation of variable price for the brand, Dominick’s OJ.

Q1b

For Question 1b, run the following function to read the dominick_oj.csv file:

oj_q1b <- read_csv(
  'https://bcdanl.github.io/data/dominick_oj.csv'
)

The description of variables in oj_q1b is the same as oj_q1a.

Describe the relationship between the log of price and the log of sales by brand using ggplot. Make a simple comment on your ggplot figure.

Question 2

For Question 2, run the following R command to read the nyc_dogs.csv file.

nyc_dogs <- read_csv('https://bcdanl.github.io/data/nyc_dogs.csv')

Q2a

Describe the distribution of animal_gender using ggplot. Make a simple comment on your ggplot figure.

Q2b

Find the five most popular breeds in NYC.

Q2c

Describe the relationship between the five popular breeds and borough using ggplot. Make a simple comment on your ggplot figure.

Q2d

Find the five most popular breeds for each borough in NYC.

Q2e

Find the five most popular dog names for each gender in NYC.

Q2f

Find the five most popular dog names for each gender for each borough in NYC.

Q2g

Assume that all dogs in the nyc_dogs data frame are alive as of today.

Describe the distribution of age for each borough using ggplot. Make a simple comment on your ggplot.

Question 3

For Question 3, run the following function to read the NYC’s Citywide Payroll Data.

nyc_payroll <- read_csv(
  'https://bcdanl.github.io/data/nyc_payroll.csv'
)

Description of variables in the nyc_payroll dataset is provided at the end of the R script.

Q3a.

Create a variable, payroll, which is defined as:

\[ \texttt{payroll} = \texttt{regular_gross_paid} + \texttt{total_ot_paid}\] where regular_gross_paid and total_ot_paid are variables in the nyc_payroll data frame.

Q3b.

Calculate the mean of payroll by title_description.

Q3c.

Calculate the mean of payroll by work_location_borough.

Variable Description

Fiscal Year: Fiscal Year
Payroll Number: Payroll Number
Agency Name: The Payroll agency that the employee works for
Last Name: Last name of employee
First Name: First name of employee
Mid Init: Middle initial of employee
Agency Start Date: Date which employee began working for their current agency Date & Time
Work Location Borough: Borough of employee’s primary work location
Title Description: Civil service title description of the employee
Leave Status as of June 30: Status of employee as of the close of the relevant fiscal year: Active, Ceased, or On Leave
Base Salary: Base Salary assigned to the employee
Pay Basis: Lists whether the employee is paid on an hourly, per diem or annual basis
Regular Hours: Number of regular hours employee worked in the fiscal year
Regular Gross Paid: The amount paid to the employee for base salary during the fiscal year
OT Hours: Overtime Hours worked by employee in the fiscal year
Total OT Paid: Total overtime pay paid to the employee in the fiscal year
Total Other Pay: Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable.

DANL 200: Introduction to Data Analytics

DANL 200 - Midterm Exam, Spring 2022

Byeong-Hak Choe

2023-02-14

Loading R packages for the Midterm Exam

Question 1

Q1a.

Variable Description

Q1b

Question 2

Q2a

Q2b

Q2c

Q2d

Q2e

Q2f

Q2g

Question 3

Q3a.

Q3b.

Q3c.

Variable Description