Loading R packages for the Midterm Exam

library(tidyverse)



Question 1

Q1a.

Download dominick_oj_q1a.csv from the Midterm Exam in the Assignments or the Files sections in our Canvas.

Then import the dominick_oj_q1a.csv using the following lines:

oj_q1a <- read_csv('ABSOLUTE_PATH_NAME_FOR_THE_FILE_dominick_oj_q1a.csv')
table(oj_q1a$brand)


You need to provide the absolute path name for the file, dominick_oj_q1a.csv to the above read_csv() function to read the file.

Variable Description

  • sales: the number of orange juice (OJ) cartons sold in a week

  • price: price of OJ carton

  • brand: OJ brand

  • feat: Advertisement status— 1 if advertised; 0 if not advertised.

  • Report (1) minimum, (2) median, (3) maximum, (4) mean, and (5) standard deviation of variable price for the brand, Dominick’s OJ.



Q1b

For Question 1b, run the following function to read the dominick_oj.csv file:

oj_q1b <- read_csv(
  'https://bcdanl.github.io/data/dominick_oj.csv'
)


The description of variables in oj_q1b is the same as oj_q1a.

  • Describe the relationship between the log of price and the log of sales by brand using ggplot. Make a simple comment on your ggplot figure.




Question 2

For Question 2, run the following R command to read the nyc_dogs.csv file.

nyc_dogs <- read_csv('https://bcdanl.github.io/data/nyc_dogs.csv')


Q2a

Describe the distribution of animal_gender using ggplot. Make a simple comment on your ggplot figure.



Q2b

Find the five most popular breeds in NYC.



Q2c

Describe the relationship between the five popular breeds and borough using ggplot. Make a simple comment on your ggplot figure.



Q2d

Find the five most popular breeds for each borough in NYC.



Q2e

Find the five most popular dog names for each gender in NYC.



Q2f

Find the five most popular dog names for each gender for each borough in NYC.



Q2g

Assume that all dogs in the nyc_dogs data frame are alive as of today.

Describe the distribution of age for each borough using ggplot. Make a simple comment on your ggplot.




Question 3

For Question 3, run the following function to read the NYC’s Citywide Payroll Data.

nyc_payroll <- read_csv(
  'https://bcdanl.github.io/data/nyc_payroll.csv'
)


Description of variables in the nyc_payroll dataset is provided at the end of the R script.


Q3a.

Create a variable, payroll, which is defined as:

\[ \texttt{payroll} = \texttt{regular_gross_paid} + \texttt{total_ot_paid}\] where regular_gross_paid and total_ot_paid are variables in the nyc_payroll data frame.



Q3b.

Calculate the mean of payroll by title_description.



Q3c.

Calculate the mean of payroll by work_location_borough.



Variable Description

  1. Fiscal Year: Fiscal Year

  2. Payroll Number: Payroll Number

  3. Agency Name: The Payroll agency that the employee works for

  4. Last Name: Last name of employee

  5. First Name: First name of employee

  6. Mid Init: Middle initial of employee

  7. Agency Start Date: Date which employee began working for their current agency Date & Time

  8. Work Location Borough: Borough of employee’s primary work location

  9. Title Description: Civil service title description of the employee

  10. Leave Status as of June 30: Status of employee as of the close of the relevant fiscal year: Active, Ceased, or On Leave

  11. Base Salary: Base Salary assigned to the employee

  12. Pay Basis: Lists whether the employee is paid on an hourly, per diem or annual basis

  13. Regular Hours: Number of regular hours employee worked in the fiscal year

  14. Regular Gross Paid: The amount paid to the employee for base salary during the fiscal year

  15. OT Hours: Overtime Hours worked by employee in the fiscal year

  16. Total OT Paid: Total overtime pay paid to the employee in the fiscal year

  17. Total Other Pay: Includes any compensation in addition to gross salary and overtime pay, ie Differentials, lump sums, uniform allowance, meal allowance, retroactive pay increases, settlement amounts, and bonus pay, if applicable.