The following is the Python libraries we need for this homework.

import pandas as pd
import seaborn as sns

Question 1

Load the CSV file, beer_markets_cleaned.csv, from the class website:

beer_markets = pd.read_csv('https://bcdanl.github.io/data/beer_markets_cleaned.csv')

Description of DataFrame beer_markets
- One single observation (row) corresponds to information regarding one household and its beer purchase.
- DataFrame beer_markets does not have any missing values.

Variable (column) description
- hh: an identifier of the household;
- purchase_desc: details on the purchased item;
- quantity: the number of items purchased;
- brand: Bud Light, Busch Light, Coors Light, Miller Lite, or Natural Light;
- spent: total dollar value of purchase;
- beer_floz: total volume of beer, in fluid ounces;
- price_per_floz: price per fl.oz. (i.e., beer spent/beer floz);
- container: the type of container;
- promo: Whether the item was promoted (coupon or otherwise);
- market: Scan-track market (or state if rural);
- demographic data, including gender, marital status, household income, class of work, race, education, age, the size of household, and whether or not the household has a microwave or a dishwasher.

Q1a

Sort the DataFrame beer_markets by hh in ascending order.

Answer

beer_markets = beer_markets.sort_values('hh')

Q1b

Count the number of households for each market.

Answer

# The below line counts the number of non-missing values
# for each variable for each group of "market" and "hh".
q1b = beer_markets.groupby(["market", "hh"]).count() 

# In the resulting DataFrame q1b above, 
# there is only one single observation for each group of
# "market" and "hh".
# So, the following line counts the number of households
# for each market.
q1b = q1b.groupby(["market"]).count()

q1b = q1b.loc[ :, ["brand"] ]
q1b.columns = ['n']   # rename `brand` with `n`

# For renaming, the following also works:
# q1b = q1b.rename({'brand':"n"}, axis = 'columns')

Q1c

Find the top 5 beer markets in terms of the number of households that purchased beer.

Answer

q1c = q1b.sort_values('n', ascending = False)

market	n
TAMPA	280
DETROIT	214
COLUMBUS	184
MIAMI	178
PHOENIX	168

Q1d

Sum of beer_floz for each market.

Answer

q1d = beer_markets.groupby(["market"]).sum()
q1d = q1d.loc[ :, ["beer_floz"] ]

# the following gives the Series, instead of the DataFrame
# q1d = q1d['beer_floz']

Q1e

Find the top 5 beer markets in terms of the amount of total beer consumption.

Answer

q1e = q1d.sort_values('beer_floz', ascending = False)

market	beer_floz
TAMPA	462904
PHOENIX	388824
DETROIT	309588
MIAMI	271870
COLUMBUS	266896

Q1f

Variable price_per_floz is continuous.
Variable brand is categorical.

Describe the distribution of price_per_floz for each brand using seaborn.

Make a simple comment on comparison for the distribution of price_per_floz across brands.

Answer

# distribution of price_per_floz for BUD LIGHT
sns.displot( x = 'price_per_floz', bins = 200,
             data = beer_markets[ beer_markets['brand'] == 'BUD LIGHT' ] )

# distribution of price_per_floz for COORS LIGHT
sns.displot( x = 'price_per_floz', bins = 200,
             data = beer_markets[ beer_markets['brand'] == 'COORS LIGHT' ] )

# distribution of price_per_floz for MILLER LITE
sns.displot( x = 'price_per_floz', bins = 200,
             data = beer_markets[ beer_markets['brand'] == 'MILLER LITE' ] )

# distribution of price_per_floz for BUSCH LIGHT
sns.displot( x = 'price_per_floz', bins = 200,
             data = beer_markets[ beer_markets['brand'] == 'BUSCH LIGHT' ] )

# distribution of price_per_floz for NATURAL LIGHT
sns.displot( x = 'price_per_floz', bins = 200,
             data = beer_markets[ beer_markets['brand'] == 'NATURAL LIGHT' ] )

BUD LIGHT, COORS LIGHT, and MILLER LITE have the similar distribution of price_per_floz with each other.
BUSCH LIGHT and NATURAL LIGHT have the similar distribution of price_per_floz with each other.
Overall, BUD LIGHT, COORS LIGHT, and MILLER LITE are more expensive than BUSCH LIGHT and NATURAL LIGHT.

Q1g

Both variables price_per_floz and beer_floz are continuous.
Describe the relationship between price_per_floz and beer_floz by brand using seaborn.
Make a simple comment on the visualization result regarding how the relationship between price_per_floz and beer_floz varies by brand.

sns.lmplot(x = "beer_floz",
           y = "price_per_floz",
           scatter_kws = {'alpha' : 0.2},
           data = beer_markets[ beer_markets['brand'] == 'BUD LIGHT' ] )


sns.lmplot(x = "beer_floz",
           y = "price_per_floz",
           scatter_kws = {'alpha' : 0.2},
           data = beer_markets[ beer_markets['brand'] == 'COORS LIGHT' ] )


sns.lmplot(x = "beer_floz",
           y = "price_per_floz",
           scatter_kws = {'alpha' : 0.2},
           data = beer_markets[ beer_markets['brand'] == 'MILLER LITE' ] )


sns.lmplot(x = "beer_floz",
           y = "price_per_floz",
           scatter_kws = {'alpha' : 0.2},
           data = beer_markets[ beer_markets['brand'] == 'BUSCH LIGHT' ] )


sns.lmplot(x = "beer_floz",
           y = "price_per_floz",
           scatter_kws = {'alpha' : 0.2},
           data = beer_markets[ beer_markets['brand'] == 'NATURAL LIGHT' ] )


# all the plots in one plot:
sns.lmplot(x = "beer_floz",
           y = "price_per_floz",
           hue = "brand",
           scatter_kws = {'alpha' : 0.2},   
           data = beer_markets)

The law of demand holds in the beer market:
- From the fitted straight line, we see the downward-sloping demand curve for each beer brand.
The demand curves for BUD LIGHT, COORS LIGHT, and MILLER LITE are steeper than those for BUSCH LIGHT and NATURAL LIGHT.
According to the demand curves, BUD LIGHT, COORS LIGHT, and MILLER LITE have larger demands than BUSCH LIGHT and NATURAL LIGHT given the same level of price_per_floz \(\geq\) 0.0375.

DANL 100: Programming for Data Analytics

DANL 100 - Homework Assignment 5 - Example Answers

Byeong-Hak Choe

2023-02-14

Question 1

Q1a

Q1b

Q1c

Q1d

Q1e

Q1f

Q1g