Lecture 23

class: title-slide, left, bottom

# Lecture 23
----
## **DANL 100: Programming for Data Analytics**
### Byeong-Hak Choe
### November 22, 2022

---
class: inverse, center, middle

# Getting started with `pandas`

---
# Getting started with `pandas`
### Indexing `DataFrame`

.panelset[
.panel[.panel-name[DataFrame]

- `pd.DataFrame()` creates a `DataFrame` which is a two-dimensional tabular-like structure with labeled axis (rows and columns).

```python
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "population": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame2 = pd.DataFrame(data)
```

- In this example the construction of the `DataFrame` is done by passing a dictionary of equal-length lists.

- It is also possible to pass a dictionary of NumPy arrays.

]

.panel[.panel-name[Adding a column]

- We can add a new column to `DataFrame` as follows:

```python
frame2["change"] = [1.2, -3.2, 0.4, -0.12, 2.4, 0.3]
frame2["change"]
```

- Selecting the column of `DataFrame`, a `Series` is returned,
- A attribute-like access, e.g., `frame2.change`, is also possible.
- The returned `Series` has the same index as the initial `DataFrame`.

]

.panel[.panel-name[Selecting columns]

- The result of using a list of multiple columns is a DataFrame:

```python
frame2[ ["state", "population"] ]
```

]

.panel[.panel-name[Reindexing]

- `DataFrame.reindex()` creates new `DataFrame` with data conformed to a new index, while the initial `DataFrame` will not be changed:

```python
frame3 = frame.reindex([0, 2, 3, 4])
frame3
```

]

---
# Getting started with `pandas`
### Indexing `DataFrame`

.panelset[

.panel[.panel-name[Example df]

```python
data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}

companies = pd.DataFrame(data)
companies
companies[2:]
```

]

.panel[.panel-name[Reindexing & Missing values]

- Index values that are not already present will be filled with `NaN` by
default.

- The `pd.isna()` and `pd.notna()` functions detect missing data:

```python
companies3 = companies.reindex(index = [0, 2, 3, 4, 5], 
                               columns=["company", "price", "market cap"])
companies3
pd.isna(companies3)
pd.notna(companies3)
```

]

.panel[.panel-name[Dropping rows]

- Calling `drop` with a sequence of labels will drop values from the row labels (axis 0):

```python
obj = pd.Series(np.arange(5.), 
                index = ["a", "b", "c", "d", "e"])
obj
new_obj = obj.drop("c")
new_obj
obj.drop(["d", "c"])
```

]

---
# Getting started with `pandas`
### Dropping columns

.panelset[

.panel[.panel-name[(1)]

- With `DataFrame`, index values can be deleted from either axis. To illustrate this, we first create an example `DataFrame`:

```python
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index = ["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])

data
data.drop(index = ["Colorado", "Ohio"])
```

]

.panel[.panel-name[(2)]

- To drop labels from the columns, we can use the `columns` keyword:

```python
data.drop(columns=["two"])
```
]

.panel[.panel-name[(3)]
- We can also drop values from the columns by passing `axis=1` or `axis="columns"`:

```python
data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")
```
]

.panel[.panel-name[del]
- `del DataFrame[column]` deletes column from `DataFrame`.

```python
del data["two"]
data
```

]

---
# Getting started with `pandas`
### Indexing, selecting and filtering

.panelset[

.panel[.panel-name[example df]
- Indexing of DataFrames works like indexing an `np.array`.
  - We can use the default index values:

```python
data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}

companies = pd.DataFrame(data)
companies
companies[2:]
```

]

.panel[.panel-name[example df2]

- We can also use a manually set index.

```python
companies2 = pd.DataFrame(data, index = ["a", "b", "c", "d", "e"])
companies2
companies2["b":"d"]
```

- When slicing with labels, the *end element is inclusive*.
]

.panel[.panel-name[.loc() and .iloc()]

- `DataFrame.loc()` selects a subset of rows and columns from a DataFrame using **axis labels**.

- `DataFrame.iloc()` selects a subset of rows and columns from a
DataFrame using **integers**.

```python
companies2.loc[ "c", ["company", "price"] ]
companies2.iloc[ 2, [0, 1] ]

companies2.loc[ ["c", "d", "e"], ["volume", "price", "company"] ]
companies2.iloc[ 2:, : :-1 ]
```

]

.panel[.panel-name[summary]

- `df[val]` selects single column or set of columns;
  
  
  - `df.loc[val]` selects single row or set of rows;
  - `df.loc[:, val]` selects single column or set of columns;
  - `df.loc[val1, val2]` selects row and column by label;
  
  
  - `df.iloc[where]` selects row or set of rows by integer position;
  - `df.iloc[:, where]` selects column or set of columns by integer position;
  - `df.iloc[w1, w2]` Select row and column by integer position.

]

---
# Getting started with `pandas`
### Boolean Indexing

- Boolean indexing of DataFrames works like boolean indexing an `np.array`.
 - `DataFrame[ DataFrame['VARIABLE_NAME'] > VALUE ]`
 - `DataFrame[ DataFrame['VARIABLE_NAME'] == VALUE ]`
 - `DataFrame[ DataFrame['VARIABLE_NAME'] < VALUE ]`

```python
data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}

companies = pd.DataFrame(data)
companies_daimler = companies[ companies['company'] ==  "Daimler" ]
```

---
# Getting started with `pandas`
### Import/Export data
`pd.read_csv("PATH_NAME_OF_*.csv")` reads the csv file into `DataFrame`.
 - `header=None` does not read the top row of the csv file as column names.
 - We can set column names with `names`, for example, `names=["a", "b", "c", "d", "e"]`.
 
- `DataFrame.head()` and `DataFrame.tail()` prints the first and last five rows on the Console, respectively.

```python
nbc_show = pd.read_csv("https://bcdanl.github.io/data/nbc_show_na.csv")
# `GRP`: audience size; `PE`: audience engagement.
nbc_show.head()   # showing the first five rows
nbc_show.tail()   # showing the last five rows
```

---
# Getting started with `pandas`
### Export data
`DataFrame.to_csv("filename")` writes `DataFrame` to the csv file.
 - `index = False` and `header=False` do not write row index and column names in the csv file.
 - We can set column names with `header`, for example, `header=["a", "b", "c", "d", "e"]`.

```python
nbc_show.to_csv("PATH_NAME_OF_THE_csv_FILE")
```

---
# Getting started with `pandas`
### Summarizing `DataFrame`

- `DataFrame.count()` returns a Series containing the number of non-missing values for each column.
- `DataFrame.sum()` returns a Series containing the sum of values for each column.
- `DataFrame.mean()` returns a Series containing the mean of values for each column.
  - Passing `axis="columns"` or `axis=1` sums across the columns instead:

```python
nbc_count = nbc_show.sum()
nbc_sum = nbc_show.sum()
nbc_sum_c = nbc_show.sum( axis="columns" )
nbc_mean = nbc_show.mean()
```

---
# Getting started with `pandas`
### Grouping `DataFrame`

- `DataFrame.groupby(col1, col2)` groups `DataFrame` by columns (grouping by one or more than two columns is also possible!).

- Adding the functions `count()`, `sum()`, `mean()` to `groupby()` returns the sum or the mean of the grouped columns.

```python
nbc_genre_count = nbc_show.groupby(["Genre"]).count()
nbc_genre_sum = nbc_show.groupby(["Genre"]).sum()
nbc_network_genre_mean = nbc_show.groupby(["Network", "Genre"]).mean()
```

---
# Getting started with `pandas`
### Sorting `DataFrame`

- `DataFrame.sort_index()` sorts DataFrame by index on either axis.

- `DataFrame.sort_index(axis="columns")` sorts DataFrame by column index.
  
  - `DataFrame.sort_index(ascending=False)` sorts DataFrame by either index in descending order.

```python
nbc_show.sort_index()
nbc_show.sort_index(ascending = False)
nbc_show.sort_index(axis = "columns")
```

---
# Getting started with `pandas`
### Sorting `DataFrame`

- `DataFrame.sort_value("SOME_VARIABLE")` sorts DataFrame by values of SOME_VARIABLE.

- For `Series.sort_values()`, we do not need to provide `"SOME_VARIABLE"` in the `sort_values()` function.

- `DataFrame.sort_values("SOME_VARIABLE", ascdening = False)` sorts DataFrame by values of SOME_VARIABLE in descending order.

```python
nbc_show.sort_values("GRP")
nbc_show.sort_values("GRP", ascending = False)

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
```

---
# Getting started with `pandas`
### Class Exercise

Use the `nbc_show_na.csv` file to answer the following questions:

1. Find the top show in terms of the value of `PE` for each Genre.

2. Find the top show in terms of the value of `GRP` for each Network.

3. Which genre does have the largest `GRP` on average?

---
# Workflow
### Installing Python modules

- Anaconda Spyder also allow for installation on Spyder.

.panelset[

.panel[.panel-name[Windows]
- Step 1. Type the following command on Spyder Python script:
- Step 2. Run the command.

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

.panel[.panel-name[Mac]
- Step 1. Type the following command on Spyder Python script:
- Step 2. Run the command.

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

]

---
# Workflow
### Installing Python modules

- Let's install the Python visualization library `seaborn` on Anaconda Spyder.

.panelset[

.panel[.panel-name[Windows]
- Step 1. Open "Anaconda Prompt".
- Step 2. Type the following:

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

.panel[.panel-name[Mac]
- Step 1. Open "Terminal".
- Step 2. Type the following:

```python
conda install seaborn
```
or

```python
pip install seaborn
```
]

]

---
class: inverse, center, middle

# Data Visualization with `seaborn`

---
# Data Visualization

.pull-left[

]
.pull-right[
- Graphs and charts let us explore and learn about the structure of the information we have in DataFrame.

- Good data visualizations make it easier to communicate our ideas and findings to other people.

]

---
# Exploratory Data Analysis (EDA)

- We use visualization and summary statistics (e.g., mean, median, minimum, maximum) to explore our data in a systematic way.

- EDA is an iterative cycle. We:

- Generate questions about our data.

- Search for answers by visualizing, transforming, and modelling our data.

- Use what we learn to refine our questions and/or generate new questions.

---
#  `seaborn`

- `seaborn` is a Python data visualization library based on `matplotlib`. 
  - It allows us to easily create beautiful but complex graphics using a simple interface.
  - It also provides a general improvement in the default appearance of `matplotlib`-produced plots, and so I recommend using it by default.

```python
import seaborn as sns
```

---
# Data Visualization with `seaborn`
### Types of plots

- We will consider the following types of visualization:

- Bar chart

- Histogram

- Scatter plot
  
  - Scatter plot with Fitted line

- Line chart

---
# Getting started with `pandas`
### What is *tidy* `DataFrame`?

- There are three rules which make a dataset tidy:

1. Each **variable** has its own *column*.
  2. Each **observation** has its own *row*.
  3. Each **value** has its own *cell*.

---
# Data Visualization with `seaborn`
### Getting started with `seaborn`

- Let's get the names of `DataFrame`s provided by the `seaborn` library:

```python
import seaborn as sns
print( sns.get_dataset_names() )
```

- Let's use the `titanic` and `tips` DataFrames:

```python
df_titanic = sns.load_dataset('titanic')
df_titanic.head()
df_tips = sns.load_dataset('tips')
df_tips.head()
```

---
# Data Visualization with `seaborn`
### Bar Chart

- A bar chart is used to plot the frequency of the different categories.
  - It is useful to visualize how values of a **categorical variable** are distributed.
  - A variable is **categorical** if it can only take one of a small set of values.
  
  
- We use `sns.countplot()` function to plot a bar chart:

.pull-left[

```python
sns.countplot(x =  'sex', 
              data = df_titanic)
```
]

.pull-right[

- Mapping
  - `data`: DataFrame.
  - `x`:  Name of a categorical variable (column) in DataFrame

]

---
# Data Visualization with `seaborn`
### Bar Chart

- We can further break up the bars in the bar chart based on another categorical variable.

- This is useful to visualize the relationship between the two categorical variables.

.pull-left[

```python
sns.countplot(x = 'sex', 
              hue = 'survived', 
              data = df_titanic)
```
]

.pull-right[

- Mapping
  - `hue`:  Name of a categorical variable

]

---
# Data Visualization with `seaborn`
### Histogram

- A histogram is a **continuous** version of a bar chart.
  - It is used to plot the frequency of the different values.
  - It is useful to visualize how values of a **continuous variable** are distributed.
  - A variable is **continuous** if it can take any of an infinite set of ordered values.
  
  
- We use `sns.displot()` function to plot a histogram:
.pull-left[

```python
sns.displot(x =  'age', 
            bins = 5 ,
            data = df_titanic)
```
]

.pull-right[
- Mapping
  - `bins`:  Number of bins

]

---
# Data Visualization with `seaborn`
### Scatter plot

- A scatter plot is used to display the relationship between two continuous variables.

-  We can see co-variation as a pattern in the scattered points.

- We use `sns.scatterplot()` function to plot a scatter plot:

.pull-left[

```python
sns.scatterplot(x = 'total_bill', 
                y = 'tip',
                data = df_tips)
```
]

.pull-right[
- Mapping
  - `x`:  Name of a continuous variable on the horizontal axis
  - `y`:  Name of a continuous variable on the vertical axis
]

---
# Data Visualization with `seaborn`
### Scatter plot

- To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`.

- Suppose we are interested in the following question:
  - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior?

```python
sns.scatterplot(x = 'total_bill', 
                y = 'tip',
                hue = 'smoker',
                data = df)
```

---
# Data Visualization with `seaborn`
### Fitted line

- From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables.

- `sns.lmplot()` adds a line that fits well into the scattered points.
  
  - On average, the fitted line describes the relationship between two continuous variables.

```python
sns.lmplot(x = 'total_bill', 
           y = 'tip',
           data = df_tips)
```

---
# Data Visualization with `seaborn`
### Scatter plot

- To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`.

- Using the fitted lines, let's answer the following question:
  - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior?

```python
sns.scatterplot(x = 'total_bill', 
                y = 'tip',
                hue = 'smoker',
                data = df_tips)
```

---
# Data Visualization with `seaborn`
### Line cahrt

- A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable.
  - It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur.
- We use `sns.lineplot()` function to plot a line plot:
.pull-left[

```python
path_csv = '/Users/byeong-hakchoe/Google Drive/suny-geneseo/teaching-materials/lecture-data/dji.csv'
dow = pd.read_csv(path_csv, index_col=0, parse_dates=True)
sns.lineplot(x =  'Date', 
             y =  'Close', 
             data = dow)
```
]

.pull-right[
- Mapping
  - `x`:  Name of a continuous variable (often time variable) on the horizontal axis 
  - `y`:  Name of a continuous variable on the vertical axis
]