class: title-slide, left, bottom # Lecture 23 ---- ## **DANL 100: Programming for Data Analytics** ### Byeong-Hak Choe ### November 22, 2022 --- class: inverse, center, middle # Getting started with `pandas` --- # Getting started with `pandas` ### <p style="color:#00449E"> Indexing `DataFrame` .panelset[ .panel[.panel-name[DataFrame] - `pd.DataFrame()` creates a `DataFrame` which is a two-dimensional tabular-like structure with labeled axis (rows and columns). ```python data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"], "year": [2000, 2001, 2002, 2001, 2002, 2003], "population": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} frame2 = pd.DataFrame(data) ``` - In this example the construction of the `DataFrame` is done by passing a dictionary of equal-length lists. - It is also possible to pass a dictionary of NumPy arrays. ] .panel[.panel-name[Adding a column] - We can add a new column to `DataFrame` as follows: ```python frame2["change"] = [1.2, -3.2, 0.4, -0.12, 2.4, 0.3] frame2["change"] ``` - Selecting the column of `DataFrame`, a `Series` is returned, - A attribute-like access, e.g., `frame2.change`, is also possible. - The returned `Series` has the same index as the initial `DataFrame`. ] .panel[.panel-name[Selecting columns] - The result of using a list of multiple columns is a DataFrame: ```python frame2[ ["state", "population"] ] ``` ] .panel[.panel-name[Reindexing] - `DataFrame.reindex()` creates new `DataFrame` with data conformed to a new index, while the initial `DataFrame` will not be changed: ```python frame3 = frame.reindex([0, 2, 3, 4]) frame3 ``` ] ] --- # Getting started with `pandas` ### <p style="color:#00449E"> Indexing `DataFrame` .panelset[ .panel[.panel-name[Example df] ```python data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"], "price": [69.2, 8.11, 110.92, 87.28, 87.81], "volume": [4456290, 3667975, 3669487, 1778058, 1824582]} companies = pd.DataFrame(data) companies companies[2:] ``` ] .panel[.panel-name[Reindexing & Missing values] - Index values that are not already present will be filled with `NaN` by default. - The `pd.isna()` and `pd.notna()` functions detect missing data: ```python companies3 = companies.reindex(index = [0, 2, 3, 4, 5], columns=["company", "price", "market cap"]) companies3 pd.isna(companies3) pd.notna(companies3) ``` ] .panel[.panel-name[Dropping rows] - Calling `drop` with a sequence of labels will drop values from the row labels (axis 0): ```python obj = pd.Series(np.arange(5.), index = ["a", "b", "c", "d", "e"]) obj new_obj = obj.drop("c") new_obj obj.drop(["d", "c"]) ``` ] ] --- # Getting started with `pandas` ### <p style="color:#00449E"> Dropping columns .panelset[ .panel[.panel-name[(1)] - With `DataFrame`, index values can be deleted from either axis. To illustrate this, we first create an example `DataFrame`: ```python data = pd.DataFrame(np.arange(16).reshape((4, 4)), index = ["Ohio", "Colorado", "Utah", "New York"], columns=["one", "two", "three", "four"]) data data.drop(index = ["Colorado", "Ohio"]) ``` ] .panel[.panel-name[(2)] - To drop labels from the columns, we can use the `columns` keyword: ```python data.drop(columns=["two"]) ``` ] .panel[.panel-name[(3)] - We can also drop values from the columns by passing `axis=1` or `axis="columns"`: ```python data.drop("two", axis=1) data.drop(["two", "four"], axis="columns") ``` ] .panel[.panel-name[del] - `del DataFrame[column]` deletes column from `DataFrame`. ```python del data["two"] data ``` ] ] --- # Getting started with `pandas` ### <p style="color:#00449E"> Indexing, selecting and filtering .panelset[ .panel[.panel-name[example df] - Indexing of DataFrames works like indexing an `np.array`. - We can use the default index values: ```python data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"], "price": [69.2, 8.11, 110.92, 87.28, 87.81], "volume": [4456290, 3667975, 3669487, 1778058, 1824582]} companies = pd.DataFrame(data) companies companies[2:] ``` ] .panel[.panel-name[example df2] - We can also use a manually set index. ```python companies2 = pd.DataFrame(data, index = ["a", "b", "c", "d", "e"]) companies2 companies2["b":"d"] ``` - When slicing with labels, the *end element is inclusive*. ] .panel[.panel-name[.loc() and .iloc()] - `DataFrame.loc()` selects a subset of rows and columns from a DataFrame using **axis labels**. - `DataFrame.iloc()` selects a subset of rows and columns from a DataFrame using **integers**. ```python companies2.loc[ "c", ["company", "price"] ] companies2.iloc[ 2, [0, 1] ] companies2.loc[ ["c", "d", "e"], ["volume", "price", "company"] ] companies2.iloc[ 2:, : :-1 ] ``` ] .panel[.panel-name[summary] - `df[val]` selects single column or set of columns; - `df.loc[val]` selects single row or set of rows; - `df.loc[:, val]` selects single column or set of columns; - `df.loc[val1, val2]` selects row and column by label; - `df.iloc[where]` selects row or set of rows by integer position; - `df.iloc[:, where]` selects column or set of columns by integer position; - `df.iloc[w1, w2]` Select row and column by integer position. ] ] --- # Getting started with `pandas` ### <p style="color:#00449E"> Boolean Indexing - Boolean indexing of DataFrames works like boolean indexing an `np.array`. - `DataFrame[ DataFrame['VARIABLE_NAME'] > VALUE ]` - `DataFrame[ DataFrame['VARIABLE_NAME'] == VALUE ]` - `DataFrame[ DataFrame['VARIABLE_NAME'] < VALUE ]` ```python data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"], "price": [69.2, 8.11, 110.92, 87.28, 87.81], "volume": [4456290, 3667975, 3669487, 1778058, 1824582]} companies = pd.DataFrame(data) companies_daimler = companies[ companies['company'] == "Daimler" ] ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Import/Export data `pd.read_csv("PATH_NAME_OF_*.csv")` reads the csv file into `DataFrame`. - `header=None` does not read the top row of the csv file as column names. - We can set column names with `names`, for example, `names=["a", "b", "c", "d", "e"]`. - `DataFrame.head()` and `DataFrame.tail()` prints the first and last five rows on the Console, respectively. ```python nbc_show = pd.read_csv("https://bcdanl.github.io/data/nbc_show_na.csv") # `GRP`: audience size; `PE`: audience engagement. nbc_show.head() # showing the first five rows nbc_show.tail() # showing the last five rows ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Export data `DataFrame.to_csv("filename")` writes `DataFrame` to the csv file. - `index = False` and `header=False` do not write row index and column names in the csv file. - We can set column names with `header`, for example, `header=["a", "b", "c", "d", "e"]`. ```python nbc_show.to_csv("PATH_NAME_OF_THE_csv_FILE") ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Summarizing `DataFrame` - `DataFrame.count()` returns a Series containing the number of non-missing values for each column. - `DataFrame.sum()` returns a Series containing the sum of values for each column. - `DataFrame.mean()` returns a Series containing the mean of values for each column. - Passing `axis="columns"` or `axis=1` sums across the columns instead: ```python nbc_count = nbc_show.sum() nbc_sum = nbc_show.sum() nbc_sum_c = nbc_show.sum( axis="columns" ) nbc_mean = nbc_show.mean() ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Grouping `DataFrame` - `DataFrame.groupby(col1, col2)` groups `DataFrame` by columns (grouping by one or more than two columns is also possible!). - Adding the functions `count()`, `sum()`, `mean()` to `groupby()` returns the sum or the mean of the grouped columns. ```python nbc_genre_count = nbc_show.groupby(["Genre"]).count() nbc_genre_sum = nbc_show.groupby(["Genre"]).sum() nbc_network_genre_mean = nbc_show.groupby(["Network", "Genre"]).mean() ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Sorting `DataFrame` - `DataFrame.sort_index()` sorts DataFrame by index on either axis. - `DataFrame.sort_index(axis="columns")` sorts DataFrame by column index. - `DataFrame.sort_index(ascending=False)` sorts DataFrame by either index in descending order. ```python nbc_show.sort_index() nbc_show.sort_index(ascending = False) nbc_show.sort_index(axis = "columns") ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Sorting `DataFrame` - `DataFrame.sort_value("SOME_VARIABLE")` sorts DataFrame by values of SOME_VARIABLE. - For `Series.sort_values()`, we do not need to provide `"SOME_VARIABLE"` in the `sort_values()` function. - `DataFrame.sort_values("SOME_VARIABLE", ascdening = False)` sorts DataFrame by values of SOME_VARIABLE in descending order. ```python nbc_show.sort_values("GRP") nbc_show.sort_values("GRP", ascending = False) obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) obj.sort_values() ``` --- # Getting started with `pandas` ### <p style="color:#00449E"> Class Exercise Use the `nbc_show_na.csv` file to answer the following questions: 1. Find the top show in terms of the value of `PE` for each Genre. 2. Find the top show in terms of the value of `GRP` for each Network. 3. Which genre does have the largest `GRP` on average? --- # Workflow ### <p style="color:#00449E"> Installing Python modules - Anaconda Spyder also allow for installation on Spyder. .panelset[ .panel[.panel-name[Windows] - Step 1. Type the following command on Spyder Python script: - Step 2. Run the command. ```python conda install seaborn ``` or ```python pip install seaborn ``` ] .panel[.panel-name[Mac] - Step 1. Type the following command on Spyder Python script: - Step 2. Run the command. ```python conda install seaborn ``` or ```python pip install seaborn ``` ] ] --- # Workflow ### <p style="color:#00449E"> Installing Python modules - Let's install the Python visualization library `seaborn` on Anaconda Spyder. .panelset[ .panel[.panel-name[Windows] - Step 1. Open "Anaconda Prompt". - Step 2. Type the following: ```python conda install seaborn ``` or ```python pip install seaborn ``` ] .panel[.panel-name[Mac] - Step 1. Open "Terminal". - Step 2. Type the following: ```python conda install seaborn ``` or ```python pip install seaborn ``` ] ] --- class: inverse, center, middle # Data Visualization with `seaborn` --- # Data Visualization .pull-left[ <img src="../lec_figs/lego.png" width="67%" style="display: block; margin: auto;" /> ] .pull-right[ - Graphs and charts let us explore and learn about the structure of the information we have in DataFrame. - Good data visualizations make it easier to communicate our ideas and findings to other people. ] --- # Exploratory Data Analysis (EDA) - We use visualization and summary statistics (e.g., mean, median, minimum, maximum) to explore our data in a systematic way. - EDA is an iterative cycle. We: - Generate questions about our data. - Search for answers by visualizing, transforming, and modelling our data. - Use what we learn to refine our questions and/or generate new questions. --- # `seaborn` <img src="../lec_figs/seaborn-logo.png" width="20%" style="display: block; margin: auto;" /> - `seaborn` is a Python data visualization library based on `matplotlib`. - It allows us to easily create beautiful but complex graphics using a simple interface. - It also provides a general improvement in the default appearance of `matplotlib`-produced plots, and so I recommend using it by default. ```python import seaborn as sns ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Types of plots - We will consider the following types of visualization: - Bar chart - Histogram - Scatter plot - Scatter plot with Fitted line - Line chart --- # Getting started with `pandas` ### <p style="color:#00449E"> What is *tidy* `DataFrame`? </p> - There are three rules which make a dataset tidy: 1. Each **variable** has its own *column*. 2. Each **observation** has its own *row*. 3. Each **value** has its own *cell*. <img src="../lec_figs/tidy-1.png" width="75%" style="display: block; margin: auto;" /> --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Getting started with `seaborn` - Let's get the names of `DataFrame`s provided by the `seaborn` library: ```python import seaborn as sns print( sns.get_dataset_names() ) ``` - Let's use the `titanic` and `tips` DataFrames: ```python df_titanic = sns.load_dataset('titanic') df_titanic.head() df_tips = sns.load_dataset('tips') df_tips.head() ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Bar Chart - A bar chart is used to plot the frequency of the different categories. - It is useful to visualize how values of a **categorical variable** are distributed. - A variable is **categorical** if it can only take one of a small set of values. - We use `sns.countplot()` function to plot a bar chart: .pull-left[ ```python sns.countplot(x = 'sex', data = df_titanic) ``` ] .pull-right[ - Mapping - `data`: DataFrame. - `x`: Name of a categorical variable (column) in DataFrame ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Bar Chart - We can further break up the bars in the bar chart based on another categorical variable. - This is useful to visualize the relationship between the two categorical variables. .pull-left[ ```python sns.countplot(x = 'sex', hue = 'survived', data = df_titanic) ``` ] .pull-right[ - Mapping - `hue`: Name of a categorical variable ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Histogram - A histogram is a **continuous** version of a bar chart. - It is used to plot the frequency of the different values. - It is useful to visualize how values of a **continuous variable** are distributed. - A variable is **continuous** if it can take any of an infinite set of ordered values. - We use `sns.displot()` function to plot a histogram: .pull-left[ ```python sns.displot(x = 'age', bins = 5 , data = df_titanic) ``` ] .pull-right[ - Mapping - `bins`: Number of bins ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Scatter plot - A scatter plot is used to display the relationship between two continuous variables. - We can see co-variation as a pattern in the scattered points. - We use `sns.scatterplot()` function to plot a scatter plot: .pull-left[ ```python sns.scatterplot(x = 'total_bill', y = 'tip', data = df_tips) ``` ] .pull-right[ - Mapping - `x`: Name of a continuous variable on the horizontal axis - `y`: Name of a continuous variable on the vertical axis ] --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Scatter plot - To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`. - Suppose we are interested in the following question: - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior? ```python sns.scatterplot(x = 'total_bill', y = 'tip', hue = 'smoker', data = df) ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Fitted line - From the scatter plot, it is often difficult to clearly see the relationship between two continuous variables. - `sns.lmplot()` adds a line that fits well into the scattered points. - On average, the fitted line describes the relationship between two continuous variables. ```python sns.lmplot(x = 'total_bill', y = 'tip', data = df_tips) ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Scatter plot - To the scatter plot, we can add a `hue`-`VARIABLE` mapping to display how the relationship between two continuous variables varies by `VARIABLE`. - Using the fitted lines, let's answer the following question: - **Q**. Does a smoker and a non-smoker have a difference in tipping behavior? ```python sns.scatterplot(x = 'total_bill', y = 'tip', hue = 'smoker', data = df_tips) ``` --- # Data Visualization with `seaborn` ### <p style="color:#00449E"> Line cahrt - A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable. - It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur. - We use `sns.lineplot()` function to plot a line plot: .pull-left[ ```python path_csv = '/Users/byeong-hakchoe/Google Drive/suny-geneseo/teaching-materials/lecture-data/dji.csv' dow = pd.read_csv(path_csv, index_col=0, parse_dates=True) sns.lineplot(x = 'Date', y = 'Close', data = dow) ``` ] .pull-right[ - Mapping - `x`: Name of a continuous variable (often time variable) on the horizontal axis - `y`: Name of a continuous variable on the vertical axis ]