Lecture 21DANL 100: Programming for Data AnalyticsByeong-Hak ChoeNovember 15, 20221 / 35

Getting started with pandas2 / 35

`pandas`

pandas is a Python library including the following features:
- Data manipulation and analysis,
- DataFrame objects and Series,
- Export and import data from files and web,
- Handling of missing data.
pandas provides high-performance data structures and data analysis tools.

import pandas as pd

3 / 35

Getting started with `pandas`

Create `Series`

pd.Series() creates one-dimensional array-like object including values and an index.

obj = pd.Series([4, 7, -5, 3])
obj

Simple Series formed only from a list.
An index is added automatically.

4 / 35

Getting started with `pandas`

Create `Series`

NumPy arrays can only be indexed by integers, while Series can be indexed by the manually set index.

obj2 = pd.Series([2, -5, 9, 4], index=["a", "b", "c", "d"])
npobj = np.array([2, -5, 9, 4])
obj2
obj2["b"]
npobj[1]

5 / 35

Getting started with `pandas`

Create `Series`

Series.values returns the values of a Series.
Series.index returns the index of a Series.

obj.values
obj.index
obj2.index

The values and the index of a Series can be printed separately.
The default index, if none was explicitly specified, is a RangeIndex.

6 / 35

Getting started with `pandas`

Create `Series`

pandas Series can be created from NumPy arrays.

npobj = np.array([2, -5, 9, 4])
obj2 = pd.Series(npobj, 
                 index=["a", "b", 
                        "c", "d"])
obj2

obj2.index
obj2["a"]
obj2["d"] = 6
obj2[ ["c", "a", "d"] ]

Here ["c", "a", "d"] is interpreted as a list of indices.

Using NumPy functions or NumPy-like operations will preserve the index-value link.
Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values.

obj2[obj2 > 0]
obj2 * 2
np.exp(obj2)

"b" in obj2
"e" in obj2

7 / 35

Getting started with `pandas`

Create `Series`

pandas Series can be created from dictionaries as well.
- The index of the resulting Series consists of the dict’s keys.
- The index can be set manually when passing a dictionary to a Series

dictdata = {"Rochester": 210_606, 
            "Buffalo": 276_807,
            "Syracuse": 146_103}
obj3 = pd.Series(dictdata)
obj3

cities = ["Niagara", "Buffalo", 
          "Syracuse"]
obj4 = pd.Series(dictdata, 
                index=cities)
obj4

NaN (not a number) marks missing values where the index and the dict do not match.

8 / 35

Getting started with `pandas`

`Series` properties

Series.name returns name of the Series.
Series.index.name returns name of the Series's index.

obj4.name = "population"
obj4.index.name = "cities"
obj4

The attribute name will change the name of the existing Series.
There is no default name of the Series or the index.

9 / 35

Getting started with `pandas`

`pd.Series` vs. `np.array`

NumPy arrays are accessed by their integer positions.

Series can be accessed by a user defined index, including letters and numbers.

Different Series can be aligned efficiently by the index.

Series can work with missing values, so operations do not automatically fail.
- The isna and notna functions are used to detect missing data:
```
pd.isna(obj4)
pd.notna(obj4)
```
```
obj4.isna()
obj4.notna()
```

10 / 35

Getting started with `pandas`

`pd.DataFrame`

DataFrame is the primary structure of pandas.
DataFrame represents a table of data with an ordered collection of columns.
Each column can have a different data type.
DataFrame can be thought of as a dictionary of Series sharing the same index.

11 / 35

Getting started with `pandas`

Create `DataFrame`

pd.DataFrame() creates a DataFrame which is a two-dimensional tabular-like structure with labeled axis (rows and columns).

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "population": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In this example the construction of the DataFrame is done by passing a dictionary of equal-length lists.
It is also possible to pass a dictionary of NumPy arrays.

Passing a column that is not contained in the dict, it will be marked with NaN:

frame2 = pd.DataFrame(data, columns=["state", "year",
"population", "income"])
frame2

The default index will be assigned automatically as with Series.

If we specify a sequence of columns, the DataFrame's columns will be arranged in that order:

frame2 = pd.DataFrame(data, columns=["year", "state",
                                     "population"])
frame2

We can pass the following types of objects to pd.DataFrame():
- 2D NumPy arrays
- Dict of lists, tuples, dicts, arrays, or Series
- List of lists, tuples, dicts, or Series
- Another DataFrame

12 / 35

Getting started with `pandas`

Indexing `DataFrame`

We can add a new column to DataFrame as follows:

frame2["change"] = [1.2, -3.2, 0.4, -0.12, 2.4, 0.3]
frame2["change"]

Selecting the column of DataFrame, a Series is returned,
A attribute-like access, e.g., frame2.change, is also possible.
The returned Series has the same index as the initial DataFrame.

The result of using a list of multiple columns is a DataFrame:

frame2[ ["state", "population"] ]

We can name what the index and the columns are representing by using index.name and columns.name respectively:

frame2.index.name = "number:"
frame2.columns.name = "variable:"
frame2

In DataFrames, there is no default name for the index or the columns.

DataFrame.reindex() creates new DataFrame with data conformed to a new index, while the initial DataFrame will not be changed:

frame3 = frame.reindex([0, 2, 3, 4])
frame3

data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}
companies = pd.DataFrame(data)
companies
companies[2:]

Index values that are not already present will be filled with NaN by default.
The pd.isna() and pd.notna() functions detect missing data:

companies3 = companies.reindex(index=[0, 2, 3, 4, 5], 
                               columns=["company", "price", "market cap"])
companies3
pd.isna(companies3)
pd.notna(companies3)

Calling drop with a sequence of labels will drop values from the row labels (axis 0):

obj = pd.Series(np.arange(5.), 
                index=["a", "b", "c", "d", "e"])
obj
new_obj = obj.drop("c")
new_obj
obj.drop(["d", "c"])

13 / 35

Getting started with `pandas`

Dropping columns

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data
data.drop(index=["Colorado", "Ohio"])

To drop labels from the columns, we can use the columns keyword:

data.drop(columns=["two"])

We can also drop values from the columns by passing axis=1 or axis="columns":

data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")

del DataFrame[column] deletes column from DataFrame.

del data["two"]
data

14 / 35

Getting started with `pandas`

Indexing, selecting and filtering

Indexing of DataFrames works like indexing an np.array.
- We can use the default index values:

data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}
companies = pd.DataFrame(data)
companies
companies[2:]

We can also use a manually set index.

companies2 = pd.DataFrame(data, index=["a", "b", "c", "d", "e"])
companies2
companies2["b":"d"]

When slicing with labels, the end element is inclusive.

DataFrame.loc() selects a subset of rows and columns from a DataFrame using axis labels.
DataFrame.iloc() selects a subset of rows and columns from a DataFrame using integers.

companies2.loc[ "c", ["company", "price"] ]
companies2.iloc[ 2, [0, 1] ]
companies2.loc[ ["c", "d", "e"], ["volume", "price", "company"] ]
companies2.iloc[ 2:, : :-1 ]

df[val] selects single column or set of columns;

df.loc[val] selects single row or set of rows;
df.loc[:, val] selects single column or set of columns;
df.loc[val1, val2] selects row and column by label;

df.iloc[where] selects row or set of rows by integer position;
df.iloc[:, where] selects column or set of columns by integer position;
df.iloc[w1, w2] Select row and column by integer position.

15 / 35

Getting started with `pandas`

Operations between `DataFrame`s and `Series`

Here the series is generated from the first row of the DataFrame:

companies3 = companies[["price", "volume"]]
companies3.index = ["Daimler", "E.ON", "Siemens", "BASF", "BMW"]
series = companies3.iloc[2]
companies3
series

By default, arithmetic operations between DataFrames and Series match the index of the Series on the DataFrame's columns:

companies3 + series

DataFrame.add() does addition along a column matching the DataFrame's row index (axis=0).

series2 = companies3["price"]
companies3.add(series2, axis=0)

Here are the example DataFrames to work with arithmetic operations:

df1 = pd.DataFrame( np.arange(9.).reshape((3, 3)),
                    columns=list("bcd"),
                    index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame( np.arange(12.).reshape((4, 3)),
                    columns=list("bde"),
                    index=["Utah", "Ohio", "Texas", "Oregon"])
df1
df2
df1 + df2

DataFrame.T transposes DataFrame.

companies3.T

16 / 35

Getting started with `pandas`

NumPy functions on `DataFrame`

DataFrame.apply(np.function, axis) applies a NumPy function on the DataFrame axis.

companies3.apply(np.mean)
companies3.apply(np.sqrt)
companies3.apply(np.sqrt)[ :2]

17 / 35

Getting started with `pandas`

Import/Export data

pd.read_csv("PATH_NAME_OF_*.csv") reads the csv file into DataFrame.

header=None does not read the top row of the csv file as column names.
We can set column names with names, for example, names=["a", "b", "c", "d", "e"].

DataFrame.head() and DataFrame.tail() prints the first and last five rows on the Console, respectively.

nbc_show = pd.read_csv("https://bcdanl.github.io/data/nbc_show_na.csv")
# `GRP`: audience size; `PE`: audience engagement.
nbc_show.head()   # showing the first five rows
nbc_show.tail()   # showing the last five rows

18 / 35

Getting started with `pandas`

Export data

DataFrame.to_csv("filename") writes DataFrame to the csv file.

index=False and header=False do not write row index and column names in the csv file.
We can set column names with header, for example, header=["a", "b", "c", "d", "e"].

nbc_show.to_csv("PATH_NAME_OF_THE_csv_FILE")

19 / 35

Getting started with `pandas`

Summarizing `DataFrame`

DataFrame.count() returns a Series containing the number of non-missing values for each column.
DataFrame.sum() returns a Series containing the sum of values for each column.
DataFrame.mean() returns a Series containing the mean of values for each column.
- Passing axis="columns" or axis=1 sums across the columns instead:

nbc_count = nbc_show.sum()
nbc_sum = nbc_show.sum()
nbc_sum_c = nbc_show.sum( axis="columns" )
nbc_mean = nbc_show.mean()

20 / 35

Getting started with `pandas`

Grouping `DataFrame`

DataFrame.groupby(col1, col2) groups DataFrame by columns (grouping by one or more than two columns is also possible!).
- Adding the functions count(), sum(), mean() to groupby() returns the sum or the mean of the grouped columns.

nbc_genre_count = nbc_show.groupby(["Genre"]).count()
nbc_genre_sum = nbc_show.groupby(["Genre"]).sum()
nbc_network_genre_mean = nbc_show.groupby(["Network", "Genre"]).mean()

21 / 35

Getting started with `pandas`

Sorting `DataFrame`

DataFrame.sort_index() sorts DataFrame by index on either axis.
- DataFrame.sort_index(axis="columns") sorts DataFrame by column index.
- DataFrame.sort_index(ascending=False) sorts DataFrame by either index in descending order.

nbc_show.sort_index()
nbc_show.sort_index(ascending = False)
nbc_show.sort_index(axis = "columns")
nbc_show.sort_value()
nbc_show.sort_value(ascending = False)
nbc_show.sort_value(axis = "columns")

22 / 35

Getting started with `pandas`

Sorting `DataFrame`

DataFrame.sort_value("SOME_VARIABLE") sorts DataFrame by values of SOME_VARIABLE.
- For Series.sort_value(), we do not need to provide "SOME_VARIABLE" in the sort_value() function.
DataFrame.sort_value("SOME_VARIABLE", ascdening = False) sorts DataFrame by values of SOME_VARIABLE in descending order.

nbc_show.sort_value("GRP")
nbc_show.sort_value("GRP", ascending = False)
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

23 / 35

Getting started with `pandas`

Class Exercise

Use the nbc_show_na.csv file to answer the following questions:

Find the top show in terms of the value of PE for each Genre.
Find the top show in terms of the value of GRP for each Network.
Which genre does have the largest GRP on average?

24 / 35

Data Visualization with seaborn25 / 35

`seaborn`

seaborn is a Python data visualization library based on matplotlib.
- It allows us to easily create beautiful but complex graphics using a simple interface.
- It also provides a general improvement in the default appearance of matplotlib-produced plots, and so I recommend using it by default.

import seaborn as sns

26 / 35

Exploratory Data Analysis (EDA)

We use visualization and summary statistics (e.g., mean, standard deviation, minimum, maximum, median) to explore our data in a systematic way.
EDA is an iterative cycle. We:
- Generate questions about our data.
- Search for answers by visualizing, transforming, and modelling our data.
- Use what we learn to refine our questions and/or generate new questions.

27 / 35

Data Visualization with `seaborn`

Types of plots

We will consider the following types of visualization:
- Bar chart
- Histogram
- Scatter plot
- Line chart

28 / 35

Getting started with `pandas`

What is tidy `DataFrame`?

There are three rules which make a dataset tidy:
1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.

29 / 35

Data Visualization with `seaborn`

Getting started with `seaborn`

Let's get the names of DataFrames provided by the seaborn library:

import seaborn as sns
print( sns.get_dataset_names() )

Let's us the titanic DataFrame:

df = sns.load_dataset('titanic')
df.head()

30 / 35

Data Visualization with `seaborn`

Bar Chart

A bar chart is used to plot the frequency of the different categories.
- It is useful to visualize how values of a categorical variable are distributed.
- A variable is categorical if it can only take one of a small set of values.

We use sns.countplot() function to plot a bar chart:

sns.countplot(x = 'sex', 
              data = df)

Mapping
- data: DataFrame.
- x: Name of a categorical variable (column) in DataFrame

31 / 35

Data Visualization with `seaborn`

Bar Chart

We can further break up the bars in the bar chart based on another categorical variable.
- This is useful to visualize the relationship between the two categorical variables.

sns.countplot(x='sex', 
              hue = 'survived', 
              data = df)

Mapping
- hue: Name of a categorical variable

32 / 35

Data Visualization with `seaborn`

Histogram

A histogram is a continuous version of a bar chart.
- It is used to plot the frequency of the different values.
- It is useful to visualize how values of a continuous variable are distributed.
- A variable is continuous if it can take any of an infinite set of ordered values.

We use sns.displot() function to plot a histogram:

sns.displot(x = 'age', 
        bins = 5 ,
        data = df)

Mapping
- bins: Number of bins

33 / 35

Data Visualization with `seaborn`

Scatter plot

A scatter plot is used to display the relationship between the two continuous variables.
- We can see co-variation as a pattern in the scattered points.
We use sns.scatterplot() function to plot a scatter plot:

df = sns.load_dataset('tips')
sns.scatterplot(x='total_bill', 
                y ='tip',
                data = df)

Mapping
- x: Name of a continuous variable on the horizontal axis
- y: Name of a continuous variable on the vertical axis

34 / 35

Data Visualization with `seaborn`

Line cahrt

A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable.
- It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur.

We use sns.lineplot() function to plot a line plot:

path_csv = '/Users/byeong-hakchoe/Google Drive/suny-geneseo/teaching-materials/lecture-data/dji.csv'
dow = pd.read_csv(path_csv, index_col=0, parse_dates=True)
sns.lineplot(x = 'Date', 
         y = 'Close', 
         data = dow)

Mapping
- x: Name of a continuous variable (often time variable) on the horizontal axis
- y: Name of a continuous variable on the vertical axis

35 / 35

Lecture 21DANL 100: Programming for Data AnalyticsByeong-Hak ChoeNovember 15, 20221 / 35

Getting started with pandas2 / 35

`pandas`

pandas is a Python library including the following features:
- Data manipulation and analysis,
- DataFrame objects and Series,
- Export and import data from files and web,
- Handling of missing data.
pandas provides high-performance data structures and data analysis tools.

import pandas as pd

3 / 35

Getting started with `pandas`

Create `Series`

pd.Series() creates one-dimensional array-like object including values and an index.

obj = pd.Series([4, 7, -5, 3])
obj

Simple Series formed only from a list.
An index is added automatically.

4 / 35

Getting started with `pandas`

Create `Series`

NumPy arrays can only be indexed by integers, while Series can be indexed by the manually set index.

obj2 = pd.Series([2, -5, 9, 4], index=["a", "b", "c", "d"])
npobj = np.array([2, -5, 9, 4])
obj2
obj2["b"]
npobj[1]

5 / 35

Getting started with `pandas`

Create `Series`

Series.values returns the values of a Series.
Series.index returns the index of a Series.

obj.values
obj.index
obj2.index

The values and the index of a Series can be printed separately.
The default index, if none was explicitly specified, is a RangeIndex.

6 / 35

Getting started with `pandas`

Create `Series`

pandas Series can be created from NumPy arrays.

npobj = np.array([2, -5, 9, 4])
obj2 = pd.Series(npobj, 
                 index=["a", "b", 
                        "c", "d"])
obj2

obj2.index
obj2["a"]
obj2["d"] = 6
obj2[ ["c", "a", "d"] ]

Here ["c", "a", "d"] is interpreted as a list of indices.

Using NumPy functions or NumPy-like operations will preserve the index-value link.
Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values.

obj2[obj2 > 0]
obj2 * 2
np.exp(obj2)

"b" in obj2
"e" in obj2

7 / 35

Getting started with `pandas`

Create `Series`

pandas Series can be created from dictionaries as well.
- The index of the resulting Series consists of the dict’s keys.
- The index can be set manually when passing a dictionary to a Series

dictdata = {"Rochester": 210_606, 
            "Buffalo": 276_807,
            "Syracuse": 146_103}
obj3 = pd.Series(dictdata)
obj3

cities = ["Niagara", "Buffalo", 
          "Syracuse"]
obj4 = pd.Series(dictdata, 
                index=cities)
obj4

NaN (not a number) marks missing values where the index and the dict do not match.

8 / 35

Getting started with `pandas`

`Series` properties

Series.name returns name of the Series.
Series.index.name returns name of the Series's index.

obj4.name = "population"
obj4.index.name = "cities"
obj4

The attribute name will change the name of the existing Series.
There is no default name of the Series or the index.

9 / 35

Getting started with `pandas`

`pd.Series` vs. `np.array`

NumPy arrays are accessed by their integer positions.

Series can be accessed by a user defined index, including letters and numbers.

Different Series can be aligned efficiently by the index.

Series can work with missing values, so operations do not automatically fail.
- The isna and notna functions are used to detect missing data:
```
pd.isna(obj4)
pd.notna(obj4)
```
```
obj4.isna()
obj4.notna()
```

10 / 35

Getting started with `pandas`

`pd.DataFrame`

DataFrame is the primary structure of pandas.
DataFrame represents a table of data with an ordered collection of columns.
Each column can have a different data type.
DataFrame can be thought of as a dictionary of Series sharing the same index.

11 / 35

Getting started with `pandas`

Create `DataFrame`

pd.DataFrame() creates a DataFrame which is a two-dimensional tabular-like structure with labeled axis (rows and columns).

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "population": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In this example the construction of the DataFrame is done by passing a dictionary of equal-length lists.
It is also possible to pass a dictionary of NumPy arrays.

Passing a column that is not contained in the dict, it will be marked with NaN:

frame2 = pd.DataFrame(data, columns=["state", "year",
"population", "income"])
frame2

The default index will be assigned automatically as with Series.

If we specify a sequence of columns, the DataFrame's columns will be arranged in that order:

frame2 = pd.DataFrame(data, columns=["year", "state",
                                     "population"])
frame2

We can pass the following types of objects to pd.DataFrame():
- 2D NumPy arrays
- Dict of lists, tuples, dicts, arrays, or Series
- List of lists, tuples, dicts, or Series
- Another DataFrame

12 / 35

Getting started with `pandas`

Indexing `DataFrame`

We can add a new column to DataFrame as follows:

frame2["change"] = [1.2, -3.2, 0.4, -0.12, 2.4, 0.3]
frame2["change"]

Selecting the column of DataFrame, a Series is returned,
A attribute-like access, e.g., frame2.change, is also possible.
The returned Series has the same index as the initial DataFrame.

The result of using a list of multiple columns is a DataFrame:

frame2[ ["state", "population"] ]

We can name what the index and the columns are representing by using index.name and columns.name respectively:

frame2.index.name = "number:"
frame2.columns.name = "variable:"
frame2

In DataFrames, there is no default name for the index or the columns.

DataFrame.reindex() creates new DataFrame with data conformed to a new index, while the initial DataFrame will not be changed:

frame3 = frame.reindex([0, 2, 3, 4])
frame3

data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}
companies = pd.DataFrame(data)
companies
companies[2:]

Index values that are not already present will be filled with NaN by default.
The pd.isna() and pd.notna() functions detect missing data:

companies3 = companies.reindex(index=[0, 2, 3, 4, 5], 
                               columns=["company", "price", "market cap"])
companies3
pd.isna(companies3)
pd.notna(companies3)

Calling drop with a sequence of labels will drop values from the row labels (axis 0):

obj = pd.Series(np.arange(5.), 
                index=["a", "b", "c", "d", "e"])
obj
new_obj = obj.drop("c")
new_obj
obj.drop(["d", "c"])

13 / 35

Getting started with `pandas`

Dropping columns

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data
data.drop(index=["Colorado", "Ohio"])

To drop labels from the columns, we can use the columns keyword:

data.drop(columns=["two"])

We can also drop values from the columns by passing axis=1 or axis="columns":

data.drop("two", axis=1)
data.drop(["two", "four"], axis="columns")

del DataFrame[column] deletes column from DataFrame.

del data["two"]
data

14 / 35

Getting started with `pandas`

Indexing, selecting and filtering

Indexing of DataFrames works like indexing an np.array.
- We can use the default index values:

data = {"company": ["Daimler", "E.ON", "Siemens", "BASF", "BMW"],
"price": [69.2, 8.11, 110.92, 87.28, 87.81],
"volume": [4456290, 3667975, 3669487, 1778058, 1824582]}
companies = pd.DataFrame(data)
companies
companies[2:]

We can also use a manually set index.

companies2 = pd.DataFrame(data, index=["a", "b", "c", "d", "e"])
companies2
companies2["b":"d"]

When slicing with labels, the end element is inclusive.

DataFrame.loc() selects a subset of rows and columns from a DataFrame using axis labels.
DataFrame.iloc() selects a subset of rows and columns from a DataFrame using integers.

companies2.loc[ "c", ["company", "price"] ]
companies2.iloc[ 2, [0, 1] ]
companies2.loc[ ["c", "d", "e"], ["volume", "price", "company"] ]
companies2.iloc[ 2:, : :-1 ]

df[val] selects single column or set of columns;

df.loc[val] selects single row or set of rows;
df.loc[:, val] selects single column or set of columns;
df.loc[val1, val2] selects row and column by label;

df.iloc[where] selects row or set of rows by integer position;
df.iloc[:, where] selects column or set of columns by integer position;
df.iloc[w1, w2] Select row and column by integer position.

15 / 35

Getting started with `pandas`

Operations between `DataFrame`s and `Series`

Here the series is generated from the first row of the DataFrame:

companies3 = companies[["price", "volume"]]
companies3.index = ["Daimler", "E.ON", "Siemens", "BASF", "BMW"]
series = companies3.iloc[2]
companies3
series

By default, arithmetic operations between DataFrames and Series match the index of the Series on the DataFrame's columns:

companies3 + series

DataFrame.add() does addition along a column matching the DataFrame's row index (axis=0).

series2 = companies3["price"]
companies3.add(series2, axis=0)

Here are the example DataFrames to work with arithmetic operations:

df1 = pd.DataFrame( np.arange(9.).reshape((3, 3)),
                    columns=list("bcd"),
                    index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame( np.arange(12.).reshape((4, 3)),
                    columns=list("bde"),
                    index=["Utah", "Ohio", "Texas", "Oregon"])
df1
df2
df1 + df2

DataFrame.T transposes DataFrame.

companies3.T

16 / 35

Getting started with `pandas`

NumPy functions on `DataFrame`

DataFrame.apply(np.function, axis) applies a NumPy function on the DataFrame axis.

companies3.apply(np.mean)
companies3.apply(np.sqrt)
companies3.apply(np.sqrt)[ :2]

17 / 35

Getting started with `pandas`

Import/Export data

pd.read_csv("PATH_NAME_OF_*.csv") reads the csv file into DataFrame.

header=None does not read the top row of the csv file as column names.
We can set column names with names, for example, names=["a", "b", "c", "d", "e"].

DataFrame.head() and DataFrame.tail() prints the first and last five rows on the Console, respectively.

nbc_show = pd.read_csv("https://bcdanl.github.io/data/nbc_show_na.csv")
# `GRP`: audience size; `PE`: audience engagement.
nbc_show.head()   # showing the first five rows
nbc_show.tail()   # showing the last five rows

18 / 35

Getting started with `pandas`

Export data

DataFrame.to_csv("filename") writes DataFrame to the csv file.

index=False and header=False do not write row index and column names in the csv file.
We can set column names with header, for example, header=["a", "b", "c", "d", "e"].

nbc_show.to_csv("PATH_NAME_OF_THE_csv_FILE")

19 / 35

Getting started with `pandas`

Summarizing `DataFrame`

DataFrame.count() returns a Series containing the number of non-missing values for each column.
DataFrame.sum() returns a Series containing the sum of values for each column.
DataFrame.mean() returns a Series containing the mean of values for each column.
- Passing axis="columns" or axis=1 sums across the columns instead:

nbc_count = nbc_show.sum()
nbc_sum = nbc_show.sum()
nbc_sum_c = nbc_show.sum( axis="columns" )
nbc_mean = nbc_show.mean()

20 / 35

Getting started with `pandas`

Grouping `DataFrame`

DataFrame.groupby(col1, col2) groups DataFrame by columns (grouping by one or more than two columns is also possible!).
- Adding the functions count(), sum(), mean() to groupby() returns the sum or the mean of the grouped columns.

nbc_genre_count = nbc_show.groupby(["Genre"]).count()
nbc_genre_sum = nbc_show.groupby(["Genre"]).sum()
nbc_network_genre_mean = nbc_show.groupby(["Network", "Genre"]).mean()

21 / 35

Getting started with `pandas`

Sorting `DataFrame`

DataFrame.sort_index() sorts DataFrame by index on either axis.
- DataFrame.sort_index(axis="columns") sorts DataFrame by column index.
- DataFrame.sort_index(ascending=False) sorts DataFrame by either index in descending order.

nbc_show.sort_index()
nbc_show.sort_index(ascending = False)
nbc_show.sort_index(axis = "columns")
nbc_show.sort_value()
nbc_show.sort_value(ascending = False)
nbc_show.sort_value(axis = "columns")

22 / 35

Getting started with `pandas`

Sorting `DataFrame`

DataFrame.sort_value("SOME_VARIABLE") sorts DataFrame by values of SOME_VARIABLE.
- For Series.sort_value(), we do not need to provide "SOME_VARIABLE" in the sort_value() function.
DataFrame.sort_value("SOME_VARIABLE", ascdening = False) sorts DataFrame by values of SOME_VARIABLE in descending order.

nbc_show.sort_value("GRP")
nbc_show.sort_value("GRP", ascending = False)
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

23 / 35

Getting started with `pandas`

Class Exercise

Use the nbc_show_na.csv file to answer the following questions:

Find the top show in terms of the value of PE for each Genre.
Find the top show in terms of the value of GRP for each Network.
Which genre does have the largest GRP on average?

24 / 35

Data Visualization with seaborn25 / 35

`seaborn`

seaborn is a Python data visualization library based on matplotlib.
- It allows us to easily create beautiful but complex graphics using a simple interface.
- It also provides a general improvement in the default appearance of matplotlib-produced plots, and so I recommend using it by default.

import seaborn as sns

26 / 35

Exploratory Data Analysis (EDA)

We use visualization and summary statistics (e.g., mean, standard deviation, minimum, maximum, median) to explore our data in a systematic way.
EDA is an iterative cycle. We:
- Generate questions about our data.
- Search for answers by visualizing, transforming, and modelling our data.
- Use what we learn to refine our questions and/or generate new questions.

27 / 35

Data Visualization with `seaborn`

Types of plots

We will consider the following types of visualization:
- Bar chart
- Histogram
- Scatter plot
- Line chart

28 / 35

Getting started with `pandas`

What is tidy `DataFrame`?

There are three rules which make a dataset tidy:
1. Each variable has its own column.
2. Each observation has its own row.
3. Each value has its own cell.

29 / 35

Data Visualization with `seaborn`

Getting started with `seaborn`

Let's get the names of DataFrames provided by the seaborn library:

import seaborn as sns
print( sns.get_dataset_names() )

Let's us the titanic DataFrame:

df = sns.load_dataset('titanic')
df.head()

30 / 35

Data Visualization with `seaborn`

Bar Chart

A bar chart is used to plot the frequency of the different categories.
- It is useful to visualize how values of a categorical variable are distributed.
- A variable is categorical if it can only take one of a small set of values.

We use sns.countplot() function to plot a bar chart:

sns.countplot(x = 'sex', 
              data = df)

Mapping
- data: DataFrame.
- x: Name of a categorical variable (column) in DataFrame

31 / 35

Data Visualization with `seaborn`

Bar Chart

We can further break up the bars in the bar chart based on another categorical variable.
- This is useful to visualize the relationship between the two categorical variables.

sns.countplot(x='sex', 
              hue = 'survived', 
              data = df)

Mapping
- hue: Name of a categorical variable

32 / 35

Data Visualization with `seaborn`

Histogram

A histogram is a continuous version of a bar chart.
- It is used to plot the frequency of the different values.
- It is useful to visualize how values of a continuous variable are distributed.
- A variable is continuous if it can take any of an infinite set of ordered values.

We use sns.displot() function to plot a histogram:

sns.displot(x = 'age', 
        bins = 5 ,
        data = df)

Mapping
- bins: Number of bins

33 / 35

Data Visualization with `seaborn`

Scatter plot

A scatter plot is used to display the relationship between the two continuous variables.
- We can see co-variation as a pattern in the scattered points.
We use sns.scatterplot() function to plot a scatter plot:

df = sns.load_dataset('tips')
sns.scatterplot(x='total_bill', 
                y ='tip',
                data = df)

Mapping
- x: Name of a continuous variable on the horizontal axis
- y: Name of a continuous variable on the vertical axis

34 / 35

Data Visualization with `seaborn`

Line cahrt

A line chart is used to display the trend in a continuous variable or the change in a continuous variable over other variable.
- It draws a line by connecting the scattered points in order of the variable on the x-axis, so that it highlights exactly when changes occur.

We use sns.lineplot() function to plot a line plot:

path_csv = '/Users/byeong-hakchoe/Google Drive/suny-geneseo/teaching-materials/lecture-data/dji.csv'
dow = pd.read_csv(path_csv, index_col=0, parse_dates=True)
sns.lineplot(x = 'Date', 
         y = 'Close', 
         data = dow)

Mapping
- x: Name of a continuous variable (often time variable) on the horizontal axis
- y: Name of a continuous variable on the vertical axis

35 / 35

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help
o	Tile View: Overview of Slides

Lecture 21

DANL 100: Programming for Data Analytics

Byeong-Hak Choe

November 15, 2022

Getting started with pandas

pandas

Getting started with pandas

Create Series

Getting started with pandas

Create Series

Getting started with pandas

Create Series

Getting started with pandas

Create Series

Getting started with pandas

Create Series

Getting started with pandas

Series properties

Getting started with pandas

pd.Series vs. np.array

Getting started with pandas

pd.DataFrame

Getting started with pandas

Create DataFrame

Getting started with pandas

Indexing DataFrame

Getting started with pandas

Dropping columns

Getting started with pandas

Indexing, selecting and filtering

Getting started with pandas

Operations between DataFrames and Series

Getting started with pandas

NumPy functions on DataFrame

Getting started with pandas

Import/Export data

Getting started with pandas

Export data

Getting started with pandas

Summarizing DataFrame

Getting started with pandas

Grouping DataFrame

Getting started with pandas

Sorting DataFrame

Getting started with pandas

Sorting DataFrame

Getting started with pandas

Class Exercise

Data Visualization with seaborn

seaborn

Exploratory Data Analysis (EDA)

Data Visualization with seaborn

Types of plots

Getting started with pandas

What is tidy DataFrame?

Data Visualization with seaborn

Getting started with seaborn

Data Visualization with seaborn

Bar Chart

Data Visualization with seaborn

Bar Chart

Data Visualization with seaborn

Histogram

Data Visualization with seaborn

Scatter plot

Data Visualization with seaborn

Line cahrt

Getting started with pandas

Help

Lecture 21

Lecture 21

DANL 100: Programming for Data Analytics

Byeong-Hak Choe

November 15, 2022

Getting started with pandas

pandas

Getting started with pandas

Create Series

Getting started with pandas

Create Series

Getting started with `pandas`

`pandas`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

`Series` properties

Getting started with `pandas`

`pd.Series` vs. `np.array`

Getting started with `pandas`

`pd.DataFrame`

Getting started with `pandas`

Create `DataFrame`

Getting started with `pandas`

Indexing `DataFrame`

Getting started with `pandas`

Getting started with `pandas`

Getting started with `pandas`

Operations between `DataFrame`s and `Series`

Getting started with `pandas`

NumPy functions on `DataFrame`

Getting started with `pandas`

Getting started with `pandas`

Getting started with `pandas`

Summarizing `DataFrame`

Getting started with `pandas`

Grouping `DataFrame`

Getting started with `pandas`

Sorting `DataFrame`

Getting started with `pandas`

Sorting `DataFrame`

Getting started with `pandas`

Data Visualization with `seaborn`

`seaborn`

Data Visualization with `seaborn`

Getting started with `pandas`

What is tidy `DataFrame`?

Data Visualization with `seaborn`

Getting started with `seaborn`

Data Visualization with `seaborn`

Data Visualization with `seaborn`

Data Visualization with `seaborn`

Data Visualization with `seaborn`

Data Visualization with `seaborn`

Getting started with `pandas`

Getting started with `pandas`

`pandas`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

Create `Series`

Getting started with `pandas`

`Series` properties

Getting started with `pandas`

`pd.Series` vs. `np.array`

Getting started with `pandas`

`pd.DataFrame`

Getting started with `pandas`

Create `DataFrame`

Getting started with `pandas`

Indexing `DataFrame`

Getting started with `pandas`

Getting started with `pandas`

Getting started with `pandas`

Operations between `DataFrame`s and `Series`

Getting started with `pandas`

NumPy functions on `DataFrame`