pandas library

Overview

pandas is a Python library for data analysis. For details, see the API Reference.

To install pandas, enter pip install pandas. To see the version that is installed, print pd.__version__. To see the version of pandas, all of its dependencies, and more, enter pd.show_versions() in a Python environment.

To import the pandas library, use import pandas as pd. By convention, "pd" is the name used to refer to the library in code.

Pandas provides two primary data structures. The Series class is used to represent a series of data values and can be thought of as a single-column spreadsheet. The DataFrame class is used to represent a table of data values and can be thought of as a multi-column spreadsheet. DataFrame objects hold their data in a collection of Series objects.

While Series and DataFrame objects do not store their data in a Numpy array, a NumPy array can easily be created from them using their to_numpy methods. Series and DataFrame objects can be created from a NumPy array by invoking their constructors with the data argument set to a NumPy array.

Abbreviations

In the tables that follow, some abbreviations are used for arguments.

Loading Data

A DataFrame can be created from many sources.

To create a DataFrame from a Python list of dict objects:

dogs_df = pd.DataFrame([
{'name': 'Maisey', 'breed': 'Treeing Walker Coonhound'},
{'name': 'Oscar', 'breed': 'German Shorthaired Pointer'},
{'name': 'Ramsay', 'breed': 'Native American Indian Dog'},
{'name': 'Comet', 'breed': 'Whippet'}
])

To create a DataFrame from a CSV file

# If the first line contains column names ...
df = pd.read_csv('file-name.csv') # from comma-separated
df = pd.read_csv('file-name.csv', delimiter='\t') # for tab-separate

# If the first line contains data rather than column names ...
df = pd.read_csv('file-name.csv', names=['colName1', 'colName2', ...])

# Read only specific columns ...
df = pd.read_csv('file-name.csv', usecols=['cn1', 'cn2'])

# Specify data types for some columns ...
df = pd.read_csv('file-name.csv', dtype={'cn1': 'type1', 'cn2': 'type2', ...})
# A common type is 'category'.

To create a DataFrame from an Excel spreadsheet:

df = pd.read_excel('file-name.xlsx', sheet_name='name')

To create a DataFrame from clipboard contents, such as copying from a selected Excel or Numbers range:

df = pd.read_clipboard()

To create a DataFrame from a NumPy array:

df = pd.DataFrame(data=numpy_array)

Data Types

When pandas creates Series and DataFrame objects from Python data it converts Python string and number types to its own types.

Python typepandas type
strobject
intint64
floatfloat64

For more detail, see Overview of Pandas Data Types.

Getting Data

Note that these just return data. To print it, pass the result to the print function.

OperationCode
get column namesdf.columns
get column names as a listlist(df.columns)
get column data typesdf.dtypes
get cell valuedf.at[ri, cn] or df.iloc[ri, ci]
get all rows 1df
get first n rowsdf.head(n) where n defaults to 5
get last n rowsdf.tail(n) where n defaults to 5
get one rowdf.iloc[ri]
get row range with all columnsdf[rsi:rei] or df.iloc[rsi:rei]
get row range with a subset of columns by namedf.iloc[rsi:rei, ['cn1', 'cn2', ...]]
get row range with a subset of columns by index rangedf.iloc[rsi:rei, csi:cei]
get row range with a subset of columns by index listdf.iloc[rsi:rei, [ci1, ci2, ...]]
get row index valuesdf.index
get one column as a Series objectdf.cn or df[cn]
get one column of specified rowsdf[col_name][rsi:rei]
get multiple columns in specified orderdf[[cn1, cn2, ...]]
get rows matching criteriadf.loc[criteria]
get statistics of numeric columns
including count, min, max, mean, and std
df.describe()
display DataFrame memory usagedf.info(memory_usage='deep')

1: By default the number of rows displayed is limited. To remove the limit pd.set_option('display.max_rows', None).

Criteria use logical operators like ==, !=, <, <=, >=, and >. Criteria can also use bitwise operators for and (&), or (|), and not (~).

The example below gets all rows where the name column contains the letter "a" and the breed column consists of three words.

df.loc[df.name.str.contains('a') & (df.breed.str.split(' ').apply(len) == 3)]

In the code above, df.breed.str.split(' ') returns a Series object. The Series apply method creates a new Series that contains the results of calling the len function on each value in the original Series.

The example below gets all rows where the breed column does not contain the word "Dog".

df.loc[~df.breed.str.contains('Dog')]

The example below uses a regular expression to get rows where the breed starts with "w" or "W" (case insensitive).

import re
df.loc[df.breed.str.contains('^W', regex=True, flags=re.I)]

The example below gets all rows where the breed column value is "Vizsla" or "Weimaraner".

df.loc[df.breed.isin(['Vizsla', 'Weimaraner'])

Lookups by a specific column can be made faster by setting that column as the index. This changes the column to no longer be seen as a normal column. For example:

df.set_index('name', inplace=True)
start = time.time() # to show that lookup by index is faster
dog = df.loc['Comet']
# When the "name" column is not the index, do the lookup this way.
#dog = df[df['name'] == 'Comet']
end = time.time()
print('elapsed =', end - start)

To treat Series values as strings, add .str. For example, df.name.str.

To iterate over all rows:

for index, row in df.iterrows():
# Use index and row (a Series object) here.

Modifying Data

OperationCode
set cell value by column namedf.at[ri, cn] = value
set cell value by column indexdf.iat[ri, ci] = value
conditionally set cell valuedf.loc[criteria, cn] = value
conditionally set multiple cell valuesdf.loc[criteria, [cn1, cn2]] = [value1, value2]
add column computed from othersdf['new_cn'] = expression
add column that is sum of others 2df['my sum'] = df.iloc[:, si:ei].sum(axis=1)
delete columnsdf = df.drop(columns=['cn1', 'cn2', ...])
delete columns and ignore errors if not presentdf = df.drop(columns=['cn1', 'cn2', ...], errors='ignore')
reorder and remove columnsdf = df[['cn1', 'cn2', ...]]
rename specific columnsdf = df.rename({'old-cn1': 'new-cn1', ...}, axis='columns')
rename all columns (must specify all names)df.columns = cn1, cn2, ...
convert strings to numbers in a column 3df.cn = pd.to_numeric(df.cn, errors='coerce').fillna(0, downcast='infer')
convert strings to numbers in all columnsdf = df.apply(pd.to_numeric, errors='coerce').fillna(0, downcast='infer')

2: The first slice (colon) with no number on either side means "all rows". The second slice specifies a column range. Specifying axis=1 means horizontal and axis=0 means vertical.

3: errors='coerce' changes non-numeric strings to NaN and .fillna(0) changes NaN values to zero.

Suppose we want to change the "name" column value to be uppercase for all rows where the "breed" column value is "Whippet".

df.loc[df.breed == 'Whippet', 'name'] = df.name.str.upper()

Suppose we want to add a column named "name length" that holds the length of each "name" value.

df['name length'] = df['name'].apply(len)

Suppose we have the columns width, depth, and height that represent refrigerator dimensions and we want to add a new volume column.

df['volume'] = df.width * df.depth * df.height

Creating New DataFrames

When a new DataFrame is created from an existing one the new one shares memory with original. So changes to the underlying data affect both DataFrames.

By default new DataFrames retain the same index values at the original. To reset the index values, new_df = new_df.reset_index(drop=True) or new_df.reset_index(drop=True, inplace=True). Including drop=True removes the original index column.

OperationCode
change column datatype 4df = df['cn'].astype(type)
sort on one columndf.sort_values('cn', ascending=Bool)
sort on multiple columnsdf.sort_values(['cn1', 'cn2'], ascending=[Bool1, Bool2])
transposedf.T

4: When the values of a column can be used to segregate rows into categories and the number of distinct categories is small compared to the number of rows, set the type to 'category'. This allows the data to be stored more efficiently because those column values can be replaced internally with integers. It also supports custom sort orders such as 'small' < 'medium' < 'large'.

To sort in place, add inplace=True.

Aggregating Data

The DataFrame groupby method is used to aggregate data. It automatically sorts groups alphabetically. Statistics are often computed from the aggregated data using the methods count, sum, and mean.

For example, suppose we have a DataFrame where each row represents a dog and the columns are "name", "breed", and "weight". We can find the number of dogs from each breed and the average weight of each breed as follows.

# Group data by breed and compute the
# mean of every numeric column within each breed.
df.groupby(['breed']).mean()

# Group data by breed and compute the
# mean of only the "weight" column within each breed.
df.groupby(['breed'])['weight'].mean()

df['count'] = 0 # create new column to hold counts
# Group data by breed and compute the count of each breed.
df.groupby(['breed']).count()['count']

We can aggregate across multiple columns. For example:

players = pd.DataFrame([
{'sport': 'hockey', 'team': 'Oilers', 'name': 'Wayne Gretzky'},
{'sport': 'hockey', 'team': 'Oilers', 'name': 'Marty McSorley'},
{'sport': 'hockey', 'team': 'Oilers', 'name': 'Grant Fuhr'},
{'sport': 'hockey', 'team': 'Bruins', 'name': 'Phil Esposito'},
{'sport': 'hockey', 'team': 'Bruins', 'name': 'Bobby Orr'},
{'sport': 'baseball', 'team': 'Cardinals', 'name': 'Bob Gibson'}
])
players.groupby(['sport', 'team']).count()

The output is:

   sport      team
baseball Cardinals 1
hockey Bruins 2
Oilers 3

To get the number of players for each team in the DataFrame, use players.team.value_counts() which outputs:

Oilers       3
Bruins 2
Cardinals 1
Name: team, dtype: int64

To get only the top counts, in this case the top 2, use players.team.value_counts().nlargest(2) which outputs:

Oilers    3
Bruins 2
Name: team, dtype: int64

To get all the rows for players on the teams with the highest number of players, use players[players.team.isin(counts.nlargest(2).index)]. This gives all the rows where the team is "Oilers" or "Bruins".

Writing Data

To create a CSV file from a DataFrame:

df.to_csv(file_path) # comma-separated
df.to_csv(file_path, index=False) # to omit index column
df.to_csv(file_path, sep='\t') # tab-separated

To create an Excel file from a DataFrame:

df.to_excel(file_path)
df.to_excel(file_path, index=False) # to omit index column

Creating Other Representations

OperationCode
create NumPy arraydf.to_numpy()

Processing Large Datasets

Some datasets are too large to fit in memory. One way to work around this limitation is to process the data in chunks of rows. To demonstrate this, we will use data from the TMBD 5000 Movie Dataset. This is found at the Kaggle website which is a good source for public datasets.

For example:

# Create a new, empty DataFrame.
subset_df = pd.DataFrame()

# Process a large CSV file this many rows at a time.
row_count = 100
for df in pd.read_csv('./movies/tmdb_5000_movies.csv', chunksize=row_count):
# Only keep rows for movies with a high "vote_average".
filtered_df = df[df['vote_average'] >= 8]
subset_df = pd.concat([subset_df, filtered_df])

# Now only the retained rows are available in subset_df
# and we can operate on that.

Outstanding Questions