Skip to content
Incomplete sheet

This sheet is incomplete and could use some attention. Please submit code snippet suggestions as an issue or PR here.

Pandas DataFrame

Opinionated words of caution:

  • Data frames force a frustrating distinction between the "data" and index columns, despite this being completely meaningless in typical exploratory data analysis applications where what constitutes an index is highly context-dependent. Also, pandas is very slow regardless, so all effort on making effective use of index columns is a waste of time.
  • Don't bother with index columns unless you want to polute your code with many reset_index() calls, or risk surprising results.
  • On the other hand, you cannot ignore index columns, as you'll run into vague index errors whether you want it or not (e.g., assigning groupby results, shuffling rows).
  • MultiIndex versus multiple indices. Just don't. If you cared about speed you would not be using pandas anyway.
  • Many conflicting dtypes, inconsistencies in API functions for handling them, and ad-hoc mixing with numpy functions.
  • Complex queries will be a series of data variable updates, which is hard to read, and guaranteed to lead to bugs at a later stage during refactoring.
  • Inconsistent APIs between data frames, columns, groupby output, and index columns.
  • Because of all the complexities above, writing good and clean pandas code is very hard, as is evident by the large volume of unreadable/anti-pattern-riddled pandas code online (e.g., stackoverflow).

Code

import pandas as pd

Create

Action Code Details
Empty
?
Single column from list
pd.DataFrame([1, 2, 3])
From column lists
pd.DataFrame({'A': [1, 2], 'fruits': ['banana', 'apple']})
From dictionary with column list entries
dict = {'Name': ['John', 'Sue'], 'Age': [40, 35]}
pd.DataFrame(dict)
From Numpy 2D array
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pd.DataFrame(arr, columns = ['a', 'b', 'c'])
From JSON string
import json
json_dict = json.loads(json_string)
pd.DataFrame(json_dict)
From JSON file
pd.read_json(path)

Test

Action Code Details
Is data frame or subclass
isinstance(data, pd.DataFrame)
Is data frame and not subclass
type(data) is pd.DataFrame
Is empty
data.empty
Does column col contain NA value(s)
data[col].isna().values.any()

Test for presence of columns

Action Code Details
Has column col
col in data
Has columns cols
data.columns.isin(cols).all()
Has columns cols
set(cols).issubset(data)
Only contains columns cols
set(data) == set(cols)
Has any of these columns cols
data.columns.isin(cols).any()
Has any of these columns cols
set(data).issuperset(cols)
Does not have these columns cols
set(cols).isdisjoint(data)

Tests for column types

Action Code Details
Is column col boolean type
pd.api.types.is_bool_dtype(data[col])
Is column col categorical type
isinstance(data[col], pd.CategoricalDtype)
Is column col string type
pd.api.types.is_string_dtype(data[col])
Is column col numeric type
pd.api.types.is_numeric_dtype(data[col])
Is column col integer type
pd.api.types.is_integer_dtype(data[col])
Is column col datetime type
pd.api.types.is_datetime64_dtype(data[col])
Is column col datetime(nanosecond) type
data.dtypes[col] == numpy.dtype('datetime64[ns]'))

Tests for values

Action Code Details
Contains NA
data.isna().any().any()
Contains no NA
data.notna().all().all()
Does column col contain any NA values
data[col].hasnan
Does column col not contain NA values
~data[col].hasnan
Do none of the columns cols contain NA values
data[['col1', 'col2']].notnull().all().all()
Contains any infinity
import numpy as np
data.isin([np.inf, -np.inf]).values.any()
Does column col contain any infinite values
import numpy as np
data['col'].isin([np.inf, -np.inf]).any()
Does column col contain any infinite values
import numpy as np
np.isinf(data['col']).any()

Tests for duplicates

Action Code Details
No duplicated rows (all unique)
~data.duplicated().any()
Some rows are duplicated
data.duplicated().any()
Some groups in column groupCol have duplicate elements in column valueCol
data.duplicated(subset=['groupCol', 'valueCol']).any()
No groups in column groupCol have duplicated values for column valueCol
~data.duplicated(subset=['groupCol', 'valueCol']).any()
No columns are duplicated by values
?
Some columns are duplicate by values
?

Assertions

Action Code Details
Assert frames are equal
pd.testing.assert_frame_equal(x, y)
Assert frames are equal, ignoring column order and row order
pd.testing.assert_frame_equal(x, y, check_like=True)

Extract

Snippets which return non-DataFrame output (e.g., scalar, series)

Action Code Details
Number of rows
len(data)
Number of rows
data.shape[0]
Extract row count by group group
data.groupby('group').size().reset_index(name='count')
Returns a data frame with group columns and a respective count column
Row names
data.index
Get column named name (as series)
data['name']
Column names
list(data)
Number of columns
len(data.columns)
Number of columns
data.shape[1]
Number of cells
data.size
Dimensions (as tuple)
data.shape

Extract row masks

Action Code Details
Mask for missing values (None, NA, NaN)
data.isna()
Mask for non-missing values
data.notna()
Same result for data.notnull()?
Mask for duplicate rows
data.duplicated()
Mask for duplicates across columns in list cols
data.duplicated(subset=cols)

Query

Action Code Details
Cell at row i, col j
data.at[i, j]
First row (Series)
data.iloc[0]
Row by row number i
data.iloc[i]
Row by index i
data.loc[i]
String query
data.query(...)

Update

All operations are in-place. Transformed columns can be assigned to new frames, but this operation is generally unsafe as it depends on the index columns matching.

Action Code Details
Set cell at row i, col j to value v
data.at[i, j] = v
Replace missing values for v
data.fillna(v)
Replace specific values
data.replace(?)

Transform

Transform a single column

Action Code Details
Rename column
data.rename(columns={'old': 'new'}, inplace=True)
Set datetime column date to standard unit (ns)
data['date'] = data['date'].apply(pd.to_datetime)
Needed, for example, to avoid data corruption when saving to HDF5 store
Sample a column col per grouping in group (ignore index)
data[col] = data.groupby(group)[col].transform(
    lambda x: x.sample(frac=1).to_list()
)
to_list() is essential here to force pandas to ignore the index columns, otherwise there is no effect from sampling...
Apply function to a list column myCol (as flattened series)
flat_data = data.explode('myCol')
flat_data.index.name = '_index'
data['myCol'] = flat_data['myCol'].groupby('_index').agg(lambda x: x)
Warning: empty lists are converted to NaN (??) Setting the index name is required because pandas cannot group by nameless index...

Transform multiple columns

Action Code Details
Rename columns
data.rename(columns={'old1': 'new1', 'old2': 'new2'}, inplace=True)
Rename columns based on a list of old names and list of new names
data.rename(columns=dict(zip(oldNames, newNames)), inplace=True)
All columns to lowercase
data.rename(str.lower, axis='columns', inplace=True)

Grow

Action Code Details
Append series as column col
data['col'] = s
Overwrites existing column
Append undefined categorical column col
data['col'] = pd.Categorical([None] * len(data), categories=['a', 'b', 'c'])
Add string column col based on concatenation of two string columns x and y, with separator sep
data['col'] = data['x'] + 'sep' + data['y']
Add column with row number (ignore index)
data['num'] = range(len(data))
Starts from 1
Add column with row number per grouping in group (ignore index)
data['num'] = data.groupby('group').cumcount().add(1)
Starts from 1

Derive

All operations create a new DataFrame.

Transform

Transform single column

Action Code Details
Rename column
data.rename(columns={'old': 'new'})
Rename multiple columns
data.rename(columns={'old1': 'new1', 'old2': 'new2'})
Rename columns dynamically
data.rename(columns=dict(zip(oldNames, newNames)))
All columns to lowercase
data.rename(str.lower, axis='columns')

Grow

Action Code Details
Append column col with constant value v
?
Append column col from series s
df.assign(col=s)
Insert column
data.insert(x)
Split string column col into two columns based on separator sep
data['col'].str.split('sep', n=1, expand=True)
Append columns
data.assign(s1, s2)
Append columns
pd.concat(data, [s1, s2], axis=1)

Shrink

Action Code Details
Select columns
data[['subject', 'date']]

Reduce number of rows

Action Code Details
Keep first n rows
data.head(n)
Remove first n rows
data[n:]
Remove last n rows
data[:-n]
Remove rows list rows
data.drop(rows)
Remove duplicated rows
data.drop_duplicates()
Remove rows with a missing value in any column
data.dropna()
Remove rows with a missing value in the given columns list cols
data.dropna(subset=cols)
Remove rows with at least n non-missing values across columns
data.dropna(thresh=n)
Remove rows with missing values in every column
data.dropna(how='all')
Keep last n rows
data.tail(n)
Pop row
data.pop()
Select n largest rows according to column col
data.nlargest(n, col)
Select n largest rows per group group, according to column col
data.groupby('group').apply(lambda x: x.nlargest(n=n, columns='col')).reset_index(drop=True))
Select last row per group group
data.groupby('group').last().reset_index()
Select the row per group group with the largest value for column col
data.loc[data.groupby('group')['col'].idxmax()].reset_index(drop=True)

Reduce number of columns

Action Code Details
Select single column named subject
data[['subject']]
Note the double brackets
Remove column col
data.drop(columns=col)
Remove duplicated columns
?
Remove column col if it exists
data.drop(columns=col, errors='ignore')
Remove columns list cols
data.drop(columns=cols)
Remove columns with a missing value in any row
data.dropna(axis='columns')

Reshape

Action Code Details
Melt
data.melt(?)
Dcast
data.explode(?)
Transpose
data.T

Combine

Action Code Details
Concatenate rows of dataframes
pd.concat([df, df2, dfN])
Consider ignore_index=True argument
Concatenate rows of dataframes, having partially overlapping columns
?
Concatenate columns of dataframes, assuming equal index, and assuming columns don't overlap
pd.concat([data, data2], axis=1)
Unsafe. First make sure indexes are aligned, or output has twice the number of rows
Concatenate columns of dataframes, ignoring the index, and assuming columns don't overlap
pd.concat([data.reset_index(drop=True), data2.reset_index(drop=True)], axis=1)
Inner join
data.merge(data2, on=['sex', 'country'])
Left join
data.merge(data2, on=['sex', 'country'], how='left')
Right join
data.merge(data2, on=['sex', 'country'], how='right')
Outer (full) join
data.merge(data2, on=['sex', 'country'], how='outer')
Cross join
data.merge(data2, on=['sex', 'country'], how='cross')
Left anti join
outer_data = data.merge(data2, on=['sex', 'country'], how='outer', indicator=True)
outer_data[outer_data._merge == 'left_only'].drop('_merge', axis=1)
Left anti join
outer_data = data.merge(data2, how='outer', indicator=True)
outer_data[outer_data._merge == 'right_only'].drop('_merge', axis=1)

Iterate

Action Code Details
Over rows (as tuples)
for row in data.itertuples():
Over rows (index, series)
for i, row in data.iterrows():
Over columns (lists)
for col in data.columns:

Convert

Action Code Details
To dict of column lists
data.to_dict('list')
To dict of column series
recodata.to_dict('series')
To list of dict per row
data.to_dict('records')
To Numpy 2D array
import numpy as np
data.to_numpy()
To polars DataFrame
import polars as pl
import pyarrow
pl.from_pandas(data)
To pretty string, with at most n rows
data.to_string(index=False, max_rows=n)

To file format

Action Code Details
To CSV file
data.to_csv('file.csv', index=False)
index=False is needed not to polute CSV with a meaningless index
To TSV file
data.to_csv('file.tsv', sep='   ', index=False)
index=False is needed not to polute TSV with a meaningless index
To JSON file
data.to_json('file.json', orient)
See docs to determine orientation
To parquet
data.to_parquet('file.parquet')

Options

Action Code Details
Set an option opt to value
pd.set_option('opt', value)
Use options within a context
from pandas import option_context
with option_context('display.max_rows', 10, 'display.max_columns', None):
    print(data)
Show all columns
pd.options.display.max_columns = None
Show all rows
pd.options.display.max_rows = None
Set max output width, in characters
pd.options.display.width = 120
Default is 80
Show floats with d decimal digits precision
pd.options.display.precision = d
Format numbers with thousand separator
pd.options.styler.format.thousands = ','