Incomplete sheet

This sheet is incomplete and could use some attention. Please submit code snippet suggestions as an issue or PR here.

Pandas DataFrame¶

Opinionated words of caution:

Data frames force a frustrating distinction between the "data" and index columns, despite this being completely meaningless in typical exploratory data analysis applications where what constitutes an index is highly context-dependent. Also, pandas is very slow regardless, so all effort on making effective use of index columns is a waste of time.
Don't bother with index columns unless you want to polute your code with many reset_index() calls, or risk surprising results.
On the other hand, you cannot ignore index columns, as you'll run into vague index errors whether you want it or not (e.g., assigning groupby results, shuffling rows).
MultiIndex versus multiple indices. Just don't. If you cared about speed you would not be using pandas anyway.
Many conflicting dtypes, inconsistencies in API functions for handling them, and ad-hoc mixing with numpy functions.
Complex queries will be a series of data variable updates, which is hard to read, and guaranteed to lead to bugs at a later stage during refactoring.
Inconsistent APIs between data frames, columns, groupby output, and index columns.
Because of all the complexities above, writing good and clean pandas code is very hard, as is evident by the large volume of unreadable/anti-pattern-riddled pandas code online (e.g., stackoverflow).

Code

import pandas as pd

Create¶

Action	Code	Details
Empty	`?`
Single column from list	`pd.DataFrame([1, 2, 3])`
From column lists	`pd.DataFrame({'A': [1, 2], 'fruits': ['banana', 'apple']})`
From dictionary with column list entries	`dict = {'Name': ['John', 'Sue'], 'Age': [40, 35]} pd.DataFrame(dict)`
From Numpy 2D array	`import numpy as np arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) pd.DataFrame(arr, columns = ['a', 'b', 'c'])`
From JSON string	`import json json_dict = json.loads(json_string) pd.DataFrame(json_dict)`
From JSON file	`pd.read_json(path)`

Test¶

Action	Code	Details
Is data frame or subclass	`isinstance(data, pd.DataFrame)`
Is data frame and not subclass	`type(data) is pd.DataFrame`
Is empty	`data.empty`
Does column col contain NA value(s)	`data[col].isna().values.any()`

Test for presence of columns¶

Action	Code	Details
Has column col	`col in data`
Has columns cols	`data.columns.isin(cols).all()`
Has columns cols	`set(cols).issubset(data)`
Only contains columns cols	`set(data) == set(cols)`
Has any of these columns cols	`data.columns.isin(cols).any()`
Has any of these columns cols	`set(data).issuperset(cols)`
Does not have these columns cols	`set(cols).isdisjoint(data)`

Tests for column types¶

Action	Code	Details
Is column col boolean type	`pd.api.types.is_bool_dtype(data[col])`
Is column col categorical type	`isinstance(data[col], pd.CategoricalDtype)`
Is column col string type	`pd.api.types.is_string_dtype(data[col])`
Is column col numeric type	`pd.api.types.is_numeric_dtype(data[col])`
Is column col integer type	`pd.api.types.is_integer_dtype(data[col])`
Is column col datetime type	`pd.api.types.is_datetime64_dtype(data[col])`
Is column col datetime(nanosecond) type	`data.dtypes[col] == numpy.dtype('datetime64[ns]'))`

Tests for values¶

Action	Code	Details
Contains NA	`data.isna().any().any()`
Contains no NA	`data.notna().all().all()`
Does column col contain any NA values	`data[col].hasnan`
Does column col not contain NA values	`~data[col].hasnan`
Do none of the columns cols contain NA values	`data[['col1', 'col2']].notnull().all().all()`
Contains any infinity	`import numpy as np data.isin([np.inf, -np.inf]).values.any()`
Does column col contain any infinite values	`import numpy as np data['col'].isin([np.inf, -np.inf]).any()`
Does column col contain any infinite values	`import numpy as np np.isinf(data['col']).any()`

Tests for duplicates¶

Action	Code	Details
No duplicated rows (all unique)	`~data.duplicated().any()`
Some rows are duplicated	`data.duplicated().any()`
Some groups in column groupCol have duplicate elements in column valueCol	`data.duplicated(subset=['groupCol', 'valueCol']).any()`
No groups in column groupCol have duplicated values for column valueCol	`~data.duplicated(subset=['groupCol', 'valueCol']).any()`
No columns are duplicated by values	`?`
Some columns are duplicate by values	`?`

Assertions¶

Action	Code	Details
Assert frames are equal	`pd.testing.assert_frame_equal(x, y)`
Assert frames are equal, ignoring column order and row order	`pd.testing.assert_frame_equal(x, y, check_like=True)`

Extract¶

Snippets which return non-DataFrame output (e.g., scalar, series)

Action	Code	Details
Number of rows	`len(data)`
Number of rows	`data.shape[0]`
Extract row count by group group	`data.groupby('group').size().reset_index(name='count')`	Returns a data frame with group columns and a respective count column
Row names	`data.index`
Get column named name (as series)	`data['name']`
Column names	`list(data)`
Number of columns	`len(data.columns)`
Number of columns	`data.shape[1]`
Number of cells	`data.size`
Dimensions (as tuple)	`data.shape`

Extract row masks¶

Action	Code	Details
Mask for missing values (None, NA, NaN)	`data.isna()`
Mask for non-missing values	`data.notna()`	Same result for `data.notnull()`?
Mask for duplicate rows	`data.duplicated()`
Mask for duplicates across columns in list cols	`data.duplicated(subset=cols)`

Query¶

Action	Code	Details
Cell at row i, col j	`data.at[i, j]`
First row (Series)	`data.iloc[0]`
Row by row number i	`data.iloc[i]`
Row by index i	`data.loc[i]`
String query	`data.query(...)`

Update¶

All operations are in-place. Transformed columns can be assigned to new frames, but this operation is generally unsafe as it depends on the index columns matching.

Action	Code	Details
Set cell at row i, col j to value v	`data.at[i, j] = v`
Replace missing values for v	`data.fillna(v)`
Replace specific values	`data.replace(?)`

Transform¶

Transform a single column¶

Action	Code	Details
Rename column	`data.rename(columns={'old': 'new'}, inplace=True)`
Set datetime column date to standard unit (ns)	`data['date'] = data['date'].apply(pd.to_datetime)`	Needed, for example, to avoid data corruption when saving to HDF5 store
Sample a column col per grouping in group (ignore index)	`data[col] = data.groupby(group)[col].transform( lambda x: x.sample(frac=1).to_list() )`	`to_list()` is essential here to force pandas to ignore the index columns, otherwise there is no effect from sampling...
Apply function to a list column myCol (as flattened series)	`flat_data = data.explode('myCol') flat_data.index.name = '_index' data['myCol'] = flat_data['myCol'].groupby('_index').agg(lambda x: x)`	Warning: empty lists are converted to NaN (??) Setting the index name is required because pandas cannot group by nameless index...

Transform multiple columns¶

Action	Code	Details
Rename columns	`data.rename(columns={'old1': 'new1', 'old2': 'new2'}, inplace=True)`
Rename columns based on a list of old names and list of new names	`data.rename(columns=dict(zip(oldNames, newNames)), inplace=True)`
All columns to lowercase	`data.rename(str.lower, axis='columns', inplace=True)`

Grow¶

Action	Code	Details
Append series as column col	`data['col'] = s`	Overwrites existing column
Append undefined categorical column col	`data['col'] = pd.Categorical([None] * len(data), categories=['a', 'b', 'c'])`
Add string column col based on concatenation of two string columns x and y, with separator sep	`data['col'] = data['x'] + 'sep' + data['y']`
Add column with row number (ignore index)	`data['num'] = range(len(data))`	Starts from 1
Add column with row number per grouping in group (ignore index)	`data['num'] = data.groupby('group').cumcount().add(1)`	Starts from 1

Derive¶

All operations create a new DataFrame.

Transform¶

Transform single column¶

Action	Code	Details
Rename column	`data.rename(columns={'old': 'new'})`
Rename multiple columns	`data.rename(columns={'old1': 'new1', 'old2': 'new2'})`
Rename columns dynamically	`data.rename(columns=dict(zip(oldNames, newNames)))`
All columns to lowercase	`data.rename(str.lower, axis='columns')`

Grow¶

Action	Code	Details
Append column col with constant value v	`?`
Append column col from series s	`df.assign(col=s)`
Insert column	`data.insert(x)`
Split string column col into two columns based on separator sep	`data['col'].str.split('sep', n=1, expand=True)`
Append columns	`data.assign(s1, s2)`
Append columns	`pd.concat(data, [s1, s2], axis=1)`

Shrink¶

Action	Code	Details
Select columns	`data[['subject', 'date']]`

Reduce number of rows¶

Action	Code	Details
Keep first n rows	`data.head(n)`
Remove first n rows	`data[n:]`
Remove last n rows	`data[:-n]`
Remove rows list rows	`data.drop(rows)`
Remove duplicated rows	`data.drop_duplicates()`
Remove rows with a missing value in any column	`data.dropna()`
Remove rows with a missing value in the given columns list cols	`data.dropna(subset=cols)`
Remove rows with at least n non-missing values across columns	`data.dropna(thresh=n)`
Remove rows with missing values in every column	`data.dropna(how='all')`
Keep last n rows	`data.tail(n)`
Pop row	`data.pop()`
Select n largest rows according to column col	`data.nlargest(n, col)`
Select n largest rows per group group, according to column col	`data.groupby('group').apply(lambda x: x.nlargest(n=n, columns='col')).reset_index(drop=True))`
Select last row per group group	`data.groupby('group').last().reset_index()`
Select the row per group group with the largest value for column col	`data.loc[data.groupby('group')['col'].idxmax()].reset_index(drop=True)`

Reduce number of columns¶

Action	Code	Details
Select single column named subject	`data[['subject']]`	Note the double brackets
Remove column col	`data.drop(columns=col)`
Remove duplicated columns	`?`
Remove column col if it exists	`data.drop(columns=col, errors='ignore')`
Remove columns list cols	`data.drop(columns=cols)`
Remove columns with a missing value in any row	`data.dropna(axis='columns')`

Reshape¶

Action	Code	Details
Melt	`data.melt(?)`
Dcast	`data.explode(?)`
Transpose	`data.T`

Combine¶

Action	Code	Details
Concatenate rows of dataframes	`pd.concat([df, df2, dfN])`	Consider `ignore_index=True` argument
Concatenate rows of dataframes, having partially overlapping columns	`?`
Concatenate columns of dataframes, assuming equal index, and assuming columns don't overlap	`pd.concat([data, data2], axis=1)`	Unsafe. First make sure indexes are aligned, or output has twice the number of rows
Concatenate columns of dataframes, ignoring the index, and assuming columns don't overlap	`pd.concat([data.reset_index(drop=True), data2.reset_index(drop=True)], axis=1)`
Inner join	`data.merge(data2, on=['sex', 'country'])`
Left join	`data.merge(data2, on=['sex', 'country'], how='left')`
Right join	`data.merge(data2, on=['sex', 'country'], how='right')`
Outer (full) join	`data.merge(data2, on=['sex', 'country'], how='outer')`
Cross join	`data.merge(data2, on=['sex', 'country'], how='cross')`
Left anti join	`outer_data = data.merge(data2, on=['sex', 'country'], how='outer', indicator=True) outer_data[outer_data._merge == 'left_only'].drop('_merge', axis=1)`
Left anti join	`outer_data = data.merge(data2, how='outer', indicator=True) outer_data[outer_data._merge == 'right_only'].drop('_merge', axis=1)`

Iterate¶

Action	Code	Details
Over rows (as tuples)	`for row in data.itertuples():`
Over rows (index, series)	`for i, row in data.iterrows():`
Over columns (lists)	`for col in data.columns:`

Convert¶

Action	Code	Details
To dict of column lists	`data.to_dict('list')`
To dict of column series	`recodata.to_dict('series')`
To list of dict per row	`data.to_dict('records')`
To Numpy 2D array	`import numpy as np data.to_numpy()`
To polars DataFrame	`import polars as pl import pyarrow pl.from_pandas(data)`
To pretty string, with at most n rows	`data.to_string(index=False, max_rows=n)`

To file format¶

Action	Code	Details
To CSV file	`data.to_csv('file.csv', index=False)`	`index=False` is needed not to polute CSV with a meaningless index
To TSV file	`data.to_csv('file.tsv', sep=' ', index=False)`	`index=False` is needed not to polute TSV with a meaningless index
To JSON file	`data.to_json('file.json', orient)`	See docs to determine orientation
To parquet	`data.to_parquet('file.parquet')`

Options¶

Action	Code	Details
Set an option opt to value	`pd.set_option('opt', value)`
Use options within a context	`from pandas import option_context with option_context('display.max_rows', 10, 'display.max_columns', None): print(data)`
Show all columns	`pd.options.display.max_columns = None`
Show all rows	`pd.options.display.max_rows = None`
Set max output width, in characters	`pd.options.display.width = 120`	Default is 80
Show floats with d decimal digits precision	`pd.options.display.precision = d`
Format numbers with thousand separator	`pd.options.styler.format.thousands = ','`