Incomplete sheet
This sheet is incomplete and could use some attention. Please submit code snippet suggestions as an issue or PR here.
Pandas DataFrame
Opinionated words of caution:
- Data frames force a frustrating distinction between the "data" and index columns, despite this being completely meaningless in typical exploratory data analysis applications where what constitutes an index is highly context-dependent. Also, pandas is very slow regardless, so all effort on making effective use of index columns is a waste of time.
- Don't bother with index columns unless you want to polute your code with many
reset_index() calls, or risk surprising results.
- On the other hand, you cannot ignore index columns, as you'll run into vague index errors whether you want it or not (e.g., assigning groupby results, shuffling rows).
- MultiIndex versus multiple indices. Just don't. If you cared about speed you would not be using pandas anyway.
- Many conflicting dtypes, inconsistencies in API functions for handling them, and ad-hoc mixing with numpy functions.
- Complex queries will be a series of data variable updates, which is hard to read, and guaranteed to lead to bugs at a later stage during refactoring.
- Inconsistent APIs between data frames, columns, groupby output, and index columns.
- Because of all the complexities above, writing good and clean pandas code is very hard, as is evident by the large volume of unreadable/anti-pattern-riddled pandas code online (e.g., stackoverflow).
Code
Create
| Action |
Code |
Details |
|
Empty
|
|
|
|
Single column from list
|
|
|
|
From column lists
|
pd.DataFrame({'A': [1, 2], 'fruits': ['banana', 'apple']})
|
|
|
From dictionary with column list entries
|
dict = {'Name': ['John', 'Sue'], 'Age': [40, 35]}
pd.DataFrame(dict)
|
|
|
From Numpy 2D array
|
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pd.DataFrame(arr, columns = ['a', 'b', 'c'])
|
|
|
From JSON string
|
import json
json_dict = json.loads(json_string)
pd.DataFrame(json_dict)
|
|
|
From JSON file
|
|
|
Test
| Action |
Code |
Details |
|
Is data frame or subclass
|
isinstance(data, pd.DataFrame)
|
|
|
Is data frame and not subclass
|
type(data) is pd.DataFrame
|
|
|
Is empty
|
|
|
|
Does column col contain NA value(s)
|
data[col].isna().values.any()
|
|
Test for presence of columns
| Action |
Code |
Details |
|
Has column col
|
|
|
|
Has columns cols
|
data.columns.isin(cols).all()
|
|
|
Has columns cols
|
|
|
|
Only contains columns cols
|
|
|
|
Has any of these columns cols
|
data.columns.isin(cols).any()
|
|
|
Has any of these columns cols
|
set(data).issuperset(cols)
|
|
|
Does not have these columns cols
|
set(cols).isdisjoint(data)
|
|
Tests for column types
| Action |
Code |
Details |
|
Is column col boolean type
|
pd.api.types.is_bool_dtype(data[col])
|
|
|
Is column col categorical type
|
isinstance(data[col], pd.CategoricalDtype)
|
|
|
Is column col string type
|
pd.api.types.is_string_dtype(data[col])
|
|
|
Is column col numeric type
|
pd.api.types.is_numeric_dtype(data[col])
|
|
|
Is column col integer type
|
pd.api.types.is_integer_dtype(data[col])
|
|
|
Is column col datetime type
|
pd.api.types.is_datetime64_dtype(data[col])
|
|
|
Is column col datetime(nanosecond) type
|
data.dtypes[col] == numpy.dtype('datetime64[ns]'))
|
|
Tests for values
| Action |
Code |
Details |
|
Contains NA
|
|
|
|
Contains no NA
|
|
|
|
Does column col contain any NA values
|
|
|
|
Does column col not contain NA values
|
|
|
|
Do none of the columns cols contain NA values
|
data[['col1', 'col2']].notnull().all().all()
|
|
|
Contains any infinity
|
import numpy as np
data.isin([np.inf, -np.inf]).values.any()
|
|
|
Does column col contain any infinite values
|
import numpy as np
data['col'].isin([np.inf, -np.inf]).any()
|
|
|
Does column col contain any infinite values
|
import numpy as np
np.isinf(data['col']).any()
|
|
Tests for duplicates
| Action |
Code |
Details |
|
No duplicated rows (all unique)
|
|
|
|
Some rows are duplicated
|
|
|
|
Some groups in column groupCol have duplicate elements in column valueCol
|
data.duplicated(subset=['groupCol', 'valueCol']).any()
|
|
|
No groups in column groupCol have duplicated values for column valueCol
|
~data.duplicated(subset=['groupCol', 'valueCol']).any()
|
|
|
No columns are duplicated by values
|
|
|
|
Some columns are duplicate by values
|
|
|
Assertions
| Action |
Code |
Details |
|
Assert frames are equal
|
pd.testing.assert_frame_equal(x, y)
|
|
|
Assert frames are equal, ignoring column order and row order
|
pd.testing.assert_frame_equal(x, y, check_like=True)
|
|
Snippets which return non-DataFrame output (e.g., scalar, series)
| Action |
Code |
Details |
|
Number of rows
|
|
|
|
Number of rows
|
|
|
|
Extract row count by group group
|
data.groupby('group').size().reset_index(name='count')
|
Returns a data frame with group columns and a respective count column |
|
Row names
|
|
|
|
Get column named name (as series)
|
|
|
|
Column names
|
|
|
|
Number of columns
|
|
|
|
Number of columns
|
|
|
|
Number of cells
|
|
|
|
Dimensions (as tuple)
|
|
|
| Action |
Code |
Details |
|
Mask for missing values (None, NA, NaN)
|
|
|
|
Mask for non-missing values
|
|
Same result for data.notnull()? |
|
Mask for duplicate rows
|
|
|
|
Mask for duplicates across columns in list cols
|
data.duplicated(subset=cols)
|
|
Query
| Action |
Code |
Details |
|
Cell at row i, col j
|
|
|
|
First row (Series)
|
|
|
|
Row by row number i
|
|
|
|
Row by index i
|
|
|
|
String query
|
|
|
Update
All operations are in-place.
Transformed columns can be assigned to new frames, but this operation is generally unsafe as it depends on the index columns matching.
| Action |
Code |
Details |
|
Set cell at row i, col j to value v
|
|
|
|
Replace missing values for v
|
|
|
|
Replace specific values
|
|
|
| Action |
Code |
Details |
|
Rename column
|
data.rename(columns={'old': 'new'}, inplace=True)
|
|
|
Set datetime column date to standard unit (ns)
|
data['date'] = data['date'].apply(pd.to_datetime)
|
Needed, for example, to avoid data corruption when saving to HDF5 store |
|
Sample a column col per grouping in group (ignore index)
|
data[col] = data.groupby(group)[col].transform(
lambda x: x.sample(frac=1).to_list()
)
|
to_list() is essential here to force pandas to ignore the index columns, otherwise there is no effect from sampling... |
|
Apply function to a list column myCol (as flattened series)
|
flat_data = data.explode('myCol')
flat_data.index.name = '_index'
data['myCol'] = flat_data['myCol'].groupby('_index').agg(lambda x: x)
|
Warning: empty lists are converted to NaN (??)
Setting the index name is required because pandas cannot group by nameless index... |
| Action |
Code |
Details |
|
Rename columns
|
data.rename(columns={'old1': 'new1', 'old2': 'new2'}, inplace=True)
|
|
|
Rename columns based on a list of old names and list of new names
|
data.rename(columns=dict(zip(oldNames, newNames)), inplace=True)
|
|
|
All columns to lowercase
|
data.rename(str.lower, axis='columns', inplace=True)
|
|
Grow
| Action |
Code |
Details |
|
Append series as column col
|
|
Overwrites existing column |
|
Append undefined categorical column col
|
data['col'] = pd.Categorical([None] * len(data), categories=['a', 'b', 'c'])
|
|
|
Add string column col based on concatenation of two string columns x and y, with separator sep
|
data['col'] = data['x'] + 'sep' + data['y']
|
|
|
Add column with row number (ignore index)
|
data['num'] = range(len(data))
|
Starts from 1 |
|
Add column with row number per grouping in group (ignore index)
|
data['num'] = data.groupby('group').cumcount().add(1)
|
Starts from 1 |
Derive
All operations create a new DataFrame.
| Action |
Code |
Details |
|
Rename column
|
data.rename(columns={'old': 'new'})
|
|
|
Rename multiple columns
|
data.rename(columns={'old1': 'new1', 'old2': 'new2'})
|
|
|
Rename columns dynamically
|
data.rename(columns=dict(zip(oldNames, newNames)))
|
|
|
All columns to lowercase
|
data.rename(str.lower, axis='columns')
|
|
Grow
| Action |
Code |
Details |
|
Append column col with constant value v
|
|
|
|
Append column col from series s
|
|
|
|
Insert column
|
|
|
|
Split string column col into two columns based on separator sep
|
data['col'].str.split('sep', n=1, expand=True)
|
|
|
Append columns
|
|
|
|
Append columns
|
pd.concat(data, [s1, s2], axis=1)
|
|
Shrink
| Action |
Code |
Details |
|
Select columns
|
data[['subject', 'date']]
|
|
Reduce number of rows
| Action |
Code |
Details |
|
Keep first n rows
|
|
|
|
Remove first n rows
|
|
|
|
Remove last n rows
|
|
|
|
Remove rows list rows
|
|
|
|
Remove duplicated rows
|
|
|
|
Remove rows with a missing value in any column
|
|
|
|
Remove rows with a missing value in the given columns list cols
|
|
|
|
Remove rows with at least n non-missing values across columns
|
|
|
|
Remove rows with missing values in every column
|
|
|
|
Keep last n rows
|
|
|
|
Pop row
|
|
|
|
Select n largest rows according to column col
|
|
|
|
Select n largest rows per group group, according to column col
|
data.groupby('group').apply(lambda x: x.nlargest(n=n, columns='col')).reset_index(drop=True))
|
|
|
Select last row per group group
|
data.groupby('group').last().reset_index()
|
|
|
Select the row per group group with the largest value for column col
|
data.loc[data.groupby('group')['col'].idxmax()].reset_index(drop=True)
|
|
Reduce number of columns
| Action |
Code |
Details |
|
Select single column named subject
|
|
Note the double brackets |
|
Remove column col
|
|
|
|
Remove duplicated columns
|
|
|
|
Remove column col if it exists
|
data.drop(columns=col, errors='ignore')
|
|
|
Remove columns list cols
|
|
|
|
Remove columns with a missing value in any row
|
data.dropna(axis='columns')
|
|
Reshape
| Action |
Code |
Details |
|
Melt
|
|
|
|
Dcast
|
|
|
|
Transpose
|
|
|
Combine
| Action |
Code |
Details |
|
Concatenate rows of dataframes
|
pd.concat([df, df2, dfN])
|
Consider ignore_index=True argument |
|
Concatenate rows of dataframes, having partially overlapping columns
|
|
|
|
Concatenate columns of dataframes, assuming equal index, and assuming columns don't overlap
|
pd.concat([data, data2], axis=1)
|
Unsafe. First make sure indexes are aligned, or output has twice the number of rows |
|
Concatenate columns of dataframes, ignoring the index, and assuming columns don't overlap
|
pd.concat([data.reset_index(drop=True), data2.reset_index(drop=True)], axis=1)
|
|
|
Inner join
|
data.merge(data2, on=['sex', 'country'])
|
|
|
Left join
|
data.merge(data2, on=['sex', 'country'], how='left')
|
|
|
Right join
|
data.merge(data2, on=['sex', 'country'], how='right')
|
|
|
Outer (full) join
|
data.merge(data2, on=['sex', 'country'], how='outer')
|
|
|
Cross join
|
data.merge(data2, on=['sex', 'country'], how='cross')
|
|
|
Left anti join
|
outer_data = data.merge(data2, on=['sex', 'country'], how='outer', indicator=True)
outer_data[outer_data._merge == 'left_only'].drop('_merge', axis=1)
|
|
|
Left anti join
|
outer_data = data.merge(data2, how='outer', indicator=True)
outer_data[outer_data._merge == 'right_only'].drop('_merge', axis=1)
|
|
Iterate
| Action |
Code |
Details |
|
Over rows (as tuples)
|
for row in data.itertuples():
|
|
|
Over rows (index, series)
|
for i, row in data.iterrows():
|
|
|
Over columns (lists)
|
|
|
Convert
| Action |
Code |
Details |
|
To dict of column lists
|
|
|
|
To dict of column series
|
recodata.to_dict('series')
|
|
|
To list of dict per row
|
|
|
|
To Numpy 2D array
|
import numpy as np
data.to_numpy()
|
|
|
To polars DataFrame
|
import polars as pl
import pyarrow
pl.from_pandas(data)
|
|
|
To pretty string, with at most n rows
|
data.to_string(index=False, max_rows=n)
|
|
| Action |
Code |
Details |
|
To CSV file
|
data.to_csv('file.csv', index=False)
|
index=False is needed not to polute CSV with a meaningless index |
|
To TSV file
|
data.to_csv('file.tsv', sep=' ', index=False)
|
index=False is needed not to polute TSV with a meaningless index |
|
To JSON file
|
data.to_json('file.json', orient)
|
See docs to determine orientation |
|
To parquet
|
data.to_parquet('file.parquet')
|
|
Options
| Action |
Code |
Details |
|
Set an option opt to value
|
pd.set_option('opt', value)
|
|
|
Use options within a context
|
from pandas import option_context
with option_context('display.max_rows', 10, 'display.max_columns', None):
print(data)
|
|
|
Show all columns
|
pd.options.display.max_columns = None
|
|
|
Show all rows
|
pd.options.display.max_rows = None
|
|
|
Set max output width, in characters
|
pd.options.display.width = 120
|
Default is 80 |
|
Show floats with d decimal digits precision
|
pd.options.display.precision = d
|
|
|
Format numbers with thousand separator
|
pd.options.styler.format.thousands = ','
|
|