Pandas (WIP) Pandas.DataFrame seems to be the MS Excel of the Python data processing universe. Slow, write-once, inconsistent API, and many ways to achieve the same thing (the true pythonic way), but useful for quick & dirty data manipulation and popular for that reason.
Words of caution:
Index columns are not usable as data. Don’t bother with index columns unless you want to polute your code with many reset_index()
calls. MultiIndex versus multiple indices. Just don’t. If you cared about speed you would not be using pandas anyway. Complex queries will be a series of data variable updates, which is hard to read, and guaranteed to lead to bugs at a later stage during refactoring. Create What How Details Empty ? Single column from list DataFrame([1, 2, 3])
Column per dictionary entry dict = {'Name': ['John', 'Sue'], 'Age': [40, 35]}
DataFrame(dict)
Numpy 2D array arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
DataFrame(arr, columns = ['a', 'b', 'c'])
Properties What How Details Column names (list) list(data)
Column names (set) set(data)
Columns (index object) data.columns
Row names data.index
Number of columns len(data.columns)
Number of columns data.shape[1]
Number of rows len(data)
Number of rows data.shape[0]
Number of cells data.size
Dimensions (tuple) data.shape
Query What How Details Select column data[['Name']]
Select columns data[['Name', 'Age']]
Cell at row i , col j data.at[i, j]
First row (Series) data.iloc[0]
First n rows data.head(n)
Last n rows data.tail(n)
Row by row number i data.iloc[i]
Row by index i data.loc[i]
String query data.query(...)
Test What How Details Empty data.empty
Has column col col in data
Has columns cols set(cols).issubset(data)
Has columns cols data.columns.isin(cols).all()
Only contains columns cols set(data) == set(cols)
Has any of these columns cols data.columns.isin(cols).any()
Has any of these columns cols set(data).issuperset(cols)
Does not have these columns cols set(cols).isdisjoint(data)
Column col is boolean type pandas.api.types.is_bool_dtype(data[col])
Column col is string type pandas.api.types.is_string_dtype(data[col])
Column col is numeric type pandas.api.types.is_numeric_dtype(data[col])
Column col is integer type pandas.api.types.is_integer_dtype(data[col])
Column col is datetime type pandas.api.types.is_datetime64_dtype(data[col])
Column col is datetime type data.dtypes[col] == numpy.dtype('datetime64[ns]'))
Column col contains missing value data[col].isna().values.any()
Column col contains no missing values data[col].notnull().values.all()
Multiple columns cols contain no missing values data[['col1', 'col2']].notnull().all().all()
Update All operations are in-place.
What How Details Rename column data.rename(columns={'old': 'new'}, inplace=True)
Rename multiple columns data.rename(columns={'old1': 'new1', 'old2': 'new2'}, inplace=True)
Rename columns dynamically data.rename(columns=dict(zip(oldNames, newNames)), inplace=True)
Ugh All columns to lowercase data.rename(str.lower, axis='columns', inplace=True)
Set cell at row i , col j to value v data.at[i, j] = v
Replace missing values for v data.fillna(v)
Replace specific values data.replace(?)
Derive All operations create a new instance.
What How Details Rename column data.rename(columns={'old': 'new'})
Rename multiple columns data.rename(columns={'old1': 'new1', 'old2': 'new2'})
Rename columns dynamically data.rename(columns=dict(zip(oldNames, newNames)))
Ugh All columns to lowercase data.rename(str.lower, axis='columns')
Set datetime column to standard unit (ns) data['Date'] = data['Date'].apply(pd.to_datetime)
Needed for example to avoid data corruption when saving to HDF5 store Mask for NaNs data.isnull()
Mask for NaNs data.isna()
Mask for non-missing values data.notnull()
Mask for non-missing values data.notna()
Mask for duplicates data.duplicated()
Shrink What How Details Pop row data.pop()
Drop rows list rows data.drop(rows)
Drop duplicated rows data.drop_duplicates()
Drop duplicated columns ? Drop column col data.drop(columns=col)
Drop columns list cols data.drop(columns=cols)
Drop rows with a missing value in any column df.dropna()
Drop rows with a missing value in the given columns list cols df.dropna(subset=cols)
Drop rows with at least n non-missing values across columns df.dropna(thresh=n)
Drop rows with missing values in every column df.dropna(how='all')
Drop columns with a missing value in any row data.dropna(axis='columns')
Grow What How Details Append column ? Append columns ? Insert column data.insert(x)
Insert columns ?
Reshape What How Details Melt data.melt(?)
Dcast data.explode(?)
Transpose data.T
Iterate What How Details Over rows (index, series) for i, row in data.iterrows():
Over rows (as tuples) for row in data.itertuples():
Over rows (col, series) for col, row_series in data.items():
Over columns (lists) for col in data.columns:
Convert What How Details Numpy 2D array data.values
Back to top
Copyright © 2023 Niek Den Teuling, distributed under an MIT license . Please submit new code snippets, ideas, or suggestions by creating an issue or pull request.