Polars

Create
Properties
Query
1. Columns
2. Rows
3. Aggregate
4. Grouped
  1. Over time
Test
1. Row masking
Update
Add
Remove
1. Columns
2. Rows
Reshape
Merge
Extract
Convert

Create

What	How	Details
Create series from list	`pl.Series('name', [1, 2, 3])`
From lists	`pl.DataFrame('A': [1, 2], 'fruits': ['banana', 'apple'])`
From `pandas.DataFrame`	`pl.from_pandas(data)`
From dict	`pl.DataFrame(dict)`
From dict with schema	`pl.DataFrame(dict, schema = {'col1': pl.Float32, 'col2': pl.Int64, 'col3': pl.Date})`
From array	`data = np.array([[1, 2], [3, 4]])` `pl.DataFrame(data, schema = ['a', 'b'], orient = 'col')`
From list of lists	`data = [[1, 2, 3], [4, 5, 6]]` `pl.DataFrame(data, schema=['a', 'b', 'c'])`
From CSV file	`pl.read_csv('derp.csv')`
From list of data frames	`pl.concat([data, data2, ..., dataN])`
From list of data frames with different columns	`pl.concat([data, data2, ..., dataN], how = 'diagonal')`

Properties

What	How	Details
Number of columns	`len(data.columns)`	?
Column names	`data.columns`
Column dtypes	`data.dtypes`
Column-dtype map	`data.schema`
Find column index by name	`data.find_idx_by_name('age')`
Number of rows	`data.height`

Query

Start a lazy query using a LazyFrame by data.lazy(). Operations on a LazyFrame are not executed until this is requested by either calling collect() or fetch(). Lazy operations are advised because they allow for query optimization and more parallelization.

Columns

What	How	Details
Single column	`data.select('col1')`
Multiple columns	`data.select('col1', 'col2')`
Multiple columns, dynamically	?

Rows

What	How	Details
No rows	`data.clear()`
i th row	`data[i]`
i th row from end	`data[-i]`
First n rows	`data.head(n)`
Last n rows	`data.tail(n)`
Slice of rows from a to b	`data[a:b]` or `data.slice(a, b)`
By list of row numbers	`data[rows]`
Exclude the given row numbers	`data.with_row_count().filter(pl.col('row_nr').is_in(rows).not_())`	Leftover row_nr column
Exclude rows that contain null values	`data.drop_nulls()`
Exclude rows that contain null values in certain columns	`data.drop_nulls(['fruits', 'cars'])`
Conditionally on column	`data.filter(pl.col('age') >= 18)`
From multiple column conditions	`data.filter((pl.col('age') >= 18) & (pl.col('sex') == 'male'))`
Limit query to first n rows	`data.limit(n)`
Limit query to last n rows	`data.limit(-n)`
Number of missing values	`data.null_count()`
Number of unique values in a column	`data['col1'].n_unique()`
Number of unique rows over columns	?

Aggregate

Grouped

What	How	Details
Mean of column	`data.group_by('sex').agg(pl.col('age').mean())`

Over time

What	How	Details
Moving average	`data.group_by_dynamic('ts', every='1d').agg(pl.col('value').mean())`	?

Test

What	How	Details
Empty (no rows)	`data.is_empty()`
Contains column	`'age' in data.columns`
Contains columns	`{'age', 'sex'}.issubset(data.columns)`
Contains columns cols	`set(cols).issubset(data.columns)`
Column is missing	`'age' not in data.columns`
Column series are equal	`data['sex'].series_equal(data['sex2'].alias('sex'))`	Series names must match!
Columns are equal (ignoring name)	?
Column has missing value	`data['sex'].is_null().any()`
Column has no missing values	`data['sex'].is_not_null().all()`
Column has no duplicate values	?
Column has duplicate values	?
Column is of dtype	`data.schema['col1'] == dtype`
Column is bool dtype	`data.schema['alive'] == pl.Bool`
Column is string type	`data.schema['sex'] == pl.Utf8`
Column is integer type	`data.schema['age'] in pl.datatypes.INTEGER_DTYPES`
Column is standard integer	`data.schema['age'] == pl.Int64`
Column is consistently pairwise greater than another column	`(data['col1'] > data['col2']).all()`
Column is sometimes pairwise greater than another column	`(data['col1'] > data['col2']).any()`

Row masking

What	How	Details
Duplicated rows	`data.is_duplicated()`
Unique rows	`data.is_unique()`

Update

What	How	Details
Cast column dtype	`data.with_columns(pl.col('col1').cast(pl.Float32))`
Cast columns to dtypes	`data.cast({'col1': pl.Float32, 'col2': pl.UInt8})`
Rename column	`data.rename({'old1': 'new1', 'old2': 'new2'})`
Update column values	`data.with_columns(pl.col('age') + 5)`
Update column values on condition	`df.with_columns(` `pl.when(pl.col('age') >= 18).` `then(pl.lit(1)).` `otherwise(pl.lit(-1))` `)`
Update column values on conditions	`df.with_columns(` `pl.when(pl.col('age') >= 18).` `then(pl.lit(1)).` `when(pl.col('Sex') == 'M').` `then(4).` `otherwise(pl.lit(-1))` `)`
Update column values for specific rows	`rows = [1, 3, 5]` `data.with_row_count().with_columns(` `pl.when(pl.col('row_nr').is_in(rows)).` `then(pl.lit(True)).` `otherwise(pl.lit(False))` `)`
Fill nulls with zero	`data.fill_null(strategy = 'zero')`
Fill nulls with value	`data.fill_null(value)`
Fill nulls with LOCF	`data.fill_null(strategy='forward')`	Wrong for grouped data
Fill NaNs with value	`data.fill_nan(value)`
Replace column inplace	`data.replace('age', newAgeSeries)`
Sort table by column	`data.sort('col1')`

Add

New columns

What	How	Details
Append constant numeric column	`data.with_columns(Intercept=pl.lit(1))`
Append column from series	`s = pl.Series("apple", [10, 20, 30])` `data.hstack([s])`	Note the brackets
Append column from series inplace	`data.hstack(s, in_place = True)`
Insert column from series inplace	`data.insert_at_idx(1, s)`

Derive columns

What	How	Details
Transform another column	`data.with_columns(AgeSq = pl.col('Age') ** 2)`
Multiple transformations from another column	`data.with_columns(` `Age2 = pl.col('Age') 2` `Age3 = pl.col('Age') 3` `)`
Boolean mask based on pairwise comparison of another column	`data.with_columns(CanLive = pl.col('Income') * 3 > pl.col('Rent')`
Values conditional on another column	`data.with_columns(` `pl.when(pl.col('age') >= 18).` `then(pl.lit(1)).` `otherwise(pl.lit(-1))` `)`
Map another column	`map = dict(1 = 'a', 2 = 'b', 3 = 'c')` `data.with_columns(` `NumCat = pl.col('Num').map_dict(map).cast(pl.Categorical)` `)`
Integer seconds since Unix epoch to datetime	`data.with_columns(` `Date=pl.from_epoch(pl.col('time'))` `)`
Integer daystamp since Unix epoch to date	`data.with_columns(` `Date=pl.from_epoch(pl.col('time'), time_unit='d')` `)`
Integer milliseconds timestamp since Unix epoch to datetime	`data.with_columns(` `Date=pl.from_epoch(pl.col('time'), time_unit='ms')` `)`
Parse string column to date	`data.with_columns(` `Date=pl.col('RawDate').str.to_date()` `)`
Parse string column to date with known format	`data.with_columns(` `Date = pl.col("RawDate").str.to_date('%Y-%m-%d')` `)`
Week-day from date column
Month from date column	`data.with_columns(Month = pl.col('Date').dt.month())`
Year from date column	`data.with_columns(Month = pl.col('Date').dt.year())`
Group-wise from aggregate value	`data.with_columns(` `DaysSinceStart = pl.col('Date') - pl.col('Date').min().over('Subject').cast(pl.Int) + 1` `)`

Rows

What	How	Details
Add row as tuple	?
Add list of tuples	?
Add data frame
Add data frame inplace	`data.extend(data2)`
Add data frames inplace	`data.vstack(data2)` `data.vstack(dataN)` `data.rechunk()`

Remove

Columns

What	How	Details
Remove column	`data.drop('Age')`
Remove column inplace	`data.drop_in_place('Age')`	Returns the dropped column
Remove columns	`data.drop(['Age', 'Sex'])`
Remove all numeric columns	`data.drop(cs.numeric())`
Remove columns based on selector	`data.drop(cs)`

Rows

Reshape

What	How	Details
From wide to long format	`data.melt(id_vars='sex', value_vars=['a', 'b'])`
To narrow format	`data.explode(?)`	?

Merge

What	How	Details
Merge two data frames on the sorted key	`data.merge(data2)`
Inner join	`data.join(data2, on = ['sex', 'country'])`
Left join	`data.join(data2, on = ['sex', 'country'], how = 'left')`
Right join	`data.join(data2, on = ['sex', 'country'], how = 'right')`
Outer join	`data.join(data2, on = ['sex', 'country'], how = 'outer')`
Cross join	`data.join(data2, on = ['sex', 'country'], how = 'cross')`
Semi join (one match per index)	`data.join(data2, on = ['sex', 'country'], how = 'semi')`
Anti join (exclude matches from table 2)	`data.join(data2, on = ['sex', 'country'], how = 'anti')`

Extract

What	How	Details
Get column (as series)	`data['col1']`
Get column (as list)	`list(data['col1'])`	?
Get i th row (as tuple)	`data.row(i)`
Get rows (as list of tuple)	`data.rows(...)`	?
First item (cell)	`data.item(0, 0)`
Item (cell) from row i and column index j	`data.item(i, j)`
Item (cell) from row i and column name name	`data.item(i, name)`

Convert

What	How	Details
To `pandas.DataFrame`	`data.to_pandas()`
To list of series	`data.get_columns()`
Split into list of data frames based on column	`data.partition_by('sex')`
Split into list of data frames based on column tuples	`data.partition_by('sex', 'country')`
Split into dict of data frames based on column(s)	`data.partition_by('sex', 'country', as_dict = True)`
To CSV file	`data.write_csv('derp.csv')`
To Parquet file	`data.write_parquet('derp.parquet')`
To JSON	?