Polars DataFrame¶

The polars package provides a fast and powerful implementation of data frames.

Details Start a lazy query using a LazyFrame by data.lazy(). Operations on a LazyFrame are not executed until this is requested by either calling collect() or fetch(). Lazy operations are advised because they allow for query optimization and more parallelization.

Code

import polars as pl

Create¶

Create a new DataFrame

Action	Code	Details
From column lists	`pl.DataFrame({'A': [1, 2], 'fruits': ['banana', 'apple']})`
From list of lists	`data = [[1, 2, 3], [4, 5, 6]] pl.DataFrame(data, schema=['a', 'b', 'c'])`
From dict	`pl.DataFrame(dict)`
From dict with schema	`pl.DataFrame( dict, schema = {'col1': pl.Float32, 'col2': pl.Int64, 'col3': pl.Date} )`
Repeat each row with an index between a row-specific range, defined by min (inclusive) and max (exclusive) columns	`data.select( pl.col('patient'), pl.int_ranges(start='startDay', end='endDay').alias('day') ).explode('day')`
From `pandas.DataFrame`	`pl.from_pandas(data)`
From numpy array	`data = np.array([[1, 2], [3, 4]]) pl.DataFrame(data, schema = ['a', 'b'], orient = 'col')`

From file formats¶

Action	Code	Details
From CSV file	`pl.read_csv('file.csv')`

Test¶

Action	Code	Details
Is `pl.DataFrame` or subclass	`isinstance(x, pl.DataFrame)`
Is `pl.DataFrame` but not subclass	`type(x) is pl.DataFrame`
Empty (no rows)	`data.is_empty()`
Not empty (has rows)	`not data.is_empty()`
Data frames are equal	`data.equals(data2)`
Any column contains null	`any(col.has_nulls() for col in data)`
Any column contains null	`data.select(pl.all().is_null().any()).to_series().any()`
Any column is filled with nulls	`any(col.is_null().all() for col in data)`
Duplicates across two or more columns	`data.select('col1', 'col2').is_duplicated().any()`

Tests for specific column(s)¶

Action	Code	Details
Contains column	`'age' in data.columns`
Contains columns	`{'age', 'sex'}.issubset(data.columns)`
Contains columns cols	`set(cols).issubset(data.columns)`
Column is missing	`'age' not in data.columns`
Column has missing value	`data['sex'].is_null().any()`
Column has no missing values	`data['sex'].is_not_null().all()`
Columns are equal	`?`
Columns are equal series	`data['sex'].equals(data['sex2'].alias('sex'))`	Series names must match!
Column has no duplicate values	`data['date'].is_unique().all()`
Column has duplicate values	`data['date'].is_duplicated().any()`
Column has duplicates within groups	`data.select( pl.col('date').is_duplicated().any().over('patient').any() ).item()`	Checks whether any group (patient) has duplicate values (dates)
Column has duplicates within groups	`data.select( pl.col('date').is_duplicated().any().over('patient') ).to_series().any()`
Column has duplicates within groups	`data.group_by('patient').agg( dupe=pl.col('date').is_duplicated().any() )['dupe'].any()`
Column is of type	`data.schema['col1'] == dtype`
Column is bool type	`data.schema['alive'] == pl.Bool`
Column is string type	`data.schema['sex'] == pl.String`
Columns is numeric	`data.schema['age'].is_numeric()`
Column is integer type	`data.schema['age'].is_integer()`
Column is standard integer (64-bit)	`data.schema['age'] == pl.Int64`
Column is float	`data.schema['age'].is_float()`
Column is date	`data.schema['date'].dtype == pl.Date`
Column is list type	`data.schema['keywords'] == pl.List`
Column is of type list of strings	`data.schema['keywords'] == pl.List(pl.String)`
Column is consistently rowwise greater than another column	`(data['col1'] > data['col2']).all()`
Column is sometimes rowwise greater than another column	`(data['col1'] > data['col2']).any()`

Extract¶

Action	Code	Details
Number of rows	`data.height`
Number of columns	`len(data.columns)`
Get column name as series	`data['name']`
Column names	`data.columns`
Column types	`data.dtypes`
Column-types mapping	`data.schema`
Find column index by name	`data.find_idx_by_name('age')`
Get column name as list	`data['name'].to_list()`
Get first column as series	`data.to_series()`
Get first column as series	`data.select( pl.first() )`
Get the _j_th column as series	`data.to_series(j)`
Get last column as series	`data.select(pl.last())`
Number of missing values per column	`data.null_count()`
Number of unique values in a column	`data['col1'].n_unique()`
Number of rows without any missing values across columns a, b and c, by group	`data.group_by('subject').agg([ pl.all_horizontal( pl.col(['a', 'b', 'c']).is_not_null() ).sum() ])`
Number of unique rows across columns	`?`
Get i th row as tuple	`data.row(i)`
Get rows as list of tuple	`data.rows(...)`	?
First item (cell)	`data.item(0, 0)`
Item (cell) from row i and column index j	`data.item(i, j)`
Item (cell) from row i and column name name	`data.item(i, name)`

Update¶

Update the data frame in-place

Action	Code	Details
Replace column with another series	`data.replace('age', newAgeSeries)`
Remove or pop column	`data.drop_in_place('Age')`	Returns the dropped column
Append column from series	`data.hstack(s, in_place = True)`
Insert column from series	`data.insert_at_idx(1, s)`

Combine¶

Action	Code	Details
Add rows of a data frame (same columns)	`data.extend(data2)`
Add rows of several data frames (same columns)	`data.vstack(data2) data.vstack(dataN) data.rechunk()`

Derive¶

Transform¶

Transform the dataframe, preservering shape

Action	Code	Details
Rename column old to new	`data.rename({'old': 'new'})`
Rename columns	`data.rename({'old1': 'new1', 'old2': 'new2'})`
Cast column type	`data.with_columns(pl.col('col1').cast(pl.Float32))`
Cast columns to types	`data.cast({'col1': pl.Float32, 'col2': pl.UInt8})`
Update column values	`data.with_columns( pl.col('age') + 5 )`
Update column values on condition	`df.with_columns( pl.when( pl.col('age') >= 18 ). then( pl.lit(1) ). otherwise( pl.lit(-1) ) )`
Update column values on conditions	`df.with_columns( pl.when( pl.col('age') >= 18 ). then( pl.lit(1) ). when( pl.col('sex') == 'male' ). then(4). otherwise( pl.lit(-1) ) )`
Update column values for specific rows	`rows = [1, 3, 5] data.with_row_count().with_columns( pl.when( pl.col('row_nr').is_in(rows) ). then( pl.lit(True) ) )`

Fill¶

Action	Code	Details
Fill nulls with zero	`data.fill_null(strategy = 'zero')`
Fill nulls with value	`data.fill_null(value)`
Fill nulls with LOCF	`data.fill_null(strategy='forward')`	Wrong for grouped data
Fill NaNs with value v	`data.fill_nan(v)`
Fill infinity with value v for all columns of dtype float	`import polars.selectors as cs data.with_columns( cs.float().replace(float('inf'), v) )`

Row mask¶

All snippets output a pl.Series object. Use .to_list() to obtain list output.

Action	Code	Details
Mask indicating whether rows contain null in any column	`data.select( pl.any_horizontal(pl.all().is_null()) ).to_series()`
Mask indicating whether rows contain null in all columns	`data.select( pl.all_horizontal(pl.all().is_null()) ).to_series()`
Mask indicating duplicate rows	`data.is_duplicated()`
Mask indicating unique rows	`data.is_unique()`
Mask based on a column value	`data['age'] > 18`
Mask based on pairwise comparison of another column	`data.with_columns( pl.col('income') * 3 > pl.col('rent') ).to_series()`

Over consecutive rows¶

Aggregate over consecutive rows, e.g., for computing a moving average.

Action	Code	Details
Compute centered rolling mean with window size w	`data.with_columns( smoothValue=pl.col('value'). rolling_mean(window_size=w, min_periods=0, center=True) )`
Compute centered rolling mean per group with window size w	`data.with_columns( smoothValue=pl.col('value'). rolling_mean(window_size=w, min_periods=0, center=True). over('patient') )`
Detrend a column by group, using a centered rolling mean with window size w	`data.with_columns( trend=pl.col('value'). rolling_mean(window_size=w, min_periods=0, center=True). over('patient') ).with_columns( detrendedValue=(pl.col('value') - pl.col('trend')).over('Patient') )`
Detrend a column by group but preserve the mean level, using a centered rolling mean with window size w	`data.with_columns( trend=pl.col('value').rolling_mean( window_size=w, min_periods=0, center=True).over('patient') ).with_columns( detrendedValue=( pl.col('value') - pl.col('trend') + pl.col('trend').mean() ).over('Patient')`

Over time¶

Aggregation is done with respect to a date/time column (e.g., hourly, daily, weekly)

Row-based¶

Rolling computation for each row (i.e., observation in time)

Action	Code	Details
Rolling computation for a single column, with right-aligned partial window of max n days and at least i rows	`data.with_columns( pl.col('Hospitalized').rolling_mean( by='Date', closed='both', window_size=f'{n - 1}d', min_periods=i ) )`	Currently only available for `rolling_mean()`, not any of the other `rolling_` functions.
Rolling computation with right-aligned partial window of max n days	`new_data = data.rolling( index_column='Date', period=f'{n}d' ).agg( pl.mean('Obs) )`
Rolling computation with right-aligned partial windows of max n days, keep other columns	`new_data = data.rolling( index_column='Date', period=f'{n}d' ).agg( pl.exclude(['Date', 'Obs']).last(), pl.mean('Obs) )`

Interval-based¶

Rolling computation for constant (not dynamic!) intervals. This may introduce additional moments in time.

Action Code Details

Daily rolling computation with right-aligned partial window of max n days

new_data = data.group_by_dynamic(
    index_column='Date',
    every='1d',
    offset=f'-{n - 1}d',
    period=f'{n - 1}d',
    label='right',
    closed='both',
    start_by='window'
).agg(
    pl.mean('Obs')
)

To be verified

Dynamic daily rolling statistic proportional to number of window observations, with right-aligned partial window of max n days

new_data = data.group_by_dynamic(
    index_column='Date',
    every='1d',
    offset=f'-{n - 1}d',
    period=f'{n - 1}d',
    label='right',
    closed='both',
    start_by='window'
).agg(
    pl.sum('Hospitalized').alias('DaysHospitalized'),
    pl.count()
).with_columns(
    Proportion=pl.col('DaysHospitalized') / pl.col('count')
)

To be verified

Daily rolling computation with left-aligned partial window of max n days

new_data = data.group_by_dynamic(
    index_column='Date',
    every='1d',
    offset=f'-{n - 1}d',
    period=f'{n - 1}d',
    label='left',
    closed='both',
    start_by='datapoint'
).agg(
    pl.mean('Obs')
)

To be verified

Order¶

Action	Code	Details
Reorder all columns in the given order	`data.select(pl.col(['patient', 'sex', 'age']))`
Reorder specific columns in a given order	`cols = ['topic', 'parent', 'language'] data.select(pl.col(cols), pl.exclude(cols))`	There doesn't seem to be a shorter way to do this
Sort rows by column value	`data.sort('col1')`

Grow¶

Transformations which increase the shape of the data frame

Action	Code	Details
Append constant numeric column	`data.with_columns(Intercept=pl.lit(1))`
Append series as new column	`s = pl.Series("apple", [10, 20, 30]) data.hstack([s])`	Note the brackets

Add derived column(s)¶

Action	Code	Details
Transform another column	`data.with_columns( ageSq = pl.col('age') ** 2 )`
Transform another column	`data.with_columns( (pl.col('age') ** 2).alias('ageSq') )`
Constant values conditional on another column	`data.with_columns( AdultScore=pl.when(pl.col('age') >= 18). then(pl.lit(1)). otherwise(pl.lit(-1)) )`
Add multiple columns based on transformations of other columns	`data.with_columns( ageSq=pl.col('age') 2 hospitalizationsSq=pl.col('hospitalizations') 2 )`
Add multiple columns based on transformations of other columns	`data.with_columns( (pl.col('Age') 2).alias('Age2') pl.col('Age').alias('Age3') 3 )`	`alias()` can be called after `pl.col()`, may be more readable sometimes
Add multiple constant columns based on a dict	`for k, v in dict.items(): data = data.with_columns(pl.lit(v).alias(k))`

Integer column derivations¶

Action	Code	Details
Map integer values to strings, as categorical column	`map = {1: 'first', 2: 'second', 3: 'third'} data.with_columns( rank=pl.col('place').replace_strict(map).cast(pl.Categorical) )`
Timestamp (in seconds) since Unix epoch (1970-01-01), as datetime column	`data.with_columns( date=pl.from_epoch('timestamp') )`
Milliseconds timestamp since Unix epoch, as datetime column	`data.with_columns( date=pl.from_epoch('timestamp_ms', time_unit='ms') )`
Total days since Unix epoch, as date column	`data.with_columns( date=pl.from_epoch('daystamp', time_unit='d') )`
Total days since Unix epoch, as date column	`data.with_columns( date=pl.col('daystamp').cast(pl.Date) )`
Number of days since reference date, as date column	`from datetime import date ref_days = (x - date.fromtimestamp(0)).days data.with_columns( date=pl.col('day').cast(pl.Date) + pl.duration(days=ref_days) )`
Number of days since reference date, as date column	`from datetime import date ref_days = (x - date.fromtimestamp(0)).days data.with_columns( date=(pl.col('day') + ref_days).cast(pl.Date) )`

String column derivations¶

Action	Code	Details
Parse string column to date	`data.with_columns( date=pl.col('datestring').str.to_date() )`
Parse string column to date with known format	`data.with_columns( date = pl.col('datestring').str.to_date('%Y-%m-%d') )`

Datetime column derivations¶

Action	Code	Details
Weekday from date column	`?`
Month from date column	`data.with_columns( month=pl.col('date').dt.month() )`
Year from date column	`data.with_columns( year=pl.col('Date').dt.year() )`
Elapsed days between date ranges, by group	`data.with_columns( elapsedDays=( pl.col('Date') - pl.col('Date').min().over('Subject') ).dt.total_days().add(1) )`

Add rows¶

Action	Code	Details
Add row as tuple	`?`
Add list of tuples	`?`
Add data frame	`pl.concat(data, df2)`

Shrink the data frame¶

Remove one or more columns¶

Action	Code	Details
Remove single column	`data.drop('Age')`
Remove columns	`data.drop(['Age', 'Sex'])`
Remove columns	`data.select(pl.exclude(['Age', 'Sex']))`
Select single column	`data.select('col1')`	Verbose
Select single column	`data.select(pl.col('Age'))`
Multiple columns	`data.select('col1', 'col2')`
Multiple columns, specified by list	`data.select(['col1', 'col2'])`
Multiple columns, specified by list	`data.select(pl.col(['Age', 'Sex'])`
Remove all numeric columns	`data.drop(cs.numeric())`

Remove one or more rows¶

Action	Code	Details
Remove all rows (clear)	`data.clear()`
Limit query to first n rows	`data.limit(n)`
Limit query to last n rows	`data.limit(-n)`

Keep rows¶

Action	Code	Details
Select i th row	`data[i]`
Select i th last row	`data[-i]`
Select rows by list of row numbers	`data[rows]`
First n rows (head)	`data.head(n)`
Last n rows (tail)	`data.tail(n)`
Slice of rows from a to b	`data[a:b]`
Slice of rows from a to b	`data.slice(a, b)`
Keep rows conditionally on column	`data.filter(pl.col('age') >= 18)`
Keep rows based on multiple column conditions	`data.filter( (pl.col('age') >= 18) & (pl.col('sex') == 'male') )`
Sample at most n rows	`data.sample(n)`
Sample n rows, with replacement	`data.sample(n, with_replacement=True)`	Replicates some rows. The result is guaranteed to have n rows, even when the input contains fewer rows.
Sample n rows per group, with replacement	`data.group_by('Patient').map_groups( lambda df: df.sample(n, with_replacement=True) )`	Very slow due to the `map_groups()` call.
Sample at most n rows per group	`data.filter( pl.int_range(pl.len()).shuffle().over('patient') < n )`	Groups with fewer than n rows keep all rows.

Exclude rows¶

Action	Code	Details
Remove rows by row numbers	`data.with_row_index().filter( ~pl.col('index').is_in(rows) )`	Output contains additional column 'Index'
Remove all rows except the given row numbers	`data[[1, 6]]`
Remove all rows except the given row numbers	`data.filter(pl.arange(0, pl.count()).is_in([1, 5, 7]))`
Remove rows that contain null values	`data.drop_nulls()`
Remove rows that contain null values in certain columns	`data.drop_nulls(['fruits', 'cars'])`

Aggregate rows¶

By group¶

Action	Code	Details
Mean of column by group	`data.group_by('sex').agg(pl.col('age').mean())`

Reshape¶

Action	Code	Details
From wide to long format	`data.melt(id_vars='sex', value_vars=['a', 'b'])`
To narrow format	`data.explode(?)`	?

Combine¶

Action	Code	Details
Concatenate rows of dataframes	`pl.concat([df, df2, dfN])`
Concatenate rows of dataframes, having partially overlapping columns	`pl.concat([df, df2, dfN], how='diagonal')`
Merge two data frames on the sorted key	`data.merge(data2)`
Inner join	`data.join(data2, on=['sex', 'country'])`
Left join	`data.join(data2, on=['sex', 'country'], how='left')`
Right join	`data.join(data2, on=['sex', 'country'], how='right')`
Outer (full) join	`data.join(data2, on=['sex', 'country'], how='full')`
Cross join	`data.join(data2, on=['sex', 'country'], how='cross')`
Semi join (one match per index)	`data.join(data2, on=['sex', 'country'], how='semi')`
Left anti join (exclude matches from table 2)	`data.join(data2, on=['sex', 'country'], how='anti')`

Convert¶

Action	Code	Details
To series (first column only)	`data.to_series()`
To `polars.LazyFrame`	`data.lazy()`
To `pandas.DataFrame`	`data.to_pandas()`
To list of series	`data.get_columns()`
Split into list of data frames based on column	`data.partition_by('sex')`
Split into list of data frames based on column tuples	`data.partition_by('sex', 'country')`
Split into dict of data frames based on column(s)	`data.partition_by('sex', 'country', as_dict = True)`
To CSV file	`data.write_csv('derp.csv')`
To Parquet file	`data.write_parquet('derp.parquet')`
To Arrow IPC or Feather	`data.write_ipc('file.arrow')`
To JSON	`?`
Write to clipboard, in TSV format	`data.write_clipboard()`

Polars DataFrame¶

Create¶

From file formats¶

Test¶

Tests for specific column(s)¶

Extract¶

Update¶

Combine¶

Derive¶

Transform¶

Fill¶

Row mask¶

Over consecutive rows¶

Over time¶

Row-based¶

Interval-based¶

Order¶

Grow¶

Add derived column(s)¶

Integer column derivations¶

String column derivations¶

Datetime column derivations¶

Add rows¶

Shrink the data frame¶

Remove one or more columns¶

Remove one or more rows¶

Keep rows¶

Exclude rows¶

Aggregate rows¶

By group¶

Reshape¶

Combine¶

Convert¶

Show¶