Polars Create Properties Query Columns Rows Aggregate Grouped Over time Test Row masking Update Add New columns Derive columns Rows Remove Columns Rows Reshape Merge Convert Create What How Details Create series from list pl.Series('name', [1, 2, 3])
From lists pl.DataFrame('A': [1, 2], 'fruits': ['banana', 'apple'])
From pandas.DataFrame
pl.from_pandas(data)
From dict pl.DataFrame(dict)
From dict with schema pl.DataFrame(dict, schema = {'col1': pl.Float32, 'col2': pl.Int64, 'col3': pl.Date})
From array data = np.array([[1, 2], [3, 4]])
pl.DataFrame(data, schema = ['a', 'b'], orient = 'col')
From list of lists data = [[1, 2, 3], [4, 5, 6]]
pl.DataFrame(data, schema=['a', 'b', 'c'])
From CSV file pl.read_csv('derp.csv')
From list of data frames pl.concat([data, data2, ..., dataN])
From list of data frames with different columns pl.concat([data, data2, ..., dataN], how = 'diagonal')
Properties What How Details Number of columns len(data.columns)
? Column names data.columns
Column dtypes data.dtypes
Column-dtype map data.schema
Find column index by name data.find_idx_by_name('age')
Number of rows data.height
Query Start a lazy query using a LazyFrame by data.lazy()
. Operations on a LazyFrame are not executed until this is requested by either calling collect() or fetch(). Lazy operations are advised because they allow for query optimization and more parallelization.
Columns What How Details Single column data.select('col1')
Multiple columns data.select('col1', 'col2')
Multiple columns, dynamically ?
Rows What How Details No rows data.clear()
i th row data[i]
i th row from end data[-i]
First n rows data.head(n)
Last n rows data.tail(n)
Slice of rows from a to b data[a:b]
or data.slice(a, b)
By list of row numbers data[rows]
Exclude the given row numbers data.with_row_count().filter(pl.col('row_nr').is_in(rows).not_())
Leftover row_nr column Exclude rows that contain null values data.drop_nulls()
Exclude rows that contain null values in certain columns data.drop_nulls(['fruits', 'cars'])
Conditionally on column data.filter(pl.col('age') >= 18)
From multiple column conditions data.filter((pl.col('age') >= 18) & (pl.col('sex') == 'male'))
Limit query to first n rows data.limit(n)
Limit query to last n rows data.limit(-n)
Number of missing values data.null_count()
Number of unique values in a column data['col1'].n_unique()
Number of unique rows over columns ?
Aggregate Grouped What How Details Mean of column data.group_by('sex').agg(pl.col('age').mean())
Over time What How Details Moving average data.group_by_dynamic('ts', every='1d').agg(pl.col('value').mean())
?
Test What How Details Empty (no rows) data.is_empty()
Contains column 'age' in data.columns
Contains columns {'age', 'sex'}.issubset(data.columns)
Contains columns cols set(cols).issubset(data.columns)
Column is missing 'age' not in data.columns
Column series are equal data['sex'].series_equal(data['sex2'].alias('sex'))
Series names must match! Columns are equal (ignoring name) ? Column has missing value data['sex'].is_null().any()
Column has no missing values data['sex'].is_not_null().all()
Column has no duplicate values ? Column has duplicate values ? Column is of dtype data.schema['col1'] == dtype
Column is bool dtype data.schema['alive'] == pl.Bool
Column is string type data.schema['sex'] == pl.Utf8
Column is integer type data.schema['age'] in pl.datatypes.INTEGER_DTYPES
Column is standard integer data.schema['age'] == pl.Int64
Column is consistently pairwise greater than another column (data['col1'] > data['col2']).all()
Column is sometimes pairwise greater than another column (data['col1'] > data['col2']).any()
Row masking What How Details Duplicated rows data.is_duplicated()
Unique rows data.is_unique()
Update What How Details Cast column dtype data.with_columns(pl.col('col1').cast(pl.Float32))
Cast columns to dtypes data.cast({'col1': pl.Float32, 'col2': pl.UInt8})
Rename column data.rename({'old1': 'new1', 'old2': 'new2'})
Update column values data.with_columns(pl.col('age') + 5)
Update column values on condition df.with_columns(
pl.when(pl.col('age') >= 18).
then(pl.lit(1)).
otherwise(pl.lit(-1))
)
Update column values on conditions df.with_columns(
pl.when(pl.col('age') >= 18).
then(pl.lit(1)).
when(pl.col('Sex') == 'M').
then(4).
otherwise(pl.lit(-1))
)
Update column values for specific rows rows = [1, 3, 5]
data.with_row_count().with_columns(
pl.when(pl.col('row_nr').is_in(rows)).
then(pl.lit(True)).
otherwise(pl.lit(False))
)
Fill nulls with zero data.fill_null(strategy = 'zero')
Fill nulls with value data.fill_null(value)
Fill nulls with LOCF data.fill_null(strategy='forward')
Wrong for grouped data Fill NaNs with value data.fill_nan(value)
Replace column inplace data.replace('age', newAgeSeries)
Sort table by column data.sort('col1')
Add New columns What How Details Append constant numeric column data.with_columns(Intercept=pl.lit(1))
Append column from series s = pl.Series("apple", [10, 20, 30])
data.hstack([s])
Note the brackets Append column from series inplace data.hstack(s, in_place = True)
Insert column from series inplace data.insert_at_idx(1, s)
Derive columns What How Details Transform another column data.with_columns(AgeSq = pl.col('Age') ** 2)
Multiple transformations from another column data.with_columns(
Age2 = pl.col('Age') ** 2
Age3 = pl.col('Age') ** 3
)
Boolean mask based on pairwise comparison of another column data.with_columns(CanLive = pl.col('Income') * 3 > pl.col('Rent')
Values conditional on another column data.with_columns(
pl.when(pl.col('age') >= 18).
then(pl.lit(1)).
otherwise(pl.lit(-1))
)
Map another column map = dict(1 = 'a', 2 = 'b', 3 = 'c')
data.with_columns(
NumCat = pl.col('Num').map_dict(map).cast(pl.Categorical)
)
Integer seconds since Unix epoch to datetime data.with_columns(
Date=pl.from_epoch(pl.col('time'))
)
Integer daystamp since Unix epoch to date data.with_columns(
Date=pl.from_epoch(pl.col('time'), time_unit='d')
)
Integer milliseconds timestamp since Unix epoch to datetime data.with_columns(
Date=pl.from_epoch(pl.col('time'), time_unit='ms')
)
Parse string column to date data.with_columns(
Date=pl.col('RawDate').str.to_date()
)
Parse string column to date with known format data.with_columns(
Date = pl.col("RawDate").str.to_date('%Y-%m-%d')
)
Week-day from date column Month from date column data.with_columns(Month = pl.col('Date').dt.month())
Year from date column data.with_columns(Month = pl.col('Date').dt.year())
Group-wise from aggregate value data.with_columns(
DaysSinceStart = pl.col('Date') - pl.col('Date').min().over('Subject').cast(pl.Int) + 1
)
Rows What How Details Add row as tuple ? Add list of tuples ? Add data frame Add data frame inplace data.extend(data2)
Add data frames inplace data.vstack(data2)
data.vstack(dataN)
data.rechunk()
Remove Columns What How Details Remove column data.drop('Age')
Remove column inplace data.drop_in_place('Age')
Returns the dropped column Remove columns data.drop(['Age', 'Sex'])
Remove all numeric columns data.drop(cs.numeric())
Remove columns based on selector data.drop(cs)
Rows Reshape What How Details From wide to long format data.melt(id_vars='sex', value_vars=['a', 'b'])
To narrow format data.explode(?)
?
Merge What How Details Merge two data frames on the sorted key data.merge(data2)
Inner join data.join(data2, on = ['sex', 'country'])
Left join data.join(data2, on = ['sex', 'country'], how = 'left')
Right join data.join(data2, on = ['sex', 'country'], how = 'right')
Outer join data.join(data2, on = ['sex', 'country'], how = 'outer')
Cross join data.join(data2, on = ['sex', 'country'], how = 'cross')
Semi join (one match per index) data.join(data2, on = ['sex', 'country'], how = 'semi')
Anti join (exclude matches from table 2) data.join(data2, on = ['sex', 'country'], how = 'anti')
What How Details Get column (as series) data['col1']
Get column (as list) list(data['col1'])
? Get i th row (as tuple) data.row(i)
Get rows (as list of tuple) data.rows(...)
? First item (cell) data.item(0, 0)
Item (cell) from row i and column index j data.item(i, j)
Item (cell) from row i and column name name data.item(i, name)
Convert What How Details To pandas.DataFrame
data.to_pandas()
To list of series data.get_columns()
Split into list of data frames based on column data.partition_by('sex')
Split into list of data frames based on column tuples data.partition_by('sex', 'country')
Split into dict of data frames based on column(s) data.partition_by('sex', 'country', as_dict = True)
To CSV file data.write_csv('derp.csv')
To Parquet file data.write_parquet('derp.parquet')
To JSON ?
Back to top
Copyright © 2023 Niek Den Teuling, distributed under an MIT license . Please submit new code snippets, ideas, or suggestions by creating an issue or pull request.