datatable¶
A very fast data.frame implementation
Details
- .() is a shortcut for list(), alternative is J()
- := operator assigns by reference
- (variable) to refer to the column names stored in the variable
- get(...) evaluates expression to column name
- %>% pipe operator (magrittr package) to chain function calls in a more readable manner, e.g. dt[, mean(x), by=ID] %>% .[, mean(x)]. Calling functions without data argument is done by enclosing in a new block: dt %>% {table(.$Var)} %>% as.data.table
- %chin% faster version of %in% for character vectors
- %between% checks if value between range, e.g. a %between% c(1,9), or a %between% list(c(1,3), c(4,5)). Seems slow to use
- %inrange% checks if value in range
- %like% regular expression check
- mult='first' argument provides a useful shortcut for dropping group columns
- Perform merges prior to row subsetting to preserve the keys. Row subsetting followed by a merge drops the obsolete keys, for some reason.
Code
library(data.table)
Create¶
| Action | Code | Details |
|---|---|---|
| From data.frame |
|
|
| Change data.frame to data.table, in-place |
|
|
| From data.frame, storing row names as the first column |
|
|
| From list of named vectors |
|
|
| From statistical object |
|
|
| From statistical object to one-row frame |
|
|
| From running-length encoding (RLE) |
|
|
| From matrix to long format |
|
|
| From array to long format |
|
Ensure that all dims are named |
| From CSV file |
|
Properties¶
| Action | Code | Details |
|---|---|---|
| Column names |
|
|
| Number of columns |
|
|
| Number of rows |
|
|
| Get key(s) |
|
Test¶
| Action | Code | Details |
|---|---|---|
| Empty (no rows) |
|
Update¶
Modify the data table in-place (i.e., by reference)
| Action | Code | Details |
|---|---|---|
| Set key(s) by symbols |
|
|
| Set key(s) by names |
|
|
| Assign attribute |
|
|
| Remove attribute |
|
Map¶
Update while preservering the size of the data table
See the snippets in the Grow section for replacing complete column(s).
| Action | Code | Details |
|---|---|---|
| Update some rows with constant value, conditional on another column |
|
|
| Update rows with values conditional on group (key) |
|
|
| Replace factor column NA values by a special level |
|
Order¶
Reorder the rows of the data table
Sort¶
Sorting can be handled automatically by setting a column as one of the keys through setkey().
| Action | Code | Details |
|---|---|---|
| Sort by column symbols |
|
|
| Sort by column names |
|
Grow¶
Increase the size of the data table
| Action | Code | Details |
|---|---|---|
| Add column with constant value value |
|
|
| Add column with new vector of values (per row) |
|
values must be of same length as number of rows of the data table |
| Add columns with constant values |
|
|
| Add variable columns with constant values |
|
|
| Add columns from matrix columns |
|
|
| Add columns from matrix rows |
|
|
| Add columns from another table for matching keys |
|
Add column derived from another column¶
| Action | Code | Details |
|---|---|---|
| Add column with aggregate value by group |
|
|
| Right-aligned rolling mean of size w (partial windows) |
|
|
| Right-aligned rolling sum of size w |
|
Easily 50 times faster than zoo::rollapply |
| Right-aligned rolling mean with fixed window w (zero-filled) |
|
|
| Right-aligned rolling mean (partial windows) |
|
|
| Right-aligned rolling mean (ignoring NAs) |
|
|
| Centered rolling mean (fixed window of size $W$ (inserting zeroes)) |
|
|
| Centered rolling mean (partial windows) |
|
|
| Centered rolling mean (ignoring NAs) |
|
|
| Mark consecutive matching rows |
|
Needs a better description |
| Count column for consecutive matches |
|
Needs a better description |
Shrink¶
Reduce the size of the data table
| Action | Code | Details |
|---|---|---|
| Remove column |
|
|
| Remove columns |
|
|
| Remove variable columns |
|
Join¶
Join with another data frame
| Action | Code | Details |
|---|---|---|
| Left join |
|
Query¶
Scalar¶
Extract a scalar result.
| Action | Code | Details |
|---|---|---|
| Number of duplicated rows |
|
Vector¶
Extract a vector result.
| Action | Code | Details |
|---|---|---|
| Find row indices of key value |
|
Extract column¶
Extract a result per row (vector of length nrow(data))
| Action | Code | Details |
|---|---|---|
| Get column values as vector |
|
|
| Get column values by variable name |
|
Extract row¶
Extract a scalar result per column
Derive¶
Derive a new data frame based on the given one
Map¶
Derive a new data frame of the same shape
| Action | Code | Details |
|---|---|---|
| Replace NAs with zero |
|
|
| Replace NAs with value v |
|
|
| Replace missing values using LOCF |
|
|
| Replace missing values using NOCB |
|
Shrink¶
Derive a subset of the data frame
| Action | Code | Details |
|---|---|---|
| Select a single column |
|
|
| Select a single column by variable name |
|
|
| Select columns |
|
|
| Select all columns except |
|
|
| Select columns and apply function |
|
|
| Select rows by row number |
|
|
| Select rows with key value |
|
|
| Select rows with key value, dropping factor levels |
|
Discards factor level order! |
| Select first matching row for each specified key |
|
|
| Select rows indexed by multiple keys |
|
Notice the dot |
| Select rows of keys specified in a table |
|
|
| Select rows of groups, excluding non-existent queried rows |
|
|
| Select rows with values specified in table |
|
|
| Unique rows |
|
|
| Unique rows by columns |
|
|
| Duplicated rows |
|
|
| Sample n rows |
|
|
| Sample n rows by key groups |
|
|
| First n rows per group |
|
Not the best way |
Aggregation¶
Aggregate rows by group
| Action | Code | Details |
|---|---|---|
| By group |
|
Typically used for categorical columns, e.g., string, factor, date, or int |
| By group, from column name |
|
|
| By variable column name |
|
|
| By fixed-sized bins from a numeric column |
|
|
| By groups (columns) |
|
|
| By string column names |
|
|
| By range of keys |
|
Grow¶
| Action | Code | Details |
|---|---|---|
| Insert missing rows based on missing combination of keys |
|
Uses a cross-join on all levels of the given two columns IDa and IDb |
| Repeat each row n times |
|
|
| Repeat rows by group |
|
|
| Repeat data.table along a sequence as new index |
|
Reshape¶
| Action | Code | Details |
|---|---|---|
| To wide format |
|
... represents all variables, and . represents no variable |
| To wide format with value column prefix |
|
|
| To wide format with aggregation |
|
|
| To wide format on a single grouping factor |
|
|
| To wide format without an index |
|
|
| To long format |
|
|
| To long format for specific columns (no index) |
|
|
| Transpose |
|
Drops column names! |
Join¶
Join two or more data frames
| Action | Code | Details |
|---|---|---|
| Inner join |
|
|
| Inner join |
|
|
| Inner join with mismatching keys |
|
|
| Outer join |
|
|
| Outer join (low memory) |
|
|
| Left join |
|
|
| Left join with identical column names |
|
|
| Right join |
|
|
| Anti join |
|
|
| Union |
|
|
| Intersection |
|
|
| Set difference |
|
|
| Set equal |
|
|
Combine two tables, repeating rows for all unique pairs (expand.grid)
|
|
Convert¶
| Action | Code | Details |
|---|---|---|
| Data.frame |
|
|
| List of data.frames, by group |
|
|
| Matrix |
|