datatable¶

A very fast data.frame implementation

Details - .() is a shortcut for list(), alternative is J() - := operator assigns by reference - (variable) to refer to the column names stored in the variable - get(...) evaluates expression to column name - %>% pipe operator (magrittr package) to chain function calls in a more readable manner, e.g. dt[, mean(x), by=ID] %>% .[, mean(x)]. Calling functions without data argument is done by enclosing in a new block: dt %>% {table(.$Var)} %>% as.data.table - %chin% faster version of %in% for character vectors - %between% checks if value between range, e.g. a %between% c(1,9), or a %between% list(c(1,3), c(4,5)). Seems slow to use - %inrange% checks if value in range - %like% regular expression check - mult='first' argument provides a useful shortcut for dropping group columns - Perform merges prior to row subsetting to preserve the keys. Row subsetting followed by a merge drops the obsolete keys, for some reason.

Code

library(data.table)

Create¶

Action	Code	Details
From data.frame	`as.data.table(df)`
Change data.frame to data.table, in-place	`setDT(df)`
From data.frame, storing row names as the first column	`as.data.table(df, keep.rownames=TRUE)`
From list of named vectors	`as.data.table(do.call(rbind, veclist))`
From statistical object	`broom::tidy(lm) %>% as.data.table()`
From statistical object to one-row frame	`broom::glance(lm) %>% as.data.table()`
From running-length encoding (RLE)	`rle(x) %>% unclass() %>% as.data.table()`
From matrix to long format	`melt(data.table(x), measure = colnames(x), id=character())`
From array to long format	`as.data.table(x)`	Ensure that all dims are named
From CSV file	`fread('derp.csv')`

Properties¶

Action	Code	Details
Column names	`names(dt)`
Number of columns	`length(dt)`
Number of rows	`nrow(dt)`
Get key(s)	`key(dt)`

Test¶

Action	Code	Details
Empty (no rows)	`nrow(dt) == 0`

Update¶

Modify the data table in-place (i.e., by reference)

Action	Code	Details
Set key(s) by symbols	`setkey(dt, Age, Sex)`
Set key(s) by names	`setkeyv(dt, c('Age', 'Sex'))`
Assign attribute	`setattr(dt, 'myAttr', value)`
Remove attribute	`setattr(dt, 'myAttr', NULL)`

Map¶

Update while preservering the size of the data table

See the snippets in the Grow section for replacing complete column(s).

Action	Code	Details
Update some rows with constant value, conditional on another column	`dt[is.na(x), y:=1]`
Update rows with values conditional on group (key)	`dtg = data.table(…, x=2, key=Group) dt[dtg, x := i.x]`
Replace factor column NA values by a special level	`dt[is.na(f), f := 'N/A']`

Order¶

Reorder the rows of the data table

Sort¶

Sorting can be handled automatically by setting a column as one of the keys through setkey().

Action	Code	Details
Sort by column symbols	`setorder(dt, Age, Sex)`
Sort by column names	`setorderv(dt, c('Age', 'Sex'))`

Grow¶

Increase the size of the data table

Action	Code	Details
Add column with constant value value	`dt[, x := value]`
Add column with new vector of values (per row)	`dt[, x := values]`	values must be of same length as number of rows of the data table
Add columns with constant values	`dt[, ':='(x=1, y=2)]`
Add variable columns with constant values	`dt[, c('x', 'y') := .(1, 2)]`
Add columns from matrix columns	`mat = matrix(1:8, ncol=2) dt[, c('a' ,'b') := split(mat, col(mat))]`
Add columns from matrix rows	`mat = matrix(1:8, nrow=2) dt[, c('a' ,'b') := split(mat, row(mat))]`
Add columns from another table for matching keys	`dt2 = data.table(…) dt[dt2, (vars) := mget(vars)]`

Add column derived from another column¶

Action	Code	Details
Add column with aggregate value by group	`dt[, x := mean(y), by=ID]`
Right-aligned rolling mean of size w (partial windows)	`dt[, RMean := Reduce('+', shift(x, 1:w-1, fill=0)) / pmin(w, .I)]`
Right-aligned rolling sum of size w	`dt[, RSum := Reduce('+', shift(x, 0:(w-1), fill=0)), by=ID]`	Easily 50 times faster than zoo::rollapply
Right-aligned rolling mean with fixed window w (zero-filled)	`dt[, RMean := Reduce('+', shift(x, 0:(w-1), fill=0)) / w, by=ID]`
Right-aligned rolling mean (partial windows)	`dt[, RMean := Reduce('+', shift(x, 0:(w-1), fill=0)) / pmin(w, 1:.N), by=ID]`
Right-aligned rolling mean (ignoring NAs)	`dt[, RMean := Reduce('+', shift(ifelse(is.na(x), 0, x), 0:(W-1), fill=0)) / Reduce('+', shift(is.finite(x), 0:(W-1), fill=0)), by=ID]`
Centered rolling mean (fixed window of size $W$ (inserting zeroes))	`dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) / W, by=ID]`
Centered rolling mean (partial windows)	`dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) / pmin(W, c(1:ceiling(.N/2), floor(.N/2):1)), by=ID]`
Centered rolling mean (ignoring NAs)	`dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) / Reduce('+', shift(is.finite(x), -floor(W/2):floor(W/2), fill=0)), by=ID]`
Mark consecutive matching rows	`dt[, CMatch := { r = rle(MATCH) r$values = replace(r$values, r$lengths < GAP_SIZE & r$values, FALSE) inverse.rle(r) }, by=ID]`	Needs a better description
Count column for consecutive matches	`dt[, MatchNr := { function(x) cumsum(x) + cummin(c(0, diff(x)) * cumsum(x)) }(MATCH), by=ID]`	Needs a better description

Shrink¶

Reduce the size of the data table

Action	Code	Details
Remove column	`dt[, x := NULL]`
Remove columns	`dt[, ':='(x=NULL, y=NULL)]`
Remove variable columns	`dt[, c('x', 'y') := NULL]`

Join¶

Join with another data frame

Action	Code	Details
Left join	`cols = c('x', 'y') dt2[dt1, (cols) := mget(paste0('i.', cols))]`

Query¶

Scalar¶

Extract a scalar result.

Action	Code	Details
Number of duplicated rows	`sum(duplicated(dt))`

Vector¶

Extract a vector result.

Action	Code	Details
Find row indices of key value	`dt['a1', which=TRUE]`

Extract column¶

Extract a result per row (vector of length nrow(data))

Action	Code	Details
Get column values as vector	`dt$Age`
Get column values by variable name	`dt[[col]]`

Extract row¶

Extract a scalar result per column

Derive¶

Derive a new data frame based on the given one

Map¶

Derive a new data frame of the same shape

Action	Code	Details
Replace NAs with zero	`dt[is.na(dt)] = 0`
Replace NAs with value v	`dt[is.na(dt)] = v`
Replace missing values using LOCF	`locf = function(x) x[cummax(c(TRUE, tail(!is.na(x) * seq_along(x), -1)))] dt[, x := locf(x), by=ID]`
Replace missing values using NOCB	`locf = function(x) x[cummax(c(TRUE, tail(!is.na(x) * seq_along(x), -1)))] nocb = function(x) rev(locf(rev(x)) dt[, x := nocb(x), by=ID]`

Shrink¶

Derive a subset of the data frame

Action	Code	Details
Select a single column	`dt[, .(Age)]`
Select a single column by variable name	`dt[, list(col)]`
Select columns	`dt[, .(Age, Sex)]`
Select all columns except	`dt[, -c('Age', 'Sex')]`
Select columns and apply function	`dt[, lapply(.SD, mean), .SDcols=c('Age', 'Sex')]`
Select rows by row number	`dt[1:10]`
Select rows with key value	`dt[dtquery[query, .(ID)]]`
Select rows with key value, dropping factor levels	`dt[c('a1', 'a2')]`	Discards factor level order!
Select first matching row for each specified key	`dt[c('a1', 'a2'), mult='first']`
Select rows indexed by multiple keys	`dt[.('a1', 'b1')]`	Notice the dot
Select rows of keys specified in a table	`dtquery = data.table(..., by=keys) dt[dtquery]`
Select rows of groups, excluding non-existent queried rows	`dtquery = data.table(..., by=keys) dt[dtquery, nomatch=0]`
Select rows with values specified in table	`dtquery = data.table(...) merge(dt, dtquery, by=COLUMNS)`
Unique rows	`unique(dt)`
Unique rows by columns	`unique(dt, by = keys(dt))`
Duplicated rows	`dt[duplicated(dt),]`
Sample n rows	`dt[sample(.N, n),]`
Sample n rows by key groups	`dtquery = data.table(sample(levels(dt$Id), n)) dt[dtquery]`
First n rows per group	`dt[dt[, .I[1:10], by=ID]]`	Not the best way

Aggregation¶

Aggregate rows by group

Action	Code	Details
By group	`dt[,, by = ID]`	Typically used for categorical columns, e.g., string, factor, date, or int
By group, from column name	`dt[,, by = 'ID']`
By variable column name	`dt[,, by=(var)]`
By fixed-sized bins from a numeric column	`dt[ , , by=findInterval(NumericColumn, seq(1, 100, by=7))]`
By groups (columns)	`dt[,, by = .(IDa, IDb)]`
By string column names	`dt[,, by = c('IDa', 'IDb')]`
By range of keys	`dt[,, by = IDa:IDc]`

Grow¶

Action	Code	Details
Insert missing rows based on missing combination of keys	`dt[CJ(unique(IDa), unique(IDb))]`	Uses a cross-join on all levels of the given two columns IDa and IDb
Repeat each row n times	`dt[rep(1:.N, n)]`
Repeat rows by group	`dt[, lapply(.SD, rep, 10), by=ID]`
Repeat data.table along a sequence as new index	`s = LETTERS[1:5] dt2 = replicate(length(s), dt, simplify=FALSE) %>% setNames(s) %>% rbindlist(idcol=TRUE)`

Reshape¶

Action	Code	Details
To wide format	`dcast(dt, IDa + IDb ~ Param, value.var = 'Value')`	`...` represents all variables, and `.` represents no variable
To wide format with value column prefix	`dcast(dt, IDa + IDb ~ paste0('prefix', Param), value.var='Value')`
To wide format with aggregation	`dcast(dt, IDa + IDb ~ Param, value.var='Value', fun.aggregate=list(first, last), fill=NA)`
To wide format on a single grouping factor	`dcast(dt, ... ~ Group, value.var = 'Value')`
To wide format without an index	`unstack(dt, Value ~ Param) %>% as.data.table()`
To long format	`melt(dt, id=c('IDa', 'IDb'), measure=c('Param1', 'Param2'))`
To long format for specific columns (no index)	`melt(dt, id=character(), measure=c('Param1', 'Param2'))`
Transpose	`t(dt) %>% as.data.table()`	Drops column names!

Join¶

Join two or more data frames

Action	Code	Details
Inner join	`merge(dt1, dt2)`
Inner join	`dt1[dt2, nomatch=0]`
Inner join with mismatching keys	`merge(dt1, dt2, by.x=c('a1', 'b1'), by.y=c('b1', 'b2'))`
Outer join	`merge(dt1, dt2, all=TRUE)`
Outer join (low memory)	`dt12 = d[dt2, nomatch=0] rbind(dt1[!dt12], dt12, dt2[!dt12])`
Left join	`merge(dt1, dt2, all.x = TRUE)`
Left join with identical column names	`dt12 = dt2[dt1, .(x1=x, x2=i.x)]`
Right join	`merge(dt1, dt2, all.y = TRUE)`
Anti join	`dt1[!dt2]`
Union	`funion(dt, dt2)`
Intersection	`fintersect(dt, dt2)`
Set difference	`fsetdiff(dt, dt2)`
Set equal	`fsetequal(dt, dt2)`
Combine two tables, repeating rows for all unique pairs (`expand.grid`)	`data.table( dt1[rep(1:.N, nrow(dt2))], dt2[rep(1:.N, each=nrow(dt1))] )`

Convert¶

Action	Code	Details
Data.frame	`as.data.frame(dt)`
List of data.frames, by group	`split(dt, by='Id')`
Matrix	`data.matrix(dt)`