Skip to content

datatable

A very fast data.frame implementation

Details - .() is a shortcut for list(), alternative is J() - := operator assigns by reference - (variable) to refer to the column names stored in the variable - get(...) evaluates expression to column name - %>% pipe operator (magrittr package) to chain function calls in a more readable manner, e.g. dt[, mean(x), by=ID] %>% .[, mean(x)]. Calling functions without data argument is done by enclosing in a new block: dt %>% {table(.$Var)} %>% as.data.table - %chin% faster version of %in% for character vectors - %between% checks if value between range, e.g. a %between% c(1,9), or a %between% list(c(1,3), c(4,5)). Seems slow to use - %inrange% checks if value in range - %like% regular expression check - mult='first' argument provides a useful shortcut for dropping group columns - Perform merges prior to row subsetting to preserve the keys. Row subsetting followed by a merge drops the obsolete keys, for some reason.

Code

library(data.table)

Create

Action Code Details
From data.frame
as.data.table(df)
Change data.frame to data.table, in-place
setDT(df)
From data.frame, storing row names as the first column
as.data.table(df, keep.rownames=TRUE)
From list of named vectors
as.data.table(do.call(rbind, veclist))
From statistical object
broom::tidy(lm) %>% as.data.table()
From statistical object to one-row frame
broom::glance(lm) %>% as.data.table()
From running-length encoding (RLE)
rle(x) %>% unclass() %>% as.data.table()
From matrix to long format
melt(data.table(x), measure = colnames(x), id=character())
From array to long format
as.data.table(x)
Ensure that all dims are named
From CSV file
fread('derp.csv')

Properties

Action Code Details
Column names
names(dt)
Number of columns
length(dt)
Number of rows
nrow(dt)
Get key(s)
key(dt)

Test

Action Code Details
Empty (no rows)
nrow(dt) == 0

Update

Modify the data table in-place (i.e., by reference)

Action Code Details
Set key(s) by symbols
setkey(dt, Age, Sex)
Set key(s) by names
setkeyv(dt, c('Age', 'Sex'))
Assign attribute
setattr(dt, 'myAttr', value)
Remove attribute
setattr(dt, 'myAttr', NULL)

Map

Update while preservering the size of the data table

See the snippets in the Grow section for replacing complete column(s).

Action Code Details
Update some rows with constant value, conditional on another column
dt[is.na(x), y:=1]
Update rows with values conditional on group (key)
dtg = data.table(…, x=2, key=Group)
dt[dtg, x := i.x]
Replace factor column NA values by a special level
dt[is.na(f), f := 'N/A']

Order

Reorder the rows of the data table

Sort

Sorting can be handled automatically by setting a column as one of the keys through setkey().

Action Code Details
Sort by column symbols
setorder(dt, Age, Sex)
Sort by column names
setorderv(dt, c('Age', 'Sex'))

Grow

Increase the size of the data table

Action Code Details
Add column with constant value value
dt[, x := value]
Add column with new vector of values (per row)
dt[, x := values]
values must be of same length as number of rows of the data table
Add columns with constant values
dt[, ':='(x=1, y=2)]
Add variable columns with constant values
dt[, c('x', 'y') := .(1, 2)]
Add columns from matrix columns
mat = matrix(1:8, ncol=2)
dt[, c('a' ,'b') := split(mat, col(mat))]
Add columns from matrix rows
mat = matrix(1:8, nrow=2)
dt[, c('a' ,'b') := split(mat, row(mat))]
Add columns from another table for matching keys
dt2 = data.table(…)
dt[dt2, (vars) := mget(vars)]
Add column derived from another column
Action Code Details
Add column with aggregate value by group
dt[, x := mean(y), by=ID]
Right-aligned rolling mean of size w (partial windows)
dt[, RMean := Reduce('+', shift(x, 1:w-1, fill=0)) / pmin(w, .I)]
Right-aligned rolling sum of size w
dt[, RSum := Reduce('+', shift(x, 0:(w-1), fill=0)), by=ID]
Easily 50 times faster than zoo::rollapply
Right-aligned rolling mean with fixed window w (zero-filled)
dt[, RMean := Reduce('+', shift(x, 0:(w-1), fill=0)) / w, by=ID]
Right-aligned rolling mean (partial windows)
dt[, RMean := Reduce('+', shift(x, 0:(w-1), fill=0)) / pmin(w, 1:.N), by=ID]
Right-aligned rolling mean (ignoring NAs)
dt[, RMean := Reduce('+', shift(ifelse(is.na(x), 0, x), 0:(W-1), fill=0)) /
    Reduce('+', shift(is.finite(x), 0:(W-1), fill=0)), by=ID]
Centered rolling mean (fixed window of size $W$ (inserting zeroes))
dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) / W, by=ID]
Centered rolling mean (partial windows)
dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) /
    pmin(W, c(1:ceiling(.N/2), floor(.N/2):1)), by=ID]
Centered rolling mean (ignoring NAs)
dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) /
    Reduce('+', shift(is.finite(x), -floor(W/2):floor(W/2), fill=0)), by=ID]
Mark consecutive matching rows
dt[, CMatch := {
    r = rle(MATCH)
    r$values = replace(r$values, r$lengths < GAP_SIZE & r$values, FALSE)
    inverse.rle(r)
}, by=ID]
Needs a better description
Count column for consecutive matches
dt[, MatchNr := {
    function(x) cumsum(x) + cummin(c(0, diff(x)) * cumsum(x))
}(MATCH), by=ID]
Needs a better description

Shrink

Reduce the size of the data table

Action Code Details
Remove column
dt[, x := NULL]
Remove columns
dt[, ':='(x=NULL, y=NULL)]
Remove variable columns
dt[, c('x', 'y') := NULL]

Join

Join with another data frame

Action Code Details
Left join
cols = c('x', 'y')
dt2[dt1, (cols) := mget(paste0('i.', cols))]

Query

Scalar

Extract a scalar result.

Action Code Details
Number of duplicated rows
sum(duplicated(dt))

Vector

Extract a vector result.

Action Code Details
Find row indices of key value
dt['a1', which=TRUE]

Extract column

Extract a result per row (vector of length nrow(data))

Action Code Details
Get column values as vector
dt$Age
Get column values by variable name
dt[[col]]

Extract row

Extract a scalar result per column

Derive

Derive a new data frame based on the given one

Map

Derive a new data frame of the same shape

Action Code Details
Replace NAs with zero
dt[is.na(dt)] = 0
Replace NAs with value v
dt[is.na(dt)] = v
Replace missing values using LOCF
locf = function(x) x[cummax(c(TRUE, tail(!is.na(x) * seq_along(x), -1)))]
dt[, x := locf(x), by=ID]
Replace missing values using NOCB
locf = function(x) x[cummax(c(TRUE, tail(!is.na(x) * seq_along(x), -1)))]
nocb = function(x) rev(locf(rev(x))
dt[, x := nocb(x), by=ID]

Shrink

Derive a subset of the data frame

Action Code Details
Select a single column
dt[, .(Age)]
Select a single column by variable name
dt[, list(col)]
Select columns
dt[, .(Age, Sex)]
Select all columns except
dt[, -c('Age', 'Sex')]
Select columns and apply function
dt[, lapply(.SD, mean), .SDcols=c('Age', 'Sex')]
Select rows by row number
dt[1:10]
Select rows with key value
dt[dtquery[query, .(ID)]]
Select rows with key value, dropping factor levels
dt[c('a1', 'a2')]
Discards factor level order!
Select first matching row for each specified key
dt[c('a1', 'a2'), mult='first']
Select rows indexed by multiple keys
dt[.('a1', 'b1')]
Notice the dot
Select rows of keys specified in a table
dtquery = data.table(..., by=keys)
dt[dtquery]
Select rows of groups, excluding non-existent queried rows
dtquery = data.table(..., by=keys)
dt[dtquery, nomatch=0]
Select rows with values specified in table
dtquery = data.table(...)
merge(dt, dtquery, by=COLUMNS)
Unique rows
unique(dt)
Unique rows by columns
unique(dt, by = keys(dt))
Duplicated rows
dt[duplicated(dt),]
Sample n rows
dt[sample(.N, n),]
Sample n rows by key groups
dtquery = data.table(sample(levels(dt$Id), n))
dt[dtquery]
First n rows per group
dt[dt[, .I[1:10], by=ID]]
Not the best way

Aggregation

Aggregate rows by group

Action Code Details
By group
dt[,, by = ID]
Typically used for categorical columns, e.g., string, factor, date, or int
By group, from column name
dt[,, by = 'ID']
By variable column name
dt[,, by=(var)]
By fixed-sized bins from a numeric column
dt[ , , by=findInterval(NumericColumn, seq(1, 100, by=7))]
By groups (columns)
dt[,, by = .(IDa, IDb)]
By string column names
dt[,, by = c('IDa', 'IDb')]
By range of keys
dt[,, by = IDa:IDc]

Grow

Action Code Details
Insert missing rows based on missing combination of keys
dt[CJ(unique(IDa), unique(IDb))]
Uses a cross-join on all levels of the given two columns IDa and IDb
Repeat each row n times
dt[rep(1:.N, n)]
Repeat rows by group
dt[, lapply(.SD, rep, 10), by=ID]
Repeat data.table along a sequence as new index
s = LETTERS[1:5]
dt2 = replicate(length(s), dt, simplify=FALSE) %>%
    setNames(s) %>%
    rbindlist(idcol=TRUE)

Reshape

Action Code Details
To wide format
dcast(dt, IDa + IDb ~ Param, value.var = 'Value')
... represents all variables, and . represents no variable
To wide format with value column prefix
dcast(dt, IDa + IDb ~ paste0('prefix', Param), value.var='Value')
To wide format with aggregation
dcast(dt, IDa + IDb ~ Param, value.var='Value',
    fun.aggregate=list(first, last), fill=NA)
To wide format on a single grouping factor
dcast(dt, ... ~ Group, value.var = 'Value')
To wide format without an index
unstack(dt, Value ~ Param) %>% as.data.table()
To long format
melt(dt, id=c('IDa', 'IDb'), measure=c('Param1', 'Param2'))
To long format for specific columns (no index)
melt(dt, id=character(), measure=c('Param1', 'Param2'))
Transpose
t(dt) %>% as.data.table()
Drops column names!

Join

Join two or more data frames

Action Code Details
Inner join
merge(dt1, dt2)
Inner join
dt1[dt2, nomatch=0]
Inner join with mismatching keys
merge(dt1, dt2, by.x=c('a1', 'b1'), by.y=c('b1', 'b2'))
Outer join
merge(dt1, dt2, all=TRUE)
Outer join (low memory)
dt12 = d[dt2, nomatch=0]
rbind(dt1[!dt12], dt12, dt2[!dt12])
Left join
merge(dt1, dt2, all.x = TRUE)
Left join with identical column names
dt12 = dt2[dt1, .(x1=x, x2=i.x)]
Right join
merge(dt1, dt2, all.y = TRUE)
Anti join
dt1[!dt2]
Union
funion(dt, dt2)
Intersection
fintersect(dt, dt2)
Set difference
fsetdiff(dt, dt2)
Set equal
fsetequal(dt, dt2)
Combine two tables, repeating rows for all unique pairs (expand.grid)
data.table(
    dt1[rep(1:.N, nrow(dt2))],
    dt2[rep(1:.N, each=nrow(dt1))]
)

Convert

Action Code Details
Data.frame
as.data.frame(dt)
List of data.frames, by group
split(dt, by='Id')
Matrix
data.matrix(dt)