data.table
- Create
- Properties
- Sort
- Query
- Aggregate
- Row masking
- Test
- Update
- Append rows
- Update (byref)
- Derive columns
- Transform
- Convert
- Syntax notes
Create
What | How | Details |
---|---|---|
From data.frame | as.data.table(df) | |
From data.frame byref | setDT(df) | |
From data.frame , storing row names as the first column | as.data.table(df, keep.rownames=TRUE) | |
From list of named vectors | as.data.table(do.call(rbind, veclist)) | |
From statistical object | broom::tidy(lm) %>% as.data.table() | |
From statistical object to one-row | broom::glance(lm) %>% as.data.table() | |
From RLE | rle(x) %>% unclass() %>% as.data.table() | |
From matrix to long format | melt(data.table(x), measure = colnames(x), id=character()) | |
From array to long format | as.data.table(x) | Ensure that all dims are named |
From CSV | fread('derp.csv') |
Properties
What | How | Details |
---|---|---|
Column names | names(dt) | |
Number of columns | length(dt) | |
Number of rows | nrow(dt) | |
Get key(s) | key(dt) | |
Set key(s) by symbols byref | setkey(dt, Age, Sex) | |
Set key(s) by names byref | setkeyv(dt, c('Age', 'Sex')) | |
Assign attribute byref | setattr(dt, 'myAttr', value) | |
Remove attribute byref | setattr(dt, 'myAttr', NULL) | |
Number of duplicated rows | sum(duplicated(dt)) |
Sort
Sorting can be handled automatically by setting a column as one of the keys through setkey()
.
What | How | Details |
---|---|---|
Sort by column symbols byref | setorder(dt, Age, Sex) | |
Sort by column names byref | setorderv(dt, c('Age', 'Sex')) |
Query
What | How | Details |
---|---|---|
Select column | dt$Age | |
Select column | dt[, .(Age)] | |
Select column by variable name | dt[[x]] | |
Select columns | dt[, .(Age, Sex)] | |
Select all columns except | dt[, -c('Age', 'Sex')] | |
Select columns and apply function | dt[, lapply(.SD, mean), .SDcols=c('Age', 'Sex')] | |
Select rows by row number | dt[1:10] | |
Select rows with key value | dt[dtquery[query, .(ID)]] | |
Select rows with key value, dropping factor levels | dt[c('a1', 'a2')] | Discards factor level order! |
Select first/last row with key values | dt[c('a1', 'a2'), mult='first'] | |
Select rows with value over multiple keys | dt[.('a1', 'b1')] | Notice the dot |
Select rows of groups specified in table | dtquery = data.table(..., by=keys); dt[dtquery] | |
Select rows of groups, excluding non-existent queried rows | dtquery = data.table(..., by=keys); dt[dtquery, nomatch=0] | |
Select rows with values specified in table | dtquery = data.table(...) merge(dt, dtquery, by=COLUMNS) | |
Unique rows | unique(dt) | |
Unique rows by columns | unique(dt, by = keys(dt)) | |
Duplicated rows | dt[duplicated(dt),] | |
Sample rows | dt[sample(.N, 10),] | |
Sample by key value | dtquery = data.table(sample(levels(dt$Id), 5)); dt[dtquery] | |
Find row index of key value | dt['a1', which=TRUE] | |
Repeat rows | dt[rep(1:.N, 10)] |
Aggregate
Grouped
What | How | Details |
---|---|---|
By key | dt[,, by = ID] | |
By string key | dt[,, by = 'ID'] | |
By variable key | dt[,, by=(var)] | |
By keys | dt[,, by = .(IDa, IDb)] | |
By string keys | dt[,, by = c('IDa', 'IDb')] | |
By range of keys | dt[,, by = IDa:IDc] | |
By fixed-sized bins | dt[ , , by=findInterval(X, seq(1, 100, by=7))] | |
By key using its value | dt[, {id = first(ID); dt2[id]}, by=ID] | |
By key using its factor value | dt[, {id = levels(ID)[.GRP]; dt2[id]}, by=ID] | |
Select columns by key | dt[, .(x, y), by=ID] | |
Select rows by key | dt[dt[, .I[1:10], by=ID]] |
Rolling
What | How | Details |
---|---|---|
Right-aligned rolling sum of size w | dt[, RMean := Reduce('+', shift(x, 1:w-1, fill=0)) / pmin(w, .I)] |
Grouped rolling
What | How | Details |
---|---|---|
Right-aligned rolling sum of size w | dt[, RSum := Reduce('+', shift(x, 0:(w-1), fill=0)), by=ID] | Easily 50 times faster than zoo::rollapply |
Right-aligned rolling mean with fixed window w (zero-filled) | dt[, RMean := Reduce('+', shift(x, 0:(w-1), fill=0)) / w, by=ID] | |
Right-aligned rolling mean (partial windows) | dt[, RMean := Reduce('+', shift(x, 0:(w-1), fill=0)) / pmin(w, 1:.N), by=ID] | |
Right-aligned rolling mean (ignoring NAs) | dt[, RMean := Reduce('+', shift(ifelse(is.na(x), 0, x), 0:(W-1), fill=0)) / Reduce('+', shift(is.finite(x), 0:(W-1), fill=0)), by=ID] | |
Centered rolling mean (fixed window of size $W$ (inserting zeroes)) | dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) / W, by=ID] | |
Centered rolling mean (partial windows) | dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) / pmin(W, c(1:ceiling(.N/2), floor(.N/2):1)), by=ID] | |
Centered rolling mean (ignoring NAs) | dt[, CMean := Reduce('+', shift(x, -floor(W/2):floor(W/2), fill=0)) / Reduce('+', shift(is.finite(x), -floor(W/2):floor(W/2), fill=0)), by=ID] | |
Mark consecutive matching rows | dt[, CMatch := {r = rle(MATCH) r$values = replace(r$values, r$lengths < GAP_SIZE & r$values, FALSE) inverse.rle(r) }, by=ID] | |
Count column for consecutive matches | dt[, MatchNr := {function(x) cumsum(x) + cummin(c(0, diff(x)) * cumsum(x))}(MATCH), by=ID] | |
Impute NA values | dt[, X := Hmisc::impute(X, mean), by=ID] |
Row masking
Test
Update
What | How | Details |
---|---|---|
Fill NAs by zero | dt[is.na(dt)] = 0 |
Append rows
What | How | Details |
---|---|---|
Insert missing rows based on missing combination of keys | dt[CJ(unique(IDa), unique(IDb))] | |
Replace missing values using LOCF | locf = function(x) x[cummax(c(TRUE, tail(is.finite(x) * seq_along(x), -1)))] dt[, x := locf(x), by=ID] | |
Replace missing values using NOCB | locf = function(x) x[cummax(c(TRUE, tail(is.finite(x) * seq_along(x), -1)))] nocb = function(x) rev(locf(rev(x)) dt[, x := nocb(x), by=ID] |
Update (byref)
What | How | Details |
---|---|---|
Assign/update column byref | dt[, x:=1] | |
Update columns byref | dt[, ':='(x=1, y=2)] | |
Update columns dynamically byref | dt[, c('x', 'y') := .(1, 2)] | |
Update columns from matrix columns byref | mat = matrix(1:8, ncol=2) dt[, c('a' ,'b') := split(mat, col(mat))] | |
Update columns from matrix rows byref | mat = matrix(1:8, nrow=2) dt[, c('a' ,'b') := split(mat, row(mat))] | |
Update columns from another table byref | dt2 = data.table(…) dt[dt2, (vars) := mget(vars)] | |
Update column per key | dt[, x:=mean(y), by=ID] | |
Update subset | dt[is.na(x), y:=1] | |
Update value per key | dtg = data.table(…, x=2, key=Group) dt[dtg, x := i.x] | |
Update subset per key | GroupValue = data.table(..., key=Group) dt[, x:=GroupValue[.BY], by=ID] | Slow, can be done faster |
Replace factor NA by new level | dt[is.na(f), f := 'N/A'] | |
Remove column | dt[, x := NULL] | |
Remove columns | dt[, ':='(x=NULL, y=NULL)] | |
Remove columns dynamically | dt[, c('x', 'y') := NULL] |
Derive columns
Transform
What | How | Details |
---|---|---|
Inner join | merge(dt1, dt2) or dt1[dt2, nomatch=0] | |
Inner join with mismatching keys | merge(dt1, dt2, by.x=c('a1', 'b1'), by.y=c('b1', 'b2')) | |
Outer join | merge(dt1, dt2, all=TRUE) | |
Outer join (mem efficient) | dt12 = d[dt2, nomatch=0] rbind(dt1[!dt12], dt12, dt2[!dt12]) | |
Left join | merge(dt1, dt2, all.x = TRUE) | |
Left join with identical column names | dt12 = dt2[dt1, .(x1=x, x2=i.x)] | |
Left join byref | cols = c('x', 'y') dt2[dt1, (cols) := mget(paste0('i.', cols))] | |
Right join | merge(dt1, dt2, all.y = TRUE) | |
Anti join | dt1[!dt2] | |
Union | funion(dt, dt2) | |
Intersection | fintersect(dt, dt2) | |
Set difference | fsetdiff(dt, dt2) | |
Set equal | fsetequal(dt, dt2) | |
Combine two tables, repeating rows for all unique pairs (expand.grid ) | data.table(dt1[rep(1:.N, nrow(dt2))], dt2[rep(1:.N, each=nrow(dt1))]) | |
To wide format | dcast(dt, IDa + IDb ~ Param, value.var = 'Value') | ... represents all variables, and . represents no variable |
To wide format with aggregation | dcast(dt, IDa + IDb ~ Param, value.var='Value', fun.aggregate=list(first, last), fill=NA) | |
To wide format on a single grouping factor | dcast(dt, ... ~ Group, value.var = "Value") | |
To wide format with value column prefix | dcast(dt, IDa + IDb ~ paste0('prefix', Param), value.var='Value') | |
To wide format without index | unstack(dt, Value ~ Param) %>% as.data.table() | |
To long format | melt(dt, id=c('IDa', 'IDb'), measure=c('Param1', 'Param2')) | |
To long format for specific columns (no index) | melt(dt, id=character(), measure=c('Param1', 'Param2')) | |
Transpose | t(dt) %>% as.data.table() | Drops column names! |
Repeat rows by group | dt[, lapply(.SD, rep, 10), by=ID] | |
Repeat data.table along a sequence | s = LETTERS[1:5] dt2 = replicate(length(s), dt, simplify=FALSE) %>% setNames(s) %>% rbindlist(idcol=TRUE) |
Convert
What | How | Details |
---|---|---|
Data.frame | as.data.frame(dt) | |
Grouped list | split(dt, by='Id') | |
Matrix | data.matrix(dt) |
Syntax notes
.()
is a shortcut for list(), alternative isJ()
:=
operator assigns by reference(variable)
to refer to the column names stored in the variableget(...)
evaluates expression to column name%>%
pipe operator (magrittr package) to chain function calls in a more readable manner, e.g.dt[, mean(x), by=ID] %>% .[, mean(x)]
. Calling functions without data argument is done by enclosing in a new block:dt %>% {table(.$Var)} %>% as.data.table
%chin%
faster version of %in% for character vectors%between%
checks if value between range, e.g. a%between% c(1,9)
, or a%between% list(c(1,3), c(4,5))
. Seems slow to use%inrange%
checks if value in range%like%
regular expression checkmult='first'
argument provides a useful shortcut for dropping group columns- Perform merges prior to row subsetting to preserve the keys. Row subsetting followed by a merge drops the obsolete keys, for some reason.