r - Find values in a given interval without a vector scan -
with r package data.table
possible find values in given interval without full vector scan of data. example
>dt<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89)) >my.data.table.function(dt,min=3,max=10) x 1: 3 2: 5 3: 8
where dt
can big table.
bonus question: possible same thing set of non-overlapping intervals such as
>i<-data.table(i=c(1,2),min=c(3,20),max=c(10,40)) >i min max 1: 1 3 10 2: 2 20 40 > my.data.table.function2(dt,i) x 1: 1 3 2: 1 5 3: 1 8 4: 2 21 5: 2 34
where both i
, dt
can big. lot
first of all, vecseq
isn't exported visible function data.table
, syntax and/or behavior here change without warning in future updates package. also, untested besides simple identical
check @ end.
that out of way, need bigger example exhibit difference vector scan approach:
require(data.table) n <- 1e5l f <- 10l ni <- n / f set.seed(54321) dt <- data.table(x = 1:n + sample(-f:f, n, replace = true)) <- data.table(i = 1:ni, min = seq(from = 1l, = n, = f) + sample(0:4, ni, replace = true), max = seq(from = 1l, = n, = f) + sample(5:9, ni, replace = true))
dt
, data table not-too-random subset of 1:n
. it
, interval table ni = n / 10
non-overlapping intervals in 1:n
. doing repeated vector scan on ni
intervals takes while:
system.time({ ans.vecscan <- it[, dt[x >= min & x <= max], = i] }) ## user system elapsed ## 84.15 4.48 88.78
one can 2 rolling joins on interval endpoints (see roll
argument in ?data.table
) in 1 swoop:
system.time({ # save time if dt keyed correctly if(!identical(key(dt), "x")) setkey(dt, x) dt[, row := .i] setkey(it, min) target.low <- it[dt, roll = inf, nomatch = 0][, list(min = row[1]), keyby = i] # non-overlapping intervals => (sorted min => sorted max) setattr(it, "sorted", "max") target.high <- it[dt, roll = -inf, nomatch = 0][, list(max = last(row)), keyby = i] target <- target.low[target.high, nomatch = 0] target[, len := max - min + 1l] rm(target.low, target.high) ans.roll <- dt[data.table:::vecseq(target$min, target$len, null)][, := unlist(mapply(rep, x = target$i, times = target$len, simplify=false))] ans.roll[, row := null] setcolorder(ans.roll, c("i", "x")) }) ## user system elapsed ## 0.12 0.00 0.12
ensuring same row order verifies result:
setkey(ans.vecscan, i, x) setkey(ans.roll, i, x) identical(ans.vecscan, ans.roll) ## [1] true
Comments
Post a Comment