r - Find values in a given interval without a vector scan -


with r package data.table possible find values in given interval without full vector scan of data. example

>dt<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89)) >my.data.table.function(dt,min=3,max=10)    x 1: 3 2: 5 3: 8 

where dt can big table.

bonus question: possible same thing set of non-overlapping intervals such as

>i<-data.table(i=c(1,2),min=c(3,20),max=c(10,40)) >i    min max 1: 1   3  10 2: 2  20  40 > my.data.table.function2(dt,i)     x 1: 1  3 2: 1  5 3: 1  8 4: 2 21 5: 2 34 

where both i , dt can big. lot

first of all, vecseq isn't exported visible function data.table, syntax and/or behavior here change without warning in future updates package. also, untested besides simple identical check @ end.

that out of way, need bigger example exhibit difference vector scan approach:

require(data.table)  n <- 1e5l f <- 10l ni <- n / f  set.seed(54321) dt <- data.table(x = 1:n + sample(-f:f, n, replace = true)) <- data.table(i = 1:ni,                   min = seq(from = 1l, = n, = f) + sample(0:4, ni, replace = true),                  max = seq(from = 1l, = n, = f) + sample(5:9, ni, replace = true)) 

dt, data table not-too-random subset of 1:n. it, interval table ni = n / 10 non-overlapping intervals in 1:n. doing repeated vector scan on ni intervals takes while:

system.time({   ans.vecscan <- it[, dt[x >= min & x <= max], = i] })  ##  user  system elapsed   ## 84.15    4.48   88.78 

one can 2 rolling joins on interval endpoints (see roll argument in ?data.table) in 1 swoop:

system.time({   # save time if dt keyed correctly   if(!identical(key(dt), "x")) setkey(dt, x)    dt[, row := .i]    setkey(it, min)    target.low <- it[dt, roll = inf, nomatch = 0][, list(min = row[1]), keyby = i]    # non-overlapping intervals => (sorted min => sorted max)   setattr(it, "sorted", "max")    target.high <- it[dt, roll = -inf, nomatch = 0][, list(max = last(row)), keyby = i]    target <- target.low[target.high, nomatch = 0]   target[, len := max - min + 1l]     rm(target.low, target.high)    ans.roll <- dt[data.table:::vecseq(target$min, target$len, null)][, := unlist(mapply(rep, x = target$i, times = target$len, simplify=false))]   ans.roll[, row := null]   setcolorder(ans.roll, c("i", "x")) })  ## user  system elapsed   ## 0.12    0.00    0.12 

ensuring same row order verifies result:

setkey(ans.vecscan, i, x) setkey(ans.roll, i, x) identical(ans.vecscan, ans.roll) ## [1] true 

Comments

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -