r - Recalculate column only for highest date in each category -
i want recalculate column in data.table
rows, depending on condition
, category (cat
) , date
.
a row may qualify recalculated if condition==true
. among rows condition==true
, rows highest date
respective cat
should selected.
a simplified example:
df = data.frame(cat=rep(c("a","b","c"),each=3), date=rep(c("01-08-2013","01-07-2013","01-04-2013"),3), condition=c(true, true, false, false, false, false, false, false, true), data1=c(1:9), data2=rep(c(1:3),3), result=c(1:1)) df$date = as.date(df$date , "%m-%d-%y") dt = data.table(df) dt cat date condition data1 data2 result 1: 2013-01-08 true 1 1 1 2: 2013-01-07 true 2 2 1 3: 2013-01-04 false 3 3 1 4: b 2013-01-08 false 4 1 1 5: b 2013-01-07 false 5 2 1 6: b 2013-01-04 false 6 3 1 7: c 2013-01-08 false 7 1 1 8: c 2013-01-07 false 8 2 1 9: c 2013-01-04 true 9 3 1
i found out how extract cat
's , date
's of rows, result
must recalculated:
setkey(dt, condition, cat, date) dt[j(true), max(date), by=cat] cat v1 1: 2013-01-08 2: c 2013-01-04
however, don't know how calculate new result
these rows. in simplified example, new result
should data1+data2
.
edit:
inspired eddi's answer, came 2 more possible solutions:
approach using .i
:
dt[dt[condition==true , .i[which.max(date)], by=cat][[2]], result:=data1+data2]
approach using .sd
(see eddi's note of caution):
max_dates=dt[condition==true , .sd[which.max(date)], by=cat] setkey(dt, cat, date) dt[max_dates, result:=data1 + data2]
are there recommendations solution choose regard speed / efficiency?
something work:
dt = data.table(df) max_dates = dt[condition == true, list(date = max(date), condition = true), = cat] setkey(dt, cat, date, condition) dt[max_dates, result := data1 + data2] dt # cat date condition data1 data2 result #1: 2013-01-04 false 3 3 1 #2: 2013-01-07 true 2 2 1 #3: 2013-01-08 true 1 1 2 #4: b 2013-01-04 false 6 3 1 #5: b 2013-01-07 false 5 2 1 #6: b 2013-01-08 false 4 1 1 #7: c 2013-01-04 true 9 3 12 #8: c 2013-01-07 false 8 2 1 #9: c 2013-01-08 false 7 1 1
a note of warning: above relies on max_dates
not having key - if change have key (e.g. if by
column that's part of key), you'd have either erase key, or make have same key dt
later in code merge work correctly.
and here's approach:
dt = data.table(df) dt[, result := result + (data1 + data2 - result) * condition * (date == max(date)), = list(cat, condition)] # could've used ifelse here instead, ifelse slow
Comments
Post a Comment