r - Recalculate column only for highest date in each category -
i want recalculate column in data.table rows, depending on condition, category (cat) , date.
a row may qualify recalculated if condition==true. among rows condition==true, rows highest date respective cat should selected.
a simplified example:
df = data.frame(cat=rep(c("a","b","c"),each=3), date=rep(c("01-08-2013","01-07-2013","01-04-2013"),3), condition=c(true, true, false, false, false, false, false, false, true), data1=c(1:9), data2=rep(c(1:3),3), result=c(1:1)) df$date = as.date(df$date , "%m-%d-%y") dt = data.table(df) dt cat date condition data1 data2 result 1: 2013-01-08 true 1 1 1 2: 2013-01-07 true 2 2 1 3: 2013-01-04 false 3 3 1 4: b 2013-01-08 false 4 1 1 5: b 2013-01-07 false 5 2 1 6: b 2013-01-04 false 6 3 1 7: c 2013-01-08 false 7 1 1 8: c 2013-01-07 false 8 2 1 9: c 2013-01-04 true 9 3 1 i found out how extract cat's , date's of rows, result must recalculated:
setkey(dt, condition, cat, date) dt[j(true), max(date), by=cat] cat v1 1: 2013-01-08 2: c 2013-01-04 however, don't know how calculate new result these rows. in simplified example, new result should data1+data2.
edit:
inspired eddi's answer, came 2 more possible solutions:
approach using .i:
dt[dt[condition==true , .i[which.max(date)], by=cat][[2]], result:=data1+data2] approach using .sd (see eddi's note of caution):
max_dates=dt[condition==true , .sd[which.max(date)], by=cat] setkey(dt, cat, date) dt[max_dates, result:=data1 + data2] are there recommendations solution choose regard speed / efficiency?
something work:
dt = data.table(df) max_dates = dt[condition == true, list(date = max(date), condition = true), = cat] setkey(dt, cat, date, condition) dt[max_dates, result := data1 + data2] dt # cat date condition data1 data2 result #1: 2013-01-04 false 3 3 1 #2: 2013-01-07 true 2 2 1 #3: 2013-01-08 true 1 1 2 #4: b 2013-01-04 false 6 3 1 #5: b 2013-01-07 false 5 2 1 #6: b 2013-01-08 false 4 1 1 #7: c 2013-01-04 true 9 3 12 #8: c 2013-01-07 false 8 2 1 #9: c 2013-01-08 false 7 1 1 a note of warning: above relies on max_dates not having key - if change have key (e.g. if by column that's part of key), you'd have either erase key, or make have same key dt later in code merge work correctly.
and here's approach:
dt = data.table(df) dt[, result := result + (data1 + data2 - result) * condition * (date == max(date)), = list(cat, condition)] # could've used ifelse here instead, ifelse slow
Comments
Post a Comment