In my data, I would like to adjust the start-variable in a row, if a dummy variable is repeating:
I have data as following (added the rcount variable by myself):
head(elevation_raw)
start duration value rcount
1 2021-02-20 12:41:00 [60] [0] 1
2 2021-02-20 12:49:00 [60,20,37] [0,0,0] 2
3 2021-02-20 12:57:00 [60] [0] 3
4 2021-02-20 13:02:00 [60] [0] 4
5 2021-02-20 13:09:00 [60,60,60,60,60] [0,0,0,0,0] 5
6 2021-02-20 14:19:00 [60,60] [0,0] 6
You see the start of measurement, the duration of the measurement (for how long the measurement prevails, also indicating the end of measurement) and the value (here elevation).
You see in row 1, 5, 6 that there are multiple measurements put together in one row. For example, the 2nd measurement starts after the start of the first + duration of the first.
I already put the data in nice rows with separate_rows
and left_join
, but the start of the "multiple measurement"-rows are wrong. Please note that these multiple measurement rows can be found when rcount is repeating:
durations <- separate_rows(elevation_raw, duration, sep = ",", convert=T) %>%
select( c(-value))
durations <- mutate(durations, rcount2 = as.numeric(rownames(durations)))
values <- separate_rows(elevation_raw, value, sep = ",", convert=T) %>%
select( -c(duration))
values <- mutate(values, rcount2 = as.numeric(rownames(values)))
elevation_raw <-left_join(durations, values, by=c('rcount2', 'rcount', "start"))
#result:
head(elevation_raw)
# A tibble: 6 x 6
start duration rcount rcount2 value
<dttm> <int> <dbl> <dbl> <int>
1 2021-02-20 12:41:00 60 1 1 0
2 2021-02-20 12:49:00 60 2 2 0 #"multiple measurement" row, see rcount
3 2021-02-20 12:49:00 20 2 3 0 #"multiple measurement" row
4 2021-02-20 12:49:00 37 2 4 0 #"multiple measurement" row
5 2021-02-20 12:57:00 60 3 5 0
6 2021-02-20 13:02:00 60 4 6 0
So of course, I would like to add a column with the real starts, as now the start is incorrect for "multiple measurement" rows. I tried the following:
elevation_raw<- mutate(elevation_raw, real_start=ifelse(rcount==dplyr::lag(rcount),
dplyr::lag(elevation_raw$real_start)+dplyr::lag(elevation_raw$duration), elevation_raw$start) )
If rcount is repeating (with (rcount==lag(rcount)
) then the real start of the actual row is the real start of the previous row + the duration of the previous row. Else it is the same as in the normal start.
Of course, this approach is not working: I can't use the column I'm right now building to build the column (though I thout that if the first row is not a multiple measurement row, real_start would be the normal start, and it might work as we would have a first value to start with...). (And if I would use the normal start column, it would just be correct for 2nd measurement of multiple rows.)
Maybe I should approach it rather differently, i.e., with an apply function? Thank you for your help, if you have suggestions on how to make my question more clear, please let me know.
Like this?
library(data.table)
library(stringr)
# Sample data
DT <- fread("
start duration value rcount
2021-02-20T12:41:00 [60] [0] 1
2021-02-20T12:49:00 [60,20,37] [0,0,0] 2
2021-02-20T12:57:00 [60] [0] 3
2021-02-20T13:02:00 [60] [0] 4
2021-02-20T13:09:00 [60,60,60,60,60] [0,0,0,0,0] 5
2021-02-20T14:19:00 [60,60] [0,0] 6")
DT[, start := as.POSIXct(start, format = "%Y-%m-%dT%H:%M:%S")]
# maximum number of duration/values
ncols <- length(tstrsplit(DT$duration, ","))
# split the durarion and value columns
DT[, paste0("duration", 1:ncols) := lapply(transpose(str_extract_all(duration, "\\d+")), as.numeric)]
DT[, paste0("value", 1:ncols) := lapply(transpose(str_extract_all(value, "\\d+")), as.numeric)]
# For decimal values, replace the line above with the following line
# DT[, paste0("value", 1:ncols) := lapply(transpose(str_extract_all(value, "[0-9]\\d*(\\.\\d+)?")), as.numeric)]
# Driop the original value and durarion columns
DT[, `:=`(duration = NULL, value = NULL)]
answer <- melt(DT, measure.vars = patterns(duration = "^duration[0-9]",value = "^value[0-9]"),na.rm = TRUE)
# Order
setkey(answer, rcount, variable)
# start addition = cumulative sum of durations
answer[, start_new := start + (cumsum(duration) - duration[1]), by = .(rcount)][]
# start rcount variable duration value start_new
# 1: 2021-02-20 12:41:00 1 1 60 0 2021-02-20 12:41:00
# 2: 2021-02-20 12:49:00 2 1 60 0 2021-02-20 12:49:00
# 3: 2021-02-20 12:49:00 2 2 20 0 2021-02-20 12:49:20
# 4: 2021-02-20 12:49:00 2 3 37 0 2021-02-20 12:49:57
# 5: 2021-02-20 12:57:00 3 1 60 0 2021-02-20 12:57:00
# 6: 2021-02-20 13:02:00 4 1 60 0 2021-02-20 13:02:00
# 7: 2021-02-20 13:09:00 5 1 60 0 2021-02-20 13:09:00
# 8: 2021-02-20 13:09:00 5 2 60 0 2021-02-20 13:10:00
# 9: 2021-02-20 13:09:00 5 3 60 0 2021-02-20 13:11:00
#10: 2021-02-20 13:09:00 5 4 60 0 2021-02-20 13:12:00
#11: 2021-02-20 13:09:00 5 5 60 0 2021-02-20 13:13:00
#12: 2021-02-20 14:19:00 6 1 60 0 2021-02-20 14:19:00
#13: 2021-02-20 14:19:00 6 2 60 0 2021-02-20 14:20:00