I have a large data set (as a data.table,more than 100 million rows) and wish to calculate the difftime between two column.and the time string format if "%H%M%OS",without leading zeros.eg,data
ID time0 time1
1 93005220 93005320
2 93551060 93553940
3 93717740 93717740
4 94840800 94840800
5 94959510 94959510
6 101030460 101127870
7 101425010 101425010
8 104728320 104728980
9 105704050 105705890
10 135109830 135402830
11 93005220 150001330
Take the first row as example,I can not use difftime
,strptime
to calculate the difftime between "93005320"("09:30:05.320") and "93005220"("09:30:05.220"),I have to change "93005220" to the format like "093005.220",then I tried as:
difftime(strptime("093005.220","%H%M%OS"),strptime("093005.320","%H%M%OS"))
But if do like this, is too slow for more than 100 million rows. Is there an efficient way to do this in R?
One approach is to consider that each digit contains information about how many seconds have passed, but some are expressed differently (i.e hours, minutes). If all units of the digit were seconds, we could compare them directly using -
.
f = function(x){
c(36000, 3600, 600, 60, 10, 1, .1, .01, .001) *
x %% c(1e9, 1e8, 1e7, 1e6, 1e5, 1e4, 1e3, 1e2, 1e1) %/%
c(1e8, 1e7, 1e6, 1e5, 1e4, 1e3, 1e2, 1e1, 1e0)
}
ff = Vectorize(function(x, y) sum(f(x) - f(y)))
To then get the difference in seconds, here using dplyr
:
df %>%
mutate(diff = ff(time1, time0))
time0 time1 diff
<dbl> <dbl> <dbl>
1 93005220 93005320 0.1
2 93551060 93553940 2.88
3 93717740 93717740 0
4 94840800 94840800 0
5 94959510 94959510 0
6 101030460 101127870 57.4
7 101425010 101425010 0
8 104728320 104728980 0.66
9 105704050 105705890 1.84
10 135109830 135402830 173
11 93005220 150001330 19796.
Explanation: Using Anirban's approach, f
transforms an input integer into a vector of its components (padding with 0 where necessary). These components are then converted to seconds by multiplication, a vectorized operation. That is, c(1,2) * c(3,4)
evaluates to c(3, 8)
.