I have below code to estimate the group membership of each element of a large vector
Interval = data.frame(lowerLimit = c(0, c(13.31, 14.1, 14.52, 15.9, 17.88, 20.85, 22.14, 22.6, 23.49,
24.31, 26.54, 27.29, 32.41, 33.49, 35.08, 38.25, 41.84, 46, 47.35,
47.85, 48.13, 48.25, 48.8, 50.83, 51.55, 53.22, 54.21, 55.94,
56.31, 58.09, 58.35, 59.92, 60.78, 64.9, 68.7, 72.79, 77.27,
78.38, 79.04, 80.61, 85.52, 86.25, 86.63, 88.05, 90.07, 90.25,
95.13, 96.88, 98.47, 99.77)), upperLimit = c(13.31, 14.1, 14.52, 15.9, 17.88, 20.85, 22.14, 22.6, 23.49,
24.31, 26.54, 27.29, 32.41, 33.49, 35.08, 38.25, 41.84, 46, 47.35,
47.85, 48.13, 48.25, 48.8, 50.83, 51.55, 53.22, 54.21, 55.94,
56.31, 58.09, 58.35, 59.92, 60.78, 64.9, 68.7, 72.79, 77.27,
78.38, 79.04, 80.61, 85.52, 86.25, 86.63, 88.05, 90.07, 90.25,
95.13, 96.88, 98.47, 99.77, 100))
set.seed(1)
Num = runif(100000, 0, 100)
sapply(Num, function(i) which(Interval$lowerLimit <= i & Interval$upperLimit > i))
While above code can estimate the group membership for each element of Num
, I wonder if there is any more efficient and faster method available. For large vector this code is taking lot of time.
Many thanks for your suggestion.
Since your Interval
contains contiguous intervals, you can use findInterval
.
grp1 <- findInterval(Num, c(0, Interval[[2]]))
grp2 <- sapply(Num, function(i) which(Interval$lowerLimit <= i & Interval$upperLimit > i))
identical(grp1, grp2)
# [1] TRUE
bench::mark(
fi = findInterval(Num, c(0, Interval[[2]])),
sa = sapply(Num, function(i) which(Interval$lowerLimit <= i & Interval$upperLimit > i))
)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 fi 1.62ms 1.7ms 584. 391KB 0 292 0 500ms <int [100,000]> <Rprofmem [2 × 3]> <bench_tm [292]> <tibble [292 × 3]>
# 2 sa 207.93ms 210.9ms 4.61 101MB 7.69 3 5 650ms <int [100,000]> <Rprofmem [400,215 × 3]> <bench_tm [3]> <tibble [3 × 3]>
The findInterval
appears to be around 100x faster with the same results.