rvector

Efficient method to estimate the group membership


I have below code to estimate the group membership of each element of a large vector

Interval = data.frame(lowerLimit = c(0, c(13.31, 14.1, 14.52, 15.9, 17.88, 20.85, 22.14, 22.6, 23.49, 
24.31, 26.54, 27.29, 32.41, 33.49, 35.08, 38.25, 41.84, 46, 47.35, 
47.85, 48.13, 48.25, 48.8, 50.83, 51.55, 53.22, 54.21, 55.94, 
56.31, 58.09, 58.35, 59.92, 60.78, 64.9, 68.7, 72.79, 77.27, 
78.38, 79.04, 80.61, 85.52, 86.25, 86.63, 88.05, 90.07, 90.25, 
95.13, 96.88, 98.47, 99.77)), upperLimit = c(13.31, 14.1, 14.52, 15.9, 17.88, 20.85, 22.14, 22.6, 23.49, 
24.31, 26.54, 27.29, 32.41, 33.49, 35.08, 38.25, 41.84, 46, 47.35, 
47.85, 48.13, 48.25, 48.8, 50.83, 51.55, 53.22, 54.21, 55.94, 
56.31, 58.09, 58.35, 59.92, 60.78, 64.9, 68.7, 72.79, 77.27, 
78.38, 79.04, 80.61, 85.52, 86.25, 86.63, 88.05, 90.07, 90.25, 
95.13, 96.88, 98.47, 99.77, 100))

set.seed(1)
Num = runif(100000, 0, 100)
sapply(Num, function(i) which(Interval$lowerLimit <= i & Interval$upperLimit > i))

While above code can estimate the group membership for each element of Num, I wonder if there is any more efficient and faster method available. For large vector this code is taking lot of time.

Many thanks for your suggestion.


Solution

  • Since your Interval contains contiguous intervals, you can use findInterval.

    grp1 <- findInterval(Num, c(0, Interval[[2]]))
    grp2 <- sapply(Num, function(i) which(Interval$lowerLimit <= i & Interval$upperLimit > i))
    identical(grp1, grp2)
    # [1] TRUE
    bench::mark(
      fi = findInterval(Num, c(0, Interval[[2]])),
      sa = sapply(Num, function(i) which(Interval$lowerLimit <= i & Interval$upperLimit > i))
    )
    # Warning: Some expressions had a GC in every iteration; so filtering is disabled.
    # # A tibble: 2 × 13
    #   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result          memory                   time             gc                
    #   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>          <list>                   <list>           <list>            
    # 1 fi           1.62ms    1.7ms    584.       391KB     0      292     0      500ms <int [100,000]> <Rprofmem [2 × 3]>       <bench_tm [292]> <tibble [292 × 3]>
    # 2 sa         207.93ms  210.9ms      4.61     101MB     7.69     3     5      650ms <int [100,000]> <Rprofmem [400,215 × 3]> <bench_tm [3]>   <tibble [3 × 3]>  
    

    The findInterval appears to be around 100x faster with the same results.