rcountdata.tablerolloverrollapply

Count numbers that are lower by constant from current number


Imagine that I have a list of numbers (i.e. numbers column in data.table/data.frame).

1
5
5
10
11
12

for each number in a list a want to count how many unique numbers are there which are lower than that particular number + 5.

The explanation for upper case, first number = 1, search range is 1+5 = 6, so three numbers are in range, less than or equal to: c(1,5,5), and then count unique is 2. This is all by assuming we've got the additional condition, that the number must not only be lower than current_number + 5, but also its index in the list must be >= that of current_number.

The result in this case would be:

2
2
2
3
2
1

Note: Is there a fast solution for huge dataset, in data.frame or data.table? My dataset is rather huge, 10+M rows.


Solution

  • The fastest way I can think of in base R (works if x is sorted):

    findInterval(x + 5, unique(x)) - cumsum(!duplicated(x)) + 1L
    #[1] 2 2 2 3 2 1
    

    edit: no problem with the sorting because with data.table, sorting integers is trivial:

    nr <- 1e7
    nn <- nr/2
    set.seed(0L)
    DT <- data.table(X=sample(nn, nr, TRUE))
    #DT <- data.table(X=c(1,5,5,10,11,12))
    
    system.time(
        DT[order(X), 
            COUNT := findInterval(X + 5L, unique(X)) - cumsum(!duplicated(X)) + 1L
        ]
    )
    #   user  system elapsed 
    #   1.73    0.17    1.53 
    

    2s for 10million rows.