rloopssequential-number

Optimize loop for index creation in R


I have a dataframe with records of trawl stations in different regions. I need to create a sequential index that changes every time the region changes. I've implemented it with a for loop, but I have about 60000 records, so it's super slow. Any idea on how to do it faster? Please note that I cannot simply group by region, because I need to keep the region in the order they were sampled.

Here is my solution:

# Create dataframe
df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))

# create index column
df$region_id <- 1

# loop through each row to check if different from previous row. If different the id changes.
for(ii in 2:nrow(df)){
  if(df$region[ii]!=df$region[ii-1]) {
    df$region_id[ii] <- df$region_id[ii-1]+1
  } else {
    df$region_id[ii] <- df$region_id[ii-1]
  }
}

And I get something like this:

   region       date region_id
1       A 2020-03-20         1
2       A 2020-03-21         1
3       A 2020-03-23         1
4       B 2020-03-25         2
5       B 2020-03-27         2
6       B 2020-03-29         2
7       B 2020-03-31         2
8       B 2020-04-02         2
9       B 2020-04-04         2
10      B 2020-04-06         2
11      B 2020-04-08         2
12      C 2020-04-10         3
13      C 2020-04-12         3
14      A 2020-04-14         4
15      A 2020-04-16         4
16      A 2020-04-18         4
17      A 2020-04-20         4
18      A 2020-04-22         4
19      A 2020-04-24         4
20      A 2020-04-26         4
21      C 2020-04-28         5
22      C 2020-04-30         5
23      C 2020-05-02         5

For 56373 records this takes:

  user  system elapsed 
  36.70    0.20   36.91 

Any help will be much appreciated.

Thanks


Solution

  • Here is a base R way with rle.

    df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),
                     date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))
    
    r <- rle(df$region)
    r$values <- seq_along(r$values)
    inverse.rle(r)
    #>  [1] 1 1 1 2 2 2 2 2 2 2 2 3 3 4 4 4 4 4 4 4 5 5 5
    
    df$region_id <- inverse.rle(r)
    df
    #>    region       date region_id
    #> 1       A 2020-03-20         1
    #> 2       A 2020-03-21         1
    #> 3       A 2020-03-23         1
    #> 4       B 2020-03-25         2
    #> 5       B 2020-03-27         2
    #> 6       B 2020-03-29         2
    #> 7       B 2020-03-31         2
    #> 8       B 2020-04-02         2
    #> 9       B 2020-04-04         2
    #> 10      B 2020-04-06         2
    #> 11      B 2020-04-08         2
    #> 12      C 2020-04-10         3
    #> 13      C 2020-04-12         3
    #> 14      A 2020-04-14         4
    #> 15      A 2020-04-16         4
    #> 16      A 2020-04-18         4
    #> 17      A 2020-04-20         4
    #> 18      A 2020-04-22         4
    #> 19      A 2020-04-24         4
    #> 20      A 2020-04-26         4
    #> 21      C 2020-04-28         5
    #> 22      C 2020-04-30         5
    #> 23      C 2020-05-02         5
    က
    #> Error in eval(expr, envir, enclos): object 'က' not found
    

    Created on 2023-03-15 with reprex v2.0.2


    Edit

    With package data.table there is also rleid. This will advantageous with bigger data.frames.

    library(data.table)
    
    dt1 <- as.data.table(df)
    dt1[, region_id := rleid(region)]
    dt1
    #>     region       date region_id
    #>  1:      A 2020-03-20         1
    #>  2:      A 2020-03-21         1
    #>  3:      A 2020-03-23         1
    #>  4:      B 2020-03-25         2
    #>  5:      B 2020-03-27         2
    #>  6:      B 2020-03-29         2
    #>  7:      B 2020-03-31         2
    #>  8:      B 2020-04-02         2
    #>  9:      B 2020-04-04         2
    #> 10:      B 2020-04-06         2
    #> 11:      B 2020-04-08         2
    #> 12:      C 2020-04-10         3
    #> 13:      C 2020-04-12         3
    #> 14:      A 2020-04-14         4
    #> 15:      A 2020-04-16         4
    #> 16:      A 2020-04-18         4
    #> 17:      A 2020-04-20         4
    #> 18:      A 2020-04-22         4
    #> 19:      A 2020-04-24         4
    #> 20:      A 2020-04-26         4
    #> 21:      C 2020-04-28         5
    #> 22:      C 2020-04-30         5
    #> 23:      C 2020-05-02         5
    #>     region       date region_id
    

    Created on 2023-03-15 with reprex v2.0.2