I have a dataframe with records of trawl stations in different regions. I need to create a sequential index that changes every time the region changes. I've implemented it with a for loop, but I have about 60000 records, so it's super slow. Any idea on how to do it faster? Please note that I cannot simply group by region, because I need to keep the region in the order they were sampled.
Here is my solution:
# Create dataframe
df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))
# create index column
df$region_id <- 1
# loop through each row to check if different from previous row. If different the id changes.
for(ii in 2:nrow(df)){
if(df$region[ii]!=df$region[ii-1]) {
df$region_id[ii] <- df$region_id[ii-1]+1
} else {
df$region_id[ii] <- df$region_id[ii-1]
}
}
And I get something like this:
region date region_id
1 A 2020-03-20 1
2 A 2020-03-21 1
3 A 2020-03-23 1
4 B 2020-03-25 2
5 B 2020-03-27 2
6 B 2020-03-29 2
7 B 2020-03-31 2
8 B 2020-04-02 2
9 B 2020-04-04 2
10 B 2020-04-06 2
11 B 2020-04-08 2
12 C 2020-04-10 3
13 C 2020-04-12 3
14 A 2020-04-14 4
15 A 2020-04-16 4
16 A 2020-04-18 4
17 A 2020-04-20 4
18 A 2020-04-22 4
19 A 2020-04-24 4
20 A 2020-04-26 4
21 C 2020-04-28 5
22 C 2020-04-30 5
23 C 2020-05-02 5
For 56373 records this takes:
user system elapsed
36.70 0.20 36.91
Any help will be much appreciated.
Thanks
Here is a base R way with rle
.
df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),
date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))
r <- rle(df$region)
r$values <- seq_along(r$values)
inverse.rle(r)
#> [1] 1 1 1 2 2 2 2 2 2 2 2 3 3 4 4 4 4 4 4 4 5 5 5
df$region_id <- inverse.rle(r)
df
#> region date region_id
#> 1 A 2020-03-20 1
#> 2 A 2020-03-21 1
#> 3 A 2020-03-23 1
#> 4 B 2020-03-25 2
#> 5 B 2020-03-27 2
#> 6 B 2020-03-29 2
#> 7 B 2020-03-31 2
#> 8 B 2020-04-02 2
#> 9 B 2020-04-04 2
#> 10 B 2020-04-06 2
#> 11 B 2020-04-08 2
#> 12 C 2020-04-10 3
#> 13 C 2020-04-12 3
#> 14 A 2020-04-14 4
#> 15 A 2020-04-16 4
#> 16 A 2020-04-18 4
#> 17 A 2020-04-20 4
#> 18 A 2020-04-22 4
#> 19 A 2020-04-24 4
#> 20 A 2020-04-26 4
#> 21 C 2020-04-28 5
#> 22 C 2020-04-30 5
#> 23 C 2020-05-02 5
က
#> Error in eval(expr, envir, enclos): object 'က' not found
Created on 2023-03-15 with reprex v2.0.2
With package data.table
there is also rleid
. This will advantageous with bigger data.frames.
library(data.table)
dt1 <- as.data.table(df)
dt1[, region_id := rleid(region)]
dt1
#> region date region_id
#> 1: A 2020-03-20 1
#> 2: A 2020-03-21 1
#> 3: A 2020-03-23 1
#> 4: B 2020-03-25 2
#> 5: B 2020-03-27 2
#> 6: B 2020-03-29 2
#> 7: B 2020-03-31 2
#> 8: B 2020-04-02 2
#> 9: B 2020-04-04 2
#> 10: B 2020-04-06 2
#> 11: B 2020-04-08 2
#> 12: C 2020-04-10 3
#> 13: C 2020-04-12 3
#> 14: A 2020-04-14 4
#> 15: A 2020-04-16 4
#> 16: A 2020-04-18 4
#> 17: A 2020-04-20 4
#> 18: A 2020-04-22 4
#> 19: A 2020-04-24 4
#> 20: A 2020-04-26 4
#> 21: C 2020-04-28 5
#> 22: C 2020-04-30 5
#> 23: C 2020-05-02 5
#> region date region_id
Created on 2023-03-15 with reprex v2.0.2