rdataframeindexing

R get the indexes of corresponding row in a dataframe


I have two dataframe : df1 and df2. df2 is composed of rows from df1. For each row of df2, I want its index in df1.

For example :

df1 <- data.frame(animal=c('koala', 'hedgehog', 'sloth', 'panda'),
                  country=c('Australia', 'Italy', 'Peru', 'China'),
                  avg_sleep_hours=c(21, 18, 17, 10))
df2 <- data.frame(animal=c('koala', 'sloth', 'panda', 'panda'),
                  country=c('Australia', 'Peru', 'China', 'China'), 
                  avg_sleep_hours=c(21,17,10,10))

I want to get

1 3 4 4

I searched on Internet but found no satisfactory answer, so I wrote my own code. I know that findIdxRow could return several numbers if a row of df2 is duplicates in df1 but it won't appear in my data, so I didn't take the time to deal with that.

findIdxRow <- function(row, df)
{
  n <- nrow(df)
  is_equal <- sapply(1:n, function(i) all(row==df[i,]))
  return(which(is_equal))
}

indexes <- sapply(1:nrow(df2), function(i) findIdxRow(df2[i,],df1))

This code works, but I wonder if there is a shorter way to write it.


Solution

  • You can use match on top of paste like below

    match(do.call(paste, df2), do.call(paste, df1))
    

    Benchmark

    set.seed(42)
    n <- 5E3
    library(dplyr)
    df1 <- data.frame(
      animal = ids::adjective_animal(n),
      country = ids::proquint(n, n_words = 1))
    
    df2 <- df1 |>
      slice_sample(n = n)
    
    f1 <- \() {
      a <- merge(
        df2 |> transform(index_orig = 1:nrow(df2)),
        df1 |> transform(index = 1:nrow(df1))
      )
      a[order(a$index_orig), ]$index
    }
    
    f2 <- \() {
      df2 |>
        left_join(df1 |> mutate(index = row_number()), by = join_by(animal, country)) |>
        pull(index)
    }
    
    
    f3 <- \() {
      match(do.call(paste, df2), do.call(paste, df1))
    }
    
    microbenchmark(
      f1(),
      f2(),
      f3(),
      unit = "relative",
      check = "equal"
    )
    

    which shows

    Unit: relative
     expr       min        lq      mean    median        uq      max neval
     f1() 21.351795 16.785351 15.511564 15.514184 14.906641 9.398281   100
     f2()  3.882332  3.063971  2.929195  2.893312  2.848455 2.389725   100
     f3()  1.000000  1.000000  1.000000  1.000000  1.000000 1.000000   100