rdataframeinsert

Add new row to dataframe, at specific row-index, not appended?


The following code combines a vector with a dataframe:

newrow = c(1:4)
existingDF = rbind(existingDF,newrow)

However this code always inserts the new row at the end of the dataframe.

How can I insert the row at a specified point within the dataframe? For example, lets say the dataframe has 20 rows, how can I insert the new row between rows 10 and 11?


Solution

  • Here's a solution that avoids the (often slow) rbind call:

    existingDF <- as.data.frame(matrix(seq(20),nrow=5,ncol=4))
    r <- 3
    newrow <- seq(4)
    insertRow <- function(existingDF, newrow, r) {
      existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
      existingDF[r,] <- newrow
      existingDF
    }
    
    > insertRow(existingDF, newrow, r)
      V1 V2 V3 V4
    1  1  6 11 16
    2  2  7 12 17
    3  1  2  3  4
    4  3  8 13 18
    5  4  9 14 19
    6  5 10 15 20
    

    If speed is less important than clarity, then @Simon's solution works well:

    existingDF <- rbind(existingDF[1:r,],newrow,existingDF[-(1:r),])
    > existingDF
       V1 V2 V3 V4
    1   1  6 11 16
    2   2  7 12 17
    3   3  8 13 18
    4   1  2  3  4
    41  4  9 14 19
    5   5 10 15 20
    

    (Note we index r differently).

    And finally, benchmarks:

    library(microbenchmark)
    microbenchmark(
      rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
      insertRow(existingDF,newrow,r)
    )
    
    Unit: microseconds
                                                        expr     min       lq   median       uq       max
    1                       insertRow(existingDF, newrow, r) 660.131 678.3675 695.5515 725.2775   928.299
    2 rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 801.161 831.7730 854.6320 881.6560 10641.417
    

    Benchmarks

    As @MatthewDowle always points out to me, benchmarks need to be examined for the scaling as the size of the problem increases. Here we go then:

    benchmarkInsertionSolutions <- function(nrow=5,ncol=4) {
      existingDF <- as.data.frame(matrix(seq(nrow*ncol),nrow=nrow,ncol=ncol))
      r <- 3 # Row to insert into
      newrow <- seq(ncol)
      m <- microbenchmark(
       rbind(existingDF[1:r,],newrow,existingDF[-(1:r),]),
       insertRow(existingDF,newrow,r),
       insertRow2(existingDF,newrow,r)
      )
      # Now return the median times
      mediansBy <- by(m$time,m$expr, FUN=median)
      res <- as.numeric(mediansBy)
      names(res) <- names(mediansBy)
      res
    }
    nrows <- 5*10^(0:5)
    benchmarks <- sapply(nrows,benchmarkInsertionSolutions)
    colnames(benchmarks) <- as.character(nrows)
    ggplot( melt(benchmarks), aes(x=Var2,y=value,colour=Var1) ) + geom_line() + scale_x_log10() + scale_y_log10()
    

    @Roland's solution scales quite well, even with the call to rbind:

                                                                  5       50     500    5000    50000     5e+05
    insertRow2(existingDF, newrow, r)                      549861.5 579579.0  789452 2512926 46994560 414790214
    insertRow(existingDF, newrow, r)                       895401.0 905318.5 1168201 2603926 39765358 392904851
    rbind(existingDF[1:r, ], newrow, existingDF[-(1:r), ]) 787218.0 814979.0 1263886 5591880 63351247 829650894
    

    Plotted on a linear scale:

    linear

    And a log-log scale:

    log-log