rdataframeperformanceprocessing-efficiencycoding-efficiency

Is there way to optimize the speed for changing values in a >2 million row data frame?


I have got a tibble of more than 2 million rows. One of the columns size is a value using M to represent million, k to represent thousand; it also has some <NA> values. The column type is character, like the following:

size
1.3M
5k
302
8.6M
<NA>
4.4k
21

...and so on.

I tried the following code:

for (i in 1:length(example$size)) {
  if (!is.na(example$size[i])) {
    if (str_sub(example$size[i],-1,-1) == "M") {
      example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000000
    } else if (str_sub(example$size[i],-1,-1) == "k") {
      example$size[i] = as.numeric(str_sub(example$size[i], 1,-2)) * 1000
    }
  }
}

But it took more than half hour and still running, so I interrupted that as I was not sure if my code was wrong and it's in a infinite loop. Is there anything wrong or any way of coding to improve the efficiency?


Solution

  • You could accomplish everything using str_replace_all and as.numeric:

     as.numeric(stringr::str_replace_all(size, c(M = "e6", k = "e3")))
     [1] 1300000    5000      NA      21    4400
    

    Edit:

    A faster approach is to use baseR sub function twice:

    as.numeric(sub("k", "e3", sub("M", "e6", bigsize,fixed = TRUE), fixed = TRUE))
    

    A quick microbenchmark check shows that this method is the fastest:

    microbenchmark::microbenchmark(
       a = as.numeric(sub("k", "e3", sub("M", "e6", bigsize,fixed = TRUE), fixed = TRUE)),
       b = as.numeric(str_replace_all(bigsize, c(M = "e6", k = "e3"))),
       rep1 = rep1(bigsize),
       rep2 = rep2(bigsize), 
       rep3 = rep3(bigsize),
       rep4 = rep4(bigsize),
       rep5 = rep5(bigsize_df), times=3)
    
    Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
        a   621.1582   638.9055   664.4689   656.6529   686.1242   715.5955     3
        b  1102.8758  1108.1215  1118.1558  1113.3673  1125.7958  1138.2244     3
     rep1  1450.3998  1478.7379  1547.1752  1507.0761  1595.5629  1684.0497     3
     rep2  6144.4160  6419.0407  8411.8940  6693.6654  9545.6329 12397.6005     3
     rep3 19224.9825 19225.2984 19427.0457 19225.6143 19528.0773 19830.5402     3
     rep4  1188.0552  1310.4584  1368.6480  1432.8616  1458.9444  1485.0273     3
     rep5  3056.1525  3177.7098  3672.9781  3299.2671  3981.3909  4663.5148     3