rperformanceprocessing-efficiency

Looking for a more efficient way to implement this R code if possible


I have the following R code that works but it's quite slow. I want to create a new column based on the values of an existing column in a R dataframe. But there's a catch/complication, I need to access and change a Global environment variable which holds a comparison value across observations. I accomplish this using APPLY with a function on the rows of the dataframe. The function can write and read the external variable. This works but is slow. Is there any way of speeding up the process?

The value of drug in the first observation is the start BASE_VALUE. What I'm trying to do is label the observations (drug value) which are different from the BASE_VALUE and are not substrings of the BASE_VALUE. The drug value of this observation then become the current BASE_VALUE and the process continues. In the example below, the 2nd "applesauce" value which is in the 3rd row of the dataset should NOT be flagged, as it appeared in the 1st line. This illustrates that I need to somehow store the value of the 1st row and be able to compare it with the 3rd row. That's why using the lagged value doesn't work and why I have the BASE_VALUE variable. Indeed I originally tried to use the lagged value till I realised this.

Reproducible code is below:

base_value <- ""
char_vector <- c("applesauce", "apple", "applesauce", "orange", "orange", "banana", "applepie") 

#change char_vector to a dataframe, the lastone column isn't completely necessary
df = data.frame(drug = char_vector) %>% 
mutate(lastone = lag(drug))

test_func <- function(row, output){
if (is.na(row[2])){
   #this is the first observation - set the drug value as BASE_VALUE
   base_value <<- row[1]
   return("Y")
 }else if (!is.na(row[2]) & row[1] != base_value & !grepl(row[1], base_value, fixed = TRUE)) {
   base_value <<- row[1]
   return("Y")
 }else {
   return("N")
 }
}


switches <- apply(df, 1, test_func)
cbind(df, switches = switches)

Tried above and it works. But would like to speed it up


Solution

  • You can use accumulate from the purrr pacakge to keep track of the base_values, and then check for changes in it:

    library(purrr)
    library(dplyr)
    char_vector <- c("applesauce", "apple", "applesauce", "orange", "orange", "banana", "applepie")
        
    new_base <- function(old_base, value) {
      if (grepl(value, old_base, fixed = TRUE)) {
        return(old_base)
      } else {
        return(value)
      }
    }
    
    tibble(X = char_vector) %>%
    mutate(base_value = accumulate(X, new_base),
           changed = ifelse(base_value != lag(base_value, default = ""),
                            "Y", "N"))
    
    X          base_value   changed
    applesauce applesauce   Y
    apple      applesauce   N
    applesauce applesauce   N
    orange     orange       Y
    orange     orange       N
    banana     banana       Y
    applepie   applepie     Y
    

    With accumulate the value returned by the function is fed into the next iteration as the first argument, and is... well, accumulated, into the final result, which will be returned at the end of the recursion. Just like cumsum, e.g.