rdataframeloopsdatabrickssubset

Using loops in base R to subset a dataframe and generate a new variable across a list of new dataframes within Databricks environment


I'm operating on an R dataframe using R in the Databricks computing environment. In this environment, base R and dplyr are not compatible. If I start mixing the two, the dataframe structures change and then the code will not work. I chose to code in base R so this is what I need to stick to. I have read it will be deprecated in future releases of Databricks, but that's water under the bridge for this project.

As I am relatively new to R, I might be missing some basic principles in my coding below. After reading similar questions, I find the solutions presented use dplyr -- or do not but I cannot see the equivalency. Thank you for your help and patience!

THE QUESTION: I am trying to: 1) create a list of new dataframes subsetted from the original dataframe; and 2) calculate a within group difference for each of the new dataframes. After using "collect" on a SparkR dataframe, I now have a R dataframe within Databricks. Below is an example of my R dataframe and the code that's not working.

orig_df <- data.frame(id = c(1,2,2,2,2,3,3), period = c(1,2,2,3,3,1,3), value = c(1.2, 1.3, 2.1, 1.5, 1.7, 1.6, 2.2))
list_of_new_df <- list(1,2,3)

for(i in seq_along(list_of_new_df)) {
    list_of_new_df[[i]] <- orig_df[orig_df$period==i]
} 

# (I get the error "... undefined columns selected" for this syntax that
#  loops over orig_df.)

# After generating the list of new dataframes based on the subsetting of the original dataframe,
# I'd like to make a calculation within them the new dataframes. The calculation is the
# difference between the last entry of "value" and the first entry of "value" for each person. 
# If a person has less than two entries, the result should be NA.
>
diff_func <- func(df) {
    df$diff_value <- ave(df$diff_value, df$id, FUN = function(x) {
    x[length(x)] - x[1] * ((length(x)-1)/(length(x)-1))
    })
    return(df)
}

for(i in seq_along(list_of_new_df)) {
    list_of_new_df[[i]] <- diff_func(list_of_new_df[[i]])
}

Note for the loop immediately above using seq_along, I cannot use the R function "get". I mention this because I've read this may be needed. The function does not seem to be available in the Databricks environment. Once I have the first part of this working, I could try the command "SparkR::function" to unmask the masked R function.


Solution

  • We can use by() here.

    by(orig_df, ~period,  \(x) transform(x, diff_value = ave(value, id, 
     FUN = \(x) if (length(x) < 2L) NA else x[length(x)] - x[1L]))) 
    

    it returns

    An object of class "by", giving the results for each subset. This is always a list if simplify is false, otherwise a list or array (see tapply).

    i.e.

    > by(orig_df, ~period, function(x) transform(x, diff_value = ave(value, 
    >    id, FUN = \(x) if (length(x)<2L) NA else x[length(x)]-x[1L])))       
    period: 1
     id period value diff_value
    1  1      1   1.2         NA
    6  3      1   1.6         NA
    ------------------------------------------------------------- 
    period: 2
     id period value diff_value
    2  2      2   1.3        0.8
    3  2      2   2.1        0.8
    ------------------------------------------------------------- 
    period: 3
     id period value diff_value
    4  2      3   1.5        0.2
    5  2      3   1.7        0.2
    7  3      3   2.2         NA
    

    which we can re-combine to a data.frame with

    .. |>
     do.call(what='rbind') 
    

    Sometimes we use array2DF() instead.