I'm operating on an R dataframe using R in the Databricks computing environment. In this environment, base R and dplyr are not compatible. If I start mixing the two, the dataframe structures change and then the code will not work. I chose to code in base R so this is what I need to stick to. I have read it will be deprecated in future releases of Databricks, but that's water under the bridge for this project.
As I am relatively new to R, I might be missing some basic principles in my coding below. After reading similar questions, I find the solutions presented use dplyr -- or do not but I cannot see the equivalency. Thank you for your help and patience!
THE QUESTION: I am trying to: 1) create a list of new dataframes subsetted from the original dataframe; and 2) calculate a within group difference for each of the new dataframes. After using "collect" on a SparkR dataframe, I now have a R dataframe within Databricks. Below is an example of my R dataframe and the code that's not working.
orig_df <- data.frame(id = c(1,2,2,2,2,3,3), period = c(1,2,2,3,3,1,3), value = c(1.2, 1.3, 2.1, 1.5, 1.7, 1.6, 2.2))
list_of_new_df <- list(1,2,3)
for(i in seq_along(list_of_new_df)) {
list_of_new_df[[i]] <- orig_df[orig_df$period==i]
}
# (I get the error "... undefined columns selected" for this syntax that
# loops over orig_df.)
# After generating the list of new dataframes based on the subsetting of the original dataframe,
# I'd like to make a calculation within them the new dataframes. The calculation is the
# difference between the last entry of "value" and the first entry of "value" for each person.
# If a person has less than two entries, the result should be NA.
>
diff_func <- func(df) {
df$diff_value <- ave(df$diff_value, df$id, FUN = function(x) {
x[length(x)] - x[1] * ((length(x)-1)/(length(x)-1))
})
return(df)
}
for(i in seq_along(list_of_new_df)) {
list_of_new_df[[i]] <- diff_func(list_of_new_df[[i]])
}
Note for the loop immediately above using seq_along, I cannot use the R function "get". I mention this because I've read this may be needed. The function does not seem to be available in the Databricks environment. Once I have the first part of this working, I could try the command "SparkR::function" to unmask the masked R function.
We can use by()
here.
by(orig_df, ~period, \(x) transform(x, diff_value = ave(value, id,
FUN = \(x) if (length(x) < 2L) NA else x[length(x)] - x[1L])))
it returns
An object of class "by", giving the results for each subset. This is always a list if simplify is false, otherwise a list or array (see tapply).
i.e.
> by(orig_df, ~period, function(x) transform(x, diff_value = ave(value, > id, FUN = \(x) if (length(x)<2L) NA else x[length(x)]-x[1L]))) period: 1 id period value diff_value 1 1 1 1.2 NA 6 3 1 1.6 NA ------------------------------------------------------------- period: 2 id period value diff_value 2 2 2 1.3 0.8 3 2 2 2.1 0.8 ------------------------------------------------------------- period: 3 id period value diff_value 4 2 3 1.5 0.2 5 2 3 1.7 0.2 7 3 3 2.2 NA
which we can re-combine to a data.frame
with
.. |>
do.call(what='rbind')
Sometimes we use array2DF()
instead.