riteration

iteratively decrease values in observations for a grouped dataset without changing observations in the first rows using group_map and return a tibble


I am attempting to decrease the values of the value column by 0.000001 for observations that are not in the first row into a new column called lagged.values. I then want to fill the NAs resulting from the lag computation with the original values for the first rows.

Example Data:

test = 
  tibble(
    problems = c("money", "money", "money", "food", "food", "bills", "bills"),
    category_problems = c("financial insecurity", "financial insecurity", "financial insecurity", "cost of living", "cost of living", "financial insecurity", "financial insecurity"),
    value = c(3, 3, 3, 2, 2, 1, 1)
  )

Creation of Function:

lag.values = function(x) {
  if_else(row_number(x) != 1,
  lag(x) - 0.000001,
  x)
}

Attempt:

test |>
  mutate(lagged.values = value) |>
  group_by(value) |>
  group_map(~lag.values(.x$lagged.values))

Output:

Output

Desired Output:

Desired Output


Solution

  • Up Front: I should note that assuming perfect equality for grouping by value will be subject to floating-point issues as discussed in Why are these numbers not equal?, Is floating-point math broken?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f. While two numbers may look the same on the console, and/or they should be equal mathematically, the way floating-point numbers are stored in digital computers lends to small snafus periodically. This affects base R, dplyr, data.table, python, julia, ... anything that does standard number storage. Large-precision libraries exist that are much better at this, though they are less common.


    No need for a function,

    library(dplyr)
    test |>
      mutate(.by = value, value = value - 0.000001 * (row_number() - 1)) |>
      as.data.frame()
    #   problems    category_problems    value
    # 1    money financial insecurity 3.000000
    # 2    money financial insecurity 2.999999
    # 3    money financial insecurity 2.999998
    # 4     food       cost of living 2.000000
    # 5     food       cost of living 1.999999
    # 6    bills financial insecurity 1.000000
    # 7    bills financial insecurity 0.999999
    

    The |> as.data.frame() is merely to circumvent tibble's tendency to hide some precision, it is not required for anything else.

    You asked using dplyr and purrr, but for alternatives:

    ### base R
    ave(test$value, test$value, FUN = \(z) z - 0.000001 * (seq_along(z)-1))
    # [1] 3.000000 2.999999 2.999998 2.000000 1.999999 1.000000 0.999999
    ### assign back into `test$value`
    
    ### data.table
    library(data.table)
    as.data.table(test)[, value := value - 0.000001 * (seq_len(.N) - 1), value][]
    #    problems    category_problems    value
    #      <char>               <char>    <num>
    # 1:    money financial insecurity 3.000000
    # 2:    money financial insecurity 2.999999
    # 3:    money financial insecurity 2.999998
    # 4:     food       cost of living 2.000000
    # 5:     food       cost of living 1.999999
    # 6:    bills financial insecurity 1.000000
    # 7:    bills financial insecurity 0.999999
    

    Data

    test <- structure(list(problems = c("money", "money", "money", "food", "food", "bills", "bills"), category_problems = c("financial insecurity", "financial insecurity", "financial insecurity", "cost of living", "cost of living", "financial insecurity", "financial insecurity"), value = c(3, 3, 3, 2, 2, 1, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))