dplyracross

Calculate lags across neighboring column with many columns in dplyr


Start with this code:

set.seed(0)

the_df <- tibble(date=seq.Date(ymd('20230101'),ymd('20230101')+days(9), by='days'),
                 lead1=rep(1:10),
                 lead2=runif(10),
                 lead3=runif(10),
                 lead4=runif(10)) 

the_df %>% 
  mutate(lag2=lead2-dplyr::lag(lead1,1),
         lag3=lead3-dplyr::lag(lead2,1),
         lag4=lead4-dplyr::lag(lead3,1),)

The output will look like:

enter image description here

Now imagine a tibble with hundreds of columns. How can this lagged difference from one column the next be replicated in vector form for all columns?

I.e., if the tibble has lead1 to lead 100, the result would be lag2 to lag100.

The statement

the_df %>%
  mutate(across(lead1:lead4, ~ dplyr::lag(.x,1), .names="d_{.col}"))

calculates the lag of each column within the column and creates a new column of data with that result. How do I calculate the difference between two columns within the across function as in the simple four-column example but for all columns?


Solution

  • Update: a one-liner:

    the_df |> mutate(across(lead2:last_col(), ~ .x - lag(get(names(the_df)[which(cur_column() == names(the_df)) - 1])), .names = "lag_{.col}"))
    

    Original:

    Make the data long, then get the name of the previous column::

    the_df |>
      tidyr::pivot_longer(-date, names_to='lead')
      dplyr::mutate(prev_col = lag(lead)) -> df
    

    Left join this intermediate df with itself, getting the previous column's value as a new column. Then calculate the value minus the previous column's lagged value. Then remove the superfluous columns

    df |>
      dplyr::left_join(df, by = c("date", "prev_col" = "lead")) |>
      dplyr::mutate(value = value.x - lag(value.y), .by = lead) |>
      dplyr::select(date, lead, value) -> output
    

    To make the data wide again:

    # make the lag data wide
    tidyr::pivot_wider(output, names_from = lead, names_glue = "{paste0('lag', lead)}", values_from = value, id_cols = date) |>
    # remove the useless lag1 column
      select(-lag1) |>
      # join it with the lead data
      full_join(the_df) |>
    # then get the columns in the right order
      select(date, starts_with("lead"), starts_with("lag"))
    

    Output:

    # A tibble: 10 × 8
       date       lead1  lead2   lead3   lead4    lag2     lag3   lag4
       <date>     <int>  <dbl>   <dbl>   <dbl>   <dbl>    <dbl>  <dbl>
     1 2023-01-01     1 0.0271 0.973   0.908   NA      NA       NA    
     2 2023-01-02     2 0.985  0.520   0.603   -0.0149  0.493   -0.369
     3 2023-01-03     3 0.838  0.0891  0.0151  -1.16   -0.896   -0.505
     4 2023-01-04     4 0.199  0.00293 0.776   -2.80   -0.835    0.687
     5 2023-01-05     5 0.891  0.846   0.283   -3.11    0.647    0.280
     6 2023-01-06     6 0.337  0.902   0.272   -4.66    0.0110  -0.574
     7 2023-01-07     7 0.787  0.939   0.303   -5.21    0.602   -0.599
     8 2023-01-08     8 0.747  0.159   0.133   -6.25   -0.628   -0.805
     9 2023-01-09     9 0.704  0.753   0.00196 -7.30    0.00534 -0.157
    10 2023-01-10    10 0.908  0.976   0.876   -8.09    0.272    0.123