rdplyrpercentage

Subtract values from following rows and calculate percent change


I have a dataframe where I wish to generate an additional column with the value of the percent change calculated as the difference from the former row divided by the original value, then multiplied by 100.

I can get the difference, but I'm unsure on how to subtract the original value at the first step in the question, and then follow up with the multiplication (the first part of the subtraction).

This is the dataframe:

gens_avg <- structure(list(hap = c("hap1", "hap1", "hap1", "hap1", "hap2", 
"hap2", "hap2", "hap2"), gen = structure(c(1L, 2L, 3L, 4L, 1L, 
2L, 3L, 4L), levels = c("G1", "G2", "G3", "G4"), class = "factor"), 
    sum = c(1924959, 895814, 2344706, 2873530, 2165959, 845385, 
    2431160, 2189315), `# of ind` = c(1720L, 860L, 3010L, 4300L, 
    1720L, 860L, 3010L, 4300L), avg = c(1119.16220930233, 1041.64418604651, 
    778.972093023256, 668.262790697674, 1259.27848837209, 983.005813953488, 
    807.694352159468, 509.143023255814)), row.names = c(NA, -8L
), class = "data.frame")

and the code I'm using to calculate differences from the previous row by haplotype:

gens_avg %>%
  group_by(hap) %>%
  mutate(`% change` = avg - lag(avg, default = avg[1]))

Intended result:

    hap gen  sum # of ind   avg % change    desired output
    <chr>   <fct>   <dbl>   <int>   <dbl>   <dbl>   <dbl>
1   hap1    G1  1924959 1720    1119    0   0
2   hap1    G2  895814  860 1042    -77.5   -6.881143878
3   hap1    G3  2344706 3010    779 -263    -25.23992322
4   hap1    G4  2873530 4300    668 -111    -14.24903723
5   hap2    G1  2165959 1720    1259    0   0
6   hap2    G2  845385  860 983 -276    -21.92216044
7   hap2    G3  2431160 3010    808 -175    -17.80264496
8   hap2    G4  2189315 4300    509 -299    -37.0049505

Solution

  • This gives the desired result

    library(dplyr)
    gens_avg %>%
      mutate(`% change` = 100 * (avg / lag(avg, default = first(avg)) - 1), .by = hap)
    
    ##    hap gen     sum # of ind       avg   % change
    ## 1 hap1  G1 1924959     1720 1119.1622   0.000000
    ## 2 hap1  G2  895814      860 1041.6442  -6.926433
    ## 3 hap1  G3 2344706     3010  778.9721 -25.217065
    ## 4 hap1  G4 2873530     4300  668.2628 -14.212230
    ##5 hap2  G1 2165959     1720 1259.2785   0.000000
    ## 6 hap2  G2  845385      860  983.0058 -21.938966
    ## 7 hap2  G3 2431160     3010  807.6944 -17.834224
    

    However, it might make more sense to use NA rather than 0.

    > library(dplyr)
    > gens_avg %>%
    +   mutate(`% change` = 100 * (avg / lag(avg) - 1), .by = hap)
       hap gen     sum # of ind       avg   % change
    1 hap1  G1 1924959     1720 1119.1622         NA
    2 hap1  G2  895814      860 1041.6442  -6.926433
    3 hap1  G3 2344706     3010  778.9721 -25.217065
    4 hap1  G4 2873530     4300  668.2628 -14.212230
    5 hap2  G1 2165959     1720 1259.2785         NA
    6 hap2  G2  845385      860  983.0058 -21.938966
    7 hap2  G3 2431160     3010  807.6944 -17.834224
    8 hap2  G4 2189315     4300  509.1430 -36.963404