I have a dataframe where I wish to generate an additional column with the value of the percent change calculated as the difference from the former row divided by the original value, then multiplied by 100.
I can get the difference, but I'm unsure on how to subtract the original value at the first step in the question, and then follow up with the multiplication (the first part of the subtraction).
This is the dataframe:
gens_avg <- structure(list(hap = c("hap1", "hap1", "hap1", "hap1", "hap2",
"hap2", "hap2", "hap2"), gen = structure(c(1L, 2L, 3L, 4L, 1L,
2L, 3L, 4L), levels = c("G1", "G2", "G3", "G4"), class = "factor"),
sum = c(1924959, 895814, 2344706, 2873530, 2165959, 845385,
2431160, 2189315), `# of ind` = c(1720L, 860L, 3010L, 4300L,
1720L, 860L, 3010L, 4300L), avg = c(1119.16220930233, 1041.64418604651,
778.972093023256, 668.262790697674, 1259.27848837209, 983.005813953488,
807.694352159468, 509.143023255814)), row.names = c(NA, -8L
), class = "data.frame")
and the code I'm using to calculate differences from the previous row by haplotype:
gens_avg %>%
group_by(hap) %>%
mutate(`% change` = avg - lag(avg, default = avg[1]))
Intended result:
hap gen sum # of ind avg % change desired output
<chr> <fct> <dbl> <int> <dbl> <dbl> <dbl>
1 hap1 G1 1924959 1720 1119 0 0
2 hap1 G2 895814 860 1042 -77.5 -6.881143878
3 hap1 G3 2344706 3010 779 -263 -25.23992322
4 hap1 G4 2873530 4300 668 -111 -14.24903723
5 hap2 G1 2165959 1720 1259 0 0
6 hap2 G2 845385 860 983 -276 -21.92216044
7 hap2 G3 2431160 3010 808 -175 -17.80264496
8 hap2 G4 2189315 4300 509 -299 -37.0049505
This gives the desired result
library(dplyr)
gens_avg %>%
mutate(`% change` = 100 * (avg / lag(avg, default = first(avg)) - 1), .by = hap)
## hap gen sum # of ind avg % change
## 1 hap1 G1 1924959 1720 1119.1622 0.000000
## 2 hap1 G2 895814 860 1041.6442 -6.926433
## 3 hap1 G3 2344706 3010 778.9721 -25.217065
## 4 hap1 G4 2873530 4300 668.2628 -14.212230
##5 hap2 G1 2165959 1720 1259.2785 0.000000
## 6 hap2 G2 845385 860 983.0058 -21.938966
## 7 hap2 G3 2431160 3010 807.6944 -17.834224
However, it might make more sense to use NA rather than 0.
> library(dplyr)
> gens_avg %>%
+ mutate(`% change` = 100 * (avg / lag(avg) - 1), .by = hap)
hap gen sum # of ind avg % change
1 hap1 G1 1924959 1720 1119.1622 NA
2 hap1 G2 895814 860 1041.6442 -6.926433
3 hap1 G3 2344706 3010 778.9721 -25.217065
4 hap1 G4 2873530 4300 668.2628 -14.212230
5 hap2 G1 2165959 1720 1259.2785 NA
6 hap2 G2 845385 860 983.0058 -21.938966
7 hap2 G3 2431160 3010 807.6944 -17.834224
8 hap2 G4 2189315 4300 509.1430 -36.963404