The result produce the same p-value for each row, while the p-value looks different when I calculate each row seperately.
I am trying to test the different between the baseline and endline proportions, here is the data
group n_Baseline n_Endline sample_Baseline sample_Endline
<chr> <int> <int> <dbl> <dbl>
1 A 164 158 305 273
2 B 89 65 131 106
3 C 59 68 118 108
4 D 52 48 90 84
5 E 141 107 224 186
I tried the instruction like below:
df$P_Values <- apply(df, 1, function(x) prop.test(x = c(df$n_Baseline, df$n_Endline), n = c(df$sample_Baseline, df$sample_Endline))$p.value).
The outcome has the same p-value for each row:
group n_Baseline n_Endline sample_Baseline sample_Endline P_Values
<chr> <int> <int> <dbl> <dbl> <dbl>
1 A 164 158 305 273 0.109
2 B 89 65 131 106 0.109
3 C 59 68 118 108 0.109
4 D 52 48 90 84 0.109
5 E 141 107 224 186 0.109
However, when I do this seperately for each row, the pvalue is very different. For example in the 1st row:
prop.test(x = c(164, 158), n = c(305, 273))
Output:
2-sample test for equality of proportions with continuity correction
data: c(164, 158) out of c(305, 273) X-squared = 0.82448, df = 1, p-value = 0.3639 alternative hypothesis: two.sided 95 percent confidence interval: -0.12552283 0.04342351 sample estimates: prop 1 prop 2 0.5377049 0.5787546
Why and how do I get the exact p-value for each row instead of the same one?
The easiest way to do this is probably via rowwise
calculations inside dplyr
from the tidyverse
library(tidyverse)
df %>%
rowwise() %>%
mutate(pval = prop.test(x = c(n_Baseline, n_Endline),
n = c(sample_Baseline, sample_Endline))$p.value)
#> # A tibble: 5 x 6
#> # Rowwise:
#> group n_Baseline n_Endline sample_Baseline sample_Endline pval
#> <chr> <int> <int> <int> <int> <dbl>
#> 1 A 164 158 305 273 0.364
#> 2 B 89 65 131 106 0.355
#> 3 C 59 68 118 108 0.0676
#> 4 D 52 48 90 84 1.00
#> 5 E 141 107 224 186 0.310
If you want to stick to base R, then you can use apply
, but your syntax for apply
is not correct here. The function in apply
takes each row of your data frame as a vector and calls it x
. You then need to use the vector x
as the elements inside prop.test
, but instead you are passing whole columns from your data frame to prop.test
. Since you are passing the same thing each time, you get the same (wrong) p value each time.
In addition, because your first column is a character vector, each row will be coerced into a character vector, so the maths won't work unless you skip the first column in your apply
call by using df[-1]
The correct use of apply
would be:
df$pval <- apply(df[-1], 1, \(x) prop.test(x = x[1:2], n = x[3:4])$p.value)
df
#> group n_Baseline n_Endline sample_Baseline sample_Endline pval
#> 1 A 164 158 305 273 0.36387328
#> 2 B 89 65 131 106 0.35495949
#> 3 C 59 68 118 108 0.06758474
#> 4 D 52 48 90 84 1.00000000
#> 5 E 141 107 224 186 0.30960338
Data from question in reproducible format
df <- structure(list(group = c("A", "B", "C", "D", "E"), n_Baseline = c(164L,
89L, 59L, 52L, 141L), n_Endline = c(158L, 65L, 68L, 48L, 107L
), sample_Baseline = c(305L, 131L, 118L, 90L, 224L), sample_Endline = c(273L,
106L, 108L, 84L, 186L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))