rproportionsz-test

How to do two proportion prop.test for each single row in a data frame?


The result produce the same p-value for each row, while the p-value looks different when I calculate each row seperately.

I am trying to test the different between the baseline and endline proportions, here is the data

  group n_Baseline n_Endline sample_Baseline sample_Endline
  <chr>      <int>     <int>           <dbl>          <dbl>
1 A            164       158             305            273
2 B             89        65             131            106
3 C             59        68             118            108
4 D             52        48              90             84
5 E            141       107             224            186

I tried the instruction like below:

df$P_Values <- apply(df, 1, function(x) prop.test(x = c(df$n_Baseline, df$n_Endline), n = c(df$sample_Baseline, df$sample_Endline))$p.value).

The outcome has the same p-value for each row:

 group n_Baseline n_Endline sample_Baseline sample_Endline P_Values
  <chr>      <int>     <int>           <dbl>          <dbl>    <dbl>
1 A            164       158             305            273    0.109
2 B             89        65             131            106    0.109
3 C             59        68             118            108    0.109
4 D             52        48              90             84    0.109
5 E            141       107             224            186    0.109

However, when I do this seperately for each row, the pvalue is very different. For example in the 1st row:

prop.test(x = c(164, 158), n = c(305, 273))

Output:

2-sample test for equality of proportions with continuity correction

data: c(164, 158) out of c(305, 273) X-squared = 0.82448, df = 1, p-value = 0.3639 alternative hypothesis: two.sided 95 percent confidence interval: -0.12552283 0.04342351 sample estimates: prop 1 prop 2 0.5377049 0.5787546

Why and how do I get the exact p-value for each row instead of the same one?


Solution

  • The easiest way to do this is probably via rowwise calculations inside dplyr from the tidyverse

    library(tidyverse)
    
    df %>%
      rowwise() %>%
      mutate(pval = prop.test(x = c(n_Baseline, n_Endline), 
                              n = c(sample_Baseline, sample_Endline))$p.value)
    #> # A tibble: 5 x 6
    #> # Rowwise: 
    #>   group n_Baseline n_Endline sample_Baseline sample_Endline   pval
    #>   <chr>      <int>     <int>           <int>          <int>  <dbl>
    #> 1 A            164       158             305            273 0.364 
    #> 2 B             89        65             131            106 0.355 
    #> 3 C             59        68             118            108 0.0676
    #> 4 D             52        48              90             84 1.00  
    #> 5 E            141       107             224            186 0.310 
    

    If you want to stick to base R, then you can use apply, but your syntax for apply is not correct here. The function in apply takes each row of your data frame as a vector and calls it x. You then need to use the vector x as the elements inside prop.test, but instead you are passing whole columns from your data frame to prop.test. Since you are passing the same thing each time, you get the same (wrong) p value each time.

    In addition, because your first column is a character vector, each row will be coerced into a character vector, so the maths won't work unless you skip the first column in your apply call by using df[-1]

    The correct use of apply would be:

    df$pval <- apply(df[-1], 1, \(x) prop.test(x = x[1:2], n = x[3:4])$p.value)
    
    df
    #>   group n_Baseline n_Endline sample_Baseline sample_Endline       pval
    #> 1     A        164       158             305            273 0.36387328
    #> 2     B         89        65             131            106 0.35495949
    #> 3     C         59        68             118            108 0.06758474
    #> 4     D         52        48              90             84 1.00000000
    #> 5     E        141       107             224            186 0.30960338
    

    Data from question in reproducible format

    df <- structure(list(group = c("A", "B", "C", "D", "E"), n_Baseline = c(164L, 
    89L, 59L, 52L, 141L), n_Endline = c(158L, 65L, 68L, 48L, 107L
    ), sample_Baseline = c(305L, 131L, 118L, 90L, 224L), sample_Endline = c(273L, 
    106L, 108L, 84L, 186L)), class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5"))