So, I have a DataFrame with a large number of variables, and I want to cross-check each variable with each other variable with a t-test.
A sample of my data, called trust_news
:
row | mean | polity2 | web | rsf | civil_liberties | freedom_of_expression | vdem_gov_censorship_effort | vdem_self_censorship_effort | vdem_freedom_of_expression | ciri_freedom_of_speech_and_press | media_integrity | vdem_critical_press | vdem_media_perspective | vdem_media_bias | vdem_media_corruption | vdem_media_freedom |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2.68 | 8 | 87.2661 | 25.69 | 0.785599008 | 0.758906967 | 0.731895466 | 0.742219428 | 1 | 1 | 0.81449235 | 0.889046047 | 0.782079459 | 0.693825991 | 0.733503755 | 1 |
2 | 2.8 | 8 | 94.8967 | 22.23 | 0.810742702 | 0.832891911 | 0.8447733 | 0.831499528 | 1 | 1 | 0.88417386 | 0.868772592 | 0.881994928 | 0.835622928 | 0.828566864 | 1 |
3 | 3.22 | 10 | 89.7391 | 14.6 | 0.821268417 | 0.83327835 | 0.883343829 | 0.805721471 | 1 | 1 | 0.829951651 | 0.917491749 | 0.725950972 | 0.709774199 | 0.874261064 | 1 |
5 | 2.96 | 10 | 74.3872 | 24.98 | 0.813949794 | 0.781986225 | 0.844615869 | 0.729330399 | 0.666666667 | 0.5 | 0.878769429 | 0.872387239 | 0.919019442 | 0.841939049 | 0.810193322 | 0.5 |
Then, I run this code on it:
trust_news_combos <- combn(trust_news, 1, t.test, simplify = TRUE)
First off, is the code correct? I have no clue what to put for m
in the combn()
function. AAnyway, that line gives me this:
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | c(t = 85.1670166474227) | c(t = 15.9614095646055) | c(t = 29.2365516170159) | c(t = 11.0778062107689) | c(t = 30.4673329981756) | c(t = 26.8521522144486) | c(t = 23.160185720972) | c(t = 25.1063414199952) | c(t = 17.1830959329723) | c(t = 11.06502519693) | c(t = 33.0841916129404) | c(t = 29.3707961673045) | c(t = 31.2455551028106) | c(t = 39.1490231250879) | c(t = 27.6089179039943) | c(t = 14.0719508946058) | ||||||||||||||||
2 | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | c(df = 32) | ||||||||||||||||
3 | 2.69E-39 | 8.55E-17 | 1.18E-24 | 1.75E-12 | 3.29E-25 | 1.61E-23 | 1.46E-21 | 1.26E-22 | 1.03E-17 | 1.80E-12 | 2.55E-26 | 1.02E-24 | 1.51E-25 | 1.32E-28 | 6.88E-24 | 2.96E-15 | ||||||||||||||||
4 | c(3.00189912275063 | 3.14900996815846) | c(7.56066019283154 | 9.77267314050179) | c(73.5097801046279 | 84.5198259559781) | c(19.628297122971 | 28.4729149982411) | c(0.682586494865725 | 0.780396107679729) | c(0.639468676034051 | 0.744449016935646) | c(0.664192511270674 | 0.792289818305084) | c(0.665160025455844 | 0.782621785210823) | c(0.676674167771883 | 0.858679367682662) | c(0.543941635486123 | 0.78939169784721) | c(0.739756992152986 | 0.836824222392469) | c(0.730937293702635 | 0.839876930600395) | c(0.729509614919607 | 0.831257822777363) | c(0.709894349786553 | 0.787820841122538) | c(0.708427672557418 | 0.821287114048642) | c(0.647915673315896 | 0.867235841835619) |
5 | c(mean of x = 3.07545454545455) |
c(mean of x = 8.66666666666667) |
c(mean of x = 79.014803030303) |
c(mean of x = 24.0506060606061) |
c(mean of x = 0.731491301272727) |
c(mean of x = 0.691958846484849) |
c(mean of x = 0.728241164787879) |
c(mean of x = 0.723890905333333) |
c(mean of x = 0.767676767727273) |
c(mean of x = 0.666666666666667) |
c(mean of x = 0.788290607272727) |
c(mean of x = 0.785407112151515) |
c(mean of x = 0.780383718848485) |
c(mean of x = 0.748857595454545) |
c(mean of x = 0.76485739330303) |
c(mean of x = 0.757575757575758) |
||||||||||||||||
6 | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | c(mean = 0) | ||||||||||||||||
7 | 0.036110864 | 0.542976272 | 2.702603374 | 2.171062176 | 0.024009036 | 0.025769214 | 0.031443667 | 0.028832991 | 0.044676278 | 0.0602499 | 0.023826806 | 0.02674109 | 0.024975831 | 0.019128385 | 0.027703273 | 0.053835873 | ||||||||||||||||
8 | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | two.sided | ||||||||||||||||
9 | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | One Sample t-test | ||||||||||||||||
10 | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] | x[a] |
It gives me the p-values I'm looking for in row 3, but how do I check which two columns are being checked?
Any help is appreciated and will be thanked in my final code!
You should write a small function to compute exactly what you need, and use it instead of the standard function t.test
. For example:
# get four column names
cols <- names(mtcars)[1:4] # use trust_news instead of mtcars, and keep all the names
# compute the pval for a pair of names
pval <- function(pair) {
value <- t.test(mtcars[, pair[1]], mtcars[, pair[2]])$p.value
names(value) <- paste(pair, collapse = " vs. ")
value
}
# do it for all pairs. Don't simplify, and it will keep the names
combn(cols, 2, pval, simplify = FALSE)
#> [[1]]
#> mpg vs. cyl
#> 9.507708e-15
#>
#> [[2]]
#> mpg vs. disp
#> 7.978234e-11
#>
#> [[3]]
#> mpg vs. hp
#> 1.030354e-11
#>
#> [[4]]
#> cyl vs. disp
#> 1.774454e-11
#>
#> [[5]]
#> cyl vs. hp
#> 8.321996e-13
#>
#> [[6]]
#> disp vs. hp
#> 0.001545647
Created on 2021-05-22 by the reprex package (v2.0.0)