The data for this question can be found here:
Here is the original data: https://github.com/cjy8s/data/blob/c9605876625b3aec8acb949c8bb0b6b4be3a8c41/tab_cond_Loc_phases.csv
Here is the t.test data, which is the data available right before problems occur https://github.com/cjy8s/data/blob/e8e79fdc62d3d8b36a9fd842c0b0dcfa731ec2e1/ttest_compList_2.csv
So I am trying to perform t.tests on the conditions found in the data.table tab_cond_Loc_phases
. In my current dataset, I have 82 experimental conditions or about 3300 comparison pairs when I use combn
. The t.tests work great, and I store those values in a data.table called ttest_compList_2
. I am able to add a few extra columns successfully, but when I try to select/filter only the rows with values < 0.5 in the column p_value
, the new data.table (named ttest_compList
for problem-solving sake) comes up with no rows but no error message.
If I use select
instead of filter
, then I get this error:
Error in `select()`:
! Problem while evaluating `p_value < 0.05`.
Caused by error:
! object 'p_value' not found
This doesn't seem to happen when I use a dataset for tab_cond_Loc_phases
that has fewer conditions to compare. I'm not sure why my code is unable to see the p_value
column of tab_cond_Loc_phases
here.
I'm also sure there is probably a better way to do my t.tests, but this has worked so far. I am also interested to hear if anyone has feedback on my general approach here as well if they are willing to give it.
Here is my MRE:
controls <- c("WT+DMSO", "MUT+DMSO")
#Get the unique names of the rows in a column
condition_vec <- unique(tab_cond_Loc_phases$condition)
#get a list of all possible combinations of conditions, without duplication or replicates
col_vec <- combn(condition_vec, 2, FUN = paste)
#for combinations of condition averages to be compared with t.tests over time, grouped by condition
con_tab_2 <- list()
for (comparison in 1:ncol(col_vec)) {
#Loop through the col_vec combinations and use each pairing as arguments for t.test comparisons
tmp_ttest_2 <- t.test(tab_cond_Loc_phases[condition == col_vec[1, comparison], exp_sums],
tab_cond_Loc_phases[condition == col_vec[2, comparison], exp_sums])
#Additional columns describing the t.tests, to be added to a data.table.
#Each value of res_tab_2 represents one t.test comparison
res_tab_2 <- data.table(
condition1 = combn(condition_vec, 2)[1, comparison],
condition2 = combn(condition_vec, 2)[2, comparison],
t_statistic = tmp_ttest_2$statistic,
df = tmp_ttest_2$parameter,
p_value = tmp_ttest_2$p.value,
mean_cond1 = tmp_ttest_2$estimate[1],
mean_cond2 = tmp_ttest_2$estimate[2],
method = tmp_ttest_2$method
)
#Add the row of t.test data from res_tab_2 from the current iteration to the growing list of lists
#These will be added together to make one data.table
con_tab_2[[comparison]] <- rbind(res_tab_2)
print(paste('t.test comparison group ', comparison, '/', ncol(col_vec)))
}
#Bind all of the lists within con_tab_2 together to make one data.table, for easier referencing later
ttest_compList_2 <- rbindlist(con_tab_2)
#This filters the comparisons that contain at least one of the controls and only keeps the statistically significant comparisons
ttest_compList <- ttest_compList_2 %>%
mutate(pair = as.numeric(factor(1:nrow(ttest_compList_2))),
xmin = pair - 0.2,
xmax = pair + 0.2) %>%
dplyr::filter(p_value < 0.05, grepl(paste(controls, collapse = "|"), condition1) | grepl(paste(controls, collapse = "|"), condition2))
UPDATE For anyone that may be interested, I made a faster way to do the t.tests because of r2evans comment. On my computer, the previous way, updated with theN's answer, executes in about 33 seconds. This new way executes in about 12 seconds.
library(tidyverse)
library(data.table)
###Maybe think about changing the t.test format to be better.... more like the otherone so that
#you don't need rbind and rbindlist
#get subsets of the data instead of just one row of the data at a time
controls <- c("WT+DMSO", "MUT+DMSO")
#Get the unique names of the rows in a column
condition_vec <- unique(tab_cond_Loc_phases$condition)
# Create an empty data.table to store the results
ttest_results <- data.table()
#See how many times each condition occurs, so that you don't get errors if any condition appears only once
occurances <- tab_cond_Loc_phases[, list(replicates = .N), by = condition]
#allow the ttests if there are more than 1 replicate of each condition
if(min(occurances[['replicates']]) > 1) {
#get a list of all possible combinations of conditions, without duplication or replicates
col_vec <- combn(condition_vec, 2, simplify = FALSE)
col_vec_length <- length(col_vec)
for (i in 1:col_vec_length) {
#Selecting conditions to be t.tested
condition1 <- col_vec[[i]][1]
condition2 <- col_vec[[i]][2]
#subset the data for the current condition combinations
conds_subset1 <- tab_cond_Loc_phases[condition == condition1, exp_sums]
conds_subset2 <- tab_cond_Loc_phases[condition == condition2, exp_sums]
# Perform the t-test
ttest <- t.test(conds_subset1, conds_subset2)
# Store the results in the data.table
ttest_results <- rbind(ttest_results, data.table(condition1 = condition1,
condition2 = condition2,
mean_condition1 = ttest$estimate[1],
mean_condition2 = ttest$estimate[2],
statistic = ttest$statistic,
df = ttest$parameter,
p.value = ttest$p.value,
method = ttest$method))
}
}
#This filters the comparisons that contain at least one of the controls and only keeps the statistically significant comparisons
ttest_results <- ttest_results %>%
mutate(pair = as.numeric(factor(1:nrow(ttest_results))),
xmin = pair - 0.2,
xmax = pair + 0.2) %>%
filter(p.value < 0.05, condition1 %in% controls | condition2 %in% controls)
From my observation, it seems that the error is with the matching using grepl()
.
I am not very proficient with regex, so I am suggesting a slightly different approach using %in%
.
ttest_compList <- ttest_compList_2 %>%
mutate(pair = as.numeric(factor(1:nrow(ttest_compList_2))),
xmin = pair - 0.2,
xmax = pair + 0.2) %>%
dplyr::filter(p_value < 0.05, condition1 %in% controls | condition2 %in% controls)
I changed the last chunk of code as shown above, this should work.