rdata.tablet-test

Filter or select on data.table causes new data.table to come up empty with correct columns but no rows


The data for this question can be found here:

Here is the original data: https://github.com/cjy8s/data/blob/c9605876625b3aec8acb949c8bb0b6b4be3a8c41/tab_cond_Loc_phases.csv

Here is the t.test data, which is the data available right before problems occur https://github.com/cjy8s/data/blob/e8e79fdc62d3d8b36a9fd842c0b0dcfa731ec2e1/ttest_compList_2.csv

So I am trying to perform t.tests on the conditions found in the data.table tab_cond_Loc_phases. In my current dataset, I have 82 experimental conditions or about 3300 comparison pairs when I use combn. The t.tests work great, and I store those values in a data.table called ttest_compList_2. I am able to add a few extra columns successfully, but when I try to select/filter only the rows with values < 0.5 in the column p_value, the new data.table (named ttest_compList for problem-solving sake) comes up with no rows but no error message.

If I use select instead of filter, then I get this error:

Error in `select()`:
! Problem while evaluating `p_value < 0.05`.
Caused by error:
! object 'p_value' not found

This doesn't seem to happen when I use a dataset for tab_cond_Loc_phases that has fewer conditions to compare. I'm not sure why my code is unable to see the p_value column of tab_cond_Loc_phases here.

I'm also sure there is probably a better way to do my t.tests, but this has worked so far. I am also interested to hear if anyone has feedback on my general approach here as well if they are willing to give it.

Here is my MRE:

controls <- c("WT+DMSO", "MUT+DMSO")

#Get the unique names of the rows in a column
condition_vec <- unique(tab_cond_Loc_phases$condition)

#get a list of all possible combinations of conditions, without duplication or replicates
col_vec <- combn(condition_vec, 2, FUN = paste)

#for combinations of condition averages to be compared with t.tests over time, grouped by condition
con_tab_2 <- list()

for (comparison in 1:ncol(col_vec)) {
  #Loop through the col_vec combinations and use each pairing as arguments for t.test comparisons
  tmp_ttest_2 <- t.test(tab_cond_Loc_phases[condition == col_vec[1, comparison], exp_sums],
                       tab_cond_Loc_phases[condition == col_vec[2, comparison], exp_sums])

  #Additional columns describing the t.tests, to be added to a data.table.
  #Each value of res_tab_2 represents one t.test comparison
  res_tab_2 <- data.table(
    condition1 = combn(condition_vec, 2)[1, comparison],
    condition2 = combn(condition_vec, 2)[2, comparison],
    t_statistic = tmp_ttest_2$statistic,
    df = tmp_ttest_2$parameter,
    p_value = tmp_ttest_2$p.value,
    mean_cond1 = tmp_ttest_2$estimate[1],
    mean_cond2 = tmp_ttest_2$estimate[2],
    method = tmp_ttest_2$method
  )
  
  #Add the row of t.test data from res_tab_2 from the current iteration to the growing list of lists
  #These will be added together to make one data.table
  con_tab_2[[comparison]] <- rbind(res_tab_2)
  print(paste('t.test comparison group ', comparison, '/', ncol(col_vec)))
}

#Bind all of the lists within con_tab_2 together to make one data.table, for easier referencing later
ttest_compList_2 <- rbindlist(con_tab_2)

#This filters the comparisons that contain at least one of the controls and only keeps the statistically significant comparisons
ttest_compList <- ttest_compList_2 %>%
  mutate(pair = as.numeric(factor(1:nrow(ttest_compList_2))),
         xmin = pair - 0.2,
         xmax = pair + 0.2) %>%
  dplyr::filter(p_value < 0.05, grepl(paste(controls, collapse = "|"), condition1) | grepl(paste(controls, collapse = "|"), condition2))

UPDATE For anyone that may be interested, I made a faster way to do the t.tests because of r2evans comment. On my computer, the previous way, updated with theN's answer, executes in about 33 seconds. This new way executes in about 12 seconds.

library(tidyverse)
library(data.table)

###Maybe think about changing the t.test format to be better.... more like the otherone so that 
#you don't need rbind and rbindlist
#get subsets of the data instead of just one row of the data at a time

controls <- c("WT+DMSO", "MUT+DMSO")

#Get the unique names of the rows in a column
condition_vec <- unique(tab_cond_Loc_phases$condition)

# Create an empty data.table to store the results
ttest_results <- data.table()

#See how many times each condition occurs, so that you don't get errors if any condition appears only once
occurances <- tab_cond_Loc_phases[, list(replicates = .N), by = condition]

#allow the ttests if there are more than 1 replicate of each condition
if(min(occurances[['replicates']]) > 1) {
  #get a list of all possible combinations of conditions, without duplication or replicates
  col_vec <- combn(condition_vec, 2, simplify = FALSE)
  col_vec_length <- length(col_vec)
  
  for (i in 1:col_vec_length) {
    
    #Selecting conditions to be t.tested
    condition1 <- col_vec[[i]][1]
    condition2 <- col_vec[[i]][2]
    
    #subset the data for the current condition combinations
    conds_subset1 <- tab_cond_Loc_phases[condition == condition1, exp_sums]
    conds_subset2 <- tab_cond_Loc_phases[condition == condition2, exp_sums]
    
    # Perform the t-test
    ttest <- t.test(conds_subset1, conds_subset2)
    
    # Store the results in the data.table
    ttest_results <- rbind(ttest_results, data.table(condition1 = condition1, 
                                                     condition2 = condition2,
                                                     mean_condition1 = ttest$estimate[1],
                                                     mean_condition2 = ttest$estimate[2],
                                                     statistic = ttest$statistic,
                                                     df = ttest$parameter,
                                                     p.value = ttest$p.value,
                                                     method = ttest$method))
   
  }
}

#This filters the comparisons that contain at least one of the controls and only keeps the statistically significant comparisons
ttest_results <- ttest_results %>%
  mutate(pair = as.numeric(factor(1:nrow(ttest_results))),
         xmin = pair - 0.2,
         xmax = pair + 0.2) %>%
  filter(p.value < 0.05, condition1 %in% controls | condition2 %in% controls)


Solution

  • From my observation, it seems that the error is with the matching using grepl().

    I am not very proficient with regex, so I am suggesting a slightly different approach using %in%.

    ttest_compList <- ttest_compList_2 %>%
      mutate(pair = as.numeric(factor(1:nrow(ttest_compList_2))),
             xmin = pair - 0.2,
             xmax = pair + 0.2) %>%
      dplyr::filter(p_value < 0.05, condition1 %in% controls | condition2 %in% controls)
    

    I changed the last chunk of code as shown above, this should work.