rdataframecontingency

Creating a contingency table


I currently have a dataframe that states particular gene clusters within genomes, this is defined as a well-formatted tab-delimited file, which looks basically like the dataframe below (example):

Gene Cluster     Genome
-----------------------------
GCF3372      Streptomyces_hygroscopicus
GCF3450      Streptomyces_sp_Hm1069
GCF3371      Streptomyces_sp_MBT13
GCF3371      Streptomyces_xiamenensis

Based on this I want to create an absence/presence table or contingency table based on this dataframe with values of 0 and 1, depending on the absence or presence of a particular gene cluster in a genome. The whole idea is for me to be able to measure the occurrence of a particular gene cluster within a genome, thus I want a presence/absence table in order to be able to conduct a statistical analysis on this matrix.

x <- data.frame(gc = c('GCF3372','GCF3450','GCF3371','GCF3371','GCF3371'), 
                strain = c('Streptomyces_hygroscopicus', 'Streptomyces_sp_Hm1069', 
                           'Streptomyces_sp_MBT13', 'Streptomyces_xiamenensis','Streptomyces_hygroscopicus'))
dput(head(x[, c(1,2)]))

Solution

  • Here's a way to compute a contingency table from two categorical variables. For illustrative purposes, I'll use sexand height (these seem to be structurally like the two variables you have in your dataframe x):

    Data:

    set.seed(300)
    df <- data.frame(
      Height = sample(c("tall", "very tall", "small", "very small"), 20, replace = T),
      Sex = sample(c("m", "f"), 20, replace = T)
    )
    df
           Height Sex
    1   very tall   f
    2   very tall   m
    3   very tall   m
    4        tall   f
    5  very small   m
    6        tall   f
    7        tall   m
    8  very small   f
    9       small   f
    10       tall   m
    11 very small   f
    12       tall   m
    13 very small   m
    14      small   f
    15 very small   m
    16      small   m
    17 very small   m
    18 very small   m
    19       tall   f
    20       tall   m
    

    First, as noted in a comment already, tabulate the data using table:

    tbl <- table(df$Sex, df$Height); tbl
        small tall very small very tall
      f     2    3          2         1
      m     1    4          5         2
    

    Then you can define the first row of tblas a new vector femaleand the second row as male:

    female <- tbl[1,]
    male <- tbl[2,]
    

    Finally, you rowbind the two into a vector counts, which is your contingency table:

    counts <- rbind(female, male)
    counts
           small tall very small very tall
    female     2    3          2         1
    male       1    4          5         2
    

    Based on the contingency table you can run your test, likely a chi-squared:

    test <- chisq.test(counts); test
    
        Pearson's Chi-squared test
    
    data:  counts
    X-squared = 1.3492, df = 3, p-value = 0.7175