Creating a contingency table

I currently have a dataframe that states particular gene clusters within genomes, this is defined as a well-formatted tab-delimited file, which looks basically like the dataframe below (example):

Gene Cluster     Genome
-----------------------------
GCF3372      Streptomyces_hygroscopicus
GCF3450      Streptomyces_sp_Hm1069
GCF3371      Streptomyces_sp_MBT13
GCF3371      Streptomyces_xiamenensis

Based on this I want to create an absence/presence table or contingency table based on this dataframe with values of 0 and 1, depending on the absence or presence of a particular gene cluster in a genome. The whole idea is for me to be able to measure the occurrence of a particular gene cluster within a genome, thus I want a presence/absence table in order to be able to conduct a statistical analysis on this matrix.

x <- data.frame(gc = c('GCF3372','GCF3450','GCF3371','GCF3371','GCF3371'), 
                strain = c('Streptomyces_hygroscopicus', 'Streptomyces_sp_Hm1069', 
                           'Streptomyces_sp_MBT13', 'Streptomyces_xiamenensis','Streptomyces_hygroscopicus'))
dput(head(x[, c(1,2)]))

Solution

Here's a way to compute a contingency table from two categorical variables. For illustrative purposes, I'll use sexand height (these seem to be structurally like the two variables you have in your dataframe x):

Data:

set.seed(300)
df <- data.frame(
  Height = sample(c("tall", "very tall", "small", "very small"), 20, replace = T),
  Sex = sample(c("m", "f"), 20, replace = T)
)
df
       Height Sex
1   very tall   f
2   very tall   m
3   very tall   m
4        tall   f
5  very small   m
6        tall   f
7        tall   m
8  very small   f
9       small   f
10       tall   m
11 very small   f
12       tall   m
13 very small   m
14      small   f
15 very small   m
16      small   m
17 very small   m
18 very small   m
19       tall   f
20       tall   m

First, as noted in a comment already, tabulate the data using table:

tbl <- table(df$Sex, df$Height); tbl
    small tall very small very tall
  f     2    3          2         1
  m     1    4          5         2

Then you can define the first row of tblas a new vector femaleand the second row as male:

female <- tbl[1,]
male <- tbl[2,]

Finally, you rowbind the two into a vector counts, which is your contingency table:

counts <- rbind(female, male)
counts
       small tall very small very tall
female     2    3          2         1
male       1    4          5         2

Based on the contingency table you can run your test, likely a chi-squared:

test <- chisq.test(counts); test

    Pearson's Chi-squared test

data:  counts
X-squared = 1.3492, df = 3, p-value = 0.7175