I currently have a dataframe that states particular gene clusters within genomes, this is defined as a well-formatted tab-delimited file, which looks basically like the dataframe below (example):
Gene Cluster Genome
-----------------------------
GCF3372 Streptomyces_hygroscopicus
GCF3450 Streptomyces_sp_Hm1069
GCF3371 Streptomyces_sp_MBT13
GCF3371 Streptomyces_xiamenensis
Based on this I want to create an absence/presence table or contingency table based on this dataframe with values of 0 and 1, depending on the absence or presence of a particular gene cluster in a genome. The whole idea is for me to be able to measure the occurrence of a particular gene cluster within a genome, thus I want a presence/absence table in order to be able to conduct a statistical analysis on this matrix.
x <- data.frame(gc = c('GCF3372','GCF3450','GCF3371','GCF3371','GCF3371'),
strain = c('Streptomyces_hygroscopicus', 'Streptomyces_sp_Hm1069',
'Streptomyces_sp_MBT13', 'Streptomyces_xiamenensis','Streptomyces_hygroscopicus'))
dput(head(x[, c(1,2)]))
Here's a way to compute a contingency table from two categorical variables. For illustrative purposes, I'll use sex
and height
(these seem to be structurally like the two variables you have in your dataframe x
):
Data:
set.seed(300)
df <- data.frame(
Height = sample(c("tall", "very tall", "small", "very small"), 20, replace = T),
Sex = sample(c("m", "f"), 20, replace = T)
)
df
Height Sex
1 very tall f
2 very tall m
3 very tall m
4 tall f
5 very small m
6 tall f
7 tall m
8 very small f
9 small f
10 tall m
11 very small f
12 tall m
13 very small m
14 small f
15 very small m
16 small m
17 very small m
18 very small m
19 tall f
20 tall m
First, as noted in a comment already, tabulate the data using table
:
tbl <- table(df$Sex, df$Height); tbl
small tall very small very tall
f 2 3 2 1
m 1 4 5 2
Then you can define the first row of tbl
as a new vector female
and the second row as male
:
female <- tbl[1,]
male <- tbl[2,]
Finally, you rowbind the two into a vector counts
, which is your contingency table:
counts <- rbind(female, male)
counts
small tall very small very tall
female 2 3 2 1
male 1 4 5 2
Based on the contingency table you can run your test, likely a chi-squared:
test <- chisq.test(counts); test
Pearson's Chi-squared test
data: counts
X-squared = 1.3492, df = 3, p-value = 0.7175