I'm trying to obtain the relative abundance using a merge_sample option of the Phyloseq package.
When I calculate the average of each Phylum (I will use GlobalPatterns as example) with all the samples; I mean, Globalpaters have 26 samples so I made something like
library(phyloseq)
library(plyr)
data(GlobalPatterns)
TGroup <- tax_glom(GlobalPatterns, taxrank = "Phylum")
PGroup <- transform_sample_counts(TGroup, function(x)100* x / sum(x))
OTUg <- otu_table(PGroup)
TAXg <- tax_table(PGroup)[,"Phylum"]
AverageD <- as.data.frame(rowMeans(OTUg))
names(AverageD) <- c("Mean")
GTable <- merge(TAXg, AverageD, by=0, all=TRUE)
GTable$Row.names = NULL
GTable <- GTable[order(desc(GTable$Mean)),]
head(GTable)
I get something like :
Phylum Mean
1 Proteobacteria 29.45550
2 Firmicutes 18.87905
3 Bacteroidetes 17.34374
4 Cyanobacteria 13.70639
5 Actinobacteria 8.93446
6....... More.....
What I think it is ok !!!!
However when I tray to mage merge_samples( by: SampleType):
ps <- tax_glom(GlobalPatterns, "Phylum")
ps0 <- transform_sample_counts(ps, function(x)100* x / sum(x))
ps1 <- merge_samples(ps0, "SampleType")
ps2 <- transform_sample_counts(ps1, function(x)100* x / sum(x))
ps3 <- ps2
otu_table(ps3) <- t(otu_table(ps3)) # transpose the matrix otus !!!
OTUg <- otu_table(ps3)
TAXg <- tax_table(ps3)[,"Phylum"]
GTable <- merge(TAXg, OTUg, by=0, all=TRUE)
GTable$Row.names = NULL
GTable$Mean=rowMeans(GTable[,-c(1)], na.rm=TRUE)
GTable <- GTable[order(desc(GTable$Mean)),]
head(GTable)
I get the same tax but with different percent in the Mean column :
Phylum Feces Freshwater Freshwater Mock Ocean Sediment Skin Soil Tongue Mean
1 Proteobacteria 1.58 16.71 18.61 20.10 38.00 71.03 31.98 32.66 44.49 30.57
2 Firmicutes 54.82 0.12 0.65 41.42 0.08 2.53 30.67 0.64 21.67 16.96
3 Bacteroidetes 35.23 11.92 5.07 24.97 31.17 7.01 9.09 9.90 12.28 16.29
4 Cyanobacteria 2.63 30.17 62.57 0.16 19.18 3.24 4.65 0.97 6.61 14.46
5 Actinobacteria 3.47 37.11 1.74 8.39 5.12 1.04 16.78 9.99 7.49 10.13
well at this point with the merge_samples by SampleType, each column (Sample) will glom the taxa and the percent for each Phylum in each sample will change (Feces Freshwater Freshwater... ), I understand that but the general average of each Phylum must be the same, even if I merge the samples, in this case, the Mean is different (Proteobacteria 30.57, Firmicutes 16.9, Bacteroidetes 16.29........).
Any solution or advice ????
Thanks
With the first part, you are taking the means across all samples. In the second, your are taking the mean of grouped means. These two are only equivalent when the number of observations per group is the same.
For example:
# equal n for each group
abundance = seq(0.1,0.6,by=0.1)
group = rep(letters[1:3],each=2)
mean(tapply(abundance,group,mean)) == mean(abundance)
[1] TRUE
# unequal n
abundance = seq(0.1,0.6,by=0.1)
group = rep(letters[1:3],1:3)
mean(tapply(abundance,group,mean)) == mean(abundance)
[1] FALSE
Your n per SampleType is different
TGroup <- tax_glom(GlobalPatterns, taxrank = "Phylum")
PGroup <- transform_sample_counts(TGroup, function(x)100* x / sum(x))
SampleType = sample_data(PGroup)$SampleType
table(SampleType)
SampleType
Feces Freshwater Freshwater (creek) Mock
4 2 3 3
Ocean Sediment (estuary) Skin Soil
3 3 3 3
Tongue
2
To get the same mean abundance across samples, you need to find the average abundance per SampleType, then average:
mean_PGroup = sapply(levels(SampleType),function(i){
rowMeans(otu_table(PGroup)[,SampleType==i])
})
phy = tax_table(PGroup)[rownames(mean_PGroup ),"Phylum"]
rownames(mean_PGroup) = phy
head(sort(rowMeans(mean_PGroup),decreasing=TRUE))
Proteobacteria Firmicutes Bacteroidetes Cyanobacteria Actinobacteria
30.572773 16.956254 16.293286 14.463643 10.126875
Verrucomicrobia
2.774216