I have 7 datasets, each one of them have two types of dataframe: Metadata, contains a super important column that shows who is a responder and who is not, and a dataframe about cell types.
Sample using dput: This is an example from one of the datasets. The first dataframe is cells dataframe and the second is Metadata with information about drug Benefit (Response / No Response) :
cells1 <- structure(c(8.10937548981953e-20, 0.095381661829093, 0.054868371418562,
0.0523687378840825, 0.0100173293159538, 0.0332395245437795, 3.37811149975583e-20,
0.048191378909587, 0.13314908462763, 0, 0.00612878313809124,
0, 0.00117409520254045, 1.33684197233784, 0.0701023734195797,
0.290756813286141, 0.349392264371762, 0.169367429138566, 0.00209460699328093,
0.205599458004829, 0.318048653115709, 4.21796249339787e-05, 0.00844407692255898,
0, 0.00613007026042523, 0.0300024082993193, 0.0405191646567986,
0.00654087887823056, 0.0111094954094255, 1.30617589099212e-19,
0.0398730537850546, 0.0390946117756341, 0.239413780024853, 2.07521807718399e-19,
0.00116980239850497, 0), .Dim = c(6L, 6L), .Dimnames = list(c("Adipocytes",
"B-cells", "Basophils", "CD4+ memory T-cells", "CD4+ naive T-cells",
"CD4+ T-cells"), c("Pt1", "Pt10", "Pt101", "Pt103", "Pt106",
"Pt11")))
Those datasets are about cancer therapy. The columns in cells1
are samples and the rows are cell types. This is the way in all 7 datasets.
The rows are exactly the same in all of them, while the samples differ (So in each dataset there are different number of samples). Some of those samples are responders and some are non responders.
Metadata:
Metadata <- structure(list(`Mutation Load` = c("NA", "75", "10", "21", "700",
"106"), `Neo-antigen Load` = c("NA", "33", "5", "5", "219", "67"
), `Neo-peptide Load` = c("NA", "56", "6", "11", "273", "187"
), `Cytolytic Score` = c("977.86911190000001", "65.840716889999996",
"1392.1422339999999", "1108.8620289999999", "645.54163300000005",
"602.6740413"), Benefit = c("No Response", "No Response", "Response",
"No Response", "No Response", "No Response")), row.names = c("Pt1",
"Pt10", "Pt101", "Pt103", "Pt106", "Pt11"), class = "data.frame")
Goal: join the cells dataframes (I did with cbind), and now after I have a big dataframe with 1000+ columns and only 38 rows, I need to build two t-SNE plots, one is to color the samples by dataset (cells1, cells2, cells6 ...) , the second is color the samples by response (Response/ No Response).
My code: I tried to color by dataset, I thought a list of the sample names would be a good idea but got stuck there:
## Combine Cells dataframes
Total_cells = cbind(cells1, cells2, cells6, cells7, cells9, cells12, cells15)
## Color t-SNE by dataset & color by response
Mylist = list(df1 = c(colnames(cells1)), df2 = c(colnames(cells2)),
df6 = c(colnames(cells6)), df7 = c(colnames(cells7)),
df9 = c(colnames(cells9)), df12 = c(colnames(cells12)) ,df15 = c(colnames(cells15)))
t-SNE= Rtsne(t( Total_cells), perplexity = 15)
plot(t-SNE$Y, col = Mylist, pch = 15)
legend("topright",
legend=unique(Mylist), cex = 0.5,
fill =palette("default"),
border=NA,box.col=NA)
If any addtional information is needed please tell me
So I finally got it, I will answer my own question in case anyone needs it in the future.
The first thing I did is combine the cells
data frames with cbind
Scores = cbind(cells1, cells2, cells6, cells7, cells9, cells12, cells15)
The tricky part was creating the Metadata, it contains the information need to separate the columns. I created it with 3 columns - name of the dataset, the samples, and the response:
dataset <- c(rep('CELLS1', length(colnames(cells1))),rep('CELLS2', length(colnames(cells2))),
rep('CELLS6', length(colnames(cells6))), rep('CELLS7', length(colnames(cells7))),
rep('CELLS9', length(colnames(cells9))), rep('CELLS12', length(colnames(cells12))),
rep('CELLS15', length(colnames(cells15))))
samples <- c(colnames(cells1),colnames(cells2),
colnames(cells6), colnames(cells7), colnames(cells9), colnames(cells12),
colnames(cells15))
Response <- c(metadata1$Benefit ,metadata2$Benefit2,
metadata6$Benefit, metadata7$Benefit, metadata9$Benefit, metadata12$Benefit,
metadata15$Benefit)
totaldata <- data.frame(dataset,samples, Response)
The next part is the t-SNE :
par(mar=c(5, 4, 4, 8), xpd=TRUE)
tsne = Rtsne(t(Scores), perplexity = 15)
tsnetotal <- data.frame(x = tsne$Y[,1], y = tsne$Y[,2], col = as.factor(totaldata$dataset))
ggplot(tsnetotal) + geom_point(aes(x=x, y=y, color=col))+theme_classic()
This colors the samples by dataset, and to color by response I just changed to col = as.factor(totaldata$Response)
and that got the job done.