I have two data frames. The first one contains a gene-gene correlation matrix, 1484 x 1484 (each cell corresponds to the correlation value between I and J genes). The second one contains a key -> value sort of information, and it looks like this:
Complex Protein_ID
1 BCL6-HDAC4 complex Bcl6
125 BCL6-HDAC5 complex Hdac5
249 BCL6-HDAC7 complex Bcl6
373 Multisubunit ACTR coactivator complex Ep300
497 Condensin I complex Smc2
621 BLOC-3 Hps4
I am interested in extracting the correlations of genes belonging to the same complex from my matrix and storing them on a new data frame, where I will have, per complex, the values of gene-gene correlations. It would ideally look like this:
#this is a simulated data.frame
Complex Correlation values
BCL6-HDAC4 complex 0.64
BCL6-HDAC4 complex -0.25
Multisubunit ACTR coactivator complex 0.31
Multisubunit ACTR coactivator complex 0.30
Any ideas on how I can get there?
library(data.table) # >= V1.15.0
df <-
melt(data.table(cors), # matrix to long data.frame
variable.name = "i",
value.name = "cor"
)[, let(i = as.integer(i), j = rowid(i)) # cols for i and j
][i < j # keep distinct correlations
][, Complex := lkps$Complex[i] # look up Complex for i
][Complex == lkps$Complex[j]] # keep if Complex for j is same
Example data (10 genes, 3 groups, only showing first 6 cols of correlation matrix):
set.seed(1)
n_genes <- 10
cors <- cor(matrix(rnorm(n_genes * 50), nrow = 50, ncol = n_genes))
lkps <- data.frame(
Complex = sample(c("Complex A", "Complex B", "Complex C"), n_genes, replace = TRUE),
Protein_ID = replicate(n_genes, paste0(sample(c(letters, LETTERS), 4, replace = TRUE), collapse = "")))
> cors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1.00000000 -0.039087178 0.026287227 -0.27185574 0.013674895 -0.11933102
[2,] -0.03908718 1.000000000 0.003552006 -0.02391178 0.039833039 0.02218480
[3,] 0.02628723 0.003552006 1.000000000 0.21648782 0.127791868 0.12197135
[4,] -0.27185574 -0.023911775 0.216487818 1.00000000 -0.082713154 -0.24277681
[5,] 0.01367489 0.039833039 0.127791868 -0.08271315 1.000000000 0.09888519
[6,] -0.11933102 0.022184800 0.121971345 -0.24277681 0.098885194 1.00000000
[7,] 0.19468192 0.006755358 -0.074116195 0.12591453 0.184806771 -0.14283941
[8,] -0.14785348 -0.255064246 -0.054761988 -0.03252786 0.004459162 0.03851846
[9,] 0.02336706 0.198299294 0.069506207 0.14657036 0.183043022 -0.10887799
[10,] -0.36678892 0.240101899 0.031648477 0.17387651 0.131315992 -0.12944992
> lkps
Complex Protein_ID
1 Complex C jMXs
2 Complex C ruTw
3 Complex A zoCU
4 Complex C PCev
5 Complex A aWvm
6 Complex B vfRO
7 Complex A GxvG
8 Complex B jSsh
9 Complex B lkpQ
10 Complex B ufxz
Result:
cor i j Complex
<num> <int> <int> <char>
1: -0.03908718 1 2 Complex C
2: -0.27185574 1 4 Complex C
3: -0.02391178 2 4 Complex C
4: 0.12779187 3 5 Complex A
5: -0.07411620 3 7 Complex A
6: 0.18480677 5 7 Complex A
7: 0.03851846 6 8 Complex B
8: -0.10887799 6 9 Complex B
9: -0.12944992 6 10 Complex B
10: -0.05267148 8 9 Complex B
11: 0.04892611 8 10 Complex B
12: 0.18778267 9 10 Complex B