pythonmatrixpearson-correlationscanpy

scanpy correlation matrix with dendrogram


I tried to recreate the correlation matrix that is described in scanpy's tutorial, using my own RNAseq dataset.

The relevant function in scanpy is: sc.pl.correlation_matrix and the plot looks like this:

enter image description here

The main question here is: how was this Pearson's correlation between different cell types calculated, while the size of the matrix for each cell type is different?

For example: I have 1000 genes as columns, 500 CD34+ cells as rows and only 200 CD19+ B cells. So how is it possible to calculate a correlation between both cell types?

Additionally, calculating a correlation between two matrices, results in another matrix (and not with a single scalar such as displayed in the plot above). For example, numpy's corrcoef() function that is applied on two matrices of the same size results in another matrix and not with a scalar...

I tried to average the genes expression over the cells for each cell type so the calculation contained equally-sized vectors, and it still didn't match scanpy's results.

I encountered this conversation: https://github.com/theislab/scanpy/pull/425 mentioning that when hierarchical clustering is computed, this correlation matrix above is created (but no code was provided).

I'll be happy with any suggestions, explanations and some possible python / code implementations.


Solution

  • I managed to solve this problem with some trial and error method.

    Indeed, it is impossible to obtain a single scalar from a correlation between two unequal-sized matrices.

    In fact, the correlation in this function is not conducted on genes at all! It is calculated on the results of a 50-components-PCA of the dataset!

    After conducting PCA, you'll have to average the values of every component for each cell type (because each cell type contains different number of cells). Thus, you'll get a single vector (1*50) for each cell type.

    You will finally have a dataframe where each column is a different cell type and each row is the average component of the PCA for each cell type.

    The calculation of the pearson correlation can be simply achieved using the corr() function of pandas on the final averaged PCA dataframe.

    The arrangement of the plot, including dendrogram, can be achieved with seaborn's clustermap() function.

    That's it :)