graphdata-miningcluster-analysismarkov-modelsmcl

Graph analysis using mcxquery


I am clustering and analysing graphs using mcl. I'm not familiar with graph theory and I read about the function mcxquery.

It is said in the doc that: " The main use of mcxquery is to analyze a graph at different similarity cutoffs. Typically this is done on a graph constructed using a very permissive threshold. For example, one can create a graph from array expression data using mcxarray with a very low pearson correlation cutoff such as 0.2 or 0.3."

What is the similarity they are talking about? How can the pearson correlation be used to analyse the graph? Is it measuring 'how connected' the nodes are?


Solution

  • In this case Pearson correlation was just used as an example. This could be the case where a node is a gene, and a gene is associated with a time-series of expression. The similarity between two genes can then be taken as the Pearson correlation between the two time-series. This will give the similiarity, or 'connectedness' if you will, between two nodes. Do note that mcl, and most other network clustering methods, require the edge weight between two nodes to be a similarity and not a distance.

    Now, a threshold is necessary. Without a threshold, the resulting network would be a complete graph (all edges exist) or if not complete then very dense. Such networks are very hard to handle (because of time and memory requirements), but more importantly, there is usually very little information contained in the lower ranges of similarity, so all that time and memory is spent largely on relatively useless data. As far as I know, there are no good guidelines for automatic selection of a good threshold. It is probably application dependent, so will require some thinking or exploration. On the other hand, if the data is good, there is probably a plateau (range of values), where the exact threshold chosen does not matter that much.

    The program mcx query can help with picking such a threshold; it will supply information about average node degree, number of singletons, size of the largest component and other network attributes, for a list of networks resulting from applying varying thresholds.