neo4jcypherlowest-common-ancestor

Neo4j Cypher Lowest Common Ancestor performance issue


I am trying to create an application in spring-neo4j using the graphrepository. One of the requirements is finding a Lowest Common Ancestor (lca) between two child nodes.

Currently I use the following query to achieve this:

 @Query("MATCH path = (c1:Concept)-[r:Relation*{type: 'Is a'}]->(ca:Concept)<-[r:Relation*{type: 'Is a'}]-(c2:Concept) 
      WHERE c1.conceptID = {conceptId1} AND c2.conceptID = {conceptId2}
      RETURN ca ORDER BY length(path) LIMIT 1")

Concept findLowestCommonAncestor(@Param("conceptId1") Long conceptId1, @Param("conceptId2") Long conceptId2);

The problem here is performance. Originally my graph consisted of 330 000 nodes and 2 000 000 relations. The relations I am interested in are the ones having type: "Is a". They only move upwards in the tree (connecting child nodes with parent nodes). The maximum upwards distance is 3.

This is the tree structure:

Tree structure example

Because I have several tree structures like this one, I decided to add a root node connecting all the different tree structures together. So that the lca will always be found. But adding this root node changed my performance drastically: From 558 ms to 562286 ms

I know adding a root node will impact the performance but it should not be this much, right? And if so, is there a better way to calculate the lca?

I would think that my cypher query would only look for nodes upwards in the tree. So in that case, adding an extra root node should not have this much influence on performance.


Solution

  • First, I want to thank @Supamiu for his help on my question.

    Although his answer is probably sufficient for some cases. In my case it was not really possible to change the model.

    Surprisingly enough I managed to increase the performance just by changing the cypher query.

    My original query:

    MATCH path = (c1:Concept)-[r1:Relation*{type: "Is a"}]->(ca:Concept)<-[r2:Relation*{type: "Is a"}]-(c2:Concept)
    WHERE c1.conceptID=35104066 AND c2.conceptID=35808913
    RETURN ca
    ORDER BY length(path)
    LIMIT 1
    

    PROFILE: https://i.sstatic.net/b3Xa8.png

    I changed it to:

    MATCH (c1:Concept {conceptID: 35104066})-[:Relation*{type: "Is a"}]->(p1:Concept)
    MATCH (:Concept {conceptID: 35808913})-[:Relation*{type: "Is a"}]->(p2:Concept)
    WHERE p1.conceptID = p2.conceptID
    MATCH path = (c1)-[:Relation*{type: "Is a"}]->(p1)
    RETURN p1
    ORDER BY length(path)
    LIMIT 1
    

    PROFILE: https://i.sstatic.net/sc18H.jpg

    This gives my the Lowest Common Ancestor in around 100 ms, a significant improvement!

    I still don't fully understand why this makes such a difference. But I think one of the reasons is that I use a tree structure. I think somehow in the second query Cypher only searches upwards in the tree, resulting in less relations to search. So in a blob structure this would not really help, I think.

    Can someone maybe confirm this?