neo4jcypherquery-optimization

Neo4j WHERE Predicates Based on Arrays vs. EXISTS and Relationships


I have a Vacancy node, which is related to Country (Country also has the label Requirable):

(v:Vacancy)-[:WORK_PERMIT_IN]-(:Country)

Also, Vacancy has a denormalized property workCountryIds that contains all the IDs of Country.

Could you please suggest which approach is best to use in Neo4j for WHERE predicates in terms of performance with large datasets:

MATCH  ( v:Vacancy )  
WHERE  ( coalesce(size(v.workCountryIds), 0) <= 0  OR  exists { MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable) WHERE req3.id IN [11, 12, 13]}  )  
RETURN v

or

MATCH  ( v:Vacancy )  
WHERE ( NOT exists { MATCH (v)-[:WORK_PERMIT_IN]-(:Country)}  OR  exists { MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable) WHERE req3.id IN [11, 12, 13]}  )
RETURN v

In the first case, I use:

coalesce(size(v.workCountryIds), 0) <= 0

In the second:

NOT exists { MATCH (v)-[:WORK_PERMIT_IN]-(:Country)}

Which of these two approaches is better to use in terms of performance?


Solution

  • [UPDATED]

    Your second case has to follow WORK_PERMIT_IN relationships from every Vacancy to check that there are no Country nodes on the other end (and in the worst case, it has to check every Country node connected in that way). But your first case only has to look at one property of each Vacancy node (ignoring the relationships entirely), so it would be faster. [Note: if WORK_PERMIT_IN relationships always connect Vacancy nodes to Country nodes, then your second case could be made faster by changing (:Country) to just (), and you should also make the relationship pattern directional. But even with these tweaks the second case would be slower.]

    However, there is actually another performance issue, and it exists in both of your cases. I will illustrate with your first case, since that is already faster. Here is your first case (reformatted for readability):

    MATCH (v:Vacancy)  
    WHERE
      // (a)
      COALESCE(SIZE(v.workCountryIds), 0) <= 0 OR
      // (b)
      EXISTS {
        MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable)
        WHERE req3.id IN [11, 12, 13]
      }
    RETURN v
    

    You are returning vacancies that are either: (a) not connected to any Country, or (b) connected to a few desired Requirable nodes. Your implementation of (b) is inefficient because it executes MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable) for every v that fails the (a) test, which presumably is the case for most vs. And, assuming that there are many Requirable ids, it is very wasteful to perform that MATCH on almost every v instead of just the vacancies connected to the 3 desired Requirable nodes.

    The following query should be much more efficient (and does not even require your denormalized workCountryIds property!). It assumes that WORK_PERMIT_IN relationships always connect Vacancy nodes to Country nodes.

    // (a)
    MATCH (v1:Vacancy)
    WHERE NOT (v1)-[:WORK_PERMIT_IN]->()
    WITH COLLECT(v1) AS v1List
    // (b)
    MATCH (v2)-[:WORK_PERMIT_IN]->(req:Requirable)
    WHERE req.id IN [11, 12, 13] AND NOT v2 IN v1List
    WITH v1List, COLLECT(DISTINCT v2) AS v2List
    // Return combined (a) and (b) results
    RETURN v1List + v2List AS v
    

    Here is an explanation of the new (a) code. Since (v)-[:WORK_PERMIT_IN]->() is a relationship pattern that contains only one bound node (v) and the relationship type (WORK_PERMIT_IN), without any other information: neo4j will use the very efficient getDegree() operation to use a cached relationship count without actually needing get any relationships at all. This is very fast. (This syntax is also a shorthand way to test for existence without using EXISTS.)

    Way to delay and hopefully avoid running out of memory due to COLLECT, but slower

    You can COLLECT Vacancy node native ids (or elementIds) instead of the nodes themselves, and then at the end return the actual nodes one at a time. This will use up less memory per node, but can still run out of memory if an enormous number of result nodes are found.

    MATCH (v1:Vacancy)
    WHERE NOT (v1)-[:WORK_PERMIT_IN]->()
    WITH COLLECT(ID(v1)) AS id1List
    MATCH (v2)-[:WORK_PERMIT_IN]->(req:Requirable)
    WHERE req.id IN [11, 12, 13] AND NOT ID(v2) IN id1List
    WITH id1List, COLLECT(DISTINCT ID(v2)) AS id2List
    UNWIND id1List + id2List AS id
    MATCH (v) WHERE ID(v) = id
    RETURN v
    

    Alternatively, you can use run your query in batches using the APOC library. For example, the apoc.periodic.iterate procedure. There are other StackOverflow questions about this approach.