I have a Vacancy
node, which is related to Country
(Country
also has the label Requirable
):
(v:Vacancy)-[:WORK_PERMIT_IN]-(:Country)
Also, Vacancy
has a denormalized property workCountryIds
that contains all the IDs of Country.
Could you please suggest which approach is best to use in Neo4j for WHERE predicates in terms of performance with large datasets:
MATCH ( v:Vacancy )
WHERE ( coalesce(size(v.workCountryIds), 0) <= 0 OR exists { MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable) WHERE req3.id IN [11, 12, 13]} )
RETURN v
or
MATCH ( v:Vacancy )
WHERE ( NOT exists { MATCH (v)-[:WORK_PERMIT_IN]-(:Country)} OR exists { MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable) WHERE req3.id IN [11, 12, 13]} )
RETURN v
In the first case, I use:
coalesce(size(v.workCountryIds), 0) <= 0
In the second:
NOT exists { MATCH (v)-[:WORK_PERMIT_IN]-(:Country)}
Which of these two approaches is better to use in terms of performance?
[UPDATED]
Your second case has to follow WORK_PERMIT_IN
relationships from every Vacancy
to check that there are no Country
nodes on the other end (and in the worst case, it has to check every Country
node connected in that way). But your first case only has to look at one property of each Vacancy
node (ignoring the relationships entirely), so it would be faster. [Note: if WORK_PERMIT_IN
relationships always connect Vacancy
nodes to Country
nodes, then your second case could be made faster by changing (:Country)
to just ()
, and you should also make the relationship pattern directional. But even with these tweaks the second case would be slower.]
However, there is actually another performance issue, and it exists in both of your cases. I will illustrate with your first case, since that is already faster. Here is your first case (reformatted for readability):
MATCH (v:Vacancy)
WHERE
// (a)
COALESCE(SIZE(v.workCountryIds), 0) <= 0 OR
// (b)
EXISTS {
MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable)
WHERE req3.id IN [11, 12, 13]
}
RETURN v
You are returning vacancies that are either: (a) not connected to any Country
, or (b) connected to a few desired Requirable
nodes. Your implementation of (b) is inefficient because it executes MATCH (v)-[:WORK_PERMIT_IN]-(req3:Requirable)
for every v
that fails the (a) test, which presumably is the case for most v
s. And, assuming that there are many Requirable
ids, it is very wasteful to perform that MATCH
on almost every v
instead of just the vacancies connected to the 3 desired Requirable
nodes.
The following query should be much more efficient (and does not even require your denormalized workCountryIds
property!). It assumes that WORK_PERMIT_IN
relationships always connect Vacancy
nodes to Country
nodes.
// (a)
MATCH (v1:Vacancy)
WHERE NOT (v1)-[:WORK_PERMIT_IN]->()
WITH COLLECT(v1) AS v1List
// (b)
MATCH (v2)-[:WORK_PERMIT_IN]->(req:Requirable)
WHERE req.id IN [11, 12, 13] AND NOT v2 IN v1List
WITH v1List, COLLECT(DISTINCT v2) AS v2List
// Return combined (a) and (b) results
RETURN v1List + v2List AS v
Here is an explanation of the new (a) code. Since (v)-[:WORK_PERMIT_IN]->()
is a relationship pattern that contains only one bound node (v
) and the relationship type (WORK_PERMIT_IN
), without any other information: neo4j will use the very efficient getDegree()
operation to use a cached relationship count without actually needing get any relationships at all. This is very fast. (This syntax is also a shorthand way to test for existence without using EXISTS
.)
COLLECT
, but slowerYou can COLLECT
Vacancy node native ids (or elementIds) instead of the nodes themselves, and then at the end return the actual nodes one at a time. This will use up less memory per node, but can still run out of memory if an enormous number of result nodes are found.
MATCH (v1:Vacancy)
WHERE NOT (v1)-[:WORK_PERMIT_IN]->()
WITH COLLECT(ID(v1)) AS id1List
MATCH (v2)-[:WORK_PERMIT_IN]->(req:Requirable)
WHERE req.id IN [11, 12, 13] AND NOT ID(v2) IN id1List
WITH id1List, COLLECT(DISTINCT ID(v2)) AS id2List
UNWIND id1List + id2List AS id
MATCH (v) WHERE ID(v) = id
RETURN v
Alternatively, you can use run your query in batches using the APOC library. For example, the apoc.periodic.iterate procedure. There are other StackOverflow questions about this approach.