I am running the same query in Neo4j and noticing that the execution time varies significantly. For example, the query might execute in 5-6 seconds, while at another time it takes 2-3 minutes, even though the data remains unchanged.
:param {
idsToExclude: []
};
:auto LOAD CSV WITH HEADERS FROM ('file:///PERSON_DATA.csv') AS row
WITH row
WHERE NOT row.`person_id` IN $idsToExclude AND NOT row.`person_id` IS NULL
CALL {
WITH row
MERGE (n: `Person` { `person_id`: row.`person_id` })
SET n.`person_id` = row.`person_id`
SET n.`name` = row.`name`
SET n.`age` = toInteger(row.`age`)
SET n.`email` = row.`email`
SET n.`address` = row.`address`
SET n.`creation_date` = datetime(row.`creation_date`)
SET n.`last_modified_date` = datetime(row.`last_modified_date`)
} IN TRANSACTIONS OF 5000 ROWS;
here is some data from my config file that has been uncommented if needed
server.memory.heap.initial_size=8G
server.memory.heap.max_size=16G
dbms.memory.transaction.total.max=32G
EXPLAIN of my Query
Why can the execution of the same query in Neo4j take different amounts of time? What factors can influence this, and how can I optimize its performance?
The cause of the slowdown is that you do not have an index on :Person(person_id)
, that would explain why early runs (with little data) execute fast, but become increasingly expensive as data is loaded, or over a larger CSV to ingest.
Without an index, the cost of the MATCH part of the MERGE (since a MERGE is like a MATCH and then a CREATE, if no matches were found) increases linearly with the number of :Person nodes in the graph. With an index, there is still a rising cost, but it's log(n) complexity due to index use, so should remain efficient.
Add the index, confirm that the index is used in the EXPLAIN plan (you should see a NodeIndexSeek operator and no NodeByLabelScan operators), then test again.