neo4jcypherpy2neo

Best way to query a large list of nodes in neo4j


I'm trying to do something like the following using the py2neo module to get information for a large quantity of nodes in a neo4j database that I already know the id's of:

query = f'''    
    MATCH 
        (n:MY_LABEL)
    OPTIONAL MATCH 
        (n) -- (u:OTHER_LABEL) // Won't always have a neighbor
    WHERE 
        id(n) in [{','.join(very_long_list_of_nids)}]
    RETURN 
        id(n) as nid, 
        n.feature1,
        u.feature2
'''
resp = graph.run(query)

And I have noticed it's far faster to just omit the WHERE clause, and do filtering after it returns the content of every n:MY_LABEL node. Is there a more elegant way to do this?

For reference, the very_long_list_of_nodes list is about 500k elements long (and I have tried batching it into smaller, 10k chunks and have the same problem) and the database contains 4m nodes, and 10m edges.


Solution

  • You should:

    1. Move the WHERE clause right under your MATCH clause. Currently, your WHERE clause is under the OPTIONAL MATCH clause, and so the ID filtering is only done after finding the relationships of all MY_LABEL nodes.
    2. Remove the :MY_LABEL qualification from the MATCH clause. If you already get the node by native ID, checking the label is unnecessary; and you are not using indexing.
    3. Pass the list of IDs as a parameter. This will cause the Cypher query planner to run much faster (since the Cypher code will be simple), and once the plan is created it will be cached and reused every time you rerun the query with a new ID list. This also makes your client code simpler and faster as well.

    This should be much faster:

    query = f'''
        MATCH 
            (n)
        WHERE 
            ID(n) in $id_list
        OPTIONAL MATCH 
            (n) -- (u:OTHER_LABEL) // Won't always have a neighbor
        RETURN 
            ID(n) as nid, 
            n.feature1,
            u.feature2
    '''
    resp = graph.run(query, id_list=very_long_list_of_nids)
    

    Also, if the relationships between MY_LABEL and OTHER_LABEL always flow in one direction, you should consider using a directional relationship pattern (either --> or <--) in your OPTIONAL MATCH clause, especially if your MY_LABEL nodes have other kinds of relationships that flow in the opposite direction.