I have a graph database with 3 type of nodes and two relationships:
(p:PERSON)-[:manages]->(c:COMPANY)-[:seeks]->(s:SKILLS)
I want to create a new relationship between the nodes labeled
(:PERSON)
such as:
(p1:PERSON)-[:competes_with]->(p2:PERSON)
and
(p2:PERSON)-[:competes_with]->(p1:PERSON)
subject to p1.name <> p2.name
.
So that I can represent competition for scarce labor in a variety of markets represented by (s:SKILLS)
.
The condition to establish the new relationship [:competes_with]
is that 2 distinct persons nodes (:PERSON)
manage companies that seek at least 3 (:SKILLS)
profiles that coincide between the 2 companies.
Orders of magnitude are:
|(:PERSON)| = 6000
|(:COMPANY)| = 15000
|(:SKILLS)| = 95000
In my plodding way, what I did was:
MATCH (p1:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1, collect(DISTINCT s.skill_names) AS p1_skills
MATCH (p2:PERSON)-[:manages]->(:COMPANY)-[:seeks]->(s:SKILLS)
WITH p1,p1_skills, p2, collect(DISTINCT s.skill_names) AS p2_skills
WHERE p1 <> p2
UNWIND p1_skills AS sought_skills
WITH p1,p2, sought_skills, reduce(com_skills=[], sought_skills IN p2_skills | com_skills + sought_skills) AS NCS
WHERE size(NCS) >= 3
MERGE(p1)-[competes_with]->(p2)
MERGE(p2)-[competes_with]->(p1)
Given the size of the problem, this causes a 14GB RAM box to crash after a while with an "out-of-memory" exception.
So, besides the fact that I don't know whether my query actually does what I want (it crashes before completing), the question is:
Can I streamline this to make it work with smaller memory requirements? What would the improved query be like?
Thanks
Person
and MANAGES
.COMPETES_WITH
relationships between the same 2 Person
nodes if the relationship is inherently bidirectional. Neo4j can navigate incoming and outgoing relationships equally easily, and the MATCH
clause allows a relationship pattern to not specify a direction (e.g., MATCH (a)-[:FOO]-(b)
). Also, the MERGE
clause (but not CREATE
) allows you to specify an undirected relationship -- which ensures that only one relationship exists between the 2 endpoints.COMPETES_WITH
relationship really belongs between Company
nodes, since that is really the source of the competition. Also, if a Person
left a company, you should not have to remove any COMPETES_WITH
relationships from that node (and you should also not have to add a COMPETES_WITH
relationship to the replacement Person
).COMPETES_WITH
relationship is really needed in the first place. Every time the skills sought by a Company
changes, you'd have to recalculate its COMPETES_WITH
relationships. You should determine whether doing that is worth it, or whether your queries should just dynamically determine a company's competitors as needed.Here is a simplified version of your original query:
MATCH (p1:Person)-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
WITH p1, p2, COUNT(s) AS num_skills
WHERE num_skills >= 3
MERGE(p1)-[:COMPETES_WITH]-(p2);
To find the Person
nodes that compete with a given Person
:
MATCH (p1:Person {id: 123})-[:COMPETES_WITH]-(p2:Person)
RETURN p1, COLLECT(p2) AS competing_people;
If you changed the data model to have the COMPETES_WITH
relationship between Company
nodes:
MATCH (c1:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(c2:Company)
WITH c1, c2, COUNT(s) AS num_skills
WHERE num_skills >= 3
MERGE(c1)-[:COMPETES_WITH]-(c2);
With this model, to find the Person
nodes that compete with a given Person
:
MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:COMPETES_WITH]-(:Company)<-[:MANAGES]-(p2:Person)
RETURN p1, COLLECT(p2) AS competing_people;
If you did not have COMPETES_WITH
relationships at all, to find the Person
nodes that compete with a given Person
:
MATCH (p1:Person {id: 123})-[:MANAGES]->(:Company)-[:SEEKS]->(s:Skills)<-[:SEEKS]-(:Company)<-[:MANAGES]-(p2:Person)
WITH p1, p2, COUNT(s) AS num_skills
WHERE num_skills >= 3
RETURN p1, COLLECT(p2) AS competing_people;