neo4jsimilarity

Is it a problem if mean similarity score is high when building a similarity graph?


I'm building a similarity graph in Neo4j and gds.nodeSimilarity.stats is reporting a mean similarity score in the 0.60 to 0.85 range for the projection I'm using regardless of how I transform the graph. I've tried:

I realize I can always set the similarityCutoff in gds.nodeSimilarity.write to a higher value, but I'm second-guessing myself since all the toy problems I used for training, including Neo4j's practices, had mean Jaccard scores less than 0.5. Am I overthinking this or is it a sign that something is wrong?

This is a graph that has two types of nodes: Posts and entities. The posts reflect various media types, while the entities reflect various authors and proper nouns. In this case, I'm mostly focused on Twitter. Some examples of relationships:

(e1 {Type:TwitterAccount})-[TWEETED]->(p:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})

(e1 {Type:TwitterAccount})-[TWEETED]->(p2:Post
{Type:Tweet})-[QUOTE_TWEETED]->(p2:Post
{Type:Tweet})-[AT_MENTIONED]->(e2 {Type:TwitterAccount})

For my code, I've tried first projecting only AT_MENTIONED relationships:

CALL gds.graph.create('similarity_graph', ["Entity", "Post"],
"AT_MENTIONED")

I've tried doing that with a reversed orientation:

CALL gds.graph.create('similarity_graph', ["Entity", "Post"],    {AT_MENTIONED:{type:'AT_MENTIONED', orientation:'REVERSE'}})

I've tried creating a monopartite, weighted relationship between all the nodes with a RELATED_TO relationship ...

MATCH (e1:Entity)-[*2..3]->(e2:Entity) WHERE e1.Type = 'TwitterAccount' AND e2.Type = 'TwitterAccount' AND id(e1) < id(e2) WITH e1, e2, count(*) as strength MERGE (e1)-[r:RELATED_TO]->(e2) SET r.strength
= strength

...and then projecting that:

CALL gds.graph.create("similarity_graph", "Entity", "RELATED_TO")

Whichever one of the above I try, I then get my Jaccard distribution by running:

CALL gds.nodeSimilarity.stats('similarity_graph') YIELD nodesCompared, similarityDistribution

Solution

  • Part of why you are getting a high similarity score is because the default topK value is 10. This means that the relationships will be created / are considered only between the top 10 neighbors of a node. Try running the following query:

    CALL gds.nodeSimilarity.stats('similarity_graph', {topK:1000})
    YIELD nodesCompared, similarityDistribution
    

    Now you will probably get a lower mean similarity distribution. How dense the similarity graph should be depends on your use-case. You can try the default values and see how it goes. If that is still too dense you can raise the similarityCutoff threshold, and if it is too sparse you can raise the topK parameter. There is no silver bullet, it depends on your usecase and dataset.

    Changing the relationship direction will heavily influence the results. In a graph of

    (:User)-[:RELATIONSHIP]->(:Item)
    

    the resulting monopartite network will be a network of users. However if you reverse the relationship

    (:User)<-[:RELATIONSHIP]-(:Item)
    

    Then the resulting network will be a network of items.

    Finally, having Jaccard mean at 0.7 when you use topK 10 is actually great as that means that the relationship will be between actual similar nodes. The Neo4j examples lower the similarity cutoff just so some relationships are created and the similarity graph is not too sparse. You can also raise the topK parameter, it's hard to say exactly without more information about the size of your graph.