neo4jcypherpagerank

How to Normalize PageRank Scores


I'm running PageRank on a group of nodes, where each node has a property year. How can I calculate the averages of all the PageRank scores depending on the year property? That is to say if there 100 nodes with a total of 20 different year values, I would like to calculate 20 average PageRank values.

Then, for each node, I'd like to calculate a scaled score based on the difference between the PageRank score and the average PageRank score of papers in that year (where the average for that year is based on the PageRank scores for all nodes with that same value for the year property.

The code to run PageRank is: CALL algo.pageRank.stream( 'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id', 'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target', {graph:'cypher', iterations:20, write:false, concurrency:20}) YIELD node, score WITH *, node.title AS title,
node.year AS year, score AS page_rank ORDER BY page_rank DESC LIMIT 10000 RETURN title, year, page_rank;

How can I alter this code to return scaled score?

Any help is greatly appreciated!


Solution

  • This query should return the scaled_score (as an absolute value) for each year/title combination (the lower the scaled score, the closer the title's page_rank is to the average for that year):

    CALL algo.pageRank.stream(
      'MATCH (p:Paper) WHERE p.year < 2015 RETURN id(p) as id',
      'MATCH (p1:Paper)-[:CITES]->(p2:Paper) RETURN id(p1) as source, id(p2) as target',
      {graph:'cypher', iterations:20, write:false, concurrency:20})
    YIELD node, score
    WITH 
      node.title AS title,
      node.year AS year, 
      score AS page_rank
    ORDER BY page_rank DESC
    LIMIT 10000
    WITH year, COLLECT({title: title, page_rank: page_rank}) AS data, AVG(page_rank) AS avg_page_rank
    UNWIND data AS d
    RETURN year, d.title AS title, ABS(d.page_rank-avg_page_rank)/avg_page_rank AS scaled_score;
    

    You may also want to order the results (say, by year or scaled_score).