pythonapache-sparkpysparkspark-graphx

Turning a spark dataframe of edges into a graphx graph


I have a dataframe as so:

> |Id1 |Id2 |attr1 |attr2 |attr3| 
>  ----:----:------:------:-----: 
> |1   |2   |1     |0     |.5   | 
> |1   |3   |1     |1     |.33  | 
> |2   |3   |0     |.6    |.7   |

I want to create edges for the nonzero attributes with weights of the values in the table? How would I go about doing that? I cant seem to find any easy way so right now I'm just using a for loop and iterating through each row but that seems inefficient. Thanks!


Solution

  • The three attribute columns can be stacked. After filtering the resulting column for nonzero values a GraphFrame can be constructed that has no edges with a zero weight:

    df = ...
    edges = df.withColumn("weight", F.expr("stack(3,cast(attr1 as double),cast(attr2 as double),cast(attr3 as double))"))\
          .drop("attr1","attr2","attr3") \
          .filter("weight <> 0.0") \
          .withColumnRenamed("Id1", "src") \
          .withColumnRenamed("Id2", "dst")
    
    vertices = edges.selectExpr("src as id").union(edges.selectExpr("dst as id")).distinct()
    
    from graphframes import GraphFrame
    
    g = GraphFrame(vertices, edges)
    

    As a test the in-degree of each vertex can be checked:

    g.inDegrees.show()
    

    prints

    +---+--------+
    | id|inDegree|
    +---+--------+
    |  3|       5|
    |  2|       2|
    +---+--------+
    

    This result is consistent with the given data: vertex 2 has two incoming edges from the first line of the example data and vertex 3 has three incoming edges from the second data line and two edges from the third line.