I have a dataframe as so:
> |Id1 |Id2 |attr1 |attr2 |attr3|
> ----:----:------:------:-----:
> |1 |2 |1 |0 |.5 |
> |1 |3 |1 |1 |.33 |
> |2 |3 |0 |.6 |.7 |
I want to create edges for the nonzero attributes with weights of the values in the table? How would I go about doing that? I cant seem to find any easy way so right now I'm just using a for loop and iterating through each row but that seems inefficient. Thanks!
The three attribute columns can be stacked. After filtering the resulting column for nonzero values a GraphFrame can be constructed that has no edges with a zero weight:
df = ...
edges = df.withColumn("weight", F.expr("stack(3,cast(attr1 as double),cast(attr2 as double),cast(attr3 as double))"))\
.drop("attr1","attr2","attr3") \
.filter("weight <> 0.0") \
.withColumnRenamed("Id1", "src") \
.withColumnRenamed("Id2", "dst")
vertices = edges.selectExpr("src as id").union(edges.selectExpr("dst as id")).distinct()
from graphframes import GraphFrame
g = GraphFrame(vertices, edges)
As a test the in-degree of each vertex can be checked:
g.inDegrees.show()
prints
+---+--------+
| id|inDegree|
+---+--------+
| 3| 5|
| 2| 2|
+---+--------+
This result is consistent with the given data: vertex 2
has two incoming edges from the first line of the example data and vertex 3
has three incoming edges from the second data line and two edges from the third line.