apache-sparkspark-java

How to create an edge list using a column of node path of ArrayType in Spark?


I have a Spark Dataset containing a single column of ArrayType which denotes the path from one user to another through their mutual friends

path
["Amy","John","Wally"]
["Beth","Sally","Tim","Jacob"]

What I would like to achieve in the end is a table that explicitly lists the edges in the paths. (i.e. an edgelist)

src dest
"Amy" "John"
"John" "Amy"
"John" "Wally"
"Beth" "Sally"
"Sally" "Tim"
"Tim" "Sally"
"Tim" "Jacob"
"Jacob" "Tim"

How should I go about trying to transform the former table into the latter one?


Solution

  • You can turn each list to list of edges (pairs) by using arrays_zip on two slices - one w/o the last element and one w/o the first element. It will create array of structs, then explode resulting array to have each struct in a separate row and then turn struct column into two separate columns (withColumn). Then you should add reverse nodes and remove duplicates by using distinct.

    I assume that you work with DataFrame and use spark sql functions.