pythonapache-sparkpysparkapache-spark-sqlgraphframes

How to create edge list from spark data frame in Pyspark?


I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame.

For example, below is my vertices data frame. I have a list of ids and they belong to different groups.

+---+-----+
|id |group|
+---+-----+
|a  |1    |
|b  |2    |
|c  |1    |
|d  |2    |
|e  |3    |
|a  |3    |
|f  |1    |
+---+-----+

My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id could appear in multiple groups (e.g. id a above is in group 1 and 3). Below is the edge list data frame that I'd like to get:

+---+-----+-----+
|src|dst  |group|
+---+-----+-----+
|a  |c    |1    |
|a  |f    |1    |
|c  |f    |1    |
|b  |d    |2    |
|a  |e    |3    |
+---+-----+-----+

Thanks in advance!


Solution

  • Edit 1

    Not sure if it's the better way to solve, but I did a workaround:

    import pyspark.sql.functions as f
    
    df = df.withColumn('match', f.collect_set('id').over(Window.partitionBy('group')))
    
    df = df.select(f.col('id').alias('src'),
                   f.explode('match').alias('dst'),
                   f.col('group'))
    
    df = df.withColumn('duplicate_edges', f.array_sort(f.array('src', 'dst')))
    df = (df
          .where(f.col('src') != f.col('dst'))
          .drop_duplicates(subset=['duplicate_edges'])
          .drop('duplicate_edges'))
    
    df.sort('group', 'src', 'dst').show()
    

    Output

    +---+---+-----+
    |src|dst|group|
    +---+---+-----+
    |  a|  c|    1|
    |  a|  f|    1|
    |  c|  f|    1|
    |  b|  d|    2|
    |  e|  a|    3|
    +---+---+-----+
    

    Original answer

    Try this:

    import pyspark.sql.functions as f
    
    df = (df
          .groupby('group')
          .agg(f.first('id').alias('src'),
               f.last('id').alias('dst')))
    
    df.show()
    

    Output:

    +-----+---+---+
    |group|src|dst|
    +-----+---+---+
    |    1|  a|  c|
    |    3|  e|  a|
    |    2|  b|  d|
    +-----+---+---+