python-3.xmatrixpysparkgroup-byspark-window-function

transition matrix from pyspark dataframe


I have two columns (such as):

from to
1 2
1 3
2 4
4 2
4 2
4 3
3 3

And I want to create a transition matrix (where sum of rows in a columns add up to 1):

         1.      2.     3.   4.

1.      0        0      0    0
2.      0.5*     0      0    2/3
3.      0.5      0.5    1    1/3
4.      0        0.5    0    0

where 1 -> 2 would be : (the number of times 1 (in 'from') is next to 2 (in 'to)) / (total times 1 points to any value).


Solution

  • You can create this kind of transition matrix using a window and pivot.

    First some dummy data:

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    x = np.random.randint(1,5,100)
    y = np.random.randint(1,5,100)
    
    df = spark.createDataFrame(pd.DataFrame({'from': x, 'to': y}))
    df.show()
    
    +----+---+
    |from| to|
    +----+---+
    |   3|  3|
    |   4|  2|
    |   1|  2|
    ...
    

    To create a pct column, first group the data by unique combinations of from/to and get the counts. With that aggregated dataframe, create a new column, pct that uses the Window to find the total number of records for each from group which is used as the denominator.

    Lastly, pivot the table to make the to values columns and the pct data the values of the matrix.

    from pyspark.sql import functions as F, Window
    
    w = Window().partitionBy('from')
    grp = df.groupBy('from', 'to').count().withColumn('pct', F.col('count') / F.sum('count').over(w))
    
    res = grp.groupBy('from').pivot('to').agg(F.round(F.first('pct'), 2))
    res.show()
    
    +----+----+----+----+----+
    |from|   1|   2|   3|   4|
    +----+----+----+----+----+
    |   1| 0.2| 0.2|0.25|0.35|
    |   2|0.27|0.31|0.19|0.23|
    |   3|0.46|0.17|0.21|0.17|
    |   4|0.13|0.13| 0.5|0.23|
    +----+----+----+----+----+