[SOLVED] pyspark crosstab with percentages

pyspark crosstab with percentages

I have a DataFrame after applying crosstab in the pyspark, Example Below

id	A	B	C
cssdsd	0	1	0
sdsdsd	1	1	1
ssdssd	1	0	0
xxxxxx	0	0	0

Instead of 0,1's I want to get the percentages of the rows. I can be easily get that with the pandas using the crosstab function

pd.crosstab(df.index,df.list, normalize='index')

How can I get this in pyspark?

Solution

After getting the crosstab result which is df in the below code. Get sum of all columns except id and then divide each column with the sum:

from pyspark.sql import functions as F

cols = [i for i in df.columns if not i=='id']
out = (df.withColumn("SumCols",F.expr('+'.join(cols)))
.select("id",*[F.coalesce(F.round(F.col(i)/F.col("SumCols"),2),F.lit(0)).alias(i) 
                                                                 for i in cols]))

out.show()

+------+----+----+----+
|    id|   A|   B|   C|
+------+----+----+----+
|cssdsd| 0.0| 1.0| 0.0|
|sdsdsd|0.33|0.33|0.33|
|ssdssd| 1.0| 0.0| 0.0|
|xxxxxx| 0.0| 0.0| 0.0|
+------+----+----+----+

Note that you can multiply by 100 in the select statement loop if necessary.