I have a DataFrame after applying crosstab in the pyspark, Example Below
id | A | B | C |
---|---|---|---|
cssdsd | 0 | 1 | 0 |
sdsdsd | 1 | 1 | 1 |
ssdssd | 1 | 0 | 0 |
xxxxxx | 0 | 0 | 0 |
Instead of 0,1's I want to get the percentages of the rows. I can be easily get that with the pandas using the crosstab function
pd.crosstab(df.index,df.list, normalize='index')
How can I get this in pyspark?
After getting the crosstab result which is df
in the below code. Get sum of all columns except id
and then divide each column with the sum:
from pyspark.sql import functions as F
cols = [i for i in df.columns if not i=='id']
out = (df.withColumn("SumCols",F.expr('+'.join(cols)))
.select("id",*[F.coalesce(F.round(F.col(i)/F.col("SumCols"),2),F.lit(0)).alias(i)
for i in cols]))
out.show()
+------+----+----+----+
| id| A| B| C|
+------+----+----+----+
|cssdsd| 0.0| 1.0| 0.0|
|sdsdsd|0.33|0.33|0.33|
|ssdssd| 1.0| 0.0| 0.0|
|xxxxxx| 0.0| 0.0| 0.0|
+------+----+----+----+
Note that you can multiply by 100 in the select statement loop if necessary.