[SOLVED] Create a contingency table with multiple column names and row values

Create a contingency table with multiple column names and row values

I have a data_frame like this

Id	A	C
1	0	1
2	1	0
3	1	1
4	0	1

I would like to create a contingency matrix like below to calculate correlations

Label	T/F	count
A	0	2
A	1	2
B	0	4
B	1	0
C	0	1
C	1	3

I was able to get the count for one column using the below code but not sure how to do this for multiple columns.

df = data_frame.groupBy('A').count()

Any help is much appreciated. Thanks!

Solution

Narly one. See step by step approach below

#melt  all columns except Id 
new =df.withColumn('tab', F.array(*[F.struct(F.lit(x).alias('Label'), F.col(x).alias('T/F')) for x in df.columns if x!='Id'])).selectExpr('Id','inline(tab)').drop('tab')


new =(new.groupBy('Label').agg(collect_list('T/F').alias('T/F'))#Create list of T/F per labes
      .withColumn('m', array(*[lit(x).cast('integer') for x in [0,1]]))#Add column with array of 0,1
      .withColumn('m1',array_except(col('m'),col('T/F')))#add column with array oF ELEMENTS IN 0,1 MISSING ING IN T/F
      .withColumn('T/F',explode(flatten(when(size(array_except(col('m'),col('T/F')))>0,concat(array(col('T/F'),array(lit(None))))).otherwise(array('T/F')))))#Ensure one element per row after adding None where an element was found missing in T/F
      .groupBy('Label','T/F','m1').agg(count('T/F').alias('count'))#groupby
      .withColumn('T/F', coalesce('T/F', col('m1')[0]))#Where there is none, fillna with missing element
      .drop('m1')#drop unwanted column
     )

+-----+---+-----+
|Label|T/F|count|
+-----+---+-----+
|    B|  0|    4|
|    B|  1|    0|
|    C|  1|    3|
|    C|  0|    1|
|    A|  0|    2|
|    A|  1|    2|
+-----+---+-----+