apache-sparkpysparkapache-spark-sqldatabricks

PIVOT How to convert Dataframe api to spark SQL


I am getting different output while converting dataframe api to spark SQL. Point me where i missed the sql logic.

I have table called demo, have two columns value and marker. the value column is seq number start from 1 to 100. marker value is A [1 to 10] , B [11-20] and so on.

sample tables

+-----+------+
|value|marker|
+-----+------+
|    1|     A|
|    2|     A|
|    3|     A|
|    4|     A|
|    5|     A|
|    6|     A|
|    7|     A|
|    8|     A|
|    9|     A|
|   10|     A|
|   11|     B|
|   12|     B| 

DataFrame API and output.

df_2.groupBy('marker').pivot('marker').agg(F.count('*')).fillna(0).show()

output:

+------+---+---+---+---+---+---+---+---+---+---+
|marker|  A|  B|  C|  D|  E|  F|  G|  H|  I|  J|
+------+---+---+---+---+---+---+---+---+---+---+
|     B|  0| 10|  0|  0|  0|  0|  0|  0|  0|  0|
|     A| 10|  0|  0|  0|  0|  0|  0|  0|  0|  0|
|     C|  0|  0| 10|  0|  0|  0|  0|  0|  0|  0|
|     D|  0|  0|  0| 10|  0|  0|  0|  0|  0|  0|
|     E|  0|  0|  0|  0| 10|  0|  0|  0|  0|  0|
|     F|  0|  0|  0|  0|  0| 10|  0|  0|  0|  0|
|     G|  0|  0|  0|  0|  0|  0| 10|  0|  0|  0|
|     H|  0|  0|  0|  0|  0|  0|  0| 10|  0|  0|
|     I|  0|  0|  0|  0|  0|  0|  0|  0| 10|  0|
|     J|  0|  0|  0|  0|  0|  0|  0|  0|  0|  9|
+------+---+---+---+---+---+---+---+---+---+---+

I tried to convert API to SQL based using PIVOT like below.

spark.sql("""
    
    select
        *
    from 
        ( 
            select marker from demo 
        )
    PIVOT 
        (
            count(marker)
            for marker in ('A','B','C','D','E','F','G','H','I','J')
        )
            
        

""").show()

and output is

+---+---+---+---+---+---+---+---+---+---+
|  A|  B|  C|  D|  E|  F|  G|  H|  I|  J|
+---+---+---+---+---+---+---+---+---+---+
| 10| 10| 10| 10| 10| 10| 10| 10| 10|  9|
+---+---+---+---+---+---+---+---+---+---+ 

How can i get Dataframe API output using Spark SQL.

Note - We can get expected output using sum with case when clause for individual marker, but i am expecting to use pivot.


Solution

  • you are almost there. you'd need to create a column with the same values but a new name.

    here's an example

    select * from (
        select col, col as col1 from data_tbl
    )
    pivot (
        count(col) for col in ('A', 'B', 'C', 'D')
    )
    order by col1
    

    this'd result in

    +----+----+----+----+----+
    |col1|   A|   B|   C|   D|
    +----+----+----+----+----+
    |   A|  10|null|null|null|
    |   B|null|  10|null|null|
    |   C|null|null|  10|null|
    |   D|null|null|null|  10|
    +----+----+----+----+----+