apache-sparkpyspark

Explode a null column in pyspark which can be of type struct of struct


I have a dataframe and I am trying to do following transformation.

exploded_prod_df = exploded_hits_product_df.withColumn("product_cd", F.explode(F.col("products.customdimensions")))

My issue is products is Null currently, which may not be the case in future. I tried many things but they didn't work. For example, I tried below code

exploded_prod_df = (
exploded_hits_product_df
.withColumn('products_customdimensions', F.when(F.col('products').isNull(), F.lit(None).cast(T.StructType()))
.otherwise('products.customdimensions'))
.withColumn("product_cd", F.explode_outer(F.col("products_customdimensions")))

)

but this is giving error. Please help me. This is Google Analytics data. Error is: Cannot resolve "CASE WHEN (products IS NULL) THEN NULL ELSE products.customdimensions END" due to data type mismatch: Input to casewhen should all be the same type, but it's ["STRUCT<>", "STRING"].;


Solution

  • If you could provide the output of print(product_df.dtypes) I would be happy to provide more context in this answer.

    These are my thoughts so far:

    explode and explode_outer only work on Map and Array types.

    If products is NULL and products.customdimensions was one of these types, it is unlikely you would receive errors when using these explode functions.

    Since you are getting errors, I think the issue is that products.customdimensions is of type Struct


    Option 1. You know the fields in products.customdimensions.

    If so, you can access each field you need using .withColumn.

    df = product_df.withColumn("product_type", F.col("products.customdimensions.type"))\
            .withColumn("product_color", F.col("products.customdimensions.color"))
    
    df.select('products','products.customdimensions', 'product_type', 'product_color').show()
    
    products customdimensions product_type product_color
    {{red, apple}} {red, apple} apple red
    null null null null

    If you instead need this information to be in a vertical format, you could create an Array with the fields you need and then explode it.

    df = product_df.withColumn('products_customdimensions',
                        F.when(F.col('products').isNull(), F.array(F.lit(None)))\
                        .otherwise(F.array("products.customdimensions.type",
                                           "products.customdimensions.color"))\
                       )
    
    df = df.withColumn("product_cd", F.explode('products_customdimensions'))
    
    df.select('products', 'products.customdimensions', 'product_cd').show()
    
    products customdimensions product_cd
    {{red, apple}} {red, apple} apple
    {{red, apple}} {red, apple} red
    null null null

    Option 2. You do not know the fields in products.customdimensions.

    If this is the case, you may want to convert the Struct to a Map. Once it is a Map, there is an option to use it in the explode or explode_outer function

    Here is an example of converting a Struct to a Map, Convert struct to map in Spark SQL