pythonpysparksasswitch-statementcode-conversion

How to convert 'if the do' from SAS to Pyspark


I've a df like this:

Zip City
'River' 'London'
' ' 'Rome'
'River' 'Berlin'
'River' 'Madrid'
' ' 'Munich'
' ' 'Paris'

SAS code below which I've converted in Pyspark

if Zip = '' 
    then do Flud = City;
            Flag = 'City';
         end;
    else do Flud = Zip; 
            Flag = 'Zip';
         end;

So I expected output like this:

Zip City Flud Flag
'River' 'London' 'River' 'Zip'
' ' 'Rome' 'Rome' 'City'
'River' 'Berlin' 'River' 'Zip'
'River' 'Madrid' 'River' 'Zip'
' ' 'Munich' 'Munich' 'City'
' ' 'Paris' 'Paris' 'City'

I've converted in Pyspark but is giving me error:

my code:

output=df.withColumn('Flag',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
         .otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
         .otherwise(col('Zip'))))

Pyspark give me this error:

AnalysisException: Column 'Flag' does not exist.

So I've tried to create the variable first

df= df.withColumn("Flag", lit(''))

and write this new code:

output=df.withColumn('Flud',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
         .otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
         .otherwise(col('Zip'))))

and now I've this error:

Column 'Flud' does not exist.

So I've tried to create the second variable first

df= df.withColumn("Flud", lit(''))

and write this new code:

output=df.withColumn('Flud',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
         .otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
         .otherwise(col('Zip'))))

and now I've this error:

AnalysisException: cannot resolve 'CASE WHEN ((Flud.Zip IS NOT NULL) AND (Flag = 'Zip')) THEN (Flud = Flud.Zip) ELSE Flud.Zip END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type, got CASE WHEN ... THEN boolean ELSE string END;

Someone can help me?


Solution

  • When using the when function in pyspark, you have the following structure:

    when(EXPRESSION, VALUE IF TRUE).otherwise(VALUE IF FALSE). According to your code, when the expression is true, you are generating another boolean value, comparing two columns. You need to inform the value you want there and they all should be of the same data type. Using your example output, you can do the following:

    import pyspark.sql.functions as F
    
    zip_is_null = F.col("Zip").isNull()
    
    output = (
        df.withColumn(
            "Flag",
            F.when(zip_is_null, F.lit("City"))
             .otherwise(F.lit("Zip"))
         ).withColumn(
            "Flud",
            F.when(zip_is_null, F.col("City"))
             .otherwise(F.col("Zip"))
         )
    )
    

    In your case, you will need two CASE WHENs, since they will produce two different columns, Flud and Flag. This should do the trick.