I've a df like this:
Zip | City |
---|---|
'River' | 'London' |
' ' | 'Rome' |
'River' | 'Berlin' |
'River' | 'Madrid' |
' ' | 'Munich' |
' ' | 'Paris' |
SAS code below which I've converted in Pyspark
if Zip = ''
then do Flud = City;
Flag = 'City';
end;
else do Flud = Zip;
Flag = 'Zip';
end;
So I expected output like this:
Zip | City | Flud | Flag |
---|---|---|---|
'River' | 'London' | 'River' | 'Zip' |
' ' | 'Rome' | 'Rome' | 'City' |
'River' | 'Berlin' | 'River' | 'Zip' |
'River' | 'Madrid' | 'River' | 'Zip' |
' ' | 'Munich' | 'Munich' | 'City' |
' ' | 'Paris' | 'Paris' | 'City' |
I've converted in Pyspark but is giving me error:
my code:
output=df.withColumn('Flag',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
.otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
.otherwise(col('Zip'))))
Pyspark give me this error:
AnalysisException: Column 'Flag' does not exist.
So I've tried to create the variable first
df= df.withColumn("Flag", lit(''))
and write this new code:
output=df.withColumn('Flud',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
.otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
.otherwise(col('Zip'))))
and now I've this error:
Column 'Flud' does not exist.
So I've tried to create the second variable first
df= df.withColumn("Flud", lit(''))
and write this new code:
output=df.withColumn('Flud',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
.otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
.otherwise(col('Zip'))))
and now I've this error:
AnalysisException: cannot resolve 'CASE WHEN ((Flud.Zip IS NOT NULL) AND (Flag = 'Zip')) THEN (Flud = Flud.Zip) ELSE Flud.Zip END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type, got CASE WHEN ... THEN boolean ELSE string END;
Someone can help me?
When using the when
function in pyspark, you have the following structure:
when(EXPRESSION, VALUE IF TRUE).otherwise(VALUE IF FALSE). According to your code, when the expression is true, you are generating another boolean value, comparing two columns. You need to inform the value you want there and they all should be of the same data type. Using your example output, you can do the following:
import pyspark.sql.functions as F
zip_is_null = F.col("Zip").isNull()
output = (
df.withColumn(
"Flag",
F.when(zip_is_null, F.lit("City"))
.otherwise(F.lit("Zip"))
).withColumn(
"Flud",
F.when(zip_is_null, F.col("City"))
.otherwise(F.col("Zip"))
)
)
In your case, you will need two CASE WHENs, since they will produce two different columns, Flud
and Flag
. This should do the trick.