pythonapache-sparkpysparkapache-spark-sqldatabricks

Pyspark new column when otherwise results in "should be a column" error


I have a DataFrame in PySpark, and I would like to add a new column based on the value in another column. I know this is fairly common, and I've searched and tried a bunch of different ways, but always end up with the error message TypeError: condition should be a Column.

So what I want to do is basically, if the value in one column is not null, then I will do a regex on that column, otherwise I'll do a regex on another column.

My original query looked like this:

import pyspark.sql.functions as F

df.withColumn("new_column", F.when(F.col("column_a").isNotNull,
    F.regexp_extract('column_a', 'myregex', 1))
    .otherwise(F.regexp_extract('column_b', 'myotherregex', 1)))

That resulted in the mentioned error. As I was simplifying to figure out where the error was, I could never manage to get it working. So even this simplified example fails with the same error:

df.withColumn("new_column", F.when(F.col("column_a").isNotNull, F.lit("A"))
    .otherwise(F.lit("B")))

I've also tried to refer to column_a in the following way, without success: df["column_a"]

When I search and find examples of people doing the same thing it looks to my untrained eye like it should work. What am I missing?


Solution

  • isNotNull is a function, and as such it must be called with parenthesis: isNotNull().