I have a DataFrame in PySpark, and I would like to add a new column based on the value in another column. I know this is fairly common, and I've searched and tried a bunch of different ways, but always end up with the error message TypeError: condition should be a Column
.
So what I want to do is basically, if the value in one column is not null, then I will do a regex on that column, otherwise I'll do a regex on another column.
My original query looked like this:
import pyspark.sql.functions as F
df.withColumn("new_column", F.when(F.col("column_a").isNotNull,
F.regexp_extract('column_a', 'myregex', 1))
.otherwise(F.regexp_extract('column_b', 'myotherregex', 1)))
That resulted in the mentioned error. As I was simplifying to figure out where the error was, I could never manage to get it working. So even this simplified example fails with the same error:
df.withColumn("new_column", F.when(F.col("column_a").isNotNull, F.lit("A"))
.otherwise(F.lit("B")))
I've also tried to refer to column_a
in the following way, without success: df["column_a"]
When I search and find examples of people doing the same thing it looks to my untrained eye like it should work. What am I missing?
isNotNull
is a function, and as such it must be called with parenthesis: isNotNull()
.