I have the below dataframe
Column_1 Column_2 Column_3
A 1
A 2
A 3
A 4
A 5
B 1
B 4
B 5
C 1
C 2
I have to populate Column_3 based on values in Column_1 and Column_2. If Column_1 in ('A','B') and Column_2 not in ('1','3','5') i have to populate column_3 with X else Y.
Expected Output:
Column_1 Column_2 Column_3
A 1 Y
A 2 X
A 3 Y
A 4 X
A 5 Y
B 1 Y
B 4 X
B 5 Y
C 1 Y
C 2 Y
What I tried:
I tried with when and otherwise statement, but not sure how to use Not in along with the when statement. Any help on this would be highly appreciated
You can leverage isin
and inverse ~
:
import pyspark.sql.functions as F
c = (F.when(df['Column_1'].isin(['A','B']) &
(~df['Column_2'].isin([1,3,5])),'X').otherwise('Y'))
df.withColumn("Column_3",c).show()
Or:
expr = """CASE
WHEN Column_1 IN ('A','B') and Column_2 NOT IN (1,3,5)
THEN 'X' ELSE 'Y'
END as Column_3"""
df.selectExpr("*",expr).show()
+--------+--------+--------+
|Column_1|Column_2|Column_3|
+--------+--------+--------+
| A| 1| Y|
| A| 2| X|
| A| 3| Y|
| A| 4| X|
| A| 5| Y|
| B| 1| Y|
| B| 4| X|
| B| 5| Y|
| C| 1| Y|
| C| 2| Y|
+--------+--------+--------+
More details:
df['Column_1'].isin(['A','B'])
#Column<b'(Column_1 IN (A, B))'>
~df['Column_2'].isin([1,3,5])
#Column<b'(NOT (Column_2 IN (1, 3, 5)))'>