I wonder how I can filter out objects from struct by condition that their id is in list presented in other column.
Suppose I have dataframe df
with columns event
which has the type of list of Struct
with fields id
and time
. There is another column target_ids
which contains lists of id. I want to filter from each row of column events
only those events, which have id
in corresponding list of target_ids
.
If target_ids
is a column of single id of integer type, than the following works:
df = df.withColumn('target_time', F.expr('filter(events, x -> x.id == target_ids)').getField('time')
However if I change target_ids
to list and ==
to in
then I get a syntax error. How can I solve this? And is there any documentation about syntax of expressions inside expr
? This page is sadly blank
The expr function doesn't automatically translate the Python in
operator to its SQL equivalent when working with array types. The standard Spark SQL function for checking if an element exists in an array is array_contains
.
You should be able to fix by using array_contains
within your filter expression.
Pseudocode
from pyspark.sql import functions as F
df = df.withColumn(
'target_events',
F.expr('filter(events, x -> array_contains(target_ids, x.id))')
)
I don't know if this tutorial may be useful, but I'll link it anyway^^: