pythonpyspark

How to filter values from struct by field in pyspark?


I wonder how I can filter out objects from struct by condition that their id is in list presented in other column.

Suppose I have dataframe df with columns event which has the type of list of Struct with fields id and time. There is another column target_ids which contains lists of id. I want to filter from each row of column events only those events, which have id in corresponding list of target_ids.

If target_ids is a column of single id of integer type, than the following works:

df = df.withColumn('target_time', F.expr('filter(events, x -> x.id == target_ids)').getField('time')

However if I change target_ids to list and == to in then I get a syntax error. How can I solve this? And is there any documentation about syntax of expressions inside expr? This page is sadly blank


Solution

  • The expr function doesn't automatically translate the Python in operator to its SQL equivalent when working with array types. The standard Spark SQL function for checking if an element exists in an array is array_contains.

    You should be able to fix by using array_contains within your filter expression.

    Pseudocode

    from pyspark.sql import functions as F
    
    df = df.withColumn(
        'target_events',
        F.expr('filter(events, x -> array_contains(target_ids, x.id))')
    )
    

    I don't know if this tutorial may be useful, but I'll link it anyway^^:

    https://www.youtube.com/watch?v=9zX-OfOzLlQ