pysparkaws-glueaws-glue-spark

AWS Glue Dynamic Frame Pushdown Predicate List


When using pushdown predicate with AWS Glue Dynamic frame, how does it iterate through a list?

For example, the following list was created to be used as a pushdown predicate:

day=list(p_day.select('day').toPandas()['day'])
month=list(p_month.select('month').na.drop().toPandas()['month'])
year=list(p_year.select('year').toPandas()['year'])

predicate = "day in (%s) and month in (%s) and year in (%s)"%(",".join(map(lambda s: "'"+str(s)+"'",dat))
                                                         ,",".join(map(lambda s: "'"+str(s)+"'",month))
                                                         ,",".join(map(lambda s: "'"+str(s)+"'",year)))

Let's say it returns this:

"day in ('07','15') and month in ('11','09','08') and year in ('2021')"

How would the push down predicate read this combination/list?

Is it:

day month year
07 11 2021
15 11 2021
07 09 2021
15 09 2021
07 08 2021
15 08 2021

-OR-

day month year
07 11 2021
15 11 2021
15 08 2021
15 09 2021

I have a feeling that this list is read like the first table rather than the latter... But, it's the latter that I would like to pass through as a pushdown predicate. Does creating a list essentially cause a permutation? It's as if the true day, month, and year combination is lost in the list which should be 11/7/2021, 11/15/2021, 08/15/2021, and 09/15/2021.


Solution

  • This has nothing to do with Glue itself, since the Partition Predicate is just basic Spark SQL. You will receive the first list and not the second. You would have to restructure the boolean expression to receive the second list.