pythonexpressionpyarrow

Combining or appending to pyarrow.dataset.expressions


I am trying to filter pyarrow data with pyarrow.dataset. I want to add a dynamic way to add to the expressions.

from pyarrow import parquet as pq
import pyarrow.dataset as ds
import datetime

exp1 = ds.field("IntCol") == 1
exp2 = ds.field("StrCol") == 'A'
exp3 = ds.field("DateCol") == datetime.date.today()

filters = (exp1 & exp2 & exp3)
print(filters)

#To be used in reading parquet tables
df = pq.read_table('sample.parquet', filters=filters)

How can do this without writing "&" there since I may have N number of exps? I have been looking at different ways to collect expressions like np.logical_and.accumulate(). It gets me partially there, but I still need to convert the array into a single expression.

np.logical_and.accumulate([exp1, exp2, exp3])

out: array([<pyarrow.dataset.Expression (IntCol == 1)>,
       <pyarrow.dataset.Expression (StrCol == "A")>,
       <pyarrow.dataset.Expression (DateCol == 2021-06-09)>], dtype=object)

going down numpy route may not be the best answer. Does anyone have suggestion whether this can be done?


Solution

  • You can use operator.and_ to have the functional equivalent of the & operator. And then with functools.reduce it can be recursively applied on a list of expressions.

    Using your three example expressions:

    import operator
    import functools
    
    >>> functools.reduce(operator.and_, [exp1, exp2, exp3])
    <pyarrow.dataset.Expression (((IntCol == 1) and (StrCol == "A")) and (DateCol == 2021-06-10))>