pythonpandasjuliaapache-arrow

Fast way to iterate over an Arrow.Table with 30 million rows and 25 columns in Julia


I saved a Python Pandas DataFrame of size (30M x 25) as an Apache Arrow table. Then, I'm reading that table in Julia as:

input_arrow = Arrow.Table("path/to/table.arrow")

My question is how can I iterate over the rows of input_arrow in an efficient way.

If I just do:

for c in input_arrow:
    # Do something

then, I would be iterating over the columns, but I need to iterate over the rows.

Something else that I've tried is converting the Arrow.Table into a DataFrames.DataFrame:

df = DataFrames.DataFrame(input_arrow)
for row in eachrow(df)
    # do something

But this method is very slow. It reminds me of how slow it is to do df.iterrows() in Python.

So, which is the fast way (similar to df.itertuples()) to iterate over an Arrow.Table in Julia?

Solution

As suggested by László Hunyadi in the accepted solution, transforming the Arrow.Table into a Tables.rowtable shows a significant speedup.

There was an issue with the RAM; the Arrow.Table and the Tables.rowtable didn't fit in my RAM, so I had to read the Arrow.Table by chunks as follows:

for chunk in Arrow.Stream("/path/to/table.arrow")
    row_table = Tables.rowtable(chunk)
    # do something with row_table
end

Solution

  • Use Tables.rowtable from Tables.jl to convert your Arrow table into a row-oriented table so you can iterate over it. It is very efficient and should be significantly faster than converting to a DataFrame first:

    using Arrow, Tables
    input_arrow = Arrow.Table("path/to/table.arrow")
    row_table = Tables.rowtable(input_arrow)
    for row in row_table
        # Do something with row
    end
    

    This approach will load the entire table into memory. If it's too large to fit into memory, use a different approach, such as processing the table in chunks.