I saved a Python Pandas DataFrame
of size (30M x 25) as an Apache Arrow table. Then, I'm reading that table in Julia as:
input_arrow = Arrow.Table("path/to/table.arrow")
My question is how can I iterate over the rows of input_arrow
in an efficient way.
If I just do:
for c in input_arrow:
# Do something
then, I would be iterating over the columns, but I need to iterate over the rows.
Something else that I've tried is converting the Arrow.Table
into a DataFrames.DataFrame
:
df = DataFrames.DataFrame(input_arrow)
for row in eachrow(df)
# do something
But this method is very slow. It reminds me of how slow it is to do df.iterrows()
in Python.
So, which is the fast way (similar to df.itertuples()
) to iterate over an Arrow.Table in Julia?
As suggested by László Hunyadi in the accepted solution, transforming the Arrow.Table
into a Tables.rowtable
shows a significant speedup.
There was an issue with the RAM; the Arrow.Table
and the Tables.rowtable
didn't fit in my RAM, so I had to read the Arrow.Table
by chunks as follows:
for chunk in Arrow.Stream("/path/to/table.arrow")
row_table = Tables.rowtable(chunk)
# do something with row_table
end
Use Tables.rowtable
from Tables.jl
to convert your Arrow table into a row-oriented table so you can iterate over it. It is very efficient and should be significantly faster than converting to a DataFrame first:
using Arrow, Tables
input_arrow = Arrow.Table("path/to/table.arrow")
row_table = Tables.rowtable(input_arrow)
for row in row_table
# Do something with row
end
This approach will load the entire table into memory. If it's too large to fit into memory, use a different approach, such as processing the table in chunks.