I have a demo function, this is a barebones function which can help with go through very big datasets and process the data without having to load that entire dataset into memory at a time
def generator(gpkg_file, chunk_size, offset, layer_name):
while True:
query = f"""
SELECT *
FROM {layer_name}
LIMIT {chunk_size} OFFSET {offset}
"""
gdf = gpd.read_file(gpkg_file, sql=query)
if gdf.empty:
break
yield gdf
offset += chunk_size
When I use this function in a jupyter notebook environment I get the expected result which is a dataframe with proper limits and offset and paginated data, But when I use the same in a flat python file it loads the entire gpkg file without the pagination, I am stumped and in desperate need for help
This is when used in a jupyter notebook environment, which is what I need.
But when used exactly like this in a python file it just loads the entire 175 rows (present in the demo.gpkg file) every iteration and since the df does not hit an end clause it goes on forever
Using the sql
parameter with geopandas.read_file is only supported when using the pyogrio read/write engine. If the fiona
engine is used, the parameter will be simply ignored, and you'll get the effect you describe with the .py
file.
As a workaround, you can force to use the pyogrio
engine by adding an extra parameter engine="pyogrio"
, like this:
gdf = gpd.read_file(gpkg_file, sql=query, engine="pyogrio")
For the why... that might be somewhat complicated. In geopandas 0.x
, fiona
was the default engine, and there were some ways to activate pyogrio
: e.g. via the above parameter or globally via an environment variable and it also depended on which engines were installed. In the recently released geopandas 1.x
series, pyogrio
has become the default. I suppose there is a combination of factors that leads to the different engine being used.
Personally, I would just make sure you are using geopandas 1.x
and uninstall fiona
: that should avoid the problem.