pythonjupyter-notebookgeopandasgeopackage

Geopandas read_file is confusing me when used with 'sql' parameter


I have a demo function, this is a barebones function which can help with go through very big datasets and process the data without having to load that entire dataset into memory at a time


def generator(gpkg_file, chunk_size, offset, layer_name):
    while True:
        query = f"""
        SELECT *
        FROM {layer_name}
        LIMIT {chunk_size} OFFSET {offset}
        """
        gdf = gpd.read_file(gpkg_file, sql=query)
        if gdf.empty:
            break
        yield gdf
        offset += chunk_size

When I use this function in a jupyter notebook environment I get the expected result which is a dataframe with proper limits and offset and paginated data, But when I use the same in a flat python file it loads the entire gpkg file without the pagination, I am stumped and in desperate need for help

Click here to find the image

This is when used in a jupyter notebook environment, which is what I need.

But when used exactly like this in a python file it just loads the entire 175 rows (present in the demo.gpkg file) every iteration and since the df does not hit an end clause it goes on forever


Solution

  • Using the sql parameter with geopandas.read_file is only supported when using the pyogrio read/write engine. If the fiona engine is used, the parameter will be simply ignored, and you'll get the effect you describe with the .py file.

    As a workaround, you can force to use the pyogrio engine by adding an extra parameter engine="pyogrio", like this:

    gdf = gpd.read_file(gpkg_file, sql=query, engine="pyogrio")
    

    For the why... that might be somewhat complicated. In geopandas 0.x, fiona was the default engine, and there were some ways to activate pyogrio: e.g. via the above parameter or globally via an environment variable and it also depended on which engines were installed. In the recently released geopandas 1.x series, pyogrio has become the default. I suppose there is a combination of factors that leads to the different engine being used.

    Personally, I would just make sure you are using geopandas 1.x and uninstall fiona: that should avoid the problem.