geopandaspyproj

geopandas to_crs returned less records than expected


For a geopandas dataframe containing POLYGON and MULTIPOLYGON geometry data, I have tried to convert from another coordinate reference system (CRS) to EPSG:4326.

Because the geodataframe has approximately 200 thousand records, I have

This conversion process took approximately 2 full days. After applying pd.concat on all the small_gdf parts into a full geodataframe, the result shows approximately 60% of records from the original geodataframe. Could records be dropped because the 'to_crs' conversion failed?

Meanwhile, I am going to add a new column to each 'small_gdf' and rerun the to_crs operation to trace back which records were dropped during the conversion process

Code example [Please excuse any typos. I had to retype just for this post]

import geopandas as gpd
gdf = gpd.read_file('bigShapefilePath.shp')
n_records = len(gdf)

# create tuples for start-end indexes of each chunk
chunksize = 1000
i=0
list_start_end_idx_tuples = []
for start in range(i, n_records, chunksize):
    end = start+999
    if end > n_records:
        end = n_records-1

    start_end_idx_tuple = (start, end)
    list_start_end_idx_tuples.append(start_end_idx_tuple)

# convert in chunks
parts_folderpath = <parts_folderpath>
file_counter=1
for each_start_end in list_start_end_idx_tuples:
    start, end = each_start_end
    small_gdf = gdf.iloc[start:end+1]
    small_gdf['WITHIN_PART_IDX'] = range(len(small_gdf))
    small_gdf.to_crs('epsg:4326', inplace=True)
    small_gdf.to_file(f'{parts_folderpath}/small_gdf_part{file_counte
    r}.shp')

    file_counter+=1


# find file parts
full_folderpath = <full_folderpath>
i=0
list_smallgdf_filename = []
list_smallgdf_filenamenext = []

for dir, subdir, filenames in os.walk(parts_folderpath):
    for filenamenext in filenames:
        if ('.shp' in filenamenext) and ('.xml' not in filenamenext):
            filename = filenamenext.split('.')[0]
            i+=1
            list_smallgdf_filename.append(filename)
            list_smallgdf_filenamenext.append(filenamenext)


# concat into full gdf
i=0
for filenamenext in list_smallgdf_filenamenext:
    small_gdf = gpd.read_file(f'{parts_folderpath}/{filenamenext}'
    small_filename = small_filename[i]
    part_num = small_filename.split('_')[-1].split('.')[0]
    small_gdf['PART_NUM'] = int(part_num)
    
    if i<1:
        concat_gdf = small_gdf
    else:
        concat_gdf = pd.concat([concat_gdf, small_gdf])
    i+=1

concat_gdf.to_file(f'{full_folderpath}/concat_gdf.shp')

Solution

  • The issue was with the chunksize.