pythonpandasdataframealgorithmgis

How to efficiently remove overlapping circles from the dataset


I have a dataset of about 20,000 records that represent global cities of population > 20,000. I have estimated radius which more or less describes the size of the city. It's not exactly accurate but for my purposes it will be enough.

that I'm loading it into my Panda dataframe object. Below is the sample

name_city,country_code,latitude,longitude,geohash,estimated_radius,population
Vitry-sur-Seine,FR,48.78716,2.40332,u09tw9qjc3v3,1000,81001
Vincennes,FR,48.8486,2.43769,u09tzkx5dr13,500,45923
Villeneuve-Saint-Georges,FR,48.73219,2.44925,u09tpxrmxdth,500,30881
Villejuif,FR,48.7939,2.35992,u09ttdwmn45z,500,48048
Vigneux-sur-Seine,FR,48.70291,2.41357,u09tnfje022n,500,26692
Versailles,FR,48.80359,2.13424,u09t8s6je2sd,1000,85416
Vélizy-Villacoublay,FR,48.78198,2.19395,u09t9bmxdspt,500,21741
Vanves,FR,48.82345,2.29025,u09tu059nwwp,500,26068
Thiais,FR,48.76496,2.3961,u09tqt2u3pmt,500,29724
Sèvres,FR,48.82292,2.21757,u09tdryy15un,500,23724
Sceaux,FR,48.77644,2.29026,u09tkp7xqgmw,500,21511
Saint-Mandé,FR,48.83864,2.41579,u09tyfz1eyre,500,21261
Saint-Cloud,FR,48.84598,2.20289,u09tfhhh7n9u,500,28839
Paris,FR,48.85341,2.3488,u09tvmqrep8n,12000,2138551
Orly,FR,48.74792,2.39253,u09tq6q1jyzt,500,20528
Montrouge,FR,48.8162,2.31393,u09tswsyyrpr,500,38708
Montreuil,FR,48.86415,2.44322,u09tzx7n71ub,2000,111240
Montgeron,FR,48.70543,2.45039,u09tpf83dnpn,500,22843
Meudon,FR,48.81381,2.235,u09tdy73p38y,500,44652
Massy,FR,48.72692,2.28301,u09t5yqqvupx,500,38768
Malakoff,FR,48.81999,2.29998,u09tsr6v13tr,500,29420
Maisons-Alfort,FR,48.81171,2.43945,u09txtbkg61z,1000,53964
Longjumeau,FR,48.69307,2.29431,u09th0q9tq1s,500,20771
Le Plessis-Robinson,FR,48.78889,2.27078,u09te9txch23,500,22510
Le Kremlin-Bicêtre,FR,48.81471,2.36073,u09ttwrn2crz,500,27867
Le Chesnay,FR,48.8222,2.12213,u09t8rc3cjwz,500,29154
La Celle-Saint-Cloud,FR,48.85029,2.14523,u09tbufje6p6,500,21539
Ivry-sur-Seine,FR,48.81568,2.38487,u09twq8egqrc,1000,57897
Issy-les-Moulineaux,FR,48.82104,2.27718,u09tezd5njkr,1000,61447
Fresnes,FR,48.75568,2.32241,u09tkgenkj6r,500,24803
Fontenay-aux-Roses,FR,48.79325,2.29275,u09ts4t92cn3,500,24680
Clamart,FR,48.80299,2.26692,u09tes6dp0dn,1000,51400
Choisy-le-Roi,FR,48.76846,2.41874,u09trn12bez7,500,35590
Chevilly-Larue,FR,48.76476,2.3503,u09tmmr7mfns,500,20125
Châtillon,FR,48.8024,2.29346,u09tshnn96xx,500,32383
Châtenay-Malabry,FR,48.76507,2.26655,u09t7t6mn7yj,500,32715
Charenton-le-Pont,FR,48.82209,2.41217,u09twzu3r9hq,500,30910
Cachan,FR,48.79632,2.33661,u09tt5j7nvqd,500,26540
Bagnolet,FR,48.86667,2.41667,u09tyzzubrxb,500,33504
Bagneux,FR,48.79565,2.30796,u09tsdbx727w,500,38900
Athis-Mons,FR,48.70522,2.39147,u09tn6t2mr16,500,31225
Alfortville,FR,48.80575,2.4204,u09txhf6p7jp,500,37290
Quinze-Vingts,FR,48.84656,2.37439,u09tyh0zz6c8,500,26265
Croulebarbe,FR,48.81003,2.35403,u09tttd5hc5f,500,20062
Gare,FR,48.83337,2.37513,u09ty1cdbxcq,1000,75580
Maison Blanche,FR,48.82586,2.3508,u09tv2rz1xgx,1000,64302

Below is the visual representation of the data sample: enter image description here

My goal is to find an efficient algorithm that would remove the intersecting circles and only keep the one with largest population.

My initial approach was to determine which circles are intersecting using haversine formula. Problem was that to check every record for intersections with others, it needs to traverse entire dataset. The time complexity of this was too high.

My second approach was to segregate the dataset by country code, and run the comparisons by chunks:

  def _remove_intersecting_circles_for_country(df_country):
    """Helper function to remove intersections within a single country."""
    indices_to_remove = set()
    for i in range(len(df_country)):
      for j in range(i + 1, len(df_country)):
        distance = haversine(df_country['latitude'].iloc[i], df_country['longitude'].iloc[i],
                            df_country['latitude'].iloc[j], df_country['longitude'].iloc[j])
        if distance < df_country['estimated_radius'].iloc[i] + df_country['estimated_radius'].iloc[j]:
          if df_country['population'].iloc[i] < df_country['population'].iloc[j]:
            indices_to_remove.add(df_country.index[i]) 
          else:
            indices_to_remove.add(df_country.index[j])
    return indices_to_remove

  all_indices_to_remove = set()
  for country_code in df['country_code'].unique():
    df_country = df[df['country_code'] == country_code]
    indices_to_remove = _remove_intersecting_circles_for_country(df_country)
    all_indices_to_remove.update(indices_to_remove)

  new_df = df.drop(index=all_indices_to_remove)
  return new_df

This has significantly improved the performance because to check every record we only need to check against all the records with the same country_code. But that still makes a lot of unnecessary comparisons


Solution

  • Once you have the circles as polygons, determining intersections between polygons is very fast if you use a spatial index to do so.

    So, you can:

    Result (red: original cities, blue: retained cities):

    visualisation of the result

    Code:

    from pathlib import Path
    import geopandas as gpd
    from matplotlib import pyplot as plt
    from pyproj import CRS, Transformer
    from shapely.geometry import Point
    from shapely.ops import transform
    
    
    def geodesic_point_buffer(lat, lon, distance):
        # Azimuthal equidistant projection
        aeqd_proj = CRS.from_proj4(
            f"+proj=aeqd +lat_0={lat} +lon_0={lon} +x_0=0 +y_0=0")
        tfmr = Transformer.from_proj(aeqd_proj, aeqd_proj.geodetic_crs)
        buf = Point(0, 0).buffer(distance)  # distance in metres
        return transform(tfmr.transform, buf)
    
    # Read the csv file with data
    #csv_path = Path(__file__).resolve().with_suffix(".csv")
    csv_path = "https://raw.githubusercontent.com/theroggy/pysnippets/refs/heads/main/pysnippets/stackoverflow_questions/2024/Q4/2024-10-25_circles.csv"
    points_df = gpd.read_file(csv_path, autodetect_type=True)
    
    # Convert the points to circles by buffering them
    points_buffer_gdf = gpd.GeoDataFrame(
        points_df,
        geometry=points_df.apply(
            lambda row : geodesic_point_buffer(row.latitude, row.longitude, row.estimated_radius), axis=1
        ),
        crs=4326,
    )
    
    # Determine the intersecting city buffers (result includes self-intersections)
    intersecting_gdf = points_buffer_gdf.sjoin(points_buffer_gdf)
    
    # Get all city buffers that intersect a city with a larger population
    intersecting_larger_population_df = intersecting_gdf.loc[
        intersecting_gdf.population_left < intersecting_gdf.population_right
    ]
    
    # Remove the city buffers that intersect with a larger population city buffer
    result_gdf = points_buffer_gdf[
        ~points_buffer_gdf.index.isin(intersecting_larger_population_df.index)
    ]
    
    # Plot result
    ax = points_buffer_gdf.boundary.plot(color="red")
    result_gdf.boundary.plot(color="blue", ax=ax)
    plt.show()