distancepython-polarsgeopy

Efficiently compute distance between two POINTS using geodesic and polars for large scale datasets


I have created a dummy sample of ~10_000 rows with two columns (pikcup and dropoff location points - encoded as string).

I read the sample into a polars Dataframe with the following command:

df = pl.read_csv("./taxi_coordinates.csv")

I would like to efficiently compute the distance between those points using the module from geopy.distance import geodesic

Please note that I am trying to discover the most efficient approach because my original sample is over 30 million rows.

My approach using map_rows()

def compute_coordinates_v2(df:pl.DataFrame, col:str) -> pl.DataFrame:
    target_col:str = 'pu_polygon_centroid' if col == 'pickup' else 'do_polygon_centroid'
    location_data:str = f'{col}_location_cleaned'
    coordinates:str = f'{col}_coordinates'
    df = df.with_columns(
        pl.col(target_col).str.replace_all(r'POINT \(|\)', '').alias(location_data)
    ).with_columns(
        pl.col(location_data).str.split(' ').alias(coordinates)
    )
    return df

df = compute_coordinates_v2(df, 'pickup')
df = compute_coordinates_v2(df, 'dropoff')

The above operation will generate two columns of list type

shape: (5, 1)
┌───────────────────────────────────┐
│ pickup_coordinates                │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9656345353807", "40.768615… │
│ ["-73.9924375369761", "40.748497… │
└───────────────────────────────────┘
shape: (5, 1)
┌───────────────────────────────────┐
│ dropoff_coordinates               │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["-73.9656345353807", "40.768615… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9924375369761", "40.748497… │
│ ["-74.007879708664", "40.7177727… │
└───────────────────────────────────┘

Now to compute the distance I use the following func

def compute_centroid_distance_v2(row):
    if (row[0][0]) and (row[0][1]) and (row[1][0]) and (row[1][1]):
        centroid_distance = geodesic(
            (row[0][1], row[0][0]), #(latitude, longitude)
            (row[1][1], row[1][0])
        ).kilometers
    else:
        centroid_distance = 0.0
    return centroid_distance

df = df.with_columns(
        df.select(["pickup_coordinates", "dropoff_coordinates"]).map_rows(compute_centroid_distance_v2).rename({'map': "centroid_distance"})
    )

On a benchmark of 30 million rows the map_rows took approximately 1.5 hours.

Obviously something like

df = df.with_columns(
        pl.col("pickup_coordinates").list.first().cast(pl.Float32).alias('pickup_longitude'),
        pl.col("pickup_coordinates").list.last().cast(pl.Float32).alias('pickup_latitude'),
        pl.col("dropoff_coordinates").list.first().cast(pl.Float32).alias('dropoff_longitude'),
        pl.col("dropoff_coordinates").list.last().cast(pl.Float32).alias('dropoff_latitude')
    ).with_columns(
        coords = geodesic( (pl.col("pickup_latitude"), pl.col('pickup_longitude')),  (pl.col("dropoff_latitude"), pl.col('dropoff_longitude'))).kilometers
    )

didn't work because polars tries to apply a logical operation on (pl.col("pickup_latitude"), pl.col('pickup_longitude')

Thus, I would like to understand if map_rows/map_elements is my only solution or if there is a different work-around that could speed up the computations.


Solution

  • As in the answer from https://stackoverflow.com/a/76265233/ you can attempt to replicate geodesic() using Polars Expressions.

    Another potential option could be DuckDB, which can input/output Polars DataFrames easily.

    DuckDB has a SPATIAL extension: https://duckdb.org/2023/04/28/spatial.html

    duckdb.sql("install spatial") # needed once
    

    If I increase your example for a simple comparison:

    df_big = pl.concat([df] * 10)
    

    Using your map_rows approach:

    (df_big
      .select("pickup_coordinates", "dropoff_coordinates")
      .map_rows(compute_centroid_distance_v2)
      .rename({"map": "centroid_distance"})
    )
    
    shape: (98_510, 1)
    ┌───────────────────┐
    │ centroid_distance │
    │ ---               │
    │ f64               │
    ╞═══════════════════╡
    │ 1.50107           │
    │ 0.0               │
    │ 0.0               │
    │ 3.18019           │
    │ 3.652772          │
    │ …                 │
    │ 2.376629          │
    │ 1.440797          │
    │ 4.583181          │
    │ 0.53954           │
    │ 2.589928          │
    └───────────────────┘
    

    Elapsed time: 4.52725 seconds

    Using duckdb:

    duckdb.sql("load spatial")
    
    duckdb.sql("""
    from df_big
    select
       st_distance_spheroid(
          st_point(
             pickup_coordinates[2]::float, -- NOTE: array indexing is 1-based
             pickup_coordinates[1]::float
          ),
          st_point(
             dropoff_coordinates[2]::float,
             dropoff_coordinates[1]::float
          )
       ) as geodesic
    """) .pl()
    
    shape: (98_510, 1)
    ┌─────────────┐
    │ geodesic    │
    │ ---         │
    │ f64         │
    ╞═════════════╡
    │ 1501.364    │
    │ 0.0         │
    │ 0.0         │
    │ 3180.189287 │
    │ 3652.673199 │
    │ …           │
    │ 2376.786018 │
    │ 1440.740571 │
    │ 4583.039701 │
    │ 539.144276  │
    │ 2590.085087 │
    └─────────────┘
    

    Elapsed time: 0.10821 seconds

    I don't know too much about spatial data, so I'm not entirely sure why there are slight differences in the outputs.

    It also seems like you could use st_read() to load the data directly into duckdb instead of having to manually chop it up with Polars first.