[SOLVED] How to sjoin using geopandas using a common key column and also location

How to sjoin using geopandas using a common key column and also location

Suppose I have dataframe A which consists of two columns : geometry ( point) and also hour. dataframe B also consists of geometry(shape) and hour.

I am familiar with standard sjoin . What I want to do is to make sjoin link rows from the two tables only when the hours are the same. In traditional join terminology the keys are geometry and hour. How can I attain it?

Solution

Have reviewed two logical approached

spatial join followed by filter
shard (filter) data frames first on hour, spatial join shards and concatenate results from the sharded data frames
test results for equality
run some timings

Conclusions

little difference between timings on this test data set. simple is quicker if number of points is small

import pandas as pd
import numpy as np
import geopandas as gpd
import shapely.geometry
import requests

# source some points and polygons
# fmt: off
dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
df_poly = gpd.GeoDataFrame.from_features(res.json())
# fmt: on
# bulk up number of points
dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
HOURS = 24
dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))

def simple():
    return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]

def shard():
    return pd.concat(
        [
            gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
            for h in range(HOURS)
        ]
    )

print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
dataframe test: {simple().sort_index().equals(shard().sort_index())}
points: {len(dfp)}
polygons: {len(df_poly)}""")

%timeit simple()
%timeit shard()

output

length test: True 3480
dataframe test: True
points: 84000
polygons: 379
6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)