pythongeopandas

How to sjoin using geopandas using a common key column and also location


Suppose I have dataframe A which consists of two columns : geometry ( point) and also hour. dataframe B also consists of geometry(shape) and hour.

I am familiar with standard sjoin . What I want to do is to make sjoin link rows from the two tables only when the hours are the same. In traditional join terminology the keys are geometry and hour. How can I attain it?


Solution

  • Have reviewed two logical approached

    Conclusions

    import pandas as pd
    import numpy as np
    import geopandas as gpd
    import shapely.geometry
    import requests
    
    # source some points and polygons
    # fmt: off
    dfp = pd.read_html("https://www.latlong.net/category/cities-235-15.html")[0]
    dfp = gpd.GeoDataFrame(dfp, geometry=dfp.loc[:,["Longitude", "Latitude",]].apply(shapely.geometry.Point, axis=1))
    res = requests.get("https://opendata.arcgis.com/datasets/69dc11c7386943b4ad8893c45648b1e1_0.geojson")
    df_poly = gpd.GeoDataFrame.from_features(res.json())
    # fmt: on
    # bulk up number of points
    dfp = pd.concat([dfp for _ in range(1000)]).reset_index()
    HOURS = 24
    dfp["hour"] = np.random.randint(0, HOURS, len(dfp))
    df_poly["hour"] = np.random.randint(0, HOURS, len(df_poly))
    
    def simple():
        return gpd.sjoin(dfp, df_poly).loc[lambda d: d["hour_left"] == d["hour_right"]]
    
    def shard():
        return pd.concat(
            [
                gpd.sjoin(*[d.loc[d["hour"].eq(h)] for d in [dfp, df_poly]])
                for h in range(HOURS)
            ]
        )
    
    print(f"""length test: {len(simple()) == len(shard())} {len(simple())}
    dataframe test: {simple().sort_index().equals(shard().sort_index())}
    points: {len(dfp)}
    polygons: {len(df_poly)}""")
    
    %timeit simple()
    %timeit shard()
    

    output

    length test: True 3480
    dataframe test: True
    points: 84000
    polygons: 379
    6.48 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    4.05 s ± 34.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)