pythongoogle-bigquerygeodesic-sphere

geodesic distance between visitors and closest store


I have a big .csv dataset containing 10e7 points with coordinates (latitude, longitude) representing locations of visitors. I have another dataset containing 10e3 points with coordinates representing locations of stores.

I want to associate to each visitor the closest store, using some kind of geodesic formula.

I want something really fast and efficient, which I can run on python (pandas for instance) or Google BigQuery.

Can someone give me a clue?


Solution

  • To add to Felipe answer:

    You can use SQL UDF vs JS UDF
    JS UDF have some Limits that SQL UDF do not

    So equivalent SQL UDF you can use with the rest of Felipe's code is

    CREATE TEMPORARY FUNCTION distance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
    RETURNS FLOAT64 AS ((
    WITH constants AS (
      SELECT 0.017453292519943295 AS p
    ) 
    SELECT 12742 * ASIN(SQRT(
      0.5 - COS((lat2 - lat1) * p)/2 + 
      COS(lat1 * p) * COS(lat2 * p) * 
      (1 - COS((lon2 - lon1) * p))/2))
    FROM constants
    ));
    

    I tried to preserve layout of respective JS UDF as much as possible so you can see how it is created