pythonpandasdataframegoogle-location-services

Google Location distance calculation using python dataframes


I am trying to tease out the dates that I was in a certain area (within a mile or so) using Google Location data and Python Pandas Dataframe.

First convert to latitude from latitudeE7:

with open(Takeout_google_location_history) as f:
data = json.loads(f.read())

df = json_normalize(data['locations'])
df['latitudeE7'] = df['latitudeE7'].div(10000000.0)
df['longitudeE7'] = df['longitudeE7'].div(10000000.0)
df.head()

Then calculate the distance:

import haversine as hs
from haversine import Unit
loc1 = (31.393300,-99.070050)
df['diff'] = hs.haversine(loc1,(df['latitudeE7'],df['longitudeE7']),unit=Unit.MILES)
df.head()

And getting this error:

~\Anaconda2\envs\notebook\lib\site-packages\haversine\haversine.py in 
haversine(point1, point2, unit)
     92     lat1 = radians(lat1)
     93     lng1 = radians(lng1)
---> 94     lat2 = radians(lat2)
     95     lng2 = radians(lng2)
     96 

~\Anaconda2\envs\notebook\lib\site-packages\pandas\core\series.py in             wrapper(self)
    183         if len(self) == 1:
    184             return converter(self.iloc[0])
--> 185         raise TypeError(f"cannot convert the series to {converter}")
    186 
    187     wrapper.__name__ = f"__{converter.__name__}__"

TypeError: cannot convert the series to <class 'float'>      

I am not sure what to do with the data to make it a float.

I have tried:

df['latitudeE7'] = df['latitudeE7'].div(10000000.0).astype(float)

As well as using a hand written distance:

import math
def distance(origin, destination):

  lat1, lon1 = origin
  lat2, lon2 = destination
  radius = 6371  # km

  dlat = math.radians(float(lat2) - lat1)
  dlon = math.radians(float(lon2) - lon1)
  a = (math.sin(dlat / 2) * math.sin(dlat / 2) +
     math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
     math.sin(dlon / 2) * math.sin(dlon / 2))
  c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
  d = radius * c

  return d

Still getting the same error:

~\AppData\Local\Temp/ipykernel_22916/3664391511.py in distance(origin, destination)
     26     radius = 6371  # km
     27 
---> 28     dlat = math.radians(float(lat2) - lat1)
     29     dlon = math.radians(float(lon2) - lon1)
     30     a = (math.sin(dlat / 2) * math.sin(dlat / 2) +

~\Anaconda2\envs\notebook\lib\site-packages\pandas\core\series.py in wrapper(self)
    183         if len(self) == 1:
    184             return converter(self.iloc[0])
--> 185         raise TypeError(f"cannot convert the series to {converter}")
    186 
    187     wrapper.__name__ = f"__{converter.__name__}__"

TypeError: cannot convert the series to <class 'float'>

Solution

  • You cannot directly pass pd.Series to haversine function.

    Code:

    from haversine import haversine, Unit
    import pandas as pd
    
    loc1 = (31.393300, -99.070050)
    
    # Sample dataframe
    df = pd.DataFrame({'latitudeE7': [0, 0], 'longitudeE7': [0, 0]})
    
    # Calculation
    # df['diff'] = haversine(loc1, (df['latitudeE7'], df['longitudeE7']), unit=Unit.MILES) # This doesn't work
    df['diff'] = df.apply(lambda row: haversine(loc1, (row['latitudeE7'], row['longitudeE7']), unit=Unit.MILES), axis=1)
    

    Output:

    latitudeE7 longitudeE7 diff
    0 0 6752.74
    0 0 6752.74

    Reference:

    The issue you have seems related to the following post: understanding math errors in pandas dataframes


    [EDIT]

    If the number of rows is large, haversin_vector will be the proper method in terms of speed.

    Code

    # Preparation:

    from haversine import haversine, haversine_vector, Unit
    import pandas as pd
    import numpy as np
    
    loc1 = (31.393300, -99.070050)
    
    # Sample dataframe
    n = 1000000
    df = pd.DataFrame({'latitudeE7': np.random.rand(n) * 180 - 90, 'longitudeE7': np.random.rand(n) * 360 - 180})
    

    # Speed test 1 (Use haversine)

    df['diff'] = df.apply(lambda row: haversine(loc1, (row['latitudeE7'], row['longitudeE7']), unit=Unit.MILES), axis=1)
    
    9.9 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    # Speed test 2 (Use haversine_vector)

    df['diff'] = haversine_vector(loc1, df, unit=Unit.MILES, comb=True)
    
    105 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Reference: