python dictionary apache-spark tags geotagging

Establish location-affinity between two users from their geotags?

The Idea. I would like to build a function like:

location_affinity(user_a, user_b)

which establish a location affinity between two users. In particular, this function will return a float number between 0 (no affinity) and 1 (max affinity) indicating how much places user_a has been correspond to places user_b has been. e.g.: If user_a ALWAYS stays with user_b and follows him to every places he go, I'm expecting a "1" as result. If user_a lives far away from user_b and they never got even close to each other, I'm expecting a "0" as result.

The Data. Each user has a list of points(latitude, longitude) where he has been, and those points were already extracted from user's Facebook geotags. To visualize this: IMAGE

Red "X"s are points(lat, lng) user_a has been.
Green "X"s are points(lat, lng) user_b has been.
Blue area represent the overlap.

The Question. Are there any known algorithms which, based on two users' map points list, can establish the affinity (which I gather it depends on the overlap area)? If not, which keywords should I search for?

Additional. I'm trying to build Python functions with Spark. Are there any integrations?

Thank you.

Solution

How about something like this:

First we use scipy.spatial.distance.cdist to determine the distances between each point from user_a to each point from user_b to find the closest point for each. We then use the exponential function to exponentially suppress higher distances. The constant c determines how large this suppression is, smaller means large distances have a higher suppression (you will need to scale it to make sense in your actual units). Then we just look at the mean of that metric.

import numpy as np
from scipy.spatial.distance import cdist

def affinity(user_a, user_b, c=0.1):
    dists = cdist(user_a, user_b)
    return (np.exp(-dists.min(axis=0)/c)).mean()

This has the nice property that if the two sets of points are exactly equal, it returns 1.

user_a = np.random.rand(1000, 2)
user_b1 = np.random.rand(1000, 2)
user_b2 = user_a.copy()

print(affinity(user_a, user_b1))
# 0.85169834916
print(affinity(user_b1, user_a))
# 0.856871315902
print(affinity(user_a, user_b2))
# 1.0

It has a small problem, though, as you can see above. This function is not symmetric. However, we can make it symmetric by considering both equally:

def affinity(user_a, user_b, c=0.1):
    dists = cdist(user_a, user_b)
    min_dists = dists.min(axis=0), dists.min(axis=1)
    return np.concatenate([np.exp(-x/c) for x in min_dists]).mean()

print(affinity(user_a, user_b1, 0.01))
# 0.271448093071
print(affinity(user_b1, user_a, 0.01))
# 0.271448093071
print(affinity(user_a, user_b2, 0.01))
# 1.0

Of course you can use many different metrics to determine the fall-off of larger distances. Here I chose exp(-x), but you could also use 1 - tanh(x) or tanh(1/(x+epsilon)) (the epsilon is needed to avoid a divison by zero in case two points are exactly identical). This results in different behaviour:

Actually, you could use 1 - any function defined in this post.