Is there a faster way to do this? (python twitter location)

I'm trying to return a dictionary that aggregates tweets by their nearest state center. I'm iterating over all the tweets, and for each tweet, I'm checking all the states to see which state is the closest.

what would be a better way to do this?

def group_tweets_by_state(tweets):
    """

    The keys of the returned dictionary are state names, and the values are
    lists of tweets that appear closer to that state center than any other.

    tweets -- a sequence of tweet abstract data types """


    tweets_by_state = {}
    for tweet in tweets:
        position = tweet_location(tweet)
        min, result_state = 100000, 'CA'
        for state in us_states:
            if geo_distance(position, find_state_center(us_states[state]))< min:
                min = geo_distance(position, find_state_center(us_states[state]))
                result_state = state
        if result_state not in tweets_by_state:
            tweets_by_state[result_state]= []
            tweets_by_state[result_state].append(tweet)
        else:
            tweets_by_state[result_state].append(tweet)
    return tweets_by_state

Solution

Every little enhancement in that huge for-loop would result in huge performance gain for time complexity when the number of tweets is extremely big, there are few things I can think of:

1. Just call `geo_distance()` once especially when it's costly

distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
     min = distance

rather than

if geo_distance(position, find_state_center(us_states[state]))< min:
    min = geo_distance(position, find_state_center(us_states[state]))

2. If positions tend to be be repeated often:

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, find_state_center(us_states[state]))
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state

So, let's say if you have 1000 tweets from 200 different positions, and us_states is 50, your origin algorithm would call geo_distance() 1000*50*2 times , now it can be reduced to 200*50*1 times of invocation.

3. Reduce the times of invocation on `find_state_center()`

Similar to #2, it is called redundantly for every tweet every state now.

state_center_dict = {}
for state in us_states:
    state_center_dict[state] = find_state_center(us_states[state])

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, state_center_dict[state])
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state

Now find_state_center() is only called 50 times (number of state) rather than 50*1000 (number of tweets), we made another huge improvement!

Performance achievement summary

By #1, we double the performance. #2 we enhance it by (number of tweets/number of positions) times. #3 is the biggest among the three, we reduce the time complexity to only 1/(number of tweets) comparing with origin code.

Is there a faster way to do this? (python twitter location)

1. Just call geo_distance() once especially when it's costly

2. If positions tend to be be repeated often:

3. Reduce the times of invocation on find_state_center()

Performance achievement summary

1. Just call `geo_distance()` once especially when it's costly

3. Reduce the times of invocation on `find_state_center()`