I'm trying to return a dictionary that aggregates tweets by their nearest state center. I'm iterating over all the tweets, and for each tweet, I'm checking all the states to see which state is the closest.
what would be a better way to do this?
def group_tweets_by_state(tweets):
"""
The keys of the returned dictionary are state names, and the values are
lists of tweets that appear closer to that state center than any other.
tweets -- a sequence of tweet abstract data types """
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
for state in us_states:
if geo_distance(position, find_state_center(us_states[state]))< min:
min = geo_distance(position, find_state_center(us_states[state]))
result_state = state
if result_state not in tweets_by_state:
tweets_by_state[result_state]= []
tweets_by_state[result_state].append(tweet)
else:
tweets_by_state[result_state].append(tweet)
return tweets_by_state
Every little enhancement in that huge for-loop would result in huge performance gain for time complexity when the number of tweets is extremely big, there are few things I can think of:
geo_distance()
once especially when it's costlydistance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
rather than
if geo_distance(position, find_state_center(us_states[state]))< min:
min = geo_distance(position, find_state_center(us_states[state]))
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
So, let's say if you have 1000 tweets from 200 different positions, and us_states
is 50, your origin algorithm would call geo_distance()
1000*50*2 times , now it can be reduced to 200*50*1 times of invocation.
find_state_center()
Similar to #2, it is called redundantly for every tweet every state now.
state_center_dict = {}
for state in us_states:
state_center_dict[state] = find_state_center(us_states[state])
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, state_center_dict[state])
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
Now find_state_center()
is only called 50 times (number of state) rather than 50*1000 (number of tweets), we made another huge improvement!
By #1, we double the performance. #2 we enhance it by (number of tweets/number of positions) times. #3 is the biggest among the three, we reduce the time complexity to only 1/(number of tweets) comparing with origin code.