Most straightforward way to cache geocoding data

I am using geopy to get lat/long coordinates for a list of addresses. All the documentation points to limiting server queries by caching (many questions here, in fact), but few actually give practical solutions.

What is the best way to accomplish this?

This is for a self-contained data processing job I'm working on ... no app platform involved. Just trying to cut down on server queries as I run through data that I will have seen before (very likely, in my case).

My code looks like this:

from geopy import geocoders
def geocode( address ):
    # address ~= "175 5th Avenue NYC"
    g = geocoders.GoogleV3()
    cache = addressCached( address )

    if ( cache != False ): 
        # We have seen this exact address before,
        # return the saved location
        return cache

    # Otherwise, get a new location from geocoder
    location = g.geocode( address )

    saveToCache( address, location )
    return location

def addressCached( address ):
    # What does this look like?

def saveToCache( address, location ):
    # What does this look like?

Solution

How exactly you want to implement your cache is really dependent on what platform your Python code will be running on.

You want a pretty persistent "cache" since addresses' locations are not going to change often:-), so a database (in a key-value mood) seems best.

So in many cases I'd pick sqlite3, an excellent, very lightweight SQL engine that's part of the Python standard library. Unless perhaps I preferred e.g a MySQL instance that I need to have running anyway, one advantage might be that this would allow multiple applications running on different nodes to share the "cache" -- other DBs, both SQL and non, would be good for the latter, depending on your constraints and preferences.

But if I was e.g running on Google App Engine, then I'd be using the datastore it includes, instead. Unless I had specific reasons to want to share the "cache" among multiple disparate applications, in which case I might consider alternatives such as google cloud sql and google storage, as well as another alternative yet consisting of a dedicated "cache server" GAE app of my own serving RESTful results (maybe w/endpoints?). The choice is, again!, very, very dependent on your constraints and preferences (latency, queries-per-seconds sizing, etc, etc).

So please clarify what platform you are in, and what other constraints and preferences you have for your databasey "cache", and then the very simple code to implement that can easily be shown. But showing half a dozen different possibilities before you clarify would not be very productive.

Added: since the comments suggest sqlite3 may be acceptable, and there are a few important details best shown in code (such as, how to serialize and deserialize an instance of geopy.location.Location into/from a sqlite3 blob -- similar issues may well arise with other underlying databases, and the solutions are similar), I decided a solution example may be best shown in code. So, as the "geo cache" is clearly best implemented as its own module, I wrote the following simple geocache.py...:

import geopy
import pickle
import sqlite3

class Cache(object):
    def __init__(self, fn='cache.db'):
       self.conn = conn = sqlite3.connect(fn)
       cur = conn.cursor()
       cur.execute('CREATE TABLE IF NOT EXISTS '
                   'Geo ( '
                   'address STRING PRIMARY KEY, '
                   'location BLOB '
                   ')')
       conn.commit()

    def address_cached(self, address):
        cur = self.conn.cursor()
        cur.execute('SELECT location FROM Geo WHERE address=?', (address,))
        res = cur.fetchone()
        if res is None: return False
        return pickle.loads(res[0])

    def save_to_cache(self, address, location):
        cur = self.conn.cursor()
        cur.execute('INSERT INTO Geo(address, location) VALUES(?, ?)',
                    (address, sqlite3.Binary(pickle.dumps(location, -1))))
        self.conn.commit()


if __name__ == '__main__':
    # run a small test in this case
    import pprint

    cache = Cache('test.db')
    address = '1 Murphy St, Sunnyvale, CA'
    location = cache.address_cached(address)
    if location:
        print('was cached: {}\n{}'.format(location, pprint.pformat(location.raw)))
    else:
        print('was not cached, looking up and caching now')
        g = geopy.geocoders.GoogleV3()
        location = g.geocode(address)
        print('found as: {}\n{}'.format(location, pprint.pformat(location.raw)))
        cache.save_to_cache(address, location)
        print('... and now cached.')

I hope the ideas illustrated here are clear enough -- there are alternatives on each design choice, but I've tried to keep things simple (in particular, I'm using a simple example-cum-mini-test when this module is run directly, in lieu of a proper suite of unit-tests...).

For the bit about serializing to/from blobs, I've chosen pickle with the "highest protocol" (-1) protocol -- cPickle of course would be just as good in Python 2 (and faster:-) but these days I try to write code that's equally good as Python 2 or 3, unless I have specific reasons to do otherwise:-). And of course I'm using a different filename test.db for the sqlite database used in the test, so you can wipe it out with no qualms to test some variation, while the default filename meant to be used in "production" code stays intact (it is quite a dubious design choice to use a filename that's relative -- meaning "in the current directory" -- but the appropriate way to decide where to place such a file is quite platform dependent, and I didn't want to get into such exoterica here:-).

If any other question is left, please ask (perhaps best on a separate new question since this answer has already grown so big!-).