pythongoogle-app-enginetransactionsgoogle-cloud-datastoreentity-groups

GAE Lookup Table Incompatible with Transactions?


My Python High Replication Datastore application requires a large lookup table of between 100,000 and 1,000,000 entries. I need to be able to supply a code to some method that will return the value associated with that code (or None if there is no association). For example, if my table held acceptable English words then I would want the function to return True if the word was found and False (or None) otherwise.

My current implementation is to create one parentless entity for each table entry, and for that entity to contain any associated data. I set the datastore key for that entity to be the same as my lookup code. (I put all the entities into their own namespace to prevent any key conflicts, but that's not essential for this question.) Then I simply call get_by_key_name() on the code and I get the associated data.

The problem is that I can't access these entities during a transaction because I'd be trying to span entity groups. So going back to my example, let's say I wanted to spell-check all the words used in a chat session. I could access all the messages in the chat because I'd give them a common ancestor, but I couldn't access my word table because the entries there are parentless. It is imperative that I be able to reference the table during transactions.

Note that my lookup table is fixed, or changes very rarely. Again this matches the spell-check example.

One solution might be to load all the words in a chat session during one transaction, then spell-check them (saving the results), then start a second transaction that would spell-check against the saved results. But not only would this be inefficient, the chat session might have been added to between the transactions. This seems like a clumsy solution.

Ideally I'd like to tell GAE that the lookup table is immutable, and that because of this I should be able to query against it without its complaining about spanning entity groups in a transaction. I don't see any way to do this, however.

Storing the table entries in the memcache is tempting, but that too has problems. It's a large amount of data, but more troublesome is that if GAE boots out a memcache entry I wouldn't be able to reload it during the transaction.

Does anyone know of a suitable implementation for large global lookup tables?

Please understand that I'm not looking for a spell-check web service or anything like that. I'm using word lookup as an example only to make this question clear, and I'm hoping for a general solution for any sort of large lookup tables.


Solution

  • If you can, try and fit the data into instance memory. If it won't fit in instance memory, you have a few options available to you.

    You can store the data in a resource file that you upload with the app, if it only changes infrequently, and access it off disk. This assumes you can build a data structure that permits easy disk lookups - effectively, you're implementing your own read-only disk based table.

    Likewise, if it's too big to fit as a static resource, you could take the same approach as above, but store the data in blobstore.

    If your data absolutely must be in the datastore, you may need to emulate your own read-modify-write transactions. Add a 'revision' property to your records. To modify it, fetch the record (outside a transaction), perform the required changes, then inside a transaction, fetch it again to check the revision value. If it hasn't changed, increment the revision on your own record and store it to the datastore.

    Note that the underlying RPC layer does theoretically support multiple independent transactions (and non-transactional operations), but the APIs don't currently expose any way to access this from within a transaction, short of horrible (and I mean really horrible) hacks, unfortunately.

    One final option: You could run a backend provisioned with more memory, exposing a 'SpellCheckService', and make URLFetch calls to it from your frontends. Remember, in-memory is always going to be much, much faster than any disk-based option.