[SOLVED] Persistent Python Set, List and Dictionary?

Persistent Python Set, List and Dictionary?

I used to do some small Python programs for simple data analysis. It is easy to use and efficient.

Recently I start to ran into situations that the size of data in my problem is just too big to fit entirely in the memory for Python to process.

I have been researching possible persistence implementations for Python. I found pickle and some other libraries that are quite interesting but not exactly what I am looking for.

Simply put, the way pickle handles persistence is not transparent to the program. Programmer need to handle it explicitly - to load or to save etc.

I was thinking if it can be implemented in the way that it can be programmed more seamlessly. For example,

d1 = p({'k':'v'}) # where p() is the persistent version of dictionary
print d1.has_key('k') # which gives 'v', same as if it is an ordinary dictionary
d2.dump('dict.pkl')  # save the dictionary in a file, or database, etc

That is, to overload the dictionary methods with a persistent version. It looks doable to me but I need to find out exactly how many methods that I need to work on.

Browsing the Python source could help but I haven't really dig into that deep level. Hope you can offer me some pointers and direction into this.

Thanks!

EDIT

Apologies that I am not very clear in my original question. I wasn't really look into save the data structure but rather to look for some internal "paging" mechanism that can run behind the scene when my problem rans out of memory. For example,

d1 = p({}) # create a persistent dictionary
d1['k1'] = 'v1' # add
# add another, maybe 1 billion more, entries on to the dictionary
print d1.has_key('k9999999999') # entry that is not in memory

Totally behind the scene. No save/load/search required from the programmer.

Solution

anydbm works almost exactly like your example and should be reasonably fast. One issue is that it only handles string keys and string contents. I'm not sure if opening and closing the db is too much overhead. You could probably wrap this in a context manager to make it a bit nicer. Also, you'd need some magic to use different filenames each time p is called.

import anydbm
def p(initial):
    d = anydbm.open('cache', 'c')
    d.update(initial)
    return d

d1 = p({}) # create a persistent dictionary
d1['k1'] = 'v1' # add

# add another, maybe 1 billion more, entries on to the dictionary
for i in xrange(100000):
    d1['k{}'.format(i)] = 'v{}'.format(i)

print d1.has_key('k9999999999') # entry that is not in memory, prints False

d1.close() # You have to close it yourself