pythonnosqldatabasedbm

Is Python DBM really fast?


I was thinking that native DBM of Python should be quite faster than NOSQL databases such as Tokyo Cabinet, MongoDB, etc (as Python DBM has lesser features and options; i.e. a simpler system). I tested with a very simple write/read example as

#!/usr/bin/python
import time
t = time.time()
import anydbm
count = 0
while (count < 1000):
 db = anydbm.open("dbm2", "c")
 db["1"] = "something"
 db.close()
 db = anydbm.open("dbm", "r")
 print "dict['Name']: ", db['1'];
 print "%.3f" % (time.time()-t)
 db.close()
 count = count + 1

Read/Write: 1.3s Read: 0.3s Write: 1.0s

These values for MongoDb is at least 5 times faster. Is it really the Python DBM performance?


Solution

  • Python doesn't have a built-in DBM implementation. It bases its DBM functions on a wide range of DBM-style third party libraries, like AnyDBM, Berkeley DBM and GNU DBM.

    Python's dictionary implementation is really fast for key-value storage, but not persistent. If you need high-performance runtime key-value lookups, you may find a dictionary better - you can manage persistence with something like cpickle or shelve. If startup times are important to you (and if you're modifying the data, termination) - more important than runtime access speed - then something like DBM would be better.

    In your evaluation, as part of the main loop you have included both dbm open calls and also array lookup. It's a pretty unrealistic use case to open a DBM to store one value and the close and re-open before looking it up, and you're seeing the typical slow performance that one would when managing a persistent data store in such a manner (it's quite inefficient).

    Depending on your requirements, if you need fast lookups and don't care too much about startup times, DBM might be a solution - but to benchmark it, only include writes and reads in the loop! Something like the below might be suitable:

    import anydbm
    from random import random
    import time
    
    # open DBM outside of the timed loops
    db = anydbm.open("dbm2", "c")
    
    max_records = 100000
    
    # only time read and write operations
    t = time.time()
    
    # create some records
    for i in range(max_records):
      db[str(i)] = 'x'
    
    # do a some random reads
    for i in range(max_records):
      x = db[str(int(random() * max_records))]
    
    time_taken = time.time() - t
    print "Took %0.3f seconds, %0.5f microseconds / record" % (time_taken, (time_taken * 1000000) / max_records)
    
    db.close()