[SOLVED] Is Python DBM really fast?

Is Python DBM really fast?

I was thinking that native DBM of Python should be quite faster than NOSQL databases such as Tokyo Cabinet, MongoDB, etc (as Python DBM has lesser features and options; i.e. a simpler system). I tested with a very simple write/read example as

#!/usr/bin/python
import time
t = time.time()
import anydbm
count = 0
while (count < 1000):
 db = anydbm.open("dbm2", "c")
 db["1"] = "something"
 db.close()
 db = anydbm.open("dbm", "r")
 print "dict['Name']: ", db['1'];
 print "%.3f" % (time.time()-t)
 db.close()
 count = count + 1

Read/Write: 1.3s Read: 0.3s Write: 1.0s

These values for MongoDb is at least 5 times faster. Is it really the Python DBM performance?

Solution

Python doesn't have a built-in DBM implementation. It bases its DBM functions on a wide range of DBM-style third party libraries, like AnyDBM, Berkeley DBM and GNU DBM.

Python's dictionary implementation is really fast for key-value storage, but not persistent. If you need high-performance runtime key-value lookups, you may find a dictionary better - you can manage persistence with something like cpickle or shelve. If startup times are important to you (and if you're modifying the data, termination) - more important than runtime access speed - then something like DBM would be better.

In your evaluation, as part of the main loop you have included both dbm open calls and also array lookup. It's a pretty unrealistic use case to open a DBM to store one value and the close and re-open before looking it up, and you're seeing the typical slow performance that one would when managing a persistent data store in such a manner (it's quite inefficient).

Depending on your requirements, if you need fast lookups and don't care too much about startup times, DBM might be a solution - but to benchmark it, only include writes and reads in the loop! Something like the below might be suitable:

import anydbm
from random import random
import time

# open DBM outside of the timed loops
db = anydbm.open("dbm2", "c")

max_records = 100000

# only time read and write operations
t = time.time()

# create some records
for i in range(max_records):
  db[str(i)] = 'x'

# do a some random reads
for i in range(max_records):
  x = db[str(int(random() * max_records))]

time_taken = time.time() - t
print "Took %0.3f seconds, %0.5f microseconds / record" % (time_taken, (time_taken * 1000000) / max_records)

db.close()