javaxmlfileriak

Bitcask ok for simple and high performant file store?


I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.

Our requirements are:

  1. Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
  2. Very fast random lookup by id (e.g. document URL)
  3. Accessible by both Java and Perl
  4. Available on the most important Linux-Distros and Windows

I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:

  1. No clustering required
  2. No daemon ("service") required
  3. No clever search functionality required

Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)

Soo my question boils down to


Solution

  • I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.

    The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.


    Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.