databasegitdatabase-performancedatabase-replicationdocument-database

Using git repository as a database backend


I'm doing a project that deals with structured document database. I have a tree of categories (~1000 categories, up to ~50 categories on each level), each category contains several thousands (up to, say, ~10000) of structured documents. Each document is several kilobytes of data in some structured form (I'd prefer YAML, but it may just as well be JSON or XML).

Users of this systems do several types of operations:

Of course, the traditional solution would be using some sort of document database (such as CouchDB or Mongo) for this problem - however, this version control (history) thing tempted me to a wild idea - why shouldn't I use git repository as a database backend for this application?

On the first glance, it could be solved like this:

Are there any other common pitfalls in this solution? Have anyone tried to implement such backend already (i.e. for any popular frameworks - RoR, node.js, Django, CakePHP)? Does this solution have any possible implications on performance or reliability - i.e. is it proven that git would be much slower than traditional database solutions or there would be any scalability/reliability pitfalls? I presume that a cluster of such servers that push/pull each other's repository should be fairly robust & reliable.

Basically, tell me if this solution will work and why it will or won't do?


Solution

  • Answering my own question is not the best thing to do, but, as I ultimately dropped the idea, I'd like to share on the rationale that worked in my case. I'd like to emphasize that this rationale might not apply to all cases, so it's up to architect to decide.

    Generally, the first main point my question misses is that I'm dealing with multi-user system that work in parallel, concurrently, using my server with a thin client (i.e. just a web browser). This way, I have to maintain state for all of them. There are several approaches to this one, but all of them are either too hard on resources or too complex to implement (and thus kind of kill the original purpose of offloading all the hard implementation stuff to git in the first place):

    That said, note that I intentionally calculated numbers of fairly small database and user base: 100K users, 1K active users, 100 MiBs total database + history of edits, 10 MiBs of working copy. If you'd look at more prominent crowd-sourcing projects, there are much higher numbers there:

    │              │ Users │ Active users │ DB+edits │ DB only │
    ├──────────────┼───────┼──────────────┼──────────┼─────────┤
    │ MusicBrainz  │  1.2M │     1K/week  │   30 GiB │  20 GiB │
    │ en.wikipedia │ 21.5M │   133K/month │    3 TiB │  44 GiB │
    │ OSM          │  1.7M │    21K/month │  726 GiB │ 480 GiB │
    

    Obviously, for that amounts of data/activity, this approach would be utterly unacceptable.

    Generally, it would have worked, if one could use web browser as a "thick" client, i.e. issuing git operations and storing pretty much the full checkout on client's side, not on the server's side.

    There are also other points that I've missed, but they're not that bad compared to the first one:

    So, bottom line: it is possible, but for most current usecases it won't be anywhere near the optimal solution. Rolling up your own document-edit-history-to-SQL implementation or trying to use any existing document database would be probably a better alternative.