javamongodbjakarta-eejackrabbitjackrabbit-oak

Apache Jackrabbit OAK - Sharding DocumentNodeStore across cluster by node path


I am struggling to find enough documentation and examples for constructing and using Jackrabbit OAK in a clustered environment through sharding node stores by path. I know this is possible because there are references in a few places but with very little information, and the OAK or NodeStore API's are not intuitive enough to find this functionality.

Take a look at slide 17 in this PDF which lists the various sharding strategies. http://events.linuxfoundation.org/sites/events/files/slides/the%20architecture%20of%20Oak.pdf

My use case is that I need to have several remote servers all running the same Jackrabbit OAK application which uses the DocumentNodeStore backed by MongoDB for the node and blob storage. What I ultimately want is to shard (or partition) portions of my data across these remote servers organized by different paths in the overall node structure.

For example:

Server (A)
Is responsible for storing content at /a/*

Server (B)
Is responsible for storing content at /b/*

If Server (A) wants to read or write content at /b/*, it can access nodes at that path using the normal JCR or OAK API's which should completely abstract the user from the network details and the connection to the Server (B) MongoDB.

Is there any solid documentation relating to this use case? If not, what is the best way to go about learning this? I can spend the whole day wandering through the OAK source code, but documentation would be much preferred.


Solution

  • Oak's Mongodb implementation at this point does not have a strategy for sharding. The problem essentially comes down to the fact that _id of Mongo documents stored by Oak won't put documents across shards such that probabilistically bunch of nodes from same sub tree land on same shard instance. There had been some conversation of adding a shard key to handle the use case but the discussion didn't move ahead much as at this point we haven't seen a compelling use case which requires shards.

    That said, afaik, you can setup a sharded instance and provide mongouri accordingly. What I said above most likely won't scale as nicely as you might want. Also, that we haven't have seen setups which couldn't be handled by non-sharded setup.

    I know it doesn't answer your question, but maybe it's an indication why you couldn't find much said about the topic.