I would like to build a B+tree that spans a multi-node
computer network (internal subnet of Linux PCs) for
elastic massive storage. Range scans are important.
Is this basically the underlying data structure of
distributed DB systems? (Cassandra, HBase)
Is there any research out there on distributed B+Trees?
I saw the article at
http://www.cs.yale.edu/homes/aspnes/papers/opodis2005-b-trees-final.pdf
but skip BTrees just take faulty nodes out (so there's data loss)
I'm particularly interested in B+Trees with built-in redundancy
(i.e. if a host fails and all the nodes it hosts are offline,
I'd like another replicated host to become the primary node
server and take the place of the failed host)
I don't want to use a collection of DB instances
(1 node, one DB) as sharding is not a good choice
for a massively scaled storage system (across commodity
x86,x64 hardware with FOSS OS).
Am I reinventing the wheel?
Should I just use Cassandra or HBase?
Cassandra supports range queries.
Google's Big Table automatically adds new machines to the cluster when you turn the machine on. It's very elastic and easy to add more machines. Unfortunately its speed comes with a drawback: the queries are very restrictive. You can do some range queries. See this article for a list and more details: http://geothought.blogspot.com/2009/04/google-app-engine-and-bigtable-very.html
A great example how data is stored in Big Table: http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
A nice stack overflow post: storing massive ordered time series data in bigtable derivatives