[SOLVED] Fastest nosql option for number crunching?

Fastest nosql option for number crunching?

I had always thought that Mongo had excellent performance with it's mapreduce functionality, but am now reading that it is a slow implementation of it. So if I had to pick an alternative to benchmark against, what should it be?

My software will be such that users will often have millions of records, and often be sorting and crunching through unpredictable subsets that are 10s or 100s of thousands. Most of the analysis of data that uses the full millions of records can be done in summary tables and the like. I'd originally thought Hypertable was a viable alternative, but in doing research I saw in their documents their mention that Mongo would be a more performant option, while Hypertable had other benefits. But for my application speed is my number one initial priority.

Solution

First of all, it's important to decide on what is "fast enough". Undoubtedly there are faster solutions than MongoDB's map/reduce but in most cases you may be looking at significantly higher development cost.

That said MongoDB's map/reduce runs, at time of writing, on a single thread which means it will not utilize all the cpu available to it. Also, MongoDB has very little in the way of native aggregation functionality. This will change fixed with version 2.1 onwards that should improve performance though (see https://jira.mongodb.org/browse/SERVER-447 and http://www.slideshare.net/cwestin63/mongodb-aggregation-mongosf-may-2011).

Now, what MongoDB is good at is scaling up easily, especially when it comes to reads. And this is important because the best solution for number crunching on large datasets is definitely a map/reduce cloud like Augusto suggested. Let such an m/r do the number crunching while MongoDB makes the required data available at high speeds. Database query throughput too low is easily solved by adding more mongo shards. Number crunching/aggregation performance too slow is solved by adding more m/r boxes. Basically performance becomes a function of number of instances you reserve for the problem, and thus cost.