apache-sparkredisapache-stormmemcachedb

Compare in-memory cluster computing systems


I am working on Spark(Berkeley) Cluster Computing System. On my research, I learnt about some other in-memory systems like Redis, Memcachedb etc. It would be great if someone could give me a comparison between SPARK and REDIS (and MEMCACHEDB). In what scenarios does Spark have an advantage over these other in-memory systems?


Solution

  • They are complete different beasts.

    Redis and memcachedb are distributed stores. Redis is a pure in-memory system with optional persistency featuring various data structures. Memcachedb provides a memcached API on top of Berkeley-DB. In both cases, they are more likely to be used by OLTP applications, or eventually, for simple real-time analytics (on-the-fly aggregation of data).

    Both Redis and memcachedb lack mechanisms to efficiently iterate on the stored data in parallel. You cannot easily scan and apply some processing to the stored data. They are not designed for this. Also, except by using client-side manual sharding, they cannot be scaled out in a cluster (a Redis cluster implementation is on-going though).

    Spark is a system to expedite large scale analytics jobs (and especially the iterative ones) by providing in-memory distributed datasets. With Spark, you can implement efficient iterative map/reduce jobs on a cluster of machines.

    Redis and Spark both rely on in-memory data management. But Redis (and memcached) play in the same ballpark as the other OLTP NoSQL stores, while Spark is rather similar to an Hadoop map/reduce system.

    Redis is good at running numerous fast storage/retrieval operations at a high throughput with sub-millisecond latency. Spark shines at implementing large scale iterative algorithms for machine learning, graph analysis, interactive data mining, etc ... on a significant volume of data.

    Update: additional question about Storm

    The question is to compare Spark to Storm (see comments below).

    Spark is still based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data, rather than moving the data to the process. Each node stores (or caches) its dataset, and jobs are submitted to the nodes. So the process moves to the data. It is very similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms (when the output of the previous step is the input of the next step). Shark is only a query engine built on top of Spark (supporting ad-hoc analytical queries).

    You can see Storm as the complete architectural opposite of Spark. Storm is a distributed streaming engine. Each node implements a basic process, and data items flow in/out a network of interconnected nodes (contrary to Spark). With Storm, the data move to the process.

    Both frameworks are used to parallelize computations of massive amount of data.

    However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).

    Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.