apache-sparkelasticsearchcassandraspark-cassandra-connectorelasticsearch-hadoop

Spark-Cassandra Vs Spark-Elasticsearch


I have been using Elasticsearch for quite sometime now and little experience using Cassandra.

Now, I have a project we want to use spark to process the data but I need to decide if we should use Cassandra or Elasticsearch as the datastore to load my data.

In terms of connector, both Cassandra and Elasticsearch now has a good connector to load the data so that won't be deciding factor.

The winning factor to decide will be how fast I can load my data inside Spark. My data is almost 20 terabytes.

I know I can run some test using JMeter and see the result myself but I would like to ask anyone familiar with both systems.

Thanks


Solution

  • The short exact answer is "it depends", mostly on cluster sizes =)

    I wouldn't chose Elastisearch as a primary source for the data, because it's good at searching. Searching is a very specific task and it requires a very specific approach, which in this cases uses inverted index to store actual data. Each field basically goes into separate index and because of that the indexes are very compact. Although it's possible to store into index complete objects, such an index will hardly get any benefit of compression. That requires much more disk space to store indexes and much more cpu clocks, spinning disks to process them.

    Cassandra on the other hand is pretty good at storing and retrieving data.

    Without any more or less specific requirements, I'd say that Cassandra is good at being primary storage (and provides pretty simple search scenarios) and ES is good at searching.