apache-sparkelasticsearchhdfselasticsearch-hadoopdistributed-filesystem

How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?


Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?

I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.

From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).

Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContexts.

However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?

Update Below

Thanks to @Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:

val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")

val sc = new SparkContext(conf)

val allDataRDD = sc.esRDD("some/lovelydata")

val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)

Solution

  • Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.

    I'm not sure I understand why you don't just use elasticsearch-hadoop. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile and rdd.saveAsHadoopFile take hadoop.conf.Configuration arguments. So you can without any problems use two ES clusters with the same SparkContext.