scalahadoopapache-sparkhdfssequencefile

Cached Spark RDD ( read from Sequence File) has invalid entries, how do i fix this?


I am reading Hadoop Sequence Files using Spark(v1.6.1). After caching the RDD, the contents in RDD becomes invalid(the last entry duplicated n times).

Here is my code snippet:

import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.SequenceFileOutputFormat
import org.apache.spark.{SparkConf, SparkContext}

object Main {
  def main(args: Array[String]) {
    val seqfile = "data-1.seq"
    val conf: SparkConf = new SparkConf()
      .setAppName("..Buffer..")
      .setMaster("local")
      .registerKryoClasses(Array(classOf[Text]))
    val sc = new SparkContext(conf)

    sc.parallelize((0 to 1000).toSeq) //creating a sample sequence file
      .map(i => (new Text(s"$i"), new Text(s"${i*i}")))
      .saveAsHadoopFile(seqfile, classOf[Text], classOf[Text],
        classOf[SequenceFileOutputFormat[Text, Text]])

    val c = sc.sequenceFile(seqfile, classOf[Text], classOf[Text])
      .cache()
      .map(t => {println(t); t})
      .collectAsMap()
    println(c)
    println(c.size)

    sc.stop()
  }
}

The output:

(1000,1000000)
(1000,1000000)
(1000,1000000)
(1000,1000000)
(1000,1000000)
...... //Total 1000 lines with same content as above ...
Map(1000 -> 1000000)
1

EDIT : For future visitors : If you are reading sequence file like I did in the above code snippet, refer to accepted answer. A simple workaround is to make a copy of Hadoop Writable instance:

val c = sc.sequenceFile(seqfile, classOf[Text], classOf[Text])
  .map(t =>(new Text(t._1), new Text(t._2)))   //Make copy of writable instances

Solution

  • Please refer to the comments in sequenceFile.

    /** Get an RDD for a Hadoop SequenceFile with given key and value types.
     *
     * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
     * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
     * operation will create many references to the same object.
     * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
     * copy them using a `map` function.
     */