scalaapache-sparksequencefile

Is it possible to only evaluate the Key when reading a SequenceFile in Spark?


I'm trying to read a sequence file with custom Writable subclasses for both K and V of a sequencefile input to a spark job.

the vast majority of rows need to be filtered out by a match to a broadcast variable ("candidateSet") and the Kclass.getId. Unfortunately values V are deserialized for every record no matter what with the standard approach, and according to a profile that is where the majority of time is being spent.

here is my code. note my most recent attempt to read here as "Writable" generically, then later cast back which worked functionally but still caused the full deserialize in the iterator.

val rdd = sc.sequenceFile(
      path,
      classOf[MyKeyClassWritable],
      classOf[Writable]
    ).filter(a => candidateSet.value.contains(a._1.getId))```

Solution

  • Turns out Twitter has a library that handles this case pretty well. Specifically, using this class allows to evaluate the serialized fields in a later step by reading them as DataInputBuffers

    https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java