scalaapache-sparksequencefile

Two ways to read hdfs sequece file, but one failed with EOFException


Use sparkContext to read sequence file, as follows:

Method 1:

val rdd = sc.sequenceFile(path, classOf[BytesWritable], 
          classOf[BytesWritable])
rdd.count()

Method 2:

val rdd = sc.hadoopFile(path,
          classOf[SequenceFileAsBinaryInputFormat],
          classOf[BytesWritable],
          classOf[BytesWritable])
rdd.count()

Method 1 end up with EOFException, but method 2 works. What's the differences of these two methods?


Solution

  • The difference starts where "Method 1" immediately makes the call hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions) which uses SequenceFileInputFormat[BytesWritable, BytesWritable], but "Method 2" makes the same call except of course uses SequenceFileAsBinaryInputFormat.
    Then to continue, even though SequenceFileAsBinaryInputFormat extends SequenceFileInputFormat[BytesWritable, BytesWritable], SequenceFileAsBinaryInputFormat has it's own inner class called SequenceFileAsBinaryRecordReader and although it works similarly to SequenceFileRecordReader[BytesWritable, BytesWritable], there are differences. When we take a look at the code, they are doing some different implementations, namely the former is handling compression better. So if your sequence file is either record compressed or block compressed then it makes sense that SequenceFileInputFormat[BytesWritable, BytesWritable] is not iterating with the same dependability as SequenceFileAsBinaryInputFormat.

    SequenceFileAsBinaryInputFormat which uses SequenceFileAsBinaryRecordReader (lines 102-115) - https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileAsBinaryInputFormat.java

    SequenceFileRecordReader (lines 79 - 91) - https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java