Use sparkContext to read sequence file, as follows:
Method 1:
val rdd = sc.sequenceFile(path, classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
Method 2:
val rdd = sc.hadoopFile(path,
classOf[SequenceFileAsBinaryInputFormat],
classOf[BytesWritable],
classOf[BytesWritable])
rdd.count()
Method 1 end up with EOFException, but method 2 works. What's the differences of these two methods?
The difference starts where "Method 1" immediately makes the call hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)
which uses SequenceFileInputFormat[BytesWritable, BytesWritable]
, but "Method 2" makes the same call except of course uses SequenceFileAsBinaryInputFormat
.
Then to continue, even though SequenceFileAsBinaryInputFormat
extends SequenceFileInputFormat[BytesWritable, BytesWritable]
, SequenceFileAsBinaryInputFormat
has it's own inner class called SequenceFileAsBinaryRecordReader
and although it works similarly to SequenceFileRecordReader[BytesWritable, BytesWritable]
, there are differences. When we take a look at the code, they are doing some different implementations, namely the former is handling compression better. So if your sequence file is either record compressed or block compressed then it makes sense that SequenceFileInputFormat[BytesWritable, BytesWritable]
is not iterating with the same dependability as SequenceFileAsBinaryInputFormat
.
SequenceFileAsBinaryInputFormat
which uses SequenceFileAsBinaryRecordReader
(lines 102-115) - https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileAsBinaryInputFormat.java
SequenceFileRecordReader
(lines 79 - 91) -
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/SequenceFileRecordReader.java