apache-sparkhadoopserializationsequencefile

Not serializable result: org.apache.hadoop.io.IntWritable when reading Sequence File with Spark / Scala


Reading a sequence file with Int and String logically,

then if I do this:

val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
                  .map{case (x, y) => (x.toString(), y.toString().split("/")(0), y.toString().split("/")(1))}
                  .collect

this is ok as the IntWritable is converted to String.

If I do this:

val sequence_data = sc.sequenceFile("/seq_01/seq-directory/*", classOf[IntWritable], classOf[Text])
                  .map{case (x, y) => (x, y.toString().split("/")(0), y.toString().split("/")(1))}
                  .collect 

then I get this error immediately:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0 in stage 42.0 (TID 692) had a not serializable result: org.apache.hadoop.io.IntWritable

Underlying reason is not really clear - serialization, but why so difficult? This is another type of serialization aspect I note. Also it is only noted at run-time.


Solution

  • If the goal is to just get an Integer value, you would need to call a get on the writable

    .map{case (x, y) => (x.get()
    

    And then the JVM handles serialization of the Integer object rather than not knowing how to process a IntWritable because it doesn't implement the Serializable interface

    String does implement Serializable