javaapache-sparkrddsequencefilebigdata

How to split a big Sequence file into multiple sequence files?


I have a large sequence file with around 60 million entries (almost 4.5GB). I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:

//Read from sequence file
  JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
  JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
  part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);

But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)! Can anyone suggest a better/valid approach?


Solution

  • Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce, which you are using, can only decrease the partitions.

    Your code should then look like this:

    //Read from sequence file
    JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
    seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
    

    Another thing that may cause problems is that some SequenceFiles are not splittable.