I have a large sequence file with around 60 million entries (almost 4.5GB). I want to split it. For example, I want to split it into three parts, each having 20 million entries. So far my code is like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
JavaPairRDD<IntWritable,VectorWritable> part=seqVectors.coalesce(3);
part.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
But unfortunately, each of the generated sequence files is around 4GB too (total 12GB)! Can anyone suggest a better/valid approach?
Perhaps not the exact answer you are looking for, but it might be worth trying the second method for sequenceFile reading, the one that takes a minPartitions argument. Keep in mind that coalesce
, which you are using, can only decrease the partitions.
Your code should then look like this:
//Read from sequence file
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class, 3);
seqVectors.saveAsHadoopFile(outputPath+File.separator+"output", IntWritable.class, VectorWritable.class, SequenceFileOutputFormat.class);
Another thing that may cause problems is that some SequenceFiles are not splittable.