Suppose I have a very large sequence file, but I only want to work with first 1000 rows locally. How can I do that?
Currently my code looks like this
JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);
what you should do is parallelize
of these array
:
JavaPairRDD<IntWritable,VectorWritable> RDDwith1000 = sc.parallelize(seqVectors.take(1000));
see simple example here and below: