javahadoopapache-sparkrddsequencefile

How to extract a range of rows from sequence file in Spark?


Suppose I have a very large sequence file, but I only want to work with first 1000 rows locally. How can I do that?

Currently my code looks like this

JavaPairRDD<IntWritable,VectorWritable> seqVectors = sc.sequenceFile(inputPath, IntWritable.class, VectorWritable.class);

Solution

  • what you should do is parallelize of these array:

    JavaPairRDD<IntWritable,VectorWritable> RDDwith1000 = sc.parallelize(seqVectors.take(1000));
    

    see simple example here and below:
    enter image description here