pysparkcompressionsequencefile

How to read and write compressed sequence file in spark using Python with any supported compression codec


How to read and write compressed SequenceFile in Spark using Python.

I am using Spark 1.6 on CDH 5.12 Quickstart VM with Python 2.7

Found example as below, but not working.

rdd.saveAsSequenceFile(<path location>, Some(classOf[compressionCodecClass]))

sparkContext.sequenceFile(<path location>, classOf[<class name>], classOf[<compressionCodecClass >]);

Need working code to test.


Solution

  • To read a compressed sequencefile in Pyspark, use below code:

    `myRDD = sparkcontext.sequenceFile("FILE_PATH")`
    

    In Hadoop, we can find various supported compression codecs in core-site.xml file.

    Few of the popular ones are:

    org.apache.hadoop.io.compress.DefaultCodec
    org.apache.hadoop.io.compress.GzipCodec
    org.apache.hadoop.io.compress.BZip2Codec
    org.apache.hadoop.io.compress.DeflateCodec
    org.apache.hadoop.io.compress.SnappyCodec
    org.apache.hadoop.io.compress.Lz4Codec
    

    To write a Sequencefile using any of these compression codecs in Pyspark, use code as below (for GzipCodec): MYrdd.saveAsSequenceFile("FILE_PATH","org.apache.hadoop.io.compress.GzipCodec")