How to read and write compressed SequenceFile
in Spark using Python.
I am using Spark 1.6 on CDH 5.12 Quickstart VM with Python 2.7
Found example as below, but not working.
rdd.saveAsSequenceFile(<path location>, Some(classOf[compressionCodecClass]))
sparkContext.sequenceFile(<path location>,
classOf[<class name>],
classOf[<compressionCodecClass >]);
Need working code to test.
To read a compressed sequencefile
in Pyspark, use below code:
`myRDD = sparkcontext.sequenceFile("FILE_PATH")`
In Hadoop, we can find various supported compression codecs in core-site.xml file.
Few of the popular ones are:
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.DeflateCodec
org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.Lz4Codec
To write a Sequencefile
using any of these compression codecs in Pyspark, use code as below (for GzipCodec):
MYrdd.saveAsSequenceFile("FILE_PATH","org.apache.hadoop.io.compress.GzipCodec")