hadoopapache-sparkapache-spark-sqlsequencefileoutputformat

Can I create sequence file using spark dataframes?


I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?


Solution

  • AFAIK there is no native api available directly in DataFrame except the below approach


    Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :

    Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.

    import org.apache.spark.{SparkConf, SparkContext}
    import org.apache.spark.rdd.SequenceFileRDDFunctions
    import org.apache.hadoop.io.NullWritable
    
    object driver extends App {
    
       val conf = new SparkConf()
            .setAppName("HDFS writable test")
       val sc = new SparkContext(conf)
    
       val empty = sc.emptyRDD[Any].repartition(10)
    
       val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
    
       val seq = new SequenceFileRDDFunctions(data)
    
       // seq.saveAsSequenceFile("/tmp/s1", None)
    
       seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
       sc.stop()
    }
    

    Further information pls see ..