apache-sparkrddgzipbz2

Spark: difference when read in .gz and .bz2


I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!


Solution

  •     However, if I read in one single .bz2, would I still get one single giant partition?   
    Or will Spark support automatic split one .bz2 to multiple partitions?
    

    If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).


    . one giant .gz file will read in to a single partition.
    

    Please note that you can always do a

    sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)

    to get the desired number of partitions after the file has been read.


    Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.

    That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.