hadoopapache-sparkzipbzip2hadoop-lzo

How does file compression format affect my spark processing


I am confused in understanding the splittable and non splittable file format in big data world . I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat that basically unzipping it then processing it .

Then i moved to gzip format and i am able to process it in my spark job but i always had a doubt why people are saying gzip file format is also not splittable ?

How does it going to affect my spark job performance ?

So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?

Should i use gzip in my case or any other compression ?if yes then why ?

Also what is the difference in the performance

CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it

CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it


Solution

  • First, you need to remember that both Gzip and Zip are not splitable. LZO and Bzip2 are the only splittable archive formats. Snappy is also splittable, but it's only a compression format.

    For the purpose of this discussion, splittable files mean they are parallely processable across many machines rather than only one.

    Now, to answer you questions :

    if i have a very huge (10gb) gzip file and then i load it in spark and run count on it

    Its loaded by only one CPU on one executor since the file is not splittable.

    (bzip2) same size file and then load this in spark and run count on it

    Divide the file size by the HDFS block size, and you should expect that many cores across all executors working on counting that file

    Regarding any file less than the HDFS block size, there is no difference because it'll require consuming an entire HDFS block on one CPU just to count that one tiny file.