hadoop apache-spark zip bzip2 hadoop-lzo

How does file compression format affect my spark processing

I am confused in understanding the splittable and non splittable file format in big data world . I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat that basically unzipping it then processing it .

Then i moved to gzip format and i am able to process it in my spark job but i always had a doubt why people are saying gzip file format is also not splittable ?

How does it going to affect my spark job performance ?

So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?

Should i use gzip in my case or any other compression ?if yes then why ?

Also what is the difference in the performance

CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it

CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it

Solution

First, you need to remember that both Gzip and Zip are not splitable. LZO and Bzip2 are the only splittable archive formats. Snappy is also splittable, but it's only a compression format.

For the purpose of this discussion, splittable files mean they are parallely processable across many machines rather than only one.

Now, to answer you questions :

if i have a very huge (10gb) gzip file and then i load it in spark and run count on it

Its loaded by only one CPU on one executor since the file is not splittable.

(bzip2) same size file and then load this in spark and run count on it

Divide the file size by the HDFS block size, and you should expect that many cores across all executors working on counting that file

Regarding any file less than the HDFS block size, there is no difference because it'll require consuming an entire HDFS block on one CPU just to count that one tiny file.