I am confused in understanding the splittable and non splittable file format in big data world .
I was using zip file format and i understood that zip file are non splittable in a way that when i processed that file i had to use ZipFileInputFormat
that basically unzipping it then processing it .
Then i moved to gzip
format and i am able to process it in my spark job but i always had a doubt why people are saying gzip
file format is also not splittable ?
How does it going to affect my spark job performance ?
So for example if have 5k gzip files with different sizes some of them are 1 kb and some of them are 10gb and if i am going to load it in spark what will happen ?
Should i use gzip in my case or any other compression ?if yes then why ?
Also what is the difference in the performance
CASE1: if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
CASE2: If i have some splittable (bzip2) same size file and then load this in spark and run count on it
First, you need to remember that both Gzip and Zip are not splitable. LZO and Bzip2 are the only splittable archive formats. Snappy is also splittable, but it's only a compression format.
For the purpose of this discussion, splittable files mean they are parallely processable across many machines rather than only one.
Now, to answer you questions :
if i have a very huge (10gb) gzip file and then i load it in spark and run count on it
Its loaded by only one CPU on one executor since the file is not splittable.
(bzip2) same size file and then load this in spark and run count on it
Divide the file size by the HDFS block size, and you should expect that many cores across all executors working on counting that file
Regarding any file less than the HDFS block size, there is no difference because it'll require consuming an entire HDFS block on one CPU just to count that one tiny file.