I'm processing some large zipped files and found something interesting that I don't understand where using cat+pv+zcat is faster than cat+zcat which is faster than zcat alone.
For this, I'm using a 1gb zipped json as a test.
time cat test.json.gz > /dev/null
time zcat test.json.gz > /dev/null
time cat test.json.gz | zcat > /dev/null
time cat test.json.gz | pv | zcat > /dev/null
real 0m8.245s
real 0m33.075s
real 0m30.504s
real 0m26.682s
Similarly, when writing to a file:
time cat test.json.gz > t0.json.gz
time zcat test.json.gz > t1.json
time cat test.json.gz | zcat > t2.json
time cat test.json.gz | pv | zcat > t3.json
real 0m21.053s
real 0m59.011s
real 0m57.110s
real 0m54.439s
I also tried running the tests in reverse order to see if there was some caching that was making the subsequent runs go quicker, but got the same results. I checked and the output files are identical.
Generally, I think of multiple steps in a pipe as increasing the time it takes to process a file, so why would adding in pv speed things up? Does it have some sort of built-in parallelization happening? What is going on here?
If this is expected behavior, I just stumbled on a very easy way to increase processing speeds by 10%, but I'd love to understand what's going on.
Generally, I think of multiple steps in a pipe as increasing the time it takes to process a file, so why would adding in pv speed things up? Does it have some sort of built-in parallelization happening? What is going on here?
On a multicore processor, it is likely that at least sometimes you get true concurrency of multiple processes. Even if you don't get true concurrency, multiprocessing is indeed a form of parallel processing, so yes, there's some parallelization happening.
More likely than not, zcat
itself runs single-threaded, alternating between reading input, decompressing, and writing output. All I/O tends to be slow, but file I/O can plausibly be expected to be slower than I/O through a pipe. Thus, the performance of zcat
itself may be improved by feeding it input through a pipe instead of relying on it to read from a file.
Some process still needs to read that file, of course, but it is conceivable that zcat
's decompression + output I/O is more costly than (say) cat
's file I/O + pipe I/O. Under those circumstances, multiprocessing with cat | zcat
could be a win.
It's more mysterious why inserting pv
into the pipeline between cat
and zcat
might be observed to improve performance, but my guess would be that pv
is buffering more data at a time than cat
is, so that with pv
in the middle, zcat
can read the data in fewer reads overall, each larger. That's a potential performance win, too. It's entirely plausible that pv
's pipe I/O + analysis + pipe I/O is at least as fast as cat
's file I/O + pipe I/O, so pv
might well not have any inherent adverse effect on the wall time for execution of the overall pipeline.
If this is expected behavior, I just stumbled on a very easy way to increase processing speeds by 10%
It is explainable behavior, but not a priori expected behavior. The speedup you observe is likely to be dependent on the system, the data, the location of the data, and other factors. You can probably rely on it being cheap to insert pv
into a pipeline under most circumstances, but you cannot reasonably rely on that for a general-purpose performance improvement.