bashcatzcatpv

Bash: Slower to use zcat versus cat+zcat or cat+pv+zcat


I'm processing some large zipped files and found something interesting that I don't understand where using cat+pv+zcat is faster than cat+zcat which is faster than zcat alone.

For this, I'm using a 1gb zipped json as a test.

time cat test.json.gz > /dev/null
time zcat test.json.gz > /dev/null
time cat test.json.gz | zcat > /dev/null
time cat test.json.gz | pv | zcat > /dev/null
real    0m8.245s
real    0m33.075s
real    0m30.504s
real    0m26.682s

Similarly, when writing to a file:

time cat test.json.gz > t0.json.gz
time zcat test.json.gz > t1.json
time cat test.json.gz | zcat > t2.json
time cat test.json.gz | pv | zcat > t3.json
real    0m21.053s
real    0m59.011s
real    0m57.110s
real    0m54.439s

I also tried running the tests in reverse order to see if there was some caching that was making the subsequent runs go quicker, but got the same results. I checked and the output files are identical.

Generally, I think of multiple steps in a pipe as increasing the time it takes to process a file, so why would adding in pv speed things up? Does it have some sort of built-in parallelization happening? What is going on here?

If this is expected behavior, I just stumbled on a very easy way to increase processing speeds by 10%, but I'd love to understand what's going on.


Solution

  • Generally, I think of multiple steps in a pipe as increasing the time it takes to process a file, so why would adding in pv speed things up? Does it have some sort of built-in parallelization happening? What is going on here?

    On a multicore processor, it is likely that at least sometimes you get true concurrency of multiple processes. Even if you don't get true concurrency, multiprocessing is indeed a form of parallel processing, so yes, there's some parallelization happening.

    More likely than not, zcat itself runs single-threaded, alternating between reading input, decompressing, and writing output. All I/O tends to be slow, but file I/O can plausibly be expected to be slower than I/O through a pipe. Thus, the performance of zcat itself may be improved by feeding it input through a pipe instead of relying on it to read from a file.

    Some process still needs to read that file, of course, but it is conceivable that zcat's decompression + output I/O is more costly than (say) cat's file I/O + pipe I/O. Under those circumstances, multiprocessing with cat | zcat could be a win.

    It's more mysterious why inserting pv into the pipeline between cat and zcat might be observed to improve performance, but my guess would be that pv is buffering more data at a time than cat is, so that with pv in the middle, zcat can read the data in fewer reads overall, each larger. That's a potential performance win, too. It's entirely plausible that pv's pipe I/O + analysis + pipe I/O is at least as fast as cat's file I/O + pipe I/O, so pv might well not have any inherent adverse effect on the wall time for execution of the overall pipeline.

    If this is expected behavior, I just stumbled on a very easy way to increase processing speeds by 10%

    It is explainable behavior, but not a priori expected behavior. The speedup you observe is likely to be dependent on the system, the data, the location of the data, and other factors. You can probably rely on it being cheap to insert pv into a pipeline under most circumstances, but you cannot reasonably rely on that for a general-purpose performance improvement.