gziptar

Extracting files from a .tar.gz while keeping them gzipped


We have a pipeline that produces millions of .tar.gz files, each with one text file in them. We need to deliver the text files (gzipped, but not tarred).

I know I can easily pipeline the output via something like

tar xvf output-059a270d.tar.gz output-059a270d.txt && gzip output-059a270d.txt

But is there any way to take advantage of the details of the tar/.gz format to avoid having to gunzip and re-gzip the result?


Solution

  • As it stands, no, you can't. The .tar file, which is a tar header, the file contents, and a tar trailer, is gzipped as a single compressed stream. The compression of the start of the file contents can refer to matching data in the tar header, so the two cannot be split to construct a valid gzip stream of just the file.

    However, you say "we have a pipeline". If you are in control of the generation of these .tar.gz files at the other end of the pipeline, then you can write them in a way that permits them to be split. Generate the .tar file with tar, and then feed that to your own code that applies gzip separately to the first 512 bytes, then to the file content, whose length you know, and then to the trailer, which will be all zeros. Those three gzip outputs can simply be concatenated. The concatenation of gzip streams is a valid gzip stream.

    The result will be very slightly larger than the normal single gzip compression on the whole thing. But now you can look for the gzip members and pull out the middle one, which is what you want.