gziptar

How to count lines in tar.gz file without storing decompressed files


I have a number of compressed files (*.tar.gz) in a directory. The file names have the created date in them. Each contains multiple files. I need to know the number of line that contain a specific word in a file with specific naming pattern. Is there a way to do this without uncompressing and storing every file?

Let's say I have following files.

myfile.2023-01-01.tar.gz
myfile.2023-01-02.tar.gz
.......................
myfile.2024-01-01.tar.gz
myfile.2024-05-01.tar.gz

each file, when uncompressed, will contain multiple files as shown here.

access.log
access.123.log
config.log
config.456.log
config.3234.log
.....

I need to know the number of lines in the "config*" files that contain word "created".


Solution

  • Without decompressing them, no. Without storing them, yes. You can pipe each .gz file through gunzip to decompress, and write your own parsing of the tar format to count lines in the desired files. The tar format is pretty simple, with a 512-byte header before each file, containing, among other things, the name of the file and the length of the file. The file contents are in an integer number of 512-byte blocks. The tar file ends with two 512-byte blocks of zeros.

    To process the tar file data:

    1. Read 512 bytes into buf[]. (You didn't specify a language, so here [] is indexing, starting at zero.)
    2. If the bytes are all zeros, or there weren't 512 bytes to read, you've reached the end. Exit.
    3. If buf[156] is in the range '1'..'6', go to step 1. (Not a regular file and there is no data even if the size in the header is not zero.)
    4. Set buf[100] to zero.
    5. Get the zero-terminated name starting with buf[0].
    6. Convert the octal number, i.e. any ASCII digits 0..7, in buf[124] through buf[134], into an integer. That is the size of the entry.
    7. Now read that many bytes, rounded up to the next multiple of 512. If you have elected to count the lines based on the name, then do so with what you read, though not including the padding required at the end to get to a multiple of 512. Do this only if it is regular file, where buf[156] is 0, '0' or '7'. You don't have to read it all into memory. Just 512 bytes at a time. This whole process takes very little memory and no disk space.
    8. Go to step 1.