I have a number of compressed files (*.tar.gz) in a directory. The file names have the created date in them. Each contains multiple files. I need to know the number of line that contain a specific word in a file with specific naming pattern. Is there a way to do this without uncompressing and storing every file?
Let's say I have following files.
myfile.2023-01-01.tar.gz
myfile.2023-01-02.tar.gz
.......................
myfile.2024-01-01.tar.gz
myfile.2024-05-01.tar.gz
each file, when uncompressed, will contain multiple files as shown here.
access.log
access.123.log
config.log
config.456.log
config.3234.log
.....
I need to know the number of lines in the "config*" files that contain word "created".
Without decompressing them, no. Without storing them, yes. You can pipe each .gz file through gunzip to decompress, and write your own parsing of the tar format to count lines in the desired files. The tar format is pretty simple, with a 512-byte header before each file, containing, among other things, the name of the file and the length of the file. The file contents are in an integer number of 512-byte blocks. The tar file ends with two 512-byte blocks of zeros.
To process the tar file data:
buf[]
. (You didn't specify a language, so here []
is indexing, starting at zero.)buf[156]
is in the range '1'..'6'
, go to step 1. (Not a regular file and there is no data even if the size in the header is not zero.)buf[100]
to zero.buf[0]
.0
..7
, in buf[124]
through buf[134]
, into an integer. That is the size of the entry.buf[156]
is 0
, '0'
or '7'
. You don't have to read it all into memory. Just 512 bytes at a time. This whole process takes very little memory and no disk space.