Let's say I have two files a.txt
and b.txt
with some content...
$ tail *.txt
==> a.txt <==
ABC
CDE
123
C
==> b.txt <==
C
321
EDC
CBA
Let's also imagine that the files have now been put in a gzipped tarball...
$ tar -czf tarball.tgz *.txt
$ tar -tf tarball.tgz
a.txt
b.txt
Now, I want to grep through the files in the tarball. Seeing the original file-name and line-number before the match would be nice, but I most importantly want to see the matched lines.
First, I expected that zgrep 'pattern' tarball.tgz
would simply work. It does tell me whether there is a match, it can even count them, but I can't find a way to have the matches printed...
$ zgrep 'AB' tarball.tgz
Binary file (standard input) matches
$ zgrep 'C' tarball.tgz
Binary file (standard input) matches
$ zgrep -c 'AB' tarball.tgz
1
$ zgrep -c 'C' tarball.tgz
6
Second, I thought to zcat
the tarball and use a regular grep on that. But still, I get this exact same "Binary file (standard input) matches" message...
$ zcat tarball.tgz | grep 'C'
Binary file (standard input) matches
I guess zcat
(and zgrep
) do a gunzip
but no tar -xf
? If I look at zcat
I can see the same output as if I had just done tar -c
...
$ zcat tarball.tgz
a.txt0000664�3���3���0000000001613554050266013370 0ustar useruserABC
CDE
123
C
b.txt0000664�3���3���0000000001613554050301013357 0ustar useruserC
321
EDC
CBA
$ tar -c *.txt
a.txt0000664�3���3���0000000001613554050266013370 0ustar useruserABC
CDE
123
C
b.txt0000664�3���3���0000000001613554050301013357 0ustar useruserC
321
EDC
CBA
So finally, I got to this solution which works OK:
$ tar -xOzf tarball.tgz | grep 'C'
ABC
CDE
C
C
EDC
CBA
Of course, if I now ask for filenames and line-numbers, I don't get anything useful...
$ tar -xOzf tarball.tgz | grep -Hn 'C'
(standard input):1:ABC
(standard input):2:CDE
(standard input):4:C
(standard input):5:C
(standard input):7:EDC
(standard input):8:CBA
The only way I can think of, to get the results I want, would involve a bit more scripting to extract the tarball and run grep
in a loop...
Is there a nice (easy and concise) way to do this?
tar -czf
does two things:
As I was suspecting, zgrep
or zcat
will only do a gunzip
, and be left with a tar file which is still binary. That explains all the output I was getting.
The easiest way around that is to add an option to zgrep
:
-a, --text
Process a binary file as if it were text; this is equivalent to the --binary-files=text option.
That will work almost as good as tar -xOzf tarball.tgz | grep -Hn 'C'
, where we don't get the individual filenames, and the line-numbers are over the whole tar output. We also get some noise, namely the tar
format:
$ zgrep -Hna 'C' tarball.tgz
tarball.tgz:1:a.txt0000664�3���3���0000000001613554050266013370 0ustar jlehuenjlehuenABC
tarball.tgz:2:CDE
tarball.tgz:4:C
tarball.tgz:5:b.txt0000664�3���3���0000000001613554050301013357 0ustar jlehuenjlehuenC
tarball.tgz:7:EDC
tarball.tgz:8:CBA
That is easy enough to remember, and works quite well for e.g. grepping logs where the first line of the files is rarely the interesting matches.
Now, @Shawn pointed me to that answer on the Unix StackExchange. From that, I could come up to my favorite option:
$ tar -xf tarball.tgz --to-command='grep -Hn --label="$TAR_ARCHIVE/$TAR_FILENAME" C || true'
tarball.tgz/a.txt:1:ABC
tarball.tgz/a.txt:2:CDE
tarball.tgz/a.txt:4:C
tarball.tgz/b.txt:1:C
tarball.tgz/b.txt:3:EDC
tarball.tgz/b.txt:4:CBA
I'll probably create myself some function for this, because it's not fun to type. The output is exactly what I wanted, though! :)