shellzgrep

Why won't zgrep show the actual matches?


Context

Let's say I have two files a.txt and b.txt with some content...

$ tail *.txt
==> a.txt <==
ABC
CDE
123
C

==> b.txt <==
C
321
EDC
CBA

Let's also imagine that the files have now been put in a gzipped tarball...

$ tar -czf tarball.tgz *.txt
$ tar -tf tarball.tgz
a.txt
b.txt

Goal

Now, I want to grep through the files in the tarball. Seeing the original file-name and line-number before the match would be nice, but I most importantly want to see the matched lines.

What did I try?

First, I expected that zgrep 'pattern' tarball.tgz would simply work. It does tell me whether there is a match, it can even count them, but I can't find a way to have the matches printed...

$ zgrep 'AB' tarball.tgz
Binary file (standard input) matches
$ zgrep 'C' tarball.tgz
Binary file (standard input) matches
$ zgrep -c 'AB' tarball.tgz
1
$ zgrep -c 'C' tarball.tgz
6

Second, I thought to zcat the tarball and use a regular grep on that. But still, I get this exact same "Binary file (standard input) matches" message...

$ zcat tarball.tgz | grep 'C'
Binary file (standard input) matches

I guess zcat (and zgrep) do a gunzip but no tar -xf? If I look at zcat I can see the same output as if I had just done tar -c...

$ zcat tarball.tgz
a.txt0000664�3���3���0000000001613554050266013370 0ustar  useruserABC
CDE
123
C
b.txt0000664�3���3���0000000001613554050301013357 0ustar  useruserC
321
EDC
CBA

$ tar -c *.txt
a.txt0000664�3���3���0000000001613554050266013370 0ustar  useruserABC
CDE
123
C
b.txt0000664�3���3���0000000001613554050301013357 0ustar  useruserC
321
EDC
CBA

So finally, I got to this solution which works OK:

$ tar -xOzf tarball.tgz | grep 'C'
ABC
CDE
C
C
EDC
CBA

Of course, if I now ask for filenames and line-numbers, I don't get anything useful...

$ tar -xOzf tarball.tgz | grep -Hn 'C'
(standard input):1:ABC
(standard input):2:CDE
(standard input):4:C
(standard input):5:C
(standard input):7:EDC
(standard input):8:CBA

The only way I can think of, to get the results I want, would involve a bit more scripting to extract the tarball and run grep in a loop...


Is there a nice (easy and concise) way to do this?


Solution

  • tar -czf does two things:

    As I was suspecting, zgrep or zcat will only do a gunzip, and be left with a tar file which is still binary. That explains all the output I was getting.

    Easy solution

    The easiest way around that is to add an option to zgrep:

       -a, --text
              Process a binary file as if it were text; this is equivalent to the --binary-files=text option.
    

    That will work almost as good as tar -xOzf tarball.tgz | grep -Hn 'C', where we don't get the individual filenames, and the line-numbers are over the whole tar output. We also get some noise, namely the tar format:

    $ zgrep -Hna 'C' tarball.tgz
    tarball.tgz:1:a.txt0000664�3���3���0000000001613554050266013370 0ustar  jlehuenjlehuenABC
    tarball.tgz:2:CDE
    tarball.tgz:4:C
    tarball.tgz:5:b.txt0000664�3���3���0000000001613554050301013357 0ustar  jlehuenjlehuenC
    tarball.tgz:7:EDC
    tarball.tgz:8:CBA
    

    That is easy enough to remember, and works quite well for e.g. grepping logs where the first line of the files is rarely the interesting matches.

    Best output

    Now, @Shawn pointed me to that answer on the Unix StackExchange. From that, I could come up to my favorite option:

    $ tar -xf tarball.tgz --to-command='grep -Hn --label="$TAR_ARCHIVE/$TAR_FILENAME" C || true'
    tarball.tgz/a.txt:1:ABC
    tarball.tgz/a.txt:2:CDE
    tarball.tgz/a.txt:4:C
    tarball.tgz/b.txt:1:C
    tarball.tgz/b.txt:3:EDC
    tarball.tgz/b.txt:4:CBA
    

    I'll probably create myself some function for this, because it's not fun to type. The output is exactly what I wanted, though! :)