Sometimes I'm grep
-ing thousands of files and it'd be nice to see some kind of progress (bar or status).
I know this is not trivial because grep
outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR.
Would this require modifying source code of grep
?
Ideal command is:
grep -e "STRING" --results="FILE.txt"
and the progress:
[curr file being searched], number x/total number of files
written to STDOUT or STDERR
This wouldn't necessarily require modifying grep
, although you could probably get a more accurate progress bar with such a modification.
If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r
option to recursively a directory structure. In that case, it is not even clear that grep
knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)
In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep
in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat
to get the filesize) would make the progress report more exact but add an additional cost to process startup.
One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.
In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).
# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
batchSize=100
for ((i=0; i<total; i+=batchSize)); do
echo $i/$total >>/dev/stderr
grep -d skip -e "$pattern" "${files[@]:i:batchSize}" >>results.txt
done
For simplicity, I use a globstar (**
) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find
, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/
only matches directories.) Fortunately, GNU grep provides the -d skip
option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.
You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.
The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:
find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt
(Here -L 100
specifies that up to 100 files should be given to each grep instance, and -j 4
specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)