I am trying to run a function once for each file from a large list that I am piping in.
Here is some example code, here just grepping in the files whose names are coming from stdin.
In my real code I am running a program that takes significant time to process each file, but only generates output for some of the files that were processed, so grep is comparable.
I would also ideally like to capture the error code from the program to know which files failed to process and see the stderr output from the program echoed to the terminal's stderr.
#!/bin/bash
searchterm="$1"
filelist=$(cat /dev/stdin)
numfiles=$(echo "$filelist" | wc -l)
currfileno=0
while IFS= read -r file; do
((++currfileno))
echo -ne "\r\033[K" 1>&2 # clears the line
echo -ne "$currfileno/$numfiles $file" 1>&2
grep "$searchterm" "$file"
done <<< "$filelist"
I saved this as test_so_stream
, and I can run it by find ~ -type f -iname \*.txt | test_so_stream searchtext
.
The problem is that when I run it on a large list of thousands of files, nothing starts processing at all until the entire list is loaded, which can take significant time.
What I would like to happen is for it to start processing the first file immediately as soon as the first filename appears on stdin.
I know I could use a pipe for this, but I also would like the statusline (including the current file number and the total number of files) updated to stderr after the processing of each file, or every second or so.
I presume I'd need some kind of multithreading to process the list separately from the actual worker process/es, but I'm not sure how to achieve that using bash
.
Bonus points if I can process multiple files at once in a worker pool, although I do not want the output from multiple files to be intermingled, I need the full output of one file, then the full output of the next, etc. This is a low priority for me if it's complicated and is not the focus of my question.
I have tried to use parallel
and xargs
, and I know at least parallel
can process multiple files at once, in fact very close to what I want even with the output not intermingled, but I still can't work out how to have the status line updated at the same time so I know how far through the list of files it is at the same time. I know about the --bar
option of parallel
but it is too ugly for my taste and not customizable (I would like the status bar to have colors and show the filename being processed).
How can I achieve this?
edit to answer @markp-fuso questions in comments:
I know that stderr/stdout both show on the same terminal.
I would like the status bar to go to stderr so I can pipe the entire output from the program to save and further process the full output of the program. When I do this I will not be saving the stderr, that's just so I can watch the program while it's working. My example program does do this, it shows status and keeps overwriting that line until there's some output. In my full program it clears the status line and overwrites it with the output, if there is output for that file. I omitted the checking if there is output and the line clear for my example program because that's not the part of the question that's important to me.
Re: the status bar not knowing the total number of files, I want the status bar to show the current number of total files and update it when more are piped in, eg like pv
does. I imagine having one process that loads a global filelist from stdin and echoes the status bar to stderr every second, while another process simultaneously loops through that global filelist, processing every file. The problem I'm trying to avoid is the parent process does not know the total number of files immediately, it takes significant time to generate the entire list and I would like my processing to start immediately.
Perhaps calling it a status bar may be overstating what I mean. I just want to be able to see something showing how far it is through the list of files, and which file it is currently processing. Nothing super fancy but I want it to be in color so it stands out on the console from the stdout data. One colored line at the bottom that is continuously overwritten to show me that it is still working.
if you manage to spawn 3 parallel threads, how exactly do you envision their outputs and status bar(s) being displayed in a single console/terminal window?
Exactly like cat filelist | parallel grep searchterm
does. Ie The grep
output for each file shown consecutively, not intermingled. The status bar can appear anywhere (because I'm not saving that), although I would rather it appeared in between the output. Ie like then if there's another grep output it should overwrite the statusline at the bottom. Then more statusline and the cycle continues. So the statusline is just continually getting overwritten to show me what file it's up to.
I'm not 100% clear on all of OP's requirements so I'm going to focus on a stderr to status line
and stdout to a file
approach. Hopefully this will get OP a bit closer to the final goal ...
Assumptions/understandings:
gen_output
; filenames are output-#
)count_input
) and prints the new count to file counter
process_input
)count_input
) at that point in time, plus the current file being processedprocess_input
stdout is written to file process_input.stdout
The 3 programs:
######################### generate 10 outputs at 0.5 second intervals
$ cat gen_output
#!/bin/bash
for ((i=1;i<=10;i++))
do
echo "output-$i"
sleep .5
done
######################### for each input update a counter and overwrite file 'counter'
$ cat count_input
#!/bin/bash
count=0
while read -r input
do
((count++))
echo "${count}" > counter
done
######################### for each input read current total from file 'counter' and then print status line
$ cat process_input
#!/bin/bash
touch counter
count=0
cl_eol=$(tput el) # clear to end of line
while read -r input
do
((count++))
read -r total < counter
printf "\rprocessing %s/%s %s%s" "${count}" "${total}" "${input}" "${cl_eol}" >&2
echo "something to stdout - ${count} / ${total}"
sleep 2
done > process_input.stdout
printf "\nDone.\n" >&2
Using tee
to feed a copy of gen_output
to process_input
before piping to count_input
:
$ ./gen_output | tee >(./process_input) | ./count_input
I've got a .gif
of this in action but SO is not allowing me to upload the image at this time so imagine the following lines being displayed, one at a time at 2 second intervals, while overwriting the previous line:
processing 1/1 output-1
processing 2/4 output-2
processing 3/8 output-3
processing 4/10 output-4
processing 5/10 output-5
processing 6/10 output-6
processing 7/10 output-7
processing 8/10 output-8
processing 9/10 output-9
processing 10/10 output-10
And then a new line is displayed:
Done.
And the stdout:
$ cat process_input.stdout
something to stdout - 1 / 1
something to stdout - 2 / 4
something to stdout - 3 / 8
something to stdout - 4 / 10
something to stdout - 5 / 10
something to stdout - 6 / 10
something to stdout - 7 / 10
something to stdout - 8 / 10
something to stdout - 9 / 10
something to stdout - 10 / 10