I receive operations logs, one per line, in an $input_file
: each line describes which resources were used (printers, scanners, etc.) Below are examples of valid lines:
crap I don't care [] # This should not happen, but sometimes it does
crap I don't care [Printer DVX500]
crap I really do not care [Printer DV650V & Scanner SON2000]
some other crap [Printer DVX500 & Printer DV650V & Scanner SON2000]
I have no control over how these lines are produced. I only know that the
resources are supposed to be described at the end of each line, between square brackets ([...]
). Sometimes, like in the first line above, it is empty; sometimes, several resources are used, in which case resources are separated with $SEP
($SEP=="&"
in the examples).
For usage purposes, I need to produce a sorted list of (unique) resources in an
$output_file$
. The following code works, and follows these steps:
[]
at the end of the line into $content
$content
is not empty, try and split parts between $SEP
into $parts
$parts
, trim trailing/leading whitespace(s)$output_file
With the line examples above, the result should be the following:
Printer DVX500
Printer DV650V
Scanner SON2000
Here is the script I am using, which is functionally correct (at least so far):
# Get the input and output file paths from the command line arguments
input_file="$1"
output_file="$2"
# Check if the input file exists
if [[ ! -f "$input_file" ]]; then
echo "Error: Input file '$input_file' not found!"
exit 1
fi
# Use a hash table (associative array) to avoid duplicates directly in memory
declare -A unique_lines
linenumber=0
# Process the file line by line
while IFS= read -r line; do
linenumber=$[$linenumber +1]
if (( $linenumber%100 == 0 )); then
echo "line=$linenumber"
fi
# Use sed to extract the content between the first "[" and "]"
content=$(echo "$line" | sed -E 's/.*\[(.*)\].*/\1/')
# If content is not empty, process the extracted part
if [[ -n "$content" ]]; then
# Split the content by "&" and process each part
IFS="&" read -ra parts <<< "$content"
for part in "${parts[@]}"; do
# Trim leading/trailing whitespace from each part
trimmed_part=$(echo "$part" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
# Store the trimmed part in the associative array (ensures uniqueness)
unique_lines["$trimmed_part"]=1
done
fi
done < "$input_file"
# Write the sorted, unique results to the output file
for part in "${!unique_lines[@]}"; do
echo "$part"
done | sort > "$output_file" # supposed to be unique already, due to how $unique_line is fed
echo "Processing completed. Output saved to $output_file"
The issue is performance: with $input_files
between 5000-10.000 lines, the
previous script takes between 12 to 30 minutes. I tried several things, but
could not figure out a way of really optimising it:
sed
;$output_file
along the extraction, instead of storing results in the $unique_lines
array (which is way slower because of the many write operations);The culprit seems to be the second read followed by the for loop, but I don't see an easy way to do Step #2 otherwise.
If you have any indication on how to optimise it, even if the general algorithm is changed, I would be happy.
EDIT: Completing requirements list wrt. brillant commenters remarks :-)
Thanks for the thorough analysis of my use case :-). For the sake of completeness:
[...]
, but the info I need to extract is always within the last pair. So a better test set would be the following:crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]
There should not be lines without any [...]
group, or at least I never got the case yet. Maybe adopting a defensive line here would be better, I don't know...
The data inside brackets may effectively vary (in lower/upper cases, in singular/plural forms, etc.) but that does not matter much, as the rest of the workflow takes these variations into consideration.
The data inside [...]
may occasionally have trailing/leading spaces, although rarely. For my solution, that was not an issue, as I extracted the whole content between [...]
, then splitted along the [:space:]$SEP[:space:]
characters, then trimmed, which should have taken care of it.
Finally, yes: sticking to bash
is boss-mandatory, and I fully agree with you that a GPPL would have been way better (readable, faster) than bash
.
You're using the wrong tool. A shell (e.g. bash
) is a tool to create/destroy files and processes and sequence calls to other tools. It's specifically NOT a tool to manipulate text. The mandatory POSIX (i.e. available on all Unix boxes) tool to manipulate text is awk
so you should be using a single awk
script instead of bash
calling sed
and other tools in a loop which is a famously inefficient approach as well as usually resulting in lengthy, complicated, fragile, non-portable code. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
This, using any awk, will do what you appear to want to do orders of magnitude faster than what you were trying to do as well as being more concise and portable:
$ cat tst.sh
#!/usr/bin/env bash
input_file="$1"
output_file="$2"
awk '
sub(/.*\[/,"") && sub(/].*/,"") && NF {
gsub(/ *& */, ORS)
print
}
' "$input_file" | sort -u > "$output_file"
You'd run the above the same way as the script in your question:
$ ./tst.sh file outfile
and the output would be:
$ cat outfile
Printer DV650V
Printer DVX500
Scanner SON2000