arraysbashperformancesed

Optimising the performance of nested loops for text extraction/manipulation


I receive operations logs, one per line, in an $input_file: each line describes which resources were used (printers, scanners, etc.) Below are examples of valid lines:

crap I don't care [] # This should not happen, but sometimes it does
crap I don't care [Printer DVX500]
crap I really do not care [Printer DV650V & Scanner SON2000]
some other crap [Printer DVX500 & Printer DV650V & Scanner SON2000]

I have no control over how these lines are produced. I only know that the resources are supposed to be described at the end of each line, between square brackets ([...]). Sometimes, like in the first line above, it is empty; sometimes, several resources are used, in which case resources are separated with $SEP ($SEP=="&" in the examples).

For usage purposes, I need to produce a sorted list of (unique) resources in an $output_file$. The following code works, and follows these steps:

  1. Extract the content between [] at the end of the line into $content
  2. If $content is not empty, try and split parts between $SEP into $parts
  3. For each element in $parts, trim trailing/leading whitespace(s)
  4. Put the trimmed sequence into an array (which naturally adds unique resources only)
  5. Write everything into $output_file

With the line examples above, the result should be the following:

Printer DVX500
Printer DV650V 
Scanner SON2000

Here is the script I am using, which is functionally correct (at least so far):

# Get the input and output file paths from the command line arguments
input_file="$1"
output_file="$2"
# Check if the input file exists
if [[ ! -f "$input_file" ]]; then
    echo "Error: Input file '$input_file' not found!"
    exit 1
fi

# Use a hash table (associative array) to avoid duplicates directly in memory
declare -A unique_lines
linenumber=0
# Process the file line by line
while IFS= read -r line; do
        linenumber=$[$linenumber +1]
        if (( $linenumber%100 == 0 )); then
            echo "line=$linenumber"
        fi
    # Use sed to extract the content between the first "[" and "]"
    content=$(echo "$line" | sed -E 's/.*\[(.*)\].*/\1/')

    # If content is not empty, process the extracted part
    if [[ -n "$content" ]]; then
        # Split the content by "&" and process each part
        IFS="&" read -ra parts <<< "$content"
        for part in "${parts[@]}"; do
            # Trim leading/trailing whitespace from each part
            trimmed_part=$(echo "$part" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
            
            # Store the trimmed part in the associative array (ensures uniqueness)
            unique_lines["$trimmed_part"]=1
        done
    fi
done < "$input_file"

# Write the sorted, unique results to the output file
for part in "${!unique_lines[@]}"; do
    echo "$part"
done | sort > "$output_file" # supposed to be unique already, due to how $unique_line is fed

echo "Processing completed. Output saved to $output_file" 

The issue is performance: with $input_files between 5000-10.000 lines, the previous script takes between 12 to 30 minutes. I tried several things, but could not figure out a way of really optimising it:

The culprit seems to be the second read followed by the for loop, but I don't see an easy way to do Step #2 otherwise.

If you have any indication on how to optimise it, even if the general algorithm is changed, I would be happy.

EDIT: Completing requirements list wrt. brillant commenters remarks :-)

Thanks for the thorough analysis of my use case :-). For the sake of completeness:

crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]

Solution

  • You're using the wrong tool. A shell (e.g. bash) is a tool to create/destroy files and processes and sequence calls to other tools. It's specifically NOT a tool to manipulate text. The mandatory POSIX (i.e. available on all Unix boxes) tool to manipulate text is awk so you should be using a single awk script instead of bash calling sed and other tools in a loop which is a famously inefficient approach as well as usually resulting in lengthy, complicated, fragile, non-portable code. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

    This, using any awk, will do what you appear to want to do orders of magnitude faster than what you were trying to do as well as being more concise and portable:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    input_file="$1"
    output_file="$2"
    
    awk '
        sub(/.*\[/,"") && sub(/].*/,"") && NF {
            gsub(/ *& */, ORS)
            print
        }
    ' "$input_file" | sort -u > "$output_file"
    

    You'd run the above the same way as the script in your question:

    $ ./tst.sh file outfile
    

    and the output would be:

    $ cat outfile
    Printer DV650V
    Printer DVX500
    Scanner SON2000