Optimising the performance of nested loops for text extraction/manipulation

I receive operations logs, one per line, in an $input_file: each line describes which resources were used (printers, scanners, etc.) Below are examples of valid lines:

crap I don't care [] # This should not happen, but sometimes it does
crap I don't care [Printer DVX500]
crap I really do not care [Printer DV650V & Scanner SON2000]
some other crap [Printer DVX500 & Printer DV650V & Scanner SON2000]

I have no control over how these lines are produced. I only know that the resources are supposed to be described at the end of each line, between square brackets ([...]). Sometimes, like in the first line above, it is empty; sometimes, several resources are used, in which case resources are separated with $SEP ($SEP=="&" in the examples).

For usage purposes, I need to produce a sorted list of (unique) resources in an $output_file$ . The following code works, and follows these steps:

Extract the content between [] at the end of the line into $content
If $content is not empty, try and split parts between $SEP into $parts
For each element in $parts, trim trailing/leading whitespace(s)
Put the trimmed sequence into an array (which naturally adds unique resources only)
Write everything into $output_file

With the line examples above, the result should be the following:

Printer DVX500
Printer DV650V 
Scanner SON2000

Here is the script I am using, which is functionally correct (at least so far):

# Get the input and output file paths from the command line arguments
input_file="$1"
output_file="$2"
# Check if the input file exists
if [[ ! -f "$input_file" ]]; then
    echo "Error: Input file '$input_file' not found!"
    exit 1
fi

# Use a hash table (associative array) to avoid duplicates directly in memory
declare -A unique_lines
linenumber=0
# Process the file line by line
while IFS= read -r line; do
        linenumber=$[$linenumber +1]
        if (( $linenumber%100 == 0 )); then
            echo "line=$linenumber"
        fi
    # Use sed to extract the content between the first "[" and "]"
    content=$(echo "$line" | sed -E 's/.*\[(.*)\].*/\1/')

    # If content is not empty, process the extracted part
    if [[ -n "$content" ]]; then
        # Split the content by "&" and process each part
        IFS="&" read -ra parts <<< "$content"
        for part in "${parts[@]}"; do
            # Trim leading/trailing whitespace from each part
            trimmed_part=$(echo "$part" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
            
            # Store the trimmed part in the associative array (ensures uniqueness)
            unique_lines["$trimmed_part"]=1
        done
    fi
done < "$input_file"

# Write the sorted, unique results to the output file
for part in "${!unique_lines[@]}"; do
    echo "$part"
done | sort > "$output_file" # supposed to be unique already, due to how $unique_line is fed

echo "Processing completed. Output saved to $output_file"

The issue is performance: with $input_files between 5000-10.000 lines, the previous script takes between 12 to 30 minutes. I tried several things, but could not figure out a way of really optimising it:

using parameter expansion for extracting the resource list, instead of sed;
outputing into $output_file along the extraction, instead of storing results in the $unique_lines array (which is way slower because of the many write operations);

The culprit seems to be the second read followed by the for loop, but I don't see an easy way to do Step #2 otherwise.

If you have any indication on how to optimise it, even if the general algorithm is changed, I would be happy.

EDIT: Completing requirements list wrt. brillant commenters remarks :-)

Thanks for the thorough analysis of my use case :-). For the sake of completeness:

An important modification I left out: Line may contains other [...], but the info I need to extract is always within the last pair. So a better test set would be the following:

crap I don't care [CRAP] [] # This should not happen, but sometimes it does
crap I don't care [CRAPPYCRAP] [Printer DVX500]
crap I really do not care [IRREL] [CRAP!] [Printer DV650V & Scanner SON2000]
some other crap [!!!] [CRAPPPYYYY] [NOTHERE] [Printer DVX500 & Printer DV650V & Scanner SON2000]

There should not be lines without any [...] group, or at least I never got the case yet. Maybe adopting a defensive line here would be better, I don't know...
The data inside brackets may effectively vary (in lower/upper cases, in singular/plural forms, etc.) but that does not matter much, as the rest of the workflow takes these variations into consideration.
The data inside [...] may occasionally have trailing/leading spaces, although rarely. For my solution, that was not an issue, as I extracted the whole content between [...], then splitted along the [:space:]$SEP[:space:] characters, then trimmed, which should have taken care of it.
Finally, yes: sticking to bash is boss-mandatory, and I fully agree with you that a GPPL would have been way better (readable, faster) than bash.

Solution

You're using the wrong tool. A shell (e.g. bash) is a tool to create/destroy files and processes and sequence calls to other tools. It's specifically NOT a tool to manipulate text. The mandatory POSIX (i.e. available on all Unix boxes) tool to manipulate text is awk so you should be using a single awk script instead of bash calling sed and other tools in a loop which is a famously inefficient approach as well as usually resulting in lengthy, complicated, fragile, non-portable code. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

This, using any awk, will do what you appear to want to do orders of magnitude faster than what you were trying to do as well as being more concise and portable:

$ cat tst.sh
#!/usr/bin/env bash

input_file="$1"
output_file="$2"

awk '
    sub(/.*\[/,"") && sub(/].*/,"") && NF {
        gsub(/ *& */, ORS)
        print
    }
' "$input_file" | sort -u > "$output_file"

You'd run the above the same way as the script in your question:

$ ./tst.sh file outfile

and the output would be:

$ cat outfile
Printer DV650V
Printer DVX500
Scanner SON2000