arraysbashperformancesearch

Bash. The quickest and efficient array search


I need to do an array search so many time in a bash process. I need to know what is the most quick and efficient way to do it. I know how to do it. The point of the question os how to do it in the fastest way. Now, I'm doing in this way:

#!/bin/bash

array_test=("text1" "text2" "text3" "text4" "text5")
text_to_search="text4"

START=$(date +%s.%N)

for item in "${array_test[@]}"; do
    if [ ${item} = "${text_to_search}" ]; then
        echo "found!!"
        break
    fi
done

END=$(date +%s.%N)
DIFF=$(echo "$END - $START" | bc)
echo $DIFF

With this code we can measure the time.

Imagine we have 300 items or more in array. Is there a quicker way? I need to increase performance. Thank you.

EDIT I'm using bash 4.2 . The real array has line breaks:

array_test=(
"text1"
"text2"
"text3"
"text4"
"text5"
)

Solution

  • Fastest Version (for large arrays)

    Use grep -qFx in array[*].
    -q suppresses output and exists on the first match.
    -F searches fixed strings instead of regexes.
    -x searches whole lines instead of substrings. This ensures that the search string b doesn't match the element abc.

    if ( IFS=$'\n'; echo "${array[*]}" ) | grep -qFx "$text_to_search"; then
        echo "found!"
    fi
    

    Assumption: Neither the array nor the text to search contain linebreaks.

    It's also possible to search texts with linebreaks, as long as there is a known unused character which can be stored inside a bash variable. I found \x1E (the ASCII control character »Record Separator«) to work out quite well. The adapted version looks as follows:

    export d=$'\x1E' # unused character, here "Record Seperator"
    if ( IFS="$d"; echo "$d${array[*]}$d" ) | grep -qF "$d$text_to_search$d"; then
        echo "found!"
    fi
    

    In theory you might be able to speed this up even further by using multiple parallel grep on slices of the array. However, this is so fast (see results below) that you probably won't ever encounter a scenario where the parallel search pays off.

    Testing

    I used an array of size 1'000'000, which was generated as follows:

    size=$((10 ** 6))
    array_test=($(seq -f 'text%.0f' 1 "$size"))
    

    (By the way: using {1..1000000} was magnitudes slower than seq)

    The search pattern was the last entry of said array

    text_to_search="text$size"
    

    Three search methods were tested

    1. Your method using a for loop
    2. printf %s\\n "array[@]" | grep -qFx
    3. (IFS=$'\n'; echo "array[*]";) | grep -qFx

    With the following results:

    1. 65.5 seconds
    2. 59.3 seconds
    3. 00.4 seconds (yes, that's a zero before the decimal point)