Bash. The quickest and efficient array search
Asked Answered
S

5

8

I need to do an array search so many time in a bash process. I need to know what is the most quick and efficient way to do it. I know how to do it. The point of the question os how to do it in the fastest way. Now, I'm doing in this way:

#!/bin/bash

array_test=("text1" "text2" "text3" "text4" "text5")
text_to_search="text4"

START=$(date +%s.%N)

for item in "${array_test[@]}"; do
    if [ ${item} = "${text_to_search}" ]; then
        echo "found!!"
        break
    fi
done

END=$(date +%s.%N)
DIFF=$(echo "$END - $START" | bc)
echo $DIFF

With this code we can measure the time.

Imagine we have 300 items or more in array. Is there a quicker way? I need to increase performance. Thank you.

EDIT I'm using bash 4.2 . The real array has line breaks:

array_test=(
"text1"
"text2"
"text3"
"text4"
"text5"
)
Squall answered 18/2, 2017 at 10:52 Comment(5)
It only takes 30 seconds with 10,000,000 elements!Odometer
Can your array be kept sorted? Can your searches be performed in sorted order? Have you tried associative array to actually let Bash do the retrieval?Bushwhack
Yeah, the array is sorted alphabetically and it works well now, but I'm planning to do that search on every function in my script because of debug mode stuff. I think if I do that could be bad for performance. I'll check the proposals. Thanks.Squall
You can get a decent speed up by using a C-style for loop to iterate over the indices (assuming the array isn't too sparse): for((i=0; i<=${#array[@]}; i++)); do item=${array[i]}; ...; done. This is because the shell does not need to fully expand the array before starting the search.Escallop
Newlines in your array declaration do not change the content of the array (contrary to newlines inside the elements themselves, which is a very different thing).Bushwhack
U
6

Fastest Version (for large arrays)

Use grep -qFx in array[*].
-q suppresses output and exists on the first match.
-F searches fixed strings instead of regexes.
-x searches whole lines instead of substrings. This ensures that the search string b doesn't match the element abc.

if ( IFS=$'\n'; echo "${array[*]}" ) | grep -qFx "$text_to_search"; then
    echo "found!"
fi

Assumption: Neither the array nor the text to search contain linebreaks.

It's also possible to search texts with linebreaks, as long as there is a known unused character which can be stored inside a bash variable. I found \x1E (the ASCII control character »Record Separator«) to work out quite well. The adapted version looks as follows:

export d=$'\x1E' # unused character, here "Record Seperator"
if ( IFS="$d"; echo "$d${array[*]}$d" ) | grep -qF "$d$text_to_search$d"; then
    echo "found!"
fi

In theory you might be able to speed this up even further by using multiple parallel grep on slices of the array. However, this is so fast (see results below) that you probably won't ever encounter a scenario where the parallel search pays off.

Testing

I used an array of size 1'000'000, which was generated as follows:

size=$((10 ** 6))
array_test=($(seq -f 'text%.0f' 1 "$size"))

(By the way: using {1..1000000} was magnitudes slower than seq)

The search pattern was the last entry of said array

text_to_search="text$size"

Three search methods were tested

  1. Your method using a for loop
  2. printf %s\\n "array[@]" | grep -qFx
  3. (IFS=$'\n'; echo "array[*]";) | grep -qFx

With the following results:

  1. 65.5 seconds
  2. 59.3 seconds
  3. 00.4 seconds (yes, that's a zero before the decimal point)
Ultramarine answered 18/2, 2017 at 11:49 Comment(6)
Are you sure the last test is not finishing this quickly because ${array_test[*]} expands to one single string, and grep returns on the first match instead and has no other line to match against? I am pretty sure this is what is going on based on a quick test I did.Bushwhack
I will state my comment differently : your fast solution works in cases where you actually do not need to match against array elements, and matching against a single string containing all element values will do the job. In some cases this may be possible, but you have to really know what you are doing, and as soon as your array changes you have to re-do that big expansion (to build the big string), so the performance advantage may be lost if the array is being written to and read from a lot.Bushwhack
@Bushwhack You are right, the described method was (as already pointed out in the answer) making strong assumptions, but worked out in the described scenario. The answer was updated and is now safer and twice (!) as fast (probably due to the pipe | instead of <<<).Ultramarine
I like your answer, but the real array has line breaks. I edited my question.Squall
what about if you want the index too?Siva
@Siva When you don't have line breaks inside the array, you could let grep output the line number with grep -Fxnm1 instead of grep -Fxq. Then extract that line number from grep's output and subtract 1 to get the index.Ultramarine
B
4

If you deeply care about performance, perhaps (a) Bash is not the best tool and (b) you should try different approaches and test them on your data. This being said, maybe associative arrays could help you.

Try this :

#!/bin/bash

declare -A array_test()
array_test=(["text1"]="" ["text2"]="" ["text3"]="" ["text4"]="" ["text5"]="")
text_to_search="text4"

if
  [[ ${array_test[$text_to_search]+found} ]]
then
  echo "Found!"
fi

Please note in this case I build the associative arrays with keys but empty values (no real use in setting the value to the same thing as the key and eating up more memory).

Another approach would be to sort your array, and use some kind of binary search. That would, of course, involve more code, and if Bash associative arrays are implemented efficiently, is likely to be slower. But, again, nothing beats testing on actual data to validate performance hypotheses.

If you have an associative arrays with information in the keys, you can use an expansion to use them the same way you would use values :

for key in "${!array[@]}"
do
  do_something_with_the key
done

Also, you can build your array with a loop, which would be useful if you read your elements from a file or from the output of a command. As an example :

declare -A array_test=()
while IFS= read -r key
do
  array_test[$key]=
done < <(command_with_output)

It is useful to note that an element set to a null (empty) value is not the same as an unset element. You can easily see that with the following expansion :

declare -A array_test=()
array_test[existing_key]=
echo ${array_test[existing_key]+found} # Echoes "found"
echo ${array_test[missing_key]+found}  # Echoes nothing

The "${var+value}" expansion use useful because it expands to nothing if the variable is unset, and to value if it is set. It does not produce an error when using set -u that catches attempts to expand unset variables.

Bushwhack answered 18/2, 2017 at 11:24 Comment(5)
You can create your associative array with a loop declare -A array; for i in ${array_test[@]}; do array[$i]=1; done.Benedetto
@Benedetto I will add this in my answer, it may not be obvious to someone not familiar with associative arrays.Bushwhack
Very important to quote the array expansion "${array_test[@]}" -- elements containing whitespace will not be treated as a single unit otherwise.Plunger
In bash 4.3 and later, the -v operator works with array indices: if [[ -v array_test[$text_to_search] ]]Escallop
I edited the question, I'm using bash 4.2 Thank you!!Squall
P
2

If none of your array elements contain whitespace, you can use pattern matching

array_test=("text1" "text2" "text3" "text4" "text5")
text_to_search="text4"

[[ " ${array_test[*]} " == *"  $text_to_search "*]] && echo found || echo not found

The quotes and spaces are very important there.

In bash, within [[ ]] double brackets, the == operator is a pattern matching operator, not just string equality.

Plunger answered 18/2, 2017 at 13:44 Comment(2)
Good idea, but you mixed things up. The text to search (not the expanded array) should be surrounded by *, and the search pattern has to be on the right hand side of the ==. Correct version: [[ " ${array[*]} " == *\ $text\ * ]].Ultramarine
Your idea can also be used with whitespaces, when changing IFS.Ultramarine
B
1

A concrete example based on useful @Fred answer using associative array:

script.sh

#!/bin/bash

read -a array_test <<< $(seq 1 10000)
text_to_search="9999"

function from_associative_array {
  declare -A array
  for constant in ${array_test[@]}
  do
      array[$constant]=1
  done
  [[ ${array[$text_to_search]} ]] && echo  "Found in associative array";
}

function from_indexed_array {
  for item in "${array_test[@]}"; do
      if [ ${item} = "${text_to_search}" ]; then
          echo "Found in indexed array"
          break
      fi
  done
}

time from_indexed_array
time from_associative_array

$ bash script.sh
Found in indexed array

real    0m0.611s
user    0m0.604s
sys     0m0.004s

Found in associative array

real    0m0.297s
user    0m0.296s
sys     0m0.000s
Benedetto answered 18/2, 2017 at 12:11 Comment(6)
I do not think your tests are doing the same thing. The first test assigns 10000 elements and searches for one, but the second test retrieves 10000 elements. You should build the arrays first, and time only the lookup part. You could benchmark the creation time of both arrays (associative should be slower), but it is not a "fair" comparison to benchmark creation+lookup for one option, and look-up only for the other one.Bushwhack
By the way, indexed array search (done the way it is coded here) is O(n) in complexity, whereas associative arrays should be O(log n), meaning that the bigger the data set, the bigger the difference becomes. But with a small array, associative arrays probably perform worse than indexed arrays. Generally speaking, since performance impact at low data volume is often not a concern, the better "big data set" algorithm is probably the best bet, especially if it can be coded simply (i.e. use language constructs or standard libraries).Bushwhack
@Bushwhack The associative array creation is part of the first test because the initial array is an indexed one. In the OP code, all the load is on the test in loop and just setting all values to 1 to create the associative array is much more faster. That said you're right, looping over indexed array will be faster with small arrays. but with say 100 length array, the gain will be 0.001ms. Who cares...Benedetto
@Bushwhack And in case you mistinterpreted the results, reread my answer. Associative array test is faster here.Benedetto
I have to assume the OP would build the array as an associative array from the start, but that ultimately is not my call. Please note that you can assign a null value (instead of 1) to create the elements, and it will work just as well. I understood the AA to be faster from your test, I was just pointing out that part of the test was doing more than the other one.Bushwhack
@Bushwhack Just published a way of checking for presence without loop ...Aryn
A
1

Is In array test and some other quick arrays functions

Comming something late on this post, I would add here some efficients tricks I use on small array (upto some megabytes anyway).

1. IsInArray as requested:

is_in_array() {
    local IFS=$'\1' _value="$1"
    shift
    case "$IFS$*$IFS" in
        *"$IFS$_value$IFS"* ) return 0;;
        *) return 1;;
    esac
}

As this function don't use fork nor loop, this is the quickest way:

array_test=("text1" text2 text3 "text4" text5 "Hello world." $'Text with a\nnewline.')

for val in text4 text12 'Hello world.' $'Some\nstring' $'Text with a\nnewline.';do  
    if is_in_array "$val" "${array_test[@]}"; then
        echo ${val@Q} is present in \$array_test.
    else
        echo ${val@Q} is not present in \$array_test.
    fi
done
'text4' is present in $array_test.
'text12' is not present in $array_test.
'Hello world.' is present in $array_test.
$'Some\nstring' is not present in $array_test.
$'Text with a\nnewline.' is present in $array_test.

This use \1 as IFS separator in order to be able to work with strings that contain spaces or even newlines.

2. Array length

to show the length of an array:

echo ${#array_test[@]}
7

will show the number of element in array,

2.1 Array length in chars

echo ${#array_test}
5

will show the length of the 1st element

echo ${#array_test[-1]}
20

will show the length of the last element

: "${array_test[*]}"; echo ${#_}
63

will show the total length of array,

2.2 Array length in bytes (see UTF-8 string length)

As unicode characters could hold more than one byte, this could be usefull:

printf -v _ '%s%n' "${array_test[0]}" val; echo $val
5

Yes, this is not really easy to read, i little counter-intuitive, but this is the quickest ways (tried to play with local LANG=C in a function but this goes slower).

  • printf -v tell to create a variable _ which is a garbage
  • %n is a rare printf option that tell printf to store current output pointer to next argument which have to be a variable name.
printf -v _ '%s%n' "$array_test" val;echo $val
5

will show the length of the 1st element

printf -v _ '%s%n' "${array_test[-1]}" val;echo $val
20

will show the length of the last element, and

printf -v _ '%s%n' "${array_test[*]}" val; echo $val
63
array_test+=('déjà vu')
printf -v _ '%s%n' "${array_test[*]}" val
: "${array_test[*]}"
echo Whole array is $val bytes but ${#_} character length.
Whole array is 73 bytes but 71 character length.

will show the total length of array,

3. dump reusable variable

I sometime use this for transporting arrays:

printf -v string '[%%d]=%q ' "${array_test[@]}"
printf -v string "$string" "${!array_test[@]}"
echo "$string"
[0]=text1 [1]=text2 [2]=text3 [3]=text4 [4]=text5 [5]=Hello\ world. [6]=$'Text w
ith a
newline.' [7]=déjà\ vu 

which could be reused:

declare -a "newVar=($string)"
declare -p newVar
declare -a newVar=([0]="text1" [1]="text2" [2]="text3" [3]="text4" [4]="text5" [
5]="Hello world." [6]=$'Text with a\nnewline.' [7]="déjà vu")
Aryn answered 7/4 at 19:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.