Compare/Difference of two arrays in Bash
Asked Answered
P

10

94

Is it possible to take the difference of two arrays in Bash. What is a good way to do it?

Code:

Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" ) 

Array3 =diff(Array1, Array2)

Array3 ideally should be :
Array3=( "key7" "key8" "key9" "key10" )
Patino answered 22/2, 2010 at 17:31 Comment(1)
Having skimmed over the solutions, I decided not to use arrays in cases where I've got to diff them.Pontine
S
49

If you strictly want Array1 - Array2, then

Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" )

Array3=()
for i in "${Array1[@]}"; do
    skip=
    for j in "${Array2[@]}"; do
        [[ $i == $j ]] && { skip=1; break; }
    done
    [[ -n $skip ]] || Array3+=("$i")
done
declare -p Array3

Runtime might be improved with associative arrays, but I personally wouldn't bother. If you're manipulating enough data for that to matter, shell is the wrong tool.


For a symmetric difference like Dennis's answer, existing tools like comm work, as long as we massage the input and output a bit (since they work on line-based files, not shell variables).

Here, we tell the shell to use newlines to join the array into a single string, and discard tabs when reading lines from comm back into an array.

$ oldIFS=$IFS IFS=$'\n\t'
$ Array3=($(comm -3 <(echo "${Array1[*]}") <(echo "${Array2[*]}")))
comm: file 1 is not in sorted order
$ IFS=$oldIFS
$ declare -p Array3
declare -a Array3='([0]="key7" [1]="key8" [2]="key9" [3]="key10")'

It complains because, by lexographical sorting, key1 < … < key9 > key10. But since both input arrays are sorted similarly, it's fine to ignore that warning. You can use --nocheck-order to get rid of the warning, or add a | sort -u inside the <(…) process substitution if you can't guarantee order&uniqueness of the input arrays.

Sultry answered 23/2, 2010 at 1:0 Comment(7)
+1 for the 1st snippet, which also works with elements with embedded whitespace. The 2nd snippet works with elements with embedded spaces only. You can do away with saving and restoring $IFS if you simply prepend IFS=$'\n\t' directly to the Array3=... command.Hewes
@Hewes The command you're suggesting: IFS=$'\n\t' Array3=( ... ) will set IFS globally. Try it!Equivocal
@gniourf_gniourf: Thanks for catching that! Because my fallacy may be seductive to others too, I'll leave my original comment and explain here: While it's a common and useful idiom to prepend an ad-hoc, command-local variable assignment to a simple command, it does NOT work here, because my command is composed entirely of assignments. No command name (external executable, builtin) follows the assignments, which makes all of them global (in the context of the current shell); see man bash, section SIMPLE COMMAND EXPANSION).Hewes
Can you give an example how to do this in a C-shell (csh)?Uredium
@Stefan: Ugh, csh should never be used. set Array3 = ( ) foreach i ( $Array1 ) set skip = 0 foreach j ( $Array2 ) if ( "$i" == "$j" ) then set skip = 1 break endif end if ( "$skip" == 0 ) then set Array3 = ( $Array3:q "$i" ) endif end All the control statements need to be on their own lines.Sultry
would this work if array 1 had an extra value thats not inside array2? for me if its not in array2 then its not needed anyway, so i need it to ignore that case. its breaking at the moment and im wondering if thats why (edit: seems like it does actually, thank you)Kilar
1st snip works great with spaces! Though I had to put $j into double quotes. I had line aws ec2 --profile=profile_name describe-instances --instance-id i-0a8d8b5d1b8b9xxxx --query "Reservations[].Instances[].EnaSupport" which was shown as a difference, even though it existed in both arrays 1 and 2. I checked with ShellCheck and it suggested adding the double quotes around "$j".Rapids
P
210
echo ${Array1[@]} ${Array2[@]} | tr ' ' '\n' | sort | uniq -u

Output

key10
key7
key8
key9

You can add sorting if you need

Presumption answered 27/1, 2015 at 0:55 Comment(14)
He came in, he bossed it and he left. For anyone wondering how to save the value to an array, try this: Array3=(`echo ${Array1[@]} ${Array2[@]} | tr ' ' '\n' | sort | uniq -u `)Semirigid
This is what shell programming is about. Keep it simple, use the tools available. If you want to implement the other solutions, you can, but you may have an easier time using a more robust language.Sardou
Incredible even to this day.Feinberg
Brilliant. Additional note for those who need the asymmetrical difference. You can get it by outputting the duplicates of the symmetrical difference and the Array you are interested in. IE if you want the values present in Array2, but not in Array1. echo ${Array2[@]} ${Array3[@]} | tr ' ' '\n' | sort | uniq -D | uniq, where Array3 is the output of the above. Additionally if you remove the array notations and assume the variables are space separated strings, this approach is posix shell compliant.Diannediannne
Awesome solution. Slight improvement if array elements might contain spaces: printf '%s\n' "${Array1[@]}" "${Array2[@]}" | sort | uniq -uTimberlake
To simplify @Arwyn's suggestion, you can add the ignored array twice to ensure only the differences in Array2 are shown. echo ${Array1[@]} ${Array1[@]} ${Array2[@]} | tr ' ' '\n' | sort | uniq -uOstensive
one small comment to @ChristopherMarkieta's answer: the question was to calculate Array1-Array2, in this case it should be echo ${Array1[@]} ${Array2[@]} ${Array2[@]} | tr ' ' '\n' | sort | uniq -u (Array2 two times). And thanks for great addition to a great answer.Fleda
Adding to the expansion of @misberner: If your array contains newline and white spaces: printf "%s\0" "${Array1[@]}" "${Array2[@]}" | sort -z | uniq -zu . (The output will be null-delimted and needs to be processed accordingly)Varhol
I have not been able to get any of this thread's suggestions to work in my function. Lend a hand if you are able? #63314941Sectarian
Found an edge case where this solution does not work - if both arrays are empty, doing echo ${Array1[@]} ${Array2[@]} | tr ' ' '\n' | sort | uniq -u | wc -l prints 1, not 0Kermitkermy
Actually, the answer as given is computing the symmetric difference. Things that are in both sets will be removed, things only in one will be kept: $ echo {1..4} {2..6} | tr ' ' '\n' | sort | uniq -u | tr '\n' ' ' gives 1 5 6. Including the second set twice makes a regular difference: $ echo {1..4} {2..6} {2..6} | tr ' ' '\n' | sort | uniq -u | tr '\n' ' ' gives 1.Oceanus
@ilya-bystrov you should expand your anwser with the jspencer explanationCentigram
This is buggy; look at what happens if either array contains internal duplicates. (And, as others have pointed out, there are the quoting problems).Undefined
Do not do this! It's exposing the variables to the shell for interpretation and creating input to sort that isn't a valid text file so YMMV.Mcdaniel
S
49

If you strictly want Array1 - Array2, then

Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" )

Array3=()
for i in "${Array1[@]}"; do
    skip=
    for j in "${Array2[@]}"; do
        [[ $i == $j ]] && { skip=1; break; }
    done
    [[ -n $skip ]] || Array3+=("$i")
done
declare -p Array3

Runtime might be improved with associative arrays, but I personally wouldn't bother. If you're manipulating enough data for that to matter, shell is the wrong tool.


For a symmetric difference like Dennis's answer, existing tools like comm work, as long as we massage the input and output a bit (since they work on line-based files, not shell variables).

Here, we tell the shell to use newlines to join the array into a single string, and discard tabs when reading lines from comm back into an array.

$ oldIFS=$IFS IFS=$'\n\t'
$ Array3=($(comm -3 <(echo "${Array1[*]}") <(echo "${Array2[*]}")))
comm: file 1 is not in sorted order
$ IFS=$oldIFS
$ declare -p Array3
declare -a Array3='([0]="key7" [1]="key8" [2]="key9" [3]="key10")'

It complains because, by lexographical sorting, key1 < … < key9 > key10. But since both input arrays are sorted similarly, it's fine to ignore that warning. You can use --nocheck-order to get rid of the warning, or add a | sort -u inside the <(…) process substitution if you can't guarantee order&uniqueness of the input arrays.

Sultry answered 23/2, 2010 at 1:0 Comment(7)
+1 for the 1st snippet, which also works with elements with embedded whitespace. The 2nd snippet works with elements with embedded spaces only. You can do away with saving and restoring $IFS if you simply prepend IFS=$'\n\t' directly to the Array3=... command.Hewes
@Hewes The command you're suggesting: IFS=$'\n\t' Array3=( ... ) will set IFS globally. Try it!Equivocal
@gniourf_gniourf: Thanks for catching that! Because my fallacy may be seductive to others too, I'll leave my original comment and explain here: While it's a common and useful idiom to prepend an ad-hoc, command-local variable assignment to a simple command, it does NOT work here, because my command is composed entirely of assignments. No command name (external executable, builtin) follows the assignments, which makes all of them global (in the context of the current shell); see man bash, section SIMPLE COMMAND EXPANSION).Hewes
Can you give an example how to do this in a C-shell (csh)?Uredium
@Stefan: Ugh, csh should never be used. set Array3 = ( ) foreach i ( $Array1 ) set skip = 0 foreach j ( $Array2 ) if ( "$i" == "$j" ) then set skip = 1 break endif end if ( "$skip" == 0 ) then set Array3 = ( $Array3:q "$i" ) endif end All the control statements need to be on their own lines.Sultry
would this work if array 1 had an extra value thats not inside array2? for me if its not in array2 then its not needed anyway, so i need it to ignore that case. its breaking at the moment and im wondering if thats why (edit: seems like it does actually, thank you)Kilar
1st snip works great with spaces! Though I had to put $j into double quotes. I had line aws ec2 --profile=profile_name describe-instances --instance-id i-0a8d8b5d1b8b9xxxx --query "Reservations[].Instances[].EnaSupport" which was shown as a difference, even though it existed in both arrays 1 and 2. I checked with ShellCheck and it suggested adding the double quotes around "$j".Rapids
Z
15

Anytime a question pops up dealing with unique values that may not be sorted, my mind immediately goes to awk. Here is my take on it.

Code

#!/bin/bash

diff(){
  awk 'BEGIN{RS=ORS=" "}
       {NR==FNR?a[$0]++:a[$0]--}
       END{for(k in a)if(a[k])print k}' <(echo -n "${!1}") <(echo -n "${!2}")
}

Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" )
Array3=($(diff Array1[@] Array2[@]))
echo ${Array3[@]}

Output

$ ./diffArray.sh
key10 key7 key8 key9

*Note**: Like other answers given, if there are duplicate keys in an array they will only be reported once; this may or may not be the behavior you are looking for. The awk code to handle that is messier and not as clean.

Zeal answered 24/2, 2010 at 22:10 Comment(5)
To summarize the behavior and constraints: (a) performs a symmetrical difference: outputs a single array with elements unique to either input array (which with the OP's sample data happens to be the same as only outputting elements unique to the first array), (b) only works with elements that have no embedded whitespace (which satisfies the OP's requirements), and (c) the order of elements in the output array has NO guaranteed relationship to the order of input elements, due to awk's unconditional use of associative arrays - as evidenced by the sample output.Hewes
Also, this answer uses a clever-and-noteworthy-but-baffling-if-unexplained workaround for bash's lack of support for passing arrays as arguments: Array1[@] and Array2[@] are passed as strings - the respective array names plus the all-subscripts suffix [@]- to shell function diff() (as arguments $1 and $2, as usual). The shell function then uses bash's variable indirection ({!...}) to indirectly refer to all elements of the original arrays (${!1} and `${!1}').Hewes
how to transform a string "a b C" into an array?Unsupportable
found an error: elements in Array2 not in Array1 will show in diff()Unsupportable
This solution doesn't work for array elements containing whitespace. The example script can fail in multiple ways due to unquoted strings being GLOB expanded by the shell. It fails if you do touch Array1@ before you run the script, because the strings Array1[@] and Array2[@] are used as unquoted shell GLOB patterns. It fails if one array contains the element * because that unquoted GLOB pattern matches all the files in the current directory.Editorialize
G
11

Having ARR1 and ARR2 as arguments, use comm to do the job and mapfile to put it back into RESULT array:

ARR1=("key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10")
ARR2=("key1" "key2" "key3" "key4" "key5" "key6")

mapfile -t RESULT < \
    <(comm -23 \
        <(IFS=$'\n'; echo "${ARR1[*]}" | sort) \
        <(IFS=$'\n'; echo "${ARR2[*]}" | sort) \
    )

echo "${RESULT[@]}" # outputs "key10 key7 key8 key9"

Note that result may not meet source order.

Bonus aka "that's what you are here for":

function array_diff {
    eval local ARR1=\(\"\${$2[@]}\"\)
    eval local ARR2=\(\"\${$3[@]}\"\)
    local IFS=$'\n'
    mapfile -t $1 < <(comm -23 <(echo "${ARR1[*]}" | sort) <(echo "${ARR2[*]}" | sort))
}

# usage:
array_diff RESULT ARR1 ARR2
echo "${RESULT[@]}" # outputs "key10 key7 key8 key9"

Using those tricky evals is the least worst option among others dealing with array parameters passing in bash.

Also, take a look at comm manpage; based on this code it's very easy to implement, for example, array_intersect: just use -12 as comm options.

Gervase answered 22/2, 2017 at 18:28 Comment(2)
Noting that mapfile needs bash 4Hierodule
@lantrix, mapfile can be easily replaced with while..read, and even totally cut if one doesn't need an array as a result. All the magic happens in comm.Gervase
M
9

In Bash 4:

declare -A temp    # associative array
for element in "${Array1[@]}" "${Array2[@]}"
do
    ((temp[$element]++))
done
for element in "${!temp[@]}"
do
    if (( ${temp[$element]} > 1 ))
    then
        unset "temp[$element]"
    fi
done
Array3=(${!temp[@]})    # retrieve the keys as values

Edit:

ephemient pointed out a potentially serious bug. If an element exists in one array with one or more duplicates and doesn't exist at all in the other array, it will be incorrectly removed from the list of unique values. The version below attempts to handle that situation.

declare -A temp1 temp2    # associative arrays
for element in "${Array1[@]}"
do
    ((temp1[$element]++))
done

for element in "${Array2[@]}"
do
    ((temp2[$element]++))
done

for element in "${!temp1[@]}"
do
    if (( ${temp1[$element]} >= 1 && ${temp2[$element]-0} >= 1 ))
    then
        unset "temp1[$element]" "temp2[$element]"
    fi
done
Array3=(${!temp1[@]} ${!temp2[@]})
Muslim answered 22/2, 2010 at 18:48 Comment(7)
That performs a symmetric difference, and assumes that the original arrays have no duplicates. So it's not what I would have thought of first, but it works well for OP's one example.Sultry
@ephemient: Right, the parallel would be to diff(1) which is also symmetric. Also, this script will work to find elements unique to any number of arrays simply by adding them to the list in the second line of the first version. I've added an edit which provides a version to handle duplicates in one array which don't appear in the other.Muslim
Thanks A lot.. I was thinking if there was any obvious way of doing it.. If i am not aware of any command which would readily give the diff of 2 arrays.. Thanks for your support and help. I modified the code to read the diff of 2 files which was little easier to programPatino
Your 2nd snippet won't work, because > only works in (( ... )), not in [[ ... ]]; in the latter, it'd have to be -gt; however, since you probably meant >= rather than >, > should be replaced with -ge. To be explicit about what "symmetric" means in this context: the output is a single array containing values that are unique to either array.Hewes
@mklement0: > does work inside double square brackets, but lexically rather than numerically. Because of that, when comparing integers, double parentheses should be used - so you are correct in that regard. I've updated my answer accordingly.Muslim
@DennisWilliamson: Thanks for the clarification re > inside [[ ... ]] and thanks for updating your answer. However, I think it should be >= 1 rather than > 1. More crucially, though, the ((...)) conditional will break with non-existent (empty) temp2 elements, so you either need to use ${temp2[$element]-0} or stick with [[...]] and -ge.Hewes
Since I didn't hear back and didn't want a demonstrably broken answer to stand, I've taken the liberty to fix it. Please let me know if you feel the fix is incorrect or inappropriate. On a general note, this answer only works with array elements without embedded whitespace (which does satisfy the OP's requirements as stated).Hewes
C
8

It is possible to use regex too (based on another answer: Array intersection in bash):

list1=( 1 2 3 4   6 7 8 9 10 11 12)
list2=( 1 2 3   5 6   8 9    11 )

l2=" ${list2[*]} "                    # add framing blanks
for item in ${list1[@]}; do
  if ! [[ $l2 =~ " $item " ]] ; then    # use $item as regexp
    result+=($item)
  fi
done
echo  ${result[@]}:

Result:

$ bash diff-arrays.sh 
4 7 10 12
Colosseum answered 15/2, 2016 at 13:6 Comment(1)
@philwalk, I haven't downvoted this personally, but it's doing a full iteration of the string for every item in the outer list. From a big-O perspective that's deeply inefficient -- it's going to get slow faster than it needs to as the content gets longer.Undefined
V
3
Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" )
Array3=( "key1" "key2" "key3" "key4" "key5" "key6" "key11" )
a1=${Array1[@]};a2=${Array2[@]}; a3=${Array3[@]}
diff(){
    a1="$1"
    a2="$2"
    awk -va1="$a1" -va2="$a2" '
     BEGIN{
       m= split(a1, A1," ")
       n= split(a2, t," ")
       for(i=1;i<=n;i++) { A2[t[i]] }
       for (i=1;i<=m;i++){
            if( ! (A1[i] in A2)  ){
                printf A1[i]" "
            }
        }
    }'
}
Array4=( $(diff "$a1" "$a2") )  #compare a1 against a2
echo "Array4: ${Array4[@]}"
Array4=( $(diff "$a3" "$a1") )  #compare a3 against a1
echo "Array4: ${Array4[@]}"

output

$ ./shell.sh
Array4: key7 key8 key9 key10
Array4: key11
Violante answered 23/2, 2010 at 1:24 Comment(0)
H
3

@ilya-bystrov's most upvoted answer calculates the difference of Array1 and Array2. Please note that this is not the same as removing items from Array1 that are also in Array2. @ilya-bystrov's solution rather concatenates both lists and removes non-unique values. This is a huge difference when Array2 includes items that are not in Array1: Array3 will contain values that are in Array2, but not in Array1.

Here's a pure Bash solution for removing items from Array1 that are also in Array2 (note the additional "key11" in Array2):

Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" "key11" )
Array3=( $(printf "%s\n" "${Array1[@]}" "${Array2[@]}" "${Array2[@]}" | sort | uniq -u) )

Array3 will consist of "key7" "key8" "key9" "key10" and exclude the unexpected "key11" when trying to remove items from Array1.

If your array items might contain whitespaces, use mapfile to construct Array3 instead, as suggested by @David:

Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" "key11" )
mapfile -t Array3 < <(printf "%s\n" "${Array1[@]}" "${Array2[@]}" "${Array2[@]}" | sort | uniq -u)

Please note: This assumes that all values in Array1 are unique. Otherwise they won't show up in Array3. If Array1 contains duplicate values, you must remove the duplicates first (note the duplicate "key10" in Array1; possibly use mapfile if your items contain whitespaces):

Array1=( "key1" "key2" "key3" "key4" "key5" "key6" "key7" "key8" "key9" "key10" "key10" )
Array2=( "key1" "key2" "key3" "key4" "key5" "key6" "key11" )
Array3=( $({ printf "%s\n" "${Array1[@]} | sort -u; printf "%s\n" "${Array2[@]}" "${Array2[@]}"; } | sort | uniq -u) )

If you want to replicate the duplicates in Array1 to Array2, go with @ephemient' accepted answer. The same is true if Array1 and Array2 are huge: this is a very inefficient solution for a lot of items, even though it's negligible for a few items (<100). If you need to process huge arrays don't use Bash.

Housum answered 15/10, 2021 at 20:29 Comment(1)
Nice! You should use mapfile to build Array3 though, as the sloppy A=(..) expansion could break on whitespaces: a=(string\ 1 string\ 2 string\ 3); b=(string\ 2); mapfile -t c < <(printf "%s\n" "${a[@]}" "${b[@]}" "${b[@]}" | sort | uniq -u);Orchard
S
0

this code replace with diff

echo ${test1[@]} ${test2[@]} | sed 's/ /\n/g' | sort | uniq -u

for result reverse use uniq -d

Silver answered 8/1, 2022 at 20:6 Comment(0)
T
0

A one liner solution that can handle array entries containing spaces:

readarray -t array3 < <( grep --invert-match --fixed-strings --line-regexp --file=<( IFS=$'\n'; echo "${array2[*]}" ) <<<"$( IFS=$'\n'; echo "${array1[*]}" )" )
  1. Use IFS and echo to show array1 and array2 with one line per entry
  2. Use grep to display all entries of array1 that are not in array two (--invert-match and --file= for entries to remove)
  3. Put the entries in the array3 using readarray, in order to handle one entry per line

This solution keeps the array3 sorted the same as array1

Trophoblast answered 26/10, 2023 at 7:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.