Algorithm to tell if two arrays have identical members
Asked Answered
C

16

31

What's the best algorithm for comparing two arrays to see if they have the same members?

Assume there are no duplicates, the members can be in any order, and that neither is sorted.

compare(
    [a, b, c, d],
    [b, a, d, c]
) ==> true

compare(
    [a, b, e],
    [a, b, c]
) ==> false

compare(
    [a, b, c],
    [a, b]
) ==> false
Cheyenne answered 29/10, 2008 at 1:33 Comment(2)
Why not kick it up a notch, and see what happens if we can't sort. obviously we need to be able to compare for equality.Steve
Sounds like you're asking how to compare setsDependency
L
20

Obvious answers would be:

  1. Sort both lists, then check each element to see if they're identical
  2. Add the items from one array to a hashtable, then iterate through the other array, checking that each item is in the hash
  3. nickf's iterative search algorithm

Which one you'd use would depend on whether you can sort the lists first, and whether you have a good hash algorithm handy.

Lowboy answered 29/10, 2008 at 1:39 Comment(4)
Minor optimizations... 1. As mentioned below, check the length first. 2. Java's Set.add(E o) operation returns true if the element was added, so the iteration can simply test 'if (!setA.add(elementFromB))' and return false.Jocosity
The problem with the hashtable approach is it doesn't work when the list can have duplicate values. Does a[1, 2, 2, 3] == b[1, 2, 3, 3]? Both are the same length and all the items in b can be found in the hashtable of a. You need a set structure that counts the items and check the counts are equal.Talya
From a mathematical perspective a set does not contain the same value twice.Bream
Methode 2 is slower than 1 (of n is big enough): Method 2 need O(n^2) checks while comparison sort you can get down to O(n*log(n)).Latitudinarian
G
7

You could load one into a hash table, keeping track of how many elements it has. Then, loop over the second one checking to see if every one of its elements is in the hash table, and counting how many elements it has. If every element in the second array is in the hash table, and the two lengths match, they are the same, otherwise they are not. This should be O(N).

To make this work in the presence of duplicates, track how many of each element has been seen. Increment while looping over the first array, and decrement while looping over the second array. During the loop over the second array, if you can't find something in the hash table, or if the counter is already at zero, they are unequal. Also compare total counts.

Another method that would work in the presence of duplicates is to sort both arrays and do a linear compare. This should be O(N*log(N)).

Gradate answered 29/10, 2008 at 1:39 Comment(0)
C
5

Assuming you don't want to disturb the original arrays and space is a consideration, another O(n.log(n)) solution that uses less space than sorting both arrays is:

  1. Return FALSE if arrays differ in size
  2. Sort the first array -- O(n.log(n)) time, extra space required is the size of one array
  3. For each element in the 2nd array, check if it's in the sorted copy of the first array using a binary search -- O(n.log(n)) time

If you use this approach, please use a library routine to do the binary search. Binary search is surprisingly error-prone to hand-code.

[Added after reviewing solutions suggesting dictionary/set/hash lookups:]

In practice I'd use a hash. Several people have asserted O(1) behaviour for hashes, leading them to conclude a hash-based solution is O(N). Typical inserts/lookups may be close to O(1), and some hashing schemes guarantee worst-case O(1) lookup, but worst-case insertion -- in constructing the hash -- isn't O(1). Given any particular hashing data structure, there would be some set of inputs which would produce pathological behaviour. I suspect there exist hashing data structures with the combined worst-case to [insert-N-elements then lookup-N-elements] of O(N.log(N)) time and O(N) space.

Calycle answered 29/10, 2008 at 1:52 Comment(2)
If we assume data is not hostile, worst-case times are rarely interesting. Everyone says quicksort is O(n*log(n)), but it's worst-case performance is O(n^2)Dinosaurian
Frentos, Your first solution won't work for this input: Array 1 = [1,2,3] Array 2 = [1,1,1]Sateia
E
5

You can use a signature (a commutative operation over the array members) to further optimize this in the case where the array are usually different, saving the o(n log n) or the memory allocation. A signature can be of the form of a bloom filter(s), or even a simple commutative operation like addition or xor.

A simple example (assuming a long as the signature side and gethashcode as a good object identifier; if the objects are, say, ints, then their value is a better identifier; and some signatures will be larger than long)

public bool MatchArrays(object[] array1, object[] array2)
{
   if (array1.length != array2.length)
      return false;
   long signature1 = 0;
   long signature2 = 0;
   for (i=0;i<array1.length;i++) {
       signature1=CommutativeOperation(signature1,array1[i].getHashCode());
       signature2=CommutativeOperation(signature2,array2[i].getHashCode());
   }

   if (signature1 != signature2) 
       return false;

   return MatchArraysTheLongWay(array1, array2);
}

where (using an addition operation; use a different commutative operation if desired, e.g. bloom filters)

public long CommutativeOperation(long oldValue, long newElement) {
    return oldValue + newElement;
}
Episode answered 17/12, 2008 at 1:25 Comment(0)
P
3

This can be done in different ways:

1 - Brute force: for each element in array1 check that element exists in array2. Note this would require to note the position/index so that duplicates can be handled properly. This requires O(n^2) with much complicated code, don't even think of it at all...

2 - Sort both lists, then check each element to see if they're identical. O(n log n) for sorting and O(n) to check so basically O(n log n), sort can be done in-place if messing up the arrays is not a problem, if not you need to have 2n size memory to copy the sorted list.

3 - Add the items and count from one array to a hashtable, then iterate through the other array, checking that each item is in the hashtable and in that case decrement count if it is not zero otherwise remove it from hashtable. O(n) to create a hashtable, and O(n) to check the other array items in the hashtable, so O(n). This introduces a hashtable with memory at most for n elements.

4 - Best of Best (Among the above): Subtract or take difference of each element in the same index of the two arrays and finally sum up the subtacted values. For eg A1={1,2,3}, A2={3,1,2} the Diff={-2,1,1} now sum-up the Diff = 0 that means they have same set of integers. This approach requires an O(n) with no extra memory. A c# code would look like as follows:

    public static bool ArrayEqual(int[] list1, int[] list2)
    {
        if (list1 == null || list2 == null)
        {
            throw new Exception("Invalid input");
        }

        if (list1.Length != list2.Length)
        {
            return false;
        }

        int diff = 0;

        for (int i = 0; i < list1.Length; i++)
        {
            diff += list1[i] - list2[i];
        }

        return (diff == 0);
    }

4 doesn't work at all, it is the worst

Paste answered 17/12, 2008 at 1:4 Comment(4)
that fails on the inputs: [2,4,6] and [-2,8,6] --> diff[4,-4,0] = 0.Cheyenne
Rather, do this: diff += Math.abs(list1[i]) - Math.abs(list2[i]);Arleen
abs will not do the trick. consider case: [2, 4, 0] & [5, 1, 0] => diff[-3, 3, 0] == 0.Quinn
in case of abs [1,2,3] and [-1,2,3] ..... diff =0 which is not correct... So my suggestion to take 2 diffs , one for +ve nos and other for -ve number. Both diffs should be 0Wadi
H
2

If the elements of an array are given as distinct, then XOR ( bitwise XOR ) all the elements of both the arrays, if the answer is zero, then both the arrays have the same set of numbers. The time complexity is O(n)

Hectocotylus answered 26/8, 2014 at 17:27 Comment(2)
Not really. Doing XOR on all elements of {1, 2, 3, 3} and {1, 2} will result to 0, but the arrays are different. :)Chlor
@KonstantinYovkov, I took that case into consideration. That's why I wrote "if the elements of an array are given as distinct"Hectocotylus
P
1

I would suggest using a sort first and sort both first. Then you will compare the first element of each array then the second and so on.

If you find a mismatch you can stop.

Pinkham answered 29/10, 2008 at 1:39 Comment(0)
C
1

If you sort both arrays first, you'd get O(N log(N)).

Changchangaris answered 29/10, 2008 at 1:40 Comment(0)
A
1

What is the "best" solution obviously depends on what constraints you have. If it's a small data set, the sorting, hashing, or brute force comparison (like nickf posted) will all be pretty similar. Because you know that you're dealing with integer values, you can get O(n) sort times (e.g. radix sort), and the hash table will also use O(n) time. As always, there are drawbacks to each approach: sorting will either require you to duplicate the data or destructively sort your array (losing the current ordering) if you want to save space. A hash table will obviously have memory overhead to for creating the hash table. If you use nickf's method, you can do it with little-to-no memory overhead, but you have to deal with the O(n2) runtime. You can choose which is best for your purposes.

Alexandra answered 29/10, 2008 at 2:22 Comment(1)
However, nickf's solution is the easiest to multi-thread; it doesn't do any writes to shared data and it can scale (near) linearly. It only has problems with duplicate items as mentioned in the comments.Pocky
S
1

Going on deep waters here, but:

Sorted lists sorting can be O(nlogn) as pointed out. just to clarify, it doesn't matter that there is two lists, because: O(2*nlogn) == O(nlogn), then comparing each elements is another O(n), so sorting both then comparing each element is O(n)+O(nlogn) which is: O(nlogn)

Hash-tables: Converting the first list to a hash table is O(n) for reading + the cost of storing in the hash table, which i guess can be estimated as O(n), gives O(n). Then you'll have to check the existence of each element in the other list in the produced hash table, which is (at least?) O(n) (assuming that checking existance of an element the hash-table is constant). All-in-all, we end up with O(n) for the check.

The Java List interface defines equals as each corresponding element being equal.

Interestingly, the Java Collection interface definition almost discourages implementing the equals() function.

Finally, the Java Set interface per documentation implements this very behaviour. The implementation is should be very efficient, but the documentation makes no mention of performance. (Couldn't find a link to the source, it's probably to strictly licensed. Download and look at it yourself. It comes with the JDK) Looking at the source, the HashSet (which is a commonly used implementation of Set) delegates the equals() implementation to the AbstractSet, which uses the containsAll() function of AbstractCollection using the contains() function again from hashSet. So HashSet.equals() runs in O(n) as expected. (looping through all elements and looking them up in constant time in the hash-table.)

Please edit if you know better to spare me the embarrasment.

Steve answered 29/10, 2008 at 2:59 Comment(1)
The difference between O(2 * n log n) and O(n log n) do matter in a practical application; it's the hidden constants that largely make up the performance of the algorithm. The theoretical speed is only relevant on a superficial level.Pocky
W
1

Pseudocode :

A:array
B:array
C:hashtable

if A.length != B.length then return false;

foreach objA in A
{
H = objA;
if H is not found in C.Keys then
C.add(H as key,1 as initial value);
else
C.Val[H as key]++;
}

foreach objB in B
{
H = objB;
if H is not found in C.Keys then
return false;
else
C.Val[H as key]--;
}

if(C contains non-zero value)
return false;
else
return true;
Wader answered 11/11, 2010 at 12:26 Comment(0)
B
0

Here is another option, let me know what you guys think.It should be T(n)=2n*log2n ->O(nLogn) in the worst case.

private boolean compare(List listA, List listB){
    if (listA.size()==0||listA.size()==0) return true;
    List runner = new ArrayList();
    List maxList = listA.size()>listB.size()?listA:listB;
    List minList = listA.size()>listB.size()?listB:listA;
    int macthes = 0;
    List nextList = null;;
    int maxLength = maxList.size();
    for(int i=0;i<maxLength;i++){
        for (int j=0;j<2;j++) {
            nextList = (nextList==null)?maxList:(maxList==nextList)?minList:maList;
            if (i<= nextList.size()) {
                MatchingItem nextItem =new MatchingItem(nextList.get(i),nextList)
                int position = runner.indexOf(nextItem);
                if (position <0){
                    runner.add(nextItem);
                }else{
                    MatchingItem itemInBag = runner.get(position);
                    if (itemInBag.getList != nextList)   matches++;
                    runner.remove(position);
                }
            }
        }
    }
    return maxLength==macthes;
}

public Class MatchingItem{
private Object item;
private List itemList;
public MatchingItem(Object item,List itemList){
    this.item=item
    this.itemList = itemList
}
public boolean equals(object other){
    MatchingItem otheritem = (MatchingItem)other;
    return otheritem.item.equals(this.item) and otheritem.itemlist!=this.itemlist
}

public Object getItem(){ return this.item}
public Object getList(){ return this.itemList}

}

Blasto answered 29/10, 2008 at 1:33 Comment(0)
F
0

The best way is probably to use hashmaps. Since insertion into a hashmap is O(1), building a hashmap from one array should take O(n). You then have n lookups, which each take O(1), so another O(n) operation. All in all, it's O(n).

In python:

def comparray(a, b): 
    sa = set(a)
    return len(sa)==len(b) and all(el in sa for el in b)
Fourierism answered 29/10, 2008 at 1:43 Comment(3)
I'd love to know the details of this data structure with O(1) insertions and O(1) lookups. Unless you have a perfect hashing function (no duplicates) and there is no overhead as the hashmap grows, the insertion is going to be more that O(1)Calycle
Well it is amortized O(1), meaning on average it'll be O(1). Obviously doing enough insertions will cause it to grow and collide, also causing lookups to then take slightly longer. I guess that would apply in this case, since it's doing n of them. Do you know details on python's hashmap performance?Fourierism
I agree typical inserts/lookups are amortized O(1), and I'd definitely use some form of dictionary/hash/set as a solution in practice. But O() notation is for worst-case. I googled and found data structures which guarantee O(1) access, but no analysis of worst-case insertion time for N elements.Calycle
M
0

Ignoring the built in ways to do this in C#, you could do something like this:

Its O(1) in the best case, O(N) (per list) in worst case.

public bool MatchArrays(object[] array1, object[] array2)
{
   if (array1.length != array2.length)
      return false;

   bool retValue = true;

   HashTable ht = new HashTable();

   for (int i = 0; i < array1.length; i++)
   {
      ht.Add(array1[i]);
   }

   for (int i = 0; i < array2.length; i++)
   {
      if (ht.Contains(array2[i])
      {
         retValue = false;
         break;
      }
   }

    return retValue;
}
Misconstruction answered 29/10, 2008 at 1:50 Comment(2)
Hash tables aren't O(1) to do lookups. Your solution is somewhere between O(N.log N) and O(N^2) depending on the hash table implementaton.Calycle
The O(1) is if the list doesn't match, otherwise its O(N) twice in the worst case and O(N) in the best case. Hashtables are O(1) when there is no chance of a collision btw.Misconstruction
P
0

Upon collisions a hashmap is O(n) in most cases because it uses a linked list to store the collisions. However, there are better approaches and you should hardly have collisions anyway because if you did the hashmap would be useless. In all regular cases it's simply O(1). Besides that, it's not likely to have more than a small n of collisions in a single hashmap so performance wouldn't suck that bad; you can safely say that it's O(1) or almost O(1) because the n is so small it's can be ignored.

Pocky answered 29/10, 2008 at 10:0 Comment(0)
C
-2

The best I can think of is O(n^2), I guess.

function compare($foo, $bar) {
    if (count($foo) != count($bar)) return false;

    foreach ($foo as $f) {
        foreach ($bar as $b) {
            if ($f == $b) {
                // $f exists in $bar, skip to the next $foo
                continue 2;
            }
        }
        return false;
    }
    return true;
}
Cheyenne answered 29/10, 2008 at 1:33 Comment(4)
O(n) would be the best if you're using a hashtable. If you're sorting first O(nlogn).Youngster
Does this still work if foo has duplicates? It doesn't look like it would check to see if they have the same number of duplicates. I.e., this would falsely match [0 0 0 0 1 2 3] and [0 1 1 2 2 3 3]Clarendon
no it doesn't work in that case, but that was one of the assumptions at the start. I guess that if you couldn't be sure there wasn't duplicates, then you could delete items from the bar array as they were matched.Cheyenne
Making a multi-threaded variant of this implementation that scales linearly is probably easier than any other implementation on this page.Pocky

© 2022 - 2024 — McMap. All rights reserved.