C# Similarities of two arrays
Asked Answered
M

5

7

There must be an better way to do this, I'm sure...

// Simplified code
var a = new List<int>() { 1, 2, 3, 4, 5, 6 };
var b = new List<int>() { 2, 3, 5, 7, 11 };
var z = new List<int>();
for (int i = 0; i < a.Count; i++)
    if (b.Contains(a[i]))
        z.Add(a[i]);
// (z) contains all of the numbers that are in BOTH (a) and (b), i.e. { 2, 3, 5 }

I don't mind using the above technique, but I want something fast and efficient (I need to compare very large Lists<> multiple times), and this appears to be neither! Any thoughts?

Edit: As it makes a difference - I'm using .NET 4.0, the initial arrays are already sorted and don't contain duplicates.

Melba answered 27/2, 2012 at 12:43 Comment(2)
Take a look here: #649944Espagnole
Similar questions: Efficient set intersection algorithm, Fast intersection of sets: C++ vs C#Caitlyncaitrin
P
7

You could use IEnumerable.Intersect.

var z = a.Intersect(b);

which will probably be more efficient than your current solution.

note you left out one important piece of information - whether the lists happen to be ordered or not. If they are then a couple of nested loops that pass over each input array exactly once each may be faster - and a little more fun to write.

Edit In response to your comment on ordering:

first stab at looping - it will need a little tweaking on your behalf but works for your initial data.

    int j = 0;
    foreach (var i in a)
    {
        int x = b[j];
        while (x < i)
        {
            if (x == i)
            {
                z.Add(b[j]);
            }
            j++;
            x = b[j];
        }
    }

this is where you need to add some unit tests ;)

Edit final point - it may well be that Linq can use SortedList to perform this intersection very efficiently, if performance is a concern it is worth testing the various solutions. Dont forget to take the sorting into account if you load your data in an un-ordered manner.

One Final Edit because there has been some to and fro on this and people may be using the above without properly debugging it I am posting a later version here:

        int j = 0;
        int b1 = b[j];
        foreach (var a1 in a)
        {
            while (b1 <= a1)
            {
                if (b1 == a1)
                    z1.Add(b[j]);
                j++;
                if (j >= b.Count)
                    break;
                b1 = b[j];
            }
        }
Premiership answered 27/2, 2012 at 12:49 Comment(8)
The initial arrays are orderedMelba
Thank-you for your update, and thanks to Stack over-flow for closing the question so I can't post my findings ... sigh.Melba
LIES AND SLANDER! Sorry dice, but I was not resetting the StopWatch before your code, yours is BY FAR the most efficient! It took 15 ticks after I reset the StopWatch. Awesome.Melba
Intersect() seems a bit weird - It takes about 2,750 ticks to "load", then no matter what dataset you throw at it on the second/third/subsequent calls it takes ~1 tick to complete. Your method takes longer on subsequent calls (compared to Intersect), but I'd have to do a LOT of array comparisons to make using Intersect the better choice. (Using my test data, I'd have to do over 250 comparisons - not likely in my scenario - for your method to become slower over-all)Melba
Additionally, I can halve the time your method takes by using an int[] instead of a List<int>Melba
again interesting - note that my first implementation was seriously flawed - see the second one for improvements, also with regard to Intersect behavior see here: #9468443Premiership
Thanks for the link, I'll be watching that post to see what people say. As of now, it looks like .Intersect() doesn't actually do anything until you "query" z, so if we changed the code to include a .Count before the .Stop() we might get different results. I'll give that a go at work tomorrow, and post my results.Melba
Running my tests again, with a .Count() on each test before the .Stop(), your method is 4x faster than the fastest .Intersect() (17 ticks vs 70)Melba
C
2

There's IEnumerable.Intersect, but since this is an extension method, I doubt it will be very efficient.

If you want efficiency, take one list and turn it into a Set, then go over the second list and see which elements are in the set. Note that I preallocate z, just to make sure you don't suffer from any reallocations.

var set = new HashSet<int>(a);
var z = new List<int>(Math.Min(set.Count, b.Count));

foreach(int i in b)
{
    if(set.Contains(i))
        a.Add(i);
}

This is guaranteed to run in O(N+M) (N and M being the sizes of the two lists).

Now, you could use set.IntersectWith(b), and I believe it will be just as efficient, but I'm not 100% sure.

Correll answered 27/2, 2012 at 12:50 Comment(1)
I don't mean extension methods are inefficient by themselves, I just mean that it is only aware of the IEnumerable interface, making it hard to be efficient (unless they use a Set internally, do they?)Correll
E
1

The Intersect() method does just that. From MSDN:

Produces the set intersection of two sequences by using the default equality comparer to compare values.

So in your case:

var z = a.Intersect(b);
Emalia answered 27/2, 2012 at 12:46 Comment(1)
Thank-you, it's soooo much easier when you know which word to put into Google!!Melba
S
1

If you can use LINQ, you could use the Enumerable.Intersect() extension method.

Survance answered 27/2, 2012 at 12:47 Comment(0)
V
1

Use SortedSet<T> in System.Collections.Generic namespace:

SortedSet<int> a = new SortedSet<int>() { 1, 2, 3, 4, 5, 6 };
SortedSet<int> b = new SortedSet<int>() { 2, 3, 5, 7, 11 };
b.IntersectWith(s2);

But surely you have no duplicates!
Although your second list needs not to be a SortedSet. It can be any collection (IEnumerable<T>), but internally the method act in a way that if the second list also is SortedSet<T>, the operation is an O(n) operation.

Violoncello answered 27/2, 2012 at 12:51 Comment(3)
Why use a sorted set? Why not just a Set?Correll
@zmbq: There is no Set<T> class in .NET Framework (Only an internal class in System.Linq.Expressions). And SortedSet<T> is the must efficient. It uses a balanced binary tree internally.Violoncello
@MD.Unicorn: There is a HashSet<T> for this purpose.Signory

© 2022 - 2024 — McMap. All rights reserved.