Collection priority in LINQ Intersect, Union, using IEqualityComparer
Asked Answered
N

1

5

If I have two collections of type T, and an IEqualityComparer that compares a subset of their properties, which collection would the resulting elements of an Intersect or Union come from?

The tests I've run so far suggest the following:

  • the item(s) from col1 win
  • if col1 or col2 contain duplicate items (as defined by the comparer) within themselves, the first entry (within col1, then col2) wins.

I'm aware this shouldn't be an issue, as (by definition) I should be viewing the resulting objects as equal. It just occurred to me that using Union with a custom comparer could be a bit neater than the equivalent Join - though this only holds true if the above assumptions are guaranteeed.

    class DummyComparer : IEqualityComparer<Dummy>
    {
        public bool Equals(Dummy x, Dummy y)
        {
            return x.ID == y.ID;
        }

        public int GetHashCode(Dummy obj)
        {
            return obj.ID.GetHashCode();
        }
    }

    class Dummy
    {
        public int ID { get; set; }
        public string Name { get; set; }
    }

    [Test]
    public void UnionTest()
    {
        var comparer = new DummyComparer();

        var d1 = new Dummy { ID = 0, Name = "test0" };
        var d2 = new Dummy { ID = 0, Name = "test1" };
        var d3 = new Dummy { ID = 1, Name = "test2" };
        var d4 = new Dummy { ID = 1, Name = "test3" };

        var col1 = new Dummy[] { d1, d3 };
        var col2 = new Dummy[] { d2, d4 };

        var x1 = col1.Union(col2, comparer).ToList();
        var x2 = col2.Union(col1, comparer).ToList();

        var y1 = col1.Except(col2, comparer).ToList();
        var y2 = col2.Except(col1, comparer).ToList();

        var z1 = col1.Intersect(col2, comparer).ToList();
        var z2 = col2.Intersect(col1, comparer).ToList();

        Assert.AreEqual(2, x1.Count);
        Assert.Contains(d1, x1);
        Assert.Contains(d3, x1);

        Assert.AreEqual(2, x2.Count);
        Assert.Contains(d2, x2);
        Assert.Contains(d4, x2);

        Assert.AreEqual(0, y1.Count);
        Assert.AreEqual(0, y2.Count);

        Assert.AreEqual(2, z1.Count);
        Assert.Contains(d1, z1);
        Assert.Contains(d3, z1);

        Assert.AreEqual(2, z2.Count);
        Assert.Contains(d2, z2);
        Assert.Contains(d4, z2);
    }
Neurasthenia answered 24/1, 2014 at 15:6 Comment(4)
I'd say look at the MSDN documentation, but it actually lies about Intersect. Maybe check out EduLinq, it goes into detail on the original implementations.Cresting
@Rawling: interesting, you're right. I've looked at Intersect with ILSpy and it enumerates the second collection first and then the first ,even if is documented the other way around. What could be the reason? Edit Actually Jon Skeet has also mentioned this "lie": msmvps.com/blogs/jon_skeet/archive/2010/12/30/… (in his words: "This is demonstrably incorrect.")Hutchens
@Tim It's also ambiguous about which elements Intersect returns - I read it as "marks elements in the second sequence, and returns them", but you clearly read it the other way around. (Yeah, that's the page I was thinking of when I linked EduLinq.)Cresting
@Rawling: yes, the implementation doesn't change the behaviour. It will still return the object from the first sequence even if it first enumerates the second. I've updated my answer. Confusing.Hutchens
H
8

The first collection should win always.

MSDN:

When the object returned by this method is enumerated, Union enumerates first and second in that order and yields each element that has not already been yielded.

Here is the implementation of Union(ILSPY, .NET 4), the first collection is enumerated first:

// System.Linq.Enumerable
private static IEnumerable<TSource> UnionIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
    Set<TSource> set = new Set<TSource>(comparer);
    foreach (TSource current in first)
    {
        if (set.Add(current))
        {
            yield return current;
        }
    }
    foreach (TSource current2 in second)
    {
        if (set.Add(current2))
        {
            yield return current2;
        }
    }
    yield break;
}

The same applies to Intersect (and other similar methods in Linq-To-Objects as well):

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

Update: As Rawling has mentioned in his comment MSDN lies at the documentation of Intersect. I've looked at Intersect with ILSpy and it enumerates the second collection first and only then the first, even if is documented the other way around.

Actually Jon Skeet has also mentioned this "lie" in EduLinq: http://msmvps.com/blogs/jon_skeet/archive/2010/12/30/reimplementing-linq-to-objects-part-16-intersect-and-build-fiddling.aspx (in his words: "This is demonstrably incorrect.")

However, even if it isn't implemented as expected it will still return the element of the first collection as you can see in the implementation:

// System.Linq.Enumerable
private static IEnumerable<TSource> IntersectIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
    Set<TSource> set = new Set<TSource>(comparer);
    foreach (TSource current in second)
    {
        set.Add(current);
    }
    foreach (TSource current2 in first)
    {
        if (set.Remove(current2))
        {
            yield return current2;
        }
    }
    yield break;
}
Hutchens answered 24/1, 2014 at 15:12 Comment(2)
@trilson86: updated my answer but even if Intersect behaves differently than documented it'll still return the item from the first collection.Hutchens
That's a curious implementation - thanks for pointing it out.Neurasthenia

© 2022 - 2024 — McMap. All rights reserved.