C# remove duplicates from List<List<int>>
Asked Answered
G

6

14

I'm having trouble coming up with the most efficient algorithm to remove duplicates from List<List<int>>, for example (I know this looks like a list of int[], but just doing it that way for visual purposes:

my_list[0]= {1, 2, 3};
my_list[1]= {1, 2, 3};
my_list[2]= {9, 10, 11};
my_list[3]= {1, 2, 3};

So the output would just be

new_list[0]= {1, 2, 3};
new_list[1]= {9, 10, 11};

Let me know if you have any ideas. I would really appreciate it.

Goodnatured answered 8/10, 2012 at 15:29 Comment(3)
Is {1, 2, 3} equal to {3, 2, 1}?Coadjutor
Well I know I can sort each element in that instance and those two will end up the same, so for the purposes here I'll say no.Goodnatured
I would look at the answers below that use Linq, as that greatly simplifies your code (vs the ones using EqualityComparers).Memorial
T
13

Build custom of EqualityComparer<List<int>>:

public class CusComparer : IEqualityComparer<List<int>>
{
    public bool Equals(List<int> x, List<int> y)
    {
        return x.SequenceEqual(y);
    }

    public int GetHashCode(List<int> obj)
    {
        int hashCode = 0;

        for (var index = 0; index < obj.Count; index++)
        {
            hashCode ^= new {Index = index, Item = obj[index]}.GetHashCode();
        }

        return hashCode;
    }
}

Then you can get the result by using Distinct with custom comparer method:

var result = my_list.Distinct(new CusComparer());

Edit:

Include the index into method GetHashCode to make sure different orders will not be equal

Thrum answered 8/10, 2012 at 15:37 Comment(3)
That hashcode will lead to a lot of collisions - e.g. {a,a} will collide for any a, {a,b} will collide with {b,a} as will any permutation... (Although it may be that you want permutations to collide, in which case, nice answer!)Eskew
@Rawling: you are correct, waiting the answer form Tim's comment, I am trying to fixThrum
Here is a hash code generator I feel is better: return obj.Take(5).Aggregate(1, (current, item) => (current * 37) + item.GetHashCode()); First off, I wouldn't iterate the whole sequence. Hashes are only efficient if they are generated quickly; iterating the whole list defeats that purpose. The first 5 or so (edit that number as needed) should be good. If the first several are the same the lists are likely different. Next, a good general algorithm form combining N different hashes into one is to go through each, multiply the current a prime number, and then add the next hash.Kwang
A
10

This simple program does what you want:

using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApplication6
{
    class Program
    {
        static void Main(string[] args)
        {
            List<List<int>> lists = new List<List<int>>();

            lists.Add(new List<int> { 1, 2, 3 });
            lists.Add(new List<int> { 1, 2, 3 });
            lists.Add(new List<int> { 9, 10, 11 });
            lists.Add(new List<int> { 1, 2, 3 });

            var distinct = lists.Select(x => new HashSet<int>(x))
                    .Distinct(HashSet<int>.CreateSetComparer());

            foreach (var list in distinct)
            {
                foreach (var v in list)
                {
                    Console.Write(v + " ");
                }

                Console.WriteLine();
            }
        }
    }
}
Aegisthus answered 8/10, 2012 at 15:40 Comment(2)
This is really be best answer as it makes full use of Linq to simplify the problem.Memorial
Very fast solution. This is the correct answer. Thanks! +1Kwiatkowski
M
10
    var finalList = lists.GroupBy(x => String.Join(",", x))
                         .Select(x => x.First().ToList())
                         .ToList();
Mehalek answered 8/10, 2012 at 15:40 Comment(0)
F
6

You can use the LINQ Distinct overload that takes a comparer. The comparer should see if the lists are equal. Note that the default equals operations of lists won't do what you're really looking for, so the comparer will need to loop through each for you. Here's an example of such a comparer:

public class SequenceComparer<T> : IEqualityComparer<IEnumerable<T>>
{
    IEqualityComparer<T> itemComparer;
    public SequenceComparer()
    {
        this.itemComparer = EqualityComparer<T>.Default;
    }

    public SequenceComparer(IEqualityComparer<T> itemComparer)
    {
        this.itemComparer = itemComparer;
    }

    public bool Equals(IEnumerable<T> x, IEnumerable<T> y)
    {
        if (object.Equals(x, y))
            return true;
        if (x == null || y == null)
            return false;
        return x.SequenceEqual(y, itemComparer);
    }

    public int GetHashCode(IEnumerable<T> obj)
    {
        if (obj == null)
            return -1;
        int i = 0;
        return obj.Aggregate(0, (x, y) => x ^ new { Index = i++, ItemHash = itemComparer.GetHashCode(y) }.GetHashCode());
    }
}

Update: I got the idea of using an anonymous type to make a better hash from Cuong Le's answer, and I LINQ-ified it and made it work in my class.

Fulgor answered 8/10, 2012 at 15:31 Comment(5)
Note that T in List<T> has to implement IComparable. If T is a custom type, you'll have to do this yourself.Allianora
@MichaelSallmen I've shown an implementation that optionally takes an IEqualityComparer<T> to specify how to compare T object. Good point, though. (the default equality comparer on a type without appropriate comparison interfaces defined will only check for reference equality)Fulgor
@Kwang yes, indeed. It's just an example implementation, though. Writing a good one is not trivial (e.g. see Coung Le's better implementation, that still has issues). Perhaps multiplying each item's hash by the next larger prime number before XORing would be better?Fulgor
@TimS. I would just leave a //TODO generate hash rather than putting what you have; at least that way readers will know they need to find a good algorithm on their own without thinking that this is fine.Kwang
@Kwang I've replaced it with a good implementation. Otherwise, I'd agree with you.Fulgor
B
1

For small sets of data, a comparer could be useful, but if you have 1000 or more List> then trying to compare them all could begin to take a long amount of time.

I suggest that you instead use your data to build a distinct tree. The building of the tree will be much faster and when you are done you can always bring your data back into your old data structure.

Berar answered 14/10, 2012 at 21:56 Comment(0)
K
0

I wanted to compare the performance of the answers of @Leniel Macaferi and @L.B as I wasn't sure which would be more performant, or if the difference would be significant. It turns out that the difference is very significant:

Method 1: 00:00:00.0976649 @Leniel Macaferi
Method 2: 00:00:32.0961650 @L.B

Here is the code I used to benchmark them:

public static void Main(string[] args)
        {
            var list = new List<List<int>> {new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,}, new List<int> {1, 2, 31, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 6}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9, 10, 11, 1}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9}, new List<int> {1, 2, 31, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 6, 7}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9, 10, 11}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3,}, new List<int> {1, 2, 31, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 6}, new List<int> {1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 9, 10, 11}};

            var sw1 = new Stopwatch();
            sw1.Start();

            for (var i = 0; i < 1_000_000; i++)
            {
                var distinct = list.Select(x => new HashSet<int>(x)).Distinct(HashSet<int>.CreateSetComparer());
            }

            sw1.Stop();
            Console.WriteLine($"Method 1: {sw1.Elapsed}");

            var sw2 = new Stopwatch();
            sw2.Start();
            for (var i = 0; i < 1_000_000; i++)
            {
                var distinct = list.GroupBy(a => string.Join(",", a)).Select(a => a.First()).ToList();

            }
            sw2.Stop();
            Console.WriteLine($"Method 2: {sw2.Elapsed}");

            Console.ReadKey();
        }
Kwiatkowski answered 10/2, 2018 at 14:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.