Count items existing in 2 Lists
Asked Answered
A

15

25

I have two int type List like List A and List B. I want to check how many items of List A are there in List B. I am able to do this, but what can be an efficient way as I am trying to avoid foreach, as optimization is a prime target in my code.

List<int> A = new List<int>;
List<int> B = new List<int>;
// Some logic....item added in both lists. Then

foreach(var item in A)
{
    if (B.Contains(item))
    {
        // Subtract number of duplicates
    }
}

I tried using Intersect and Any, but that returns bool so I'm not able to apply them completely.

Ampersand answered 5/8, 2013 at 9:37 Comment(2)
How about A.Where(x => B.Contains(x)).Count() ?Allegedly
Duplicate of #497838 ?List
A
29
B.Intersect(A).Count(); //should do the job
Advocaat answered 5/8, 2013 at 9:40 Comment(8)
It's O(m)+O(n) as intersect produces two foreach statements. msdn.microsoft.com/en-us/library/bb460136.aspx - so accordingly to your comment on my solution it is not any better.Ascham
Intersect is a bit more efficient that Contains as it uses a hashset internally.Liberec
@Ascham Count+Contains was O(n+mn), so not really much better :PLiberec
If B contains {0, 1, 2} and A contains {0, 0, 0, 0, 0, 4} this will return 1, not 5.Aube
@JonHanna why should it returns 5? "Only one zero exists in these two listsUniformed
@Uniformed "I want to check how many items of List A are there in List B", in the example I give there are 5 items in list A that are in list B.Aube
@JonHanna ah ok, I misunderstood, sorry about thatUniformed
I'm wondering why this is a popular answer when the OP explicitly says for an "optimization is a prime target in my code"... while this solution is more syntactically precise for the developer, as Matten points out in the first example it actually uses two two foreach statements... and as Adriano Repetti shows in his answer, there are much faster ways of performing the 'intersect'... like with linq statements, while syntactically more concise it often executes faster when we code the foreach ourselves.Seko
U
13

Standard implementation B.Intersect(A).Count() has big advantage to be short and readable, unless you have a measured performance issue you should go with it.

When performance are an issue you can introduce HashSet<int>, it's a good compromise in resources usage and search time. However because you worry about performance we should perform some testing (I am using this free tool I wrote):

CPU: 1.8 GHz Pentium Core 2 Duo
Number of iterations: 100
Number of items in each list: 1000

A.Where(a => B.Contains(a)).Count(): 8338 ticks
A.Intersect(B).Count(): 288 ticks
B.Count - B.Except(A).Count(): 313 ticks

Let's now introduce HashSet<int> in our test (pick implementation from any other answer):

HashSet<int>: 163 ticks

It performs much better. Can we do better? If input range is known (and limited) you can do much better than this using BitArray. In this example I assume (for simplicity) only positive numbers but it's easy to be adapted.

public static int UseBitArray(int range, List<int> listA, List<int> listB) {
    var BitArray array = new BitArray(range);
    for (int i = 0; i < listA.Count; ++i)
        array[listA[i]] = true;

    int count = 0;
    for (int i = 0; i < listB.Count; ++i) {
        if (array[listB[i]])
            ++count;
    }

    return count;
}

How it performs?

BitArray: 95 ticks

Performance comparison

It takes only 58% of 2nd best method (HashSet<int>). I don't even compare with others. Note that it uses memory heavily and for a wide range (let's say Int32.MaxValue / 2) it uses a lot of memory (moreover its size is limited to Int32.MaxValue then you can't have full signed 32 bit integer range. If its limitations aren't a problem for you then you should definitely go with it.

Also note that if you can do some assumptions about your inputs then you can further optimize your search function (for example if you can assume that sets are ordered).

How they scale up (Y axis scale is logarithmic):

Performance comparison with different input sets

Note that Except performs better than Intersect when number of items grow up. Also note that for such trivial object (an integer) you won't have any performance gain to do it in parallel (see also Finding the difference between two lists of strings): comparison is so trivial that overhead and syncrhonization are higher than benefits (unless it's a well-tuned algorithm on a VERY high number of items).

If you're really looking for last bit of performance gain you may even implement your own BitArray class (without unneeded stuff and error checking):

sealed class FastBitArray {
    public FastBitArray(int length) {
        m_array = new int[((length - 1) / 32) + 1];
    }

    public bool this[int index] {
        get {
            return (m_array[index / 32] & (1 << (index % 32))) != 0;
        }
        set {
            if (value)
                m_array[index / 32] |= (1 << (index % 32));
            else
                m_array[index / 32] &= ~(1 << (index % 32));
        }
    }

    private int[] m_array;
}

Note that inside setter there is a branch, we don't have to worry to optimize it away because pattern is easy (always true) for branch predictor. No performance gain to make it more complicate than this.

Latest tests:

Number of iterations: 100
Number of items in each list: 1000000

HashSet<int>: 144748 ticks
BitArray: 37292 ticks
FastBitArray: 28966 ticks

Let's compare them visually (blue series is test with 1,000 items, orange series is 1,000,000; Y axis is logarithmic to easy comparison with 1k series). Methods we know are slow are simply omitted:

Performance comparison chart 1

Same data showing only 1M series and with linear Y axis:

Performance comparison chart 2

Upbow answered 26/6, 2015 at 10:14 Comment(0)
C
5
A.Where(a=>B.Contains(a)).Count ()
Celestial answered 5/8, 2013 at 9:39 Comment(0)
S
4
HashSet<int> Btemp = new HashSet<int>(B);
var x = A.Count(p => B.Contains(p));

// or var x = A.Count(B.Contains); 
// but I have always found it to be a little unreadable to skip a lambda
// but this shorted form could be a little faster, because it skips a delegate

or

HashSet<int> Btemp = new HashSet<int>(B);
Btemp.IntersectWith(A); // note that this method is of the HashSet, it isn't 
                        // a "generic" Intersect, so it's optimized against 
                        // the HashSet internals
var y = Btemp.Count;

(theorically both the adding and the checking of existance in an HashSet are O(1) operation)

both of them are O(n) where n = A.Count, instead of being O(m * n) with m = B.Count, so O(x^2).

(technically they are O(n) + O(m) because the building of the HashSet is O(m), but it's still an O(x))...

In the end they are linear in time instead of quadratic... But all of this depends on the length of B... If B is 1-3 elements, it's probably faster to use directly the Contain as you did.

In general, if you know that A is much bigger than B, then you should put A in the HashSet and leave B in the List (you should do the reverse if B is much bigger than A)

Siusan answered 5/8, 2013 at 9:42 Comment(1)
I was about to recommend this solution.Mylonite
M
3

you can use Intersect and count method

List<int> A = new List<int>;
List<int> B = new List<int>;
// Some logic....item added in both lists. Then
int count = A.Intersect(B).Count();
Misanthropy answered 26/6, 2015 at 9:47 Comment(0)
A
2

I had the same problem but i was searching for something more efficient.

// Testcase: 500 items exist in both lists
List<int> InputA = Enumerable.Range(0, 1000).ToList();
List<int> InputB = Enumerable.Range(500, 1000).ToList();

// Result
int Result1 = InputA.Where(a => InputB.Contains(a)).Count(); //13000 ticks
int Result2 = InputA.Intersect(InputB).Count(); //5700 ticks
int Result3 = B.Count - B.Except(A).Count(); //5800 ticks

int Result4 = InputA.CountIntersect(InputB); //2400 ticks

My solution is equal to the internal Intersect method, just with counting and without copying the elements. That's why it is more than 2x faster.

Code:

public static int CountIntersect<T>(this IEnumerable<T> collectionA, IEnumerable<T> collectionB)
{
    HashSet<T> tempA = new HashSet<T>(collectionA);
    int Result = 0;
    foreach (var itemB in collectionB)
    {
        if (tempA.Remove(itemB))
            Result++;
    }
    return Result;
}
Anisette answered 26/5, 2015 at 11:15 Comment(1)
I think changes in the samples chosen for inputA and inputB will significantly impact your results. As such this post is a little misleadingList
S
1

you can get this by using this

A.Count(match => B.Contains(match));

or

var count = A.Count(B.Contains);
Strigose answered 19/6, 2015 at 13:53 Comment(0)
G
1

Well, from a theoretical point of view, being that you have to check completely one of the two lists and for each element of that list check if it is contained in the other the only thing you can do to asymptotically improve the method is to improve the search of the element in the other list. The possibilities I see are the following (i suppose we are looking for elements of the list A in element B):

  • Sort (easily done in LINQ using OrderBy) the items in list B - complexity O(m log m) - and search the elements in it by using the Binary Search algorithm. Overall complexity is O(n log m) (taking n as the number of elements in A and m as the number of elements in B).
  • Transform (using the ToDictionary method) B in a dictionary (complexity O(m)). In this way, the overall complexity becomes max(O(n), O(m)).

In LINQ another way to go at it is to perform an inner join between the two lists. This may be more readable but my guess is that it is not as performant.

Let me know if anything is unclear.

Griceldagrid answered 25/6, 2015 at 8:47 Comment(0)
A
0

Probably not the best performance but better that OP's and the linq solution.

another approach with Except()

int Result = B.Count - B.Except(A).Count();
Anisette answered 19/6, 2015 at 13:46 Comment(0)
H
0

First of all it is important to know if your lists can contain duplicates and how you want to count them in case there are any.

For example:

var listA = new List<int> { 1, 1, 1, 2, 3, 4, 4, 5 };
var listB = new List<int> { 1, 1, 2, 2, 3, 4, 5, 6 };
var result = listA.Intersect(listB).Count(); // 5

If you need to get the number of element whichs have any element equal to it in the other list, then you need to write you own method to do that because existing library methods use collections that don't allow duplicates (like Set). You can try to use a HashSet to store items from the second list (this will increase your lookup speed)

public static int GetDuplicatesCount(List<int> listA, List<int> listB)
{
    var tempB = new HashSet<int>(listB);
    return listA.Count(tempB.Contains);
}

It will return 8 for the lists above. Also you can try to profile a bit more verbose version:

public static int GetDuplicatesCount(List<int> listA, List<int> listB)
{
    var tempB = new HashSet<int>(listB);
    var result = 0;
    foreach (var item in listA)
    {
        if (tempB.Contains(item))
        {
            result++;
        }
    }
    return result;
}

Stopwatch confirms that explicit loop works faster than LINQ. So to sum it up: If you need to take the duplicates in your first list into account, then you need to use a method like the last one provided by me. If not - use a method provided by fubo

Hobbs answered 21/6, 2015 at 9:36 Comment(2)
HashSet<T> doesn't provide duplicates.Anisette
Right, that's why we use it to store the second list where we don't care about duplicates. If 8 is the right answer for the lists provided than such code works. If 15 is the right answer (the number of elements in both lists which have the same element in another list), then you can use Dictionary<T, int> to store not only the element but its count as well. The code will be a bit longer but will stay efficient enoughHobbs
E
0

If the lists are VERY large, and you want to be efficient, the first thing you will need to do is sort them. The second thing to do is remove duplicates in the target(non-counted list). But, if the problem is big enough that simple linq expressions as described in the other answers is not enough. You should push the data into a SQL server and run a query to get your answer. Then the multithreadedness of a sqlserver will take care of the scaling you will need if the problem is large.

Embitter answered 23/6, 2015 at 17:6 Comment(1)
The nice thing about multi-threaded code in SQL Server is that it is the exact same code as single threaded code. Select, Insert, Update, and Delete manage all the complexities of threads for you, and in a way that is invisible to you. If you want to thread out to a dozen or so threads and not worry about blocks, race, etc. SQL queries are the way to go.Embitter
P
0

We can't really use a HashSet for the first list since it is completely possible that the list contains duplicate entries... However we can create a HashSet for the second list (adds space complexity + O(m) but we could have started with a HashSet) since duplicates makes no sense... Then we can iterate over the first list and check if the HashSet contains the value... This will be O(n) complexity (for loop) and O(1) complexity for the HashSet check...

Used LinqPad....

  var lst = new List<int>{1,2,3,4,4,5,6,7};
  var lst2 = new List<int>{4,4,6};

  int count=0;
  var hs= new HashSet<int>(lst2);  //O(m) ... contains {4,6}
  foreach (var l in lst)  // O(n)
  {
    if (hs.Contains(l))  // O(1)
      count++;
  }
  count.Dump();  //returns 3
Pietrek answered 23/6, 2015 at 22:50 Comment(0)
B
0
A.Where(B.Distinct().ToDictionary(_ => _).ContainsKey).Count(); //This should work for other scenario with good performance
Breland answered 24/6, 2015 at 14:40 Comment(0)
H
0

From a strict data structures perspective, the best that can be done is O(n*m) if your input is unsorted. See notes below as to why O(n+m) is not necessarily correct.

Disgusting Psuedocode:

int FindCommonIntersects (ListA, ListB){
    int return_var = 0
    for each_a_entry in ListA: // Assumes that ListA is sorted
        if each_a_entry != each_a_entry->next.value() then:
            for each_b_entry in ListB:
                if each_a_entry == each_b_entry then return_var++
    return return_var;

Goes through O(n) for list A and O(m) for list B if the lists are not sorted

Ergo the optimal solution runs at O(n*m) where you only traverse each list one time. Note that even if there are multiple elements in A which are the same, the each_a_entry != each_a_entry->next.value() line means that we don't make a comparison against an element of B, thus saving us some time.

I'm sure that you can do this faster with something of a hashing structure assuming that you can create a map of size n; however, I am assuming that we do not have infinite memory and therefore cannot create a hashmap of extraordinary size.

Hurling answered 25/6, 2015 at 22:6 Comment(0)
J
0

If the information in your two lists is gathered over time, then consider tracking the overlap as items are inserted/deleted. That way the cost of determining the answer is amortized over the lifespan of the lists, and not incurred in a one-time event.

Jerrilyn answered 26/6, 2015 at 0:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.