Standard implementation B.Intersect(A).Count()
has big advantage to be short and readable, unless you have a measured performance issue you should go with it.
When performance are an issue you can introduce HashSet<int>
, it's a good compromise in resources usage and search time. However because you worry about performance we should perform some testing (I am using this free tool I wrote):
CPU: 1.8 GHz Pentium Core 2 Duo
Number of iterations: 100
Number of items in each list: 1000
A.Where(a => B.Contains(a)).Count()
: 8338 ticks
A.Intersect(B).Count()
: 288 ticks
B.Count - B.Except(A).Count()
: 313 ticks
Let's now introduce HashSet<int>
in our test (pick implementation from any other answer):
HashSet<int>
: 163 ticks
It performs much better. Can we do better? If input range is known (and limited) you can do much better than this using BitArray
. In this example I assume (for simplicity) only positive numbers but it's easy to be adapted.
public static int UseBitArray(int range, List<int> listA, List<int> listB) {
var BitArray array = new BitArray(range);
for (int i = 0; i < listA.Count; ++i)
array[listA[i]] = true;
int count = 0;
for (int i = 0; i < listB.Count; ++i) {
if (array[listB[i]])
++count;
}
return count;
}
How it performs?
BitArray
: 95 ticks
It takes only 58% of 2nd best method (HashSet<int>
). I don't even compare with others. Note that it uses memory heavily and for a wide range (let's say Int32.MaxValue / 2
) it uses a lot of memory (moreover its size is limited to Int32.MaxValue
then you can't have full signed 32 bit integer range. If its limitations aren't a problem for you then you should definitely go with it.
Also note that if you can do some assumptions about your inputs then you can further optimize your search function (for example if you can assume that sets are ordered).
How they scale up (Y axis scale is logarithmic):
Note that Except
performs better than Intersect
when number of items grow up. Also note that for such trivial object (an integer) you won't have any performance gain to do it in parallel (see also Finding the difference between two lists of strings): comparison is so trivial that overhead and syncrhonization are higher than benefits (unless it's a well-tuned algorithm on a VERY high number of items).
If you're really looking for last bit of performance gain you may even implement your own BitArray
class (without unneeded stuff and error checking):
sealed class FastBitArray {
public FastBitArray(int length) {
m_array = new int[((length - 1) / 32) + 1];
}
public bool this[int index] {
get {
return (m_array[index / 32] & (1 << (index % 32))) != 0;
}
set {
if (value)
m_array[index / 32] |= (1 << (index % 32));
else
m_array[index / 32] &= ~(1 << (index % 32));
}
}
private int[] m_array;
}
Note that inside setter there is a branch, we don't have to worry to optimize it away because pattern is easy (always true
) for branch predictor. No performance gain to make it more complicate than this.
Latest tests:
Number of iterations: 100
Number of items in each list: 1000000
HashSet<int>
: 144748 ticks
BitArray
: 37292 ticks
FastBitArray
: 28966 ticks
Let's compare them visually (blue series is test with 1,000 items, orange series is 1,000,000; Y axis is logarithmic to easy comparison with 1k series). Methods we know are slow are simply omitted:
Same data showing only 1M series and with linear Y axis:
A.Where(x => B.Contains(x)).Count()
? – Allegedly