in linq why are subsequent calls of IEnumerable.Intersect so much faster
Asked Answered
O

5

4

while looking at this question C# Similarities of two arrays it was noted that the initial linq call was significantly slower than subsequent calls. What is being cached that is making such a difference? I am interested in when we can expect to achieve this type of behavior (perhaps here it is simply because the same lists are used over and over).

    static void Main(string[] args)
    {
        var a = new List<int>() { 7, 17, 21, 29, 30, 33, 40, 42, 51, 53, 60, 63, 66, 68, 70, 84, 85, 91, 101, 102, 104, 108, 109, 112, 115, 116, 118, 125, 132, 137, 139, 142, 155, 163, 164, 172, 174, 176, 179, 184, 185, 186, 187, 188, 189, 192, 197, 206, 209, 234, 240, 244, 249, 250, 252, 253, 254, 261, 263, 270, 275, 277, 290, 292, 293, 304, 308, 310, 314, 316, 319, 321, 322, 325, 326, 327, 331, 332, 333, 340, 367, 371, 374, 403, 411, 422, 427, 436, 440, 443, 444, 446, 448, 449, 450, 452, 455, 459, 467, 470, 487, 488, 489, 492, 494, 502, 503, 505, 513, 514, 522, 523, 528, 532, 534, 535, 545, 547, 548, 553, 555, 556, 565, 568, 570, 577, 581, 593, 595, 596, 598, 599, 606, 608, 613, 615, 630, 638, 648, 661, 663, 665, 669, 673, 679, 681, 685, 687, 690, 697, 702, 705, 708, 710, 716, 719, 724, 725, 727, 728, 732, 733, 739, 744, 760, 762, 775, 781, 787, 788, 790, 795, 797, 802, 806, 808, 811, 818, 821, 822, 829, 835, 845, 848, 851, 859, 864, 866, 868, 875, 881, 898, 899, 906, 909, 912, 913, 915, 916, 920, 926, 929, 930, 933, 937, 945, 946, 949, 954, 957, 960, 968, 975, 980, 985, 987, 989, 995 };
        var b = new List<int>() { 14, 20, 22, 23, 32, 36, 40, 48, 63, 65, 67, 71, 83, 87, 90, 100, 104, 109, 111, 127, 128, 137, 139, 141, 143, 148, 152, 153, 157, 158, 161, 163, 166, 187, 192, 198, 210, 211, 217, 220, 221, 232, 233, 236, 251, 252, 254, 256, 257, 272, 273, 277, 278, 283, 292, 304, 305, 307, 321, 333, 336, 341, 342, 344, 349, 355, 356, 359, 366, 373, 379, 386, 387, 392, 394, 396, 401, 409, 412, 433, 437, 441, 445, 447, 452, 465, 471, 476, 479, 483, 511, 514, 516, 521, 523, 531, 544, 548, 551, 554, 559, 562, 566, 567, 571, 572, 574, 576, 586, 592, 593, 597, 600, 602, 615, 627, 631, 636, 644, 650, 655, 657, 660, 667, 670, 680, 691, 697, 699, 703, 704, 706, 707, 716, 742, 748, 751, 754, 766, 770, 779, 785, 788, 790, 802, 803, 806, 811, 812, 815, 816, 821, 824, 828, 841, 848, 853, 863, 866, 870, 872, 875, 879, 880, 882, 883, 885, 886, 887, 888, 892, 894, 902, 905, 909, 912, 913, 914, 916, 920, 922, 925, 926, 928, 930, 935, 936, 938, 942, 945, 952, 954, 955, 957, 959, 960, 961, 963, 970, 974, 976, 979, 987 };
        var s = new System.Diagnostics.Stopwatch();
        const int cycles = 10;
        for (int i = 0; i < cycles; i++)
        {
            s.Start();
            var z= a.Intersect(b);
            s.Stop();
            Console.WriteLine("Test 1-{0}: {1} {2}", i, s.ElapsedTicks, z.Count());
            s.Reset();
            a[0]=i;//simple attempt to make sure entire result isn't cached
        }

        for (int i = 0; i < cycles; i++)
        {
            var z1 = new List<int>(a.Count);
            s.Start();
            int j = 0;
            int b1 = b[j];
            foreach (var a1 in a)
            {
                while (b1 <= a1)
                {
                    if (b1 == a1)
                        z1.Add(b[j]);
                    j++;
                    if (j >= b.Count)
                        break;
                    b1 = b[j];
                }
            }
            s.Stop();
            Console.WriteLine("Test 2-{0}: {1} {2}", i, s.ElapsedTicks, z1.Count);
            s.Reset();
            a[0]=i;//simple attempt to make sure entire result isn't cached
        }

        Console.Write("Press Enter to quit");
        Console.ReadLine();
    }
}

as requested by some - example output:

Test 1-0: 2900 45
Test 1-1: 2 45
Test 1-2: 0 45
Test 1-3: 1 45

(the normal loop shows only a slight difference between consecutive runs)

note after alterations to call a.Intersect(b).ToArray(); rather than just a.Intersect(b); as suggested by @kerem the results become:

Test 1-0: 13656 45
Test 1-1: 113 45
Test 1-2: 76 45
Test 1-3: 64 45
Test 1-4: 90 45 
...
Occultism answered 27/2, 2012 at 15:52 Comment(2)
The statement var z = a.Intersect(b) is lazy evaluated meaning that the intersection only is computed when z is enumerated. And as far as I can se you don't enumerate z. Also, please provide some numbers making it easier to understand what you are asking.Stour
Probably the cost of JITting the Intersect and Stop method for the first time.Gest
T
3

I would expect the first run of any loop to be slower for three reasons:

  1. Code has to be jitted the first time, but not subsequently.
  2. If the executable code run is small enough to fit in cache, then it won't have been evicted, and be faster for the CPU to load.
  3. If the data is small enough to fit in cache, then it won't have been evicted, and be faster for CPU to load.
Taxation answered 27/2, 2012 at 15:56 Comment(0)
W
2

LINQ heavily uses deferred execution. Unless you enumerate a query, it does not get executed.

Change

 s.Start();
 z= a.Intersect(b);
 s.Stop();

to

 s.Start();
 z= a.Intersect(b).**ToArray**();
 s.Stop();

and please post the new performance results.

a.Intersect(b) represents an expression, independent of the value of a and b. Values of a and b are only used when the expression is evaluated by enumeration.

Wilbert answered 27/2, 2012 at 16:34 Comment(1)
good point - though the pattern is roughly similar things are that bit slower:Occultism
P
1

JITting System.Enumerable.

Put new List().Intersect(new List()); new System.Diagnostics.Stopwatch().Stop(); as your first line of code and all interations will take the same amount of time.

Pompidou answered 27/2, 2012 at 15:56 Comment(0)
S
1

Enumerable.Intersect does not do any caching. It is implemented using a HashSet. The first sequence is added to the HashSet. Then the second sequence is removed from the HashSet. The remaining elements in the HashSet is yielded as an enumerable sequence of elements. You will have to actually enumerate the HashSet to pay the cost of creating the HashSet. This implementation is suprisingly efficient even for small collections.

If you see a difference in performance in subsequent calls it is not because Enumerable.Intersect does any caching but probably because you need to "warm up" your benchmark.

Stour answered 27/2, 2012 at 16:1 Comment(1)
Enumerable.Intersect doesn't do any caching, but CPUs do caching at both loading executable code and data. I wouldn't rule it out having an effect unless I'd tested for such (pre-jit to take that part out of the equation, and see if there's still any difference).Taxation
C
1

You are enumerating the result of Intersect() only when you call Count(); that's when the calculation of the intersection actually occurs. The part you're timing is the creation of the enumerable object that represents the future calculation of the intersection.

In addition to the jitting penalty others have noted, the first call to Intersect() might be the first use of a type from System.Core.dll, so you might be looking at the time required to load the IL code into memory, as well.

Coition answered 27/2, 2012 at 16:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.