Why does a loop transitioning from having its uops fed by the Uop Cache to LSD cause a spike in branch-misses?

Asked 14/4, 2021 at 20:53 Answered 16/4, 2021 at 4:1

x86-64 cpu-architecture micro-optimization branch-prediction micro-architecture

All benchmarks are run on either Icelake or Whiskey Lake (In Skylake Family).

Summary

I am seeing a strange phenomina where it appears that when a loop transitions from running out of the Uop Cache to running out of the LSD (Loop Stream Detector) there is a spike in Branch Misses that can cause severe performance hits. I tested on both Icelake and Whiskey Lake by benchmarking a nested loop with the outer loop having a sufficiently large body s.t the entire thing did not fit in the LSD itself, but with an inner loop small enough to fit in the LSD.

Basically once the inner loop reaches some iteration count decoding seems to switch for idq.dsb_uops (Uop Cache) to lsd.uops (LSD) and at that point there is a large increase in branch-misses (without a corresponding jump in branches) causing a severe performance drop. Note: This only seems to occur for nested loops. Travis Down's Loop Test for example will not show any meaningful variation in branch misses. AFAICT this has something to do with when a loop transitions from running out of the Uop Cache to running out of the LSD.

Questions

What is happening when the loop transitions from running out of the Uop Cache to running out of the LSD that causes this spike in Branch Misses?
Is there a way to avoid this?

Benchmark

This is the minimum reproducible example I could come up with:

Note: If the .p2align statements are removed both loops will fit in the LSD and there will not be a transitions.

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))

static const uint64_t outer_N = (1UL << 24);


static void BENCH_ATTR
bench(uint64_t inner_N) {
    uint64_t inner_loop_cnt, outer_loop_cnt;
    asm volatile(
        ".p2align 12\n"
        "movl   %k[outer_N], %k[outer_loop_cnt]\n"
        ".p2align   6\n"
        "1:\n"
        "movl   %k[inner_N], %k[inner_loop_cnt]\n"
        // Extra align surrounding inner loop so that the entire thing
        // doesn't execute out of LSD.
        ".p2align   10\n"
        "2:\n"
        "decl   %k[inner_loop_cnt]\n"
        "jnz    2b\n"
        ".p2align   10\n"
        "decl   %k[outer_loop_cnt]\n"
        "jnz    1b\n"
        : [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
          [ outer_loop_cnt ] "=&r"(outer_loop_cnt)
        : [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
        :);
}
int
main(int argc, char ** argv) {
    assert(argc > 1);
    uint64_t inner_N = atoi(argv[1]);
    bench(inner_N);
}

Compile: gcc -O3 -march=native -mtune=native <filename>.c -o <filename>

Run Icelake: sudo perf stat -C 0 --all-user -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>

Run Whiskey Lake: sudo perf stat -C 0 -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>

Graphs

Edit: x label is N iterations of inner loop.

Below is a graph of Branch Misses, Branches, and LSD Uops.

Generally you can see that 1) there is no corresponding jump in Branches. 2) that the number of added Branch Misses stabilizes at a constant. And 3) That there is a strong relationship between the Branch Misses and LSD Uops.

Icelake Graph:

Whiskey Lake Graph:

Below is a graph of Branch Misses, Cycles, and LSD Uops for Icelake only as performance is not affected nearly as much on:

Analysis

Hard numbers below.

For Icelake starting at N = 22 and finishing at N = 27 there is some fluctuation in the number of uops coming from the LSD vs Uop Cache and during that time Branch Misses increases by roughly 3 order of magnitude from 10^4 -> 10^7. During this period Cycles also increased by a factor of 2. For all N > 27 Branch Misses stays at around 1.67 x 10^7 (roughly outer_loop_N). For N = [17, 40] branches continues to only increase linearly.

The results for Whiskey Lake look different in that 1) N begins fluctuating at N = 35 and continues to fluctuate until N = 49. And 2) there is less of a performance impact and more fluctuation in the data. That being said the increase in Branch-Misses corresponding with transitions from uops being fed by Uop Cache to being fed by LSD still exists.

Results

Data is mean result for 25 runs.

Icelake Results:

N	cycles	branches	branch-misses	idq.ms_uops	idq.dsb_uops	lsd.uops
1	33893260	67129521	1590	43163	115243	83908732
2	42540891	83908928	1762	49023	142909	100690381
3	50725933	100686143	1782	47656	142506	117440256
4	67533597	117461172	1655	52538	186123	134158311
5	68022910	134238387	1711	53405	204481	150954035
6	85543126	151018722	1924	62445	141397	167633971
7	84847823	167799220	1935	60248	160146	184563523
8	101532158	184570060	1709	60064	361208	201100179
9	101864898	201347253	1773	63827	459873	217780207
10	118024033	218124499	1698	59480	177223	234834304
11	118644416	234908571	2201	62514	422977	251503052
12	134627567	251678909	1679	57262	133462	268435650
13	285607942	268456135	1770	74070	285032524	315423
14	302717754	285233352	1731	74663	302101097	15953
15	321627434	302010569	81796	77831	319192830	1520819
16	337876736	318787786	71638	77056	335904260	1265766
17	353054773	335565563	1798	79839	352434780	15879
18	369800279	352344970	1978	79863	369229396	16790
19	386921048	369119438	1972	84075	385984022	16115
20	404248461	385896655	29454	85348	402790977	510176
21	421100725	402673872	37598	83400	419537730	729397
22	519623794	419451095	4447767	91209	431865775	97827331
23	702206338	436228323	12603617	109064	427880075	327661987
24	710626194	453005538	12316933	106929	432926173	344838509
25	863214037	469782765	14641887	121776	428085132	614871430
26	761037251	486559974	13067814	113011	438093034	418124984
27	832686921	503337195	16381350	113953	421924080	556915419
28	854713119	520114412	16642396	124448	420515666	598907353
29	869873144	536891629	16572581	119280	421188631	629696780
30	889642335	553668847	16717446	120116	420086570	668628871
31	906912275	570446064	16735759	126094	419970933	702822733
32	923023862	587223281	16706519	132498	420332680	735003892
33	940308170	604000498	16744992	124365	419945191	770185745
34	957075000	620777716	16726856	133675	420215897	802779119
35	974557538	637554932	16763071	134871	419764866	838012637
36	991110971	654332149	16772560	130903	419641144	872037131
37	1008489575	671109367	16757219	138788	419900997	904638287
38	1024971256	687886583	16772585	139782	419663863	938988917
39	1041404669	704669411	16776722	137681	419617131	972896126
40	1058594326	721441018	16773492	142959	419662133	1006109192
41	1075179100	738218235	16776636	141185	419601996	1039892900
42	1092093726	754995452	16776233	142793	419611902	1073373451
43	1108706464	771773224	16776359	139500	419610885	1106976114
44	1125413652	788549886	16764637	143717	419889127	1139628280
45	1142023614	805327103	16778640	144397	419558217	1174329696
46	1158833317	822104321	16765518	148045	419889914	1206833484
47	1175665684	838881537	16778437	148347	419562885	1241397845
48	1192454164	855658755	16778865	153651	419552747	1275006511
49	1210199084	872436025	16778287	152468	419599314	1307945613
50	1226321832	889213188	16778464	155552	419572344	1341893668
51	1242886388	905990406	16778745	155401	419559249	1375589883
52	1259559053	922767623	16778809	154847	419554082	1409206082
53	1276875799	939544839	16778460	162521	419576455	1442424993
54	1293113199	956322057	16778931	154913	419550955	1476316161
55	1310449232	973099274	16778534	157364	419578102	1509485876
56	1327022109	989876491	16778794	162881	419562403	1543193559
57	1344097516	1006653708	16778906	157486	419567545	1576414302
58	1362935064	1023430928	16778959	315120	419583132	1609691339
59	1381567560	1040208143	16778564	179997	419661259	1640660745
60	1394829416	1056985359	16778779	167613	419575969	1677034188
61	1411847237	1073762626	16778071	166332	419613028	1710194702
62	1428918439	1090539795	16778409	168073	419610487	1743644637
63	1445223241	1107317011	16778486	172446	419591254	1777573503
64	1461530579	1124094228	16769606	169559	419970612	1810351736

Whiskey Lake Results:

N	cycles	branches	branch-misses	idq.dsb_uops	lsd.uops
1	8332553879	35005847	37925	1799462	6019
2	8329926329	51163346	34338	1114352	5919
3	8357233041	67925775	32270	9241935	5555
4	8379609449	85364250	35667	18215077	5712
5	8394301337	101563554	33177	26392216	2159
6	8409830612	118918934	35007	35318763	5295
7	8435794672	135162597	35592	43033739	4478
8	8445843118	152636271	37802	52154850	5629
9	8459141676	168577876	30766	59245754	1543
10	8475484632	185354280	30825	68059212	4672
11	8493529857	202489273	31703	77386249	5556
12	8509281533	218912407	32133	84390084	4399
13	8528605921	236303681	33056	93995496	2093
14	8553971099	252439989	572416	99700289	2477
15	8558526147	269148605	29912	109772044	6121
16	8576658106	286414453	29839	118504526	5850
17	8591545887	302698593	28993	126409458	4865
18	8611628234	319960954	32568	136298306	5066
19	8627289083	336312187	30094	143759724	6598
20	8644741581	353730396	49458	152217853	9275
21	8685908403	369886284	1175195	161313923	7958903
22	8694494654	387336207	354008	169541244	2553802
23	8702920906	403389097	29315	176524452	12932
24	8711458401	420211718	31924	184984842	11574
25	8729941722	437299615	32472	194553843	12002
26	8743658904	453739403	28809	202074676	13279
27	8763317458	470902005	32298	211321630	15377
28	8788189716	487432842	37105	218972477	27666
29	8796580152	504414945	36756	228334744	79954
30	8821174857	520930989	39550	235849655	140461
31	8818857058	537611096	34142	648080	79191
32	8855038758	555138781	37680	18414880	70489
33	8870680446	571194669	37541	34596108	131455
34	8888946679	588222521	33724	52553756	80009
35	9256640352	604791887	16658672	132185723	41881719
36	9189040776	621918353	12296238	257921026	235389707
37	8962737456	638241888	1086663	109613368	35222987
38	9005853511	655453884	2059624	131945369	73389550
39	9005576553	671845678	1434478	143002441	51959363
40	9284680907	688991063	12776341	349762585	347998221
41	9049931865	705399210	1778532	174597773	72566933
42	9314836359	722226758	12743442	365270833	380415682
43	9072200927	739449289	1344663	205181163	61284843
44	9346737669	755766179	12681859	383580355	409359111
45	9117099955	773167996	1801713	235583664	88985013
46	9108062783	789247474	860680	250992592	43508069
47	9129892784	806871038	984804	268229102	51249366
48	9146468279	822765997	1018387	282312588	58278399
49	9476835578	840085058	13985421	241172394	809315446
50	9495578885	856579327	14155046	241909464	847629148
51	9537115189	873483093	15057500	238735335	932663942
52	9556102594	890026435	15322279	238194482	982429654
53	9589094741	907142375	15899251	234845868	1052080437
54	9609053333	923477989	16049518	233890599	1092323040
55	9628950166	940997348	16172619	235383688	1131146866
56	9650657175	957049360	16445697	231276680	1183699383
57	9666446210	973785857	16330748	233203869	1205098118
58	9687274222	990692518	16523542	230842647	1254624242
59	9706652879	1007946602	16576268	231502185	1288374980
60	9720091630	1024044005	16547047	230966608	1321807705
61	9741079017	1041285110	16635400	230873663	1362929599
62	9761596587	1057847755	16683756	230289842	1399235989
63	9782104875	1075055403	16299138	237386812	1397167324
64	9790122724	1091147494	16650471	229928585	1463076072

Edit: 2 things worth noting:

If I add padding to the inner loop so it won't fit in the uop cache I don't see this behavior until ~150 iterations.
Adding an lfence on with the padding in the outer loop changes N threshold to 31.

edit2: ~~Benchmark which clears branch history~~ The condition was reversed. It should be cmove not cmovne. With the fixed version any iteration count sees elevated Branch Misses at the same rate as above (1.67 * 10^9). This means the LSD is not itself causing Branch Misses, but leaves open the possibility that LSD is in some way defeating the Branch Predictor (what I believe to be the case).

static void BENCH_ATTR
bench(uint64_t inner_N) {
    uint64_t inner_loop_cnt, outer_loop_cnt;
    asm volatile(
        ".p2align 12\n"
        "movl   %k[outer_N], %k[outer_loop_cnt]\n"
        ".p2align   6\n"
        "1:\n"
        "testl  $3, %k[outer_loop_cnt]\n"
        "movl   $1000, %k[inner_loop_cnt]\n"
        THIS NEEDS TO BE CMOVE
        "cmovne   %k[inner_N], %k[inner_loop_cnt]\n"
        // Extra align surrounding inner loop so that the entire thing
        // doesn't execute out of LSD.
        ".p2align   10\n"
        "2:\n"
        "decl   %k[inner_loop_cnt]\n"
        "jnz    2b\n"
        ".p2align   10\n"
        "decl   %k[outer_loop_cnt]\n"
        "jnz    1b\n"
        : [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
          [ outer_loop_cnt ] "=&r"(outer_loop_cnt)
        : [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
        :);
}

Epiphytotic answered 14/4, 2021 at 20:53 Comment(2)

BTW, you can use taskset perf ... so taskset isn't part of the workload you're profiling. Also, if you config your perf_event_paranoid to 0 for example, you can avoid needing sudo on benchmark runs. – Cuprite 14/4, 2021 at 22:47

You might consider plotting branch misses on the right axis because the values are so small they are hard to see when they share an axis with "branches". – Brevier 23/5, 2021 at 0:24

This may be a coincidence; similar misprediction effects happen on Skylake (with recent microcode that disables the LSD¹): an inner-loop count of about 22 or 23 is enough to stop its IT-TAGE predictor from learning the pattern of 21 taken, 1 not-taken for the inner-loop branch, in exactly this simple nested loop situation, last time I tested.

Choosing that iteration threshold for when to lock the loop down into the LSD might make some sense, or be a side-effect of your 1-uop loop and the LSD's "unrolling" behaviour on Haswell and later to get multiple copies of tiny loops into the IDQ before locking it down, to reduce the impact of the loop not being a multiple of the pipeline width.

Footnote 1: I'm surprised your Whiskey Lake seems to have a working LSD; I thought LSD was still disabled in all Skylake derivatives, at least including Coffee Lake, which was launched concurrently with Whiskey Lake.

My test loop was two dec/jne loops nested simply, IIRC, but your code has padding after the inner loop. (Starting with a jmp because that's what a huge .p2align does.) This puts the two loop branches at significantly different addresses. Either of both of these differences may help them avoid aliasing or some other kind of interference, because I'm seeing mostly-correct predictions for many (but not all) counts much greater than 23.

With your test code on my i7-6700k, lsd.uops is always exactly 0 of course. Compared to your Whiskey Lake data, only a few inner-loop counts produce high mispredict rates, e.g. 40, but not 50.

So there might be some effect from the LSD on your WHL CPU, making it bad for some N values where SKL is fine. (Assuming their IT-TAGE predictors are truly identical.)

e.g. with perf stat ... -r 5 ./a.out on Skylake (i7-6700k) with microcode revision 0xe2.

N	count	rate	variance
17	59,602	0.02% of all branches	+- 10.85%
20	192,307	0.05% of all branches	( +- 44.60% )
21	79,853	0.02% of all branches	( +- 14.16% )
30	136,308	0.02% of all branches	( +- 18.57% )
31..32	similar to N=34		( +- 2 or 3% )
33	22,415,089	3.71% of all branches	( +- 0.11% )
34	91,483	0.01% of all branches	( +- 2.36% )
35 (and 36..37 similar)	98,806	0.02% of all branches	( +- 2.75% )
38	33,517,630	4.87% of all branches	( +- 0.05% )
39	102,077	0.01% of all branches	( +- 1.96% )
40	33,458,267	4.64% of all branches	( +- 0.06% )
41	116,241	0.02% of all branches	( +- 6.86% )
42	22,376,562	2.96% of all branches	( +- 0.01% )
43	116,713	0.02% of all branches	( +- 5.25% )
44	174,834	0.02% of all branches	( +- 35.08% )
45	124,555	0.02% of all branches	( +- 5.36% )
46	135,838	0.02% of all branches	( +- 9.95% )

These numbers are repeatable, it's not just system noise; the spikes of high mispredict counts are very real at those specific N values. Probably some effect of the size / geometry of the IT-TAGE predictor's tables.

Other counters like idq.ms_uops and idq.dsb_uops scale mostly as expected, although idq.ms_uops is somewhat higher in the ones with more misses. (That counts uops added to the IDQ while the MS-ROM is active, perhaps counting front-end work that happens while branch recovery is cleaning up the back-end? It's a very different counter from legacy-decode mite_uops.)

With higher mispredict rates, idq.dsb_uops is quite a lot higher, I guess because the IDQ gets discarded and refilled on mispredicts. Like 1,011,000,288 for N=42, vs. 789,170,126 for N=43.

Note the high variability for N=20, around that threshold near 23, but still a tiny overall miss rate, much lower than every time the inner loop exits.

This is surprising, and different from a loop without as much padding.

Cuprite answered 14/4, 2021 at 22:41 Comment(52)

That may not be the case. If I add padding to the inner loop to defeat the LSD I don't see any increase in branch-misses until ~150 inner loop iterations. – Epiphytotic 14/4, 2021 at 22:46

Err, I buy I that the LSD is a red herring, but I don't see where the constants are coming from. – Epiphytotic 14/4, 2021 at 22:56

I updated the benchmark a bit (see edit) to basically try and reset the branch predictor of the inner loop (3/4 of the runs do 1000 which should get it to always predict taken, 1/4 do N). I see the same correspondence between LSD and branch-misses. I think if it was just N reaching the size of the branch history buffer we should not expect the clear jump at N = 22. – Epiphytotic 14/4, 2021 at 23:26

@Noah: The 22 or 23 transition threshold was with a much tighter loop; apparently that makes a difference because your test loop is different. Update with some selected test data for my system. – Cuprite 14/4, 2021 at 23:30

Havent updated my firmware and bought the machine before the update. It does draw attention back to the LSD, however, if on your whiskey lake w/o the LSD you are seeing different rates. How stable are the rates if you use the same idea as my updated benchmark with flushing the branch history 3/4 of the iterations? If its just branch history buffer I would have expected that to produce a relatively stable branch-misses rate across all N. – Epiphytotic 14/4, 2021 at 23:38

@Noah: My machine is Skylake, although that's mostly microarchitecturally identical, except Whiskey Lake has HW fixes for Meltdown and L1TF, and Stepping 12 has extra mitigation for Spectre and Speculative Store Bypass. en.wikichip.org/wiki/intel/microarchitectures/…. – Cuprite 14/4, 2021 at 23:44

@Noah: Anyway, you don't need to update your mobo firmware, just install the Linux package intel-ucode or whatever your distro calls it, and make sure it's configured to have Linux update the firmware during early boot. Although if WHL needed its LSD disabled, its initial firmware would have already had that; it didn't come out until after that fix was already in CPUs like Skylake-server. – Cuprite 14/4, 2021 at 23:46

@Noah: 3/4 of the runs do 1000 iterations: IT-TAGE indexes a prediction based on history of last several branches, it's definitely not simple. I wouldn't be fully confident of that wiping out everything. – Cuprite 14/4, 2021 at 23:49

I see. What would you say as the best approach? Replace the padding with chain of jumps to fill the table? – Epiphytotic 15/4, 2021 at 0:11

@Noah: I don't know enough about IT-TAGE to know how to design experiments to figure out its parameters and do certain things. :/ Doing 1k iterations for 3/4 of the runs is a good idea; does it actually make a difference for you? I wouldn't be sure of what that will do, but it's better than nothing. – Cuprite 15/4, 2021 at 0:15

It makes a difference in that the low bar for branch-misses is roughtly 2 * 10^6 but the high bar still stabalizes as 1.67 * 10^7. I tested again with 63/64 for the 1k iteration and low from [0..20] with ~10^6. At N = 21 sharp rise again in LSD and 1.67 * 10^7 mispredicts. With 1023/1024 sharp rise at N = 22 (along with LSD). Doubtful that the LSD is the cause but there is some common thread that I think goes past just IT-TAGE. Btw do you know of any good reading for IT-TAGE? Google & Agner are not particularly helpful in finding anything. – Epiphytotic 15/4, 2021 at 0:29

@Noah: comparch.net/2013/06/30/why-tage-is-the-best is a brief intro and links a paper. Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore compares simulated IT-TAGE against real measurements on Sandybridge vs. Haswell, but doesn't go into a lot of detail about how IT-TAGE works. (And the use-case is different: primarily an indirect branch as a grand-central dispatch in an interpreter loop.) – Cuprite 15/4, 2021 at 0:34

Those links got me in the right direction. Thanks! Two nice resources I found: This CMU lecture and the origional paper. Based on what I'm seeing TAGE you might be right that I wasnt clearing the history. Think one of the layers could have kept predictor for N. – Epiphytotic 15/4, 2021 at 1:53

Alright, confirmed it was the branch predictor. Guess the constants happen to correspond with the LSD. Big oof on my part as well. The conditions on my "defeat brand predictor" benchmark where reversed. instead of 3/4 1k, it was 1/4 1k.... – Epiphytotic 15/4, 2021 at 2:11

I think what you said about "unrolling" in the LSD is somewhat related. If I pad the inner loop with a few nop instructions s.t that it can still fit in the LSD it will increase the bound where misprediction starts. Something about a jcc being unrolled by the LSD doesn't play well with the branch predictor but not sure what could be mechanism. It seems that the pc is somehow considered different in some sense but not sure how that could be. – Epiphytotic 15/4, 2021 at 14:47

@Noah: A predicted-taken uop is different from a predicted-not-taken uop (e.g. not-taken can run on more ports), so that prediction might get nailed down when it locks the uops down in the LSD. Also keep in mind that changing the relative(?) addresses of branches can affect how they interfere with each other in the BPU, so it's hard to do controlled experiments. Perhaps more smaller NOPs vs. fewer larger NOPs could let you keep all the branches at the same addresses while varying the LSD behaviour by changing how many uops in the inner loop. – Cuprite 15/4, 2021 at 20:0

Decided to just pad the second loop by same amount as first loop. Any overhead on outer loop should be neglidgable and keeps relative offset constants. Ill post a graph if I find any interesting results. – Epiphytotic 15/4, 2021 at 20:21

@Noah: I meant varying the uop count of the inner loop while holding its size in bytes constant, so the address of every branch (and every branch target) stays constant. I guess you could try that with different absolute amounts of padding in different places to see if different layouts cause different interference effects. – Cuprite 15/4, 2021 at 20:24

Edit with results from the test. Pretty sure this is caused by the LSD and its not just some happenstance. – Epiphytotic 16/4, 2021 at 2:28

@Noah: Some of that EDIT / update stuff could go in a separate answer, instead of making such a long question. That would be the best place for further experiments and especially for theories / inferences drawn from their results. – Cuprite 16/4, 2021 at 2:36

I could also cleanup the question and drop all edits buts #5 and add it is evidence that whatever is happening to the branch predictor is indeed caused by the LSD. Reason Im hesitant to write an answer is i really have no idea why this could be happening. What do you suggest? – Epiphytotic 16/4, 2021 at 3:14

@Noah: A tentative answer is fine. You don't have to accept that answer. You can even put in bold at the top of the answer "Results so far; more answers welcome" or something to that effect. – Cuprite 16/4, 2021 at 3:15

Wrote answer and cleaned up question. Sorry had to take away your checkmark, it was a bit premature but I think there is more too this. – Epiphytotic 16/4, 2021 at 4:9

@Noah: Yeah, that's fine, I was surprised you accepted it so early, when it was still just a total guess. – Cuprite 16/4, 2021 at 4:20

The relevant answer is really just the quote from the intel manual. Do you think I should move the "proof its the LSD" (or drop entirely) and make the answer just the quote? – Epiphytotic 16/4, 2021 at 19:57

@Noah: I don't think you should remove your other testing; it's useful to back up that interpretation of Intel's description with some testing. Intel has been known to be wrong in their manuals, or omit mention of other complications. Not being able to break out of the LSD other than by branch mispredict is interesting; I guess that means it doesn't spend power on the BPU and recycles the predictions it had when uops went into the LSD, so it loses the ability to predict the loop exit for short inner loops. – Cuprite 16/4, 2021 at 21:18

@Noah: Hmm, I forget when I tested and found that ~22 iteration inner-loop limit; I thought it was after microcode updates disabled the LSD, but if it was before; that could maybe explain it: LSD kicking in could have been what stopped the loop-exit from predicting correctly. – Cuprite 16/4, 2021 at 21:20

The manual explicitly mentions 20, but these as well point to 21-25. I'll edit and make it an approximate. I think breaking on mispredict also fundamentally makes sense with the design although is VERY annoying in this case (and basically any microbenchmark with a loop!) – Epiphytotic 16/4, 2021 at 21:48

@Noah: It might be interesting to try a microbenchmark with a single loop that has a simple repeating pattern for an if() in the body, like every 3rd iteration there's a taken forward branch, so it's not going to be a multiple of the LSD unroll factor. e.g. dec edi / jnz skip_reset / mov edi, 3 / skip_reset:. (That's actually taken mostly, occasionally not-taken, but I think the LSD can maybe still work if it jumps to a separate block and then back in a pattern. Or does this defeat the LSD? I forget.) – Cuprite 16/4, 2021 at 21:53

In my tests I found for very lower iteration counts it would run out of the LSD and have perfectly predicted branch. I think for just a decl edi; jnz loop iteration count was ~12, though adding nop to the inner loop body would decrease iteration count where this would work. i.e loop: nop; decl edi; jnz loop would leave LSD at ~10. So maybe there is a second case for tiny loops who can entirely unroll in the LSD. – Epiphytotic 16/4, 2021 at 22:0

@Noah: like the whole outer loop fits in the LSD, containing a fully-unrolled copy of the inner loop? – Cuprite 16/4, 2021 at 22:4

The entire nested loop runs out of the LSD for iteration <= 11 and there are a neglible amount of branch misses. It will entirely runout of the LSD for iterations = 11 as long as the inner loop uops <= FE width (5 on ICL). If I increase inner loop uops to 6 can only get this low branch miss / entirely LSD with iterations = 9-10 (it varies, 10 is 25% Uop Cache, 75% LSD but 9 is 100% LSD). The interesting thing is after inner loop uops > FE width each additional uop changes the iteration threshold, but inner loop uops <= FE width are all the same. Sorry for confusion in last comment! – Epiphytotic 17/4, 2021 at 0:57

So yeah, given how iteration count + nops affect how it runs out of the LSD I think its the outer loop + unrolled inner loop is in LSD. Interesting that you essentially get free unrolling for small constant loops at runtime. – Epiphytotic 17/4, 2021 at 1:1

@Noah: Not free, the loop branch still has to go through the issue/rename stage. The LSD usually doesn't make a difference to front-end bottlenecks for tiny loops. Real unrolling reduces loop overhead. – Cuprite 17/4, 2021 at 1:14

Can you think of any no overhead way to defeat this? I'm finding that on some of my old code I mistuned alignment when the extra nops from say aligning a loop was the difference between hitting this extra branch miss or not. Only thing I can think of is to make all my benchmarks use random(ish) data but that has overhead in itself, adding a constant number of nops (but that defeats code size optimizations), or adding empty call; ret in code path to defeat the LSD. Any ideas? (or maybe deserves its own question?). – Epiphytotic 19/4, 2021 at 14:51

@Noah: Is it a significant slowdown? Can you just unroll more so the loop doesn't fit in the LSD? But that might introduce more unpredictable branching in the loop prologue. Mis-predicting before you get the loop iterations into the back-end is probably a lot worse than after. (Unless the code before the loop could equally well fill up the back-end with a long critical-path to chew on.) Also keep in mind that hyperthreading can hide a lot of the cost of a mispredict if your code is threaded. I wouldn't try to defeat the LSD with a call;ret, that would be worse for CPUs without LSDs. – Cuprite 19/4, 2021 at 20:48

Graph 3 my origional question has CPU cycles. The extra branch misses do cause a spike in time. This test code also probably about as ideal as it gets for hiding the code of a mispredict because decl can speculate ahead easily (better though if there where some high latency instructions in the loop). In a data dependent situation (i.e pointer chasing till NULL) will probably be a lot worse. – Epiphytotic 19/4, 2021 at 21:20

@Noah: yeah, in this fully artificial benchmark sure it costs cycles, because there are no data dependency chains or cache misses for the back-end to be crunching during branch recovery. Also, your workload contains a lot of NOPs outside that inner loop, so any mispredict loses cycles of front-end throughput. dec/jnz can only "speculate ahead" when there's a slower loop body; taken branches (including macro-fused) have 1/clock throughput. Is your real use-case also that high throughput, like would come close to saturating the front-end uops / clock limit if not for mispredicts? – Cuprite 19/4, 2021 at 21:25

Okay fair. You might not see this in ideal mispredict hiding circumstances. But I think it can (and does) have an affect in real cases. Noticed this first time when benchmarking some stuff from GLIBC optimized x86_64 library. For example current strchr implementation has serious performance drop off at size = 2305 in benchmarks that loop through exact same size (as glibc currently does for their benchmarks). – Epiphytotic 19/4, 2021 at 21:51

Hey, this is totally unrelated but if you have a minute can you take a look at this glibc patch for memset. The maintainer had me replace sall with andl because andl has better tput. But I'm pretty sure its actively harmful. 1) It costs 16 bytes of code. 2) It makes it so that movl $-1, %ecx in the non page cross case is decoded on cycle 2 instead of cycle 1 essentially costing a cycle on the critical path. The only case I can see is that in page_cross sall will compete with jcc for p06. Am I missing anything? – Epiphytotic 19/4, 2021 at 22:26

Let us continue this discussion in chat. – Cuprite 19/4, 2021 at 22:55

something interesting I've been noticing is that loops aligned to 16 bytes enter the LSD slightly faster. Just thought I'd note that. – Epiphytotic 21/5, 2021 at 16:55

this is totally unrelated but I shared a document with you a while ago on some reverse engineering of apple m1. Any chance you have a link? I can't refind it. – Epiphytotic 30/11, 2022 at 17:19

@Noah: Yeah, I bookmarked the google drive link: drive.google.com/file/d/1WrMYCZMnhsGP4o3H33ioAUKL_bjuJSPt/view . IIRC, it was attributed to Maynard Handley. Yup, techspot.com/news/… is an article about it. – Cuprite 30/11, 2022 at 21:19

totally unrelated, but do you know when the domain cross penalty for unpckpd -> unpckdq was fixed? Is it HWL+ (blends), SKL+(shuffle) or something different? – Epiphytotic 4/4, 2023 at 15:5

@Noah: Was there ever a penalty on Sandybridge-family CPUs? I thought all shuffles could forward to or from int/FP with no penalty on SnB. – Cuprite 4/4, 2023 at 18:4

I think it varies a bit. So according to Agner there is not bypass delay on shufps being misused, but there is a delay for shufd on SnB. By Haswell he says no more delay on any shuffle. Does "shuffle" include unpack? This is for LLVM. We have NoBypassDelayShuffle starting as HSW. – Epiphytotic 5/4, 2023 at 3:8

we have a new pass X86FixupInstTuning that essentially replaces bad codegen that exists because modifying the DAG selector isn't effective, and I want to change the replacement of unpck{lo|hi}pd -> unpck{lo|hi}dq (currently we do shuf{ps|pd}. But haven't been able to track down the set of CPU families it safe for. – Epiphytotic 5/4, 2023 at 3:13

@Noah: Thanks, I'd forgotten that detail. punpckhqdq and unpckhpd are clearly shuffles. They run only on port 5, and they move elements to different positions in the output than they were in the input so aren't just a blend. From Agner's examples for SnB/IvB, I'd guess most or all shuffles can forward to/from SIMD-integer ops like paddd, but only FP shuffles can forward with zero latency to FP instructions, on the theory that a shuffle execution unit is expensive, so it's cheaper to just have one and cross-connect it for forwarding instead of having separate units in separate domains. – Cuprite 5/4, 2023 at 3:22

@Noah: Just a guess, not tested; I have an old SnB laptop I could boot from USB. Oh, also I was forgetting, SnB had a 2nd shuffle unit for integer shuffles because it only had AVX1, so its integer shuffle units only needed to be 128-bit wide. Only the port 5 shuffle unit was 256-bit, and only it could handle FP shuffles. (Haswell dropped the extra shuffle unit, limiting all shuffles to 1/clock throughput until Ice Lake allowed some 256-bit shuffles to run 2/clock on port 1/5). I'm sure that's related to there being bypass latency between FP math and integer shuffles on SnB. – Cuprite 5/4, 2023 at 3:26

re:"Thanks, I'd forgotten that detail. punpckhqdq and unpckhpd are clearly shuffles. They run only on port 5", realized that is somewhat missing a explanation for why on ICX only unpckhpd still only runs on p5, but all other shuffles (and punckhqdq) run on p15. – Epiphytotic 6/4, 2023 at 16:49

@Noah: Lots of shuffles on ICX are p5 only, including palignr, vpermd, unpckhps, unpcklps as well as pd unpacks. It is a bit odd that integer unpacks run on p15 while FP unpacks only run on p5, though, especially when shufps and shufpd run on p15 and can do the same shuffle as unpckl/h ps/pd. – Cuprite 6/4, 2023 at 19:3

Reason

The cause of the spike in Branch Misses is caused by the inner loop running out of the LSD.
The reason the LSD causes an extra branch miss for low iteration counts is that the "stop" condition on the LSD is a branch miss.

From Intel Optimization Manual Page 86.

The loop is sent to allocation 5 µops per cycle. After 45 out of the 46 µops are sent, in the next cycle only a single µop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number of wasted slots during LSD.

Essentially what is happening is that when low enough iteration counts run out of the Uop Cache they are perfectly predictable. But when they run out of the LSD since the built in stop condition for the LSD is a branch mispredict, we see an extra branch miss for every iteration of the outer loop. I guess the takeaway is don't let nested loops execute out of the LSD. Note that LSD only kicks in after ~[20, 25] iterations so an inner loop with < 20 iterations will run optimally.

Benchmark

All benchmarks are run on either Icelake

The new benchmark is essentially the same as the one in the origional post but at the advise of @PeterCordes I added a fixed byte size but varying number of nops in the inner loop. The idea is fixed length so that there is no change in how the branches may alias in the BHT (Branch History Table) but varying the number of nops to sometimes defeat the LSD.

I used 124 bytes of nop padding so that the nop padding + size of decl; jcc would be 128 bytes total.

The benchmark code is as follows:

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#ifndef INNER_NOPS
#error "INNER_NOPS must be defined"
#endif

         
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))

static const uint64_t outer_N   = (1UL << 24);
static const uint64_t bht_shift = 4;
static const uint64_t bht_mask  = (1023 << bht_shift);

#define NOP1   ".byte 0x90\n"
#define NOP2   ".byte 0x66,0x90\n"
#define NOP3   ".byte 0x0f,0x1f,0x00\n"
#define NOP4   ".byte 0x0f,0x1f,0x40,0x00\n"
#define NOP5   ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP6   ".byte 0x66,0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP7   ".byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00\n"
#define NOP8   ".byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP9   ".byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP10  ".byte 0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP11  ".byte 0x66,0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"


static void BENCH_ATTR
bench(uint64_t inner_N) {
    uint64_t inner_loop_cnt, outer_loop_cnt, tmp;
    asm volatile(
        ".p2align 12\n"
        "movl   %k[outer_N], %k[outer_loop_cnt]\n"
        ".p2align   6\n"
        "1:\n"
        "movl   %k[inner_N], %k[inner_loop_cnt]\n"
        ".p2align   10\n"
        "2:\n"
        // This is defined in "inner_nops.h" with the necessary padding.
        INNER_NOPS
        "decl   %k[inner_loop_cnt]\n"
        "jnz    2b\n"
        ".p2align   10\n"
        "decl   %k[outer_loop_cnt]\n"
        "jnz    1b\n"
        : [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
          [ outer_loop_cnt ] "=&r"(outer_loop_cnt), [ tmp ] "=&r"(tmp)
        : [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N),
          [ bht_mask ] "i"(bht_mask), [ bht_shift ] "i"(bht_shift)
        :);
}
// gcc -O3 -march=native -mtune=native lsd-branchmiss.c -o lsd-branchmiss
int
main(int argc, char ** argv) {
    assert(argc > 1);
    uint64_t inner_N = atoi(argv[1]);
    bench(inner_N);
    return 0;
}

Tests

I tested nop count = [0, 39].

Note that nop count = 1 would not be only 1 nop in the inner loop but actually the following:

#define INNER_NOPS NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP3 NOP1

To reach the full 128 byte padding.

Results

At nop count <= 32 the inner loop is able to run out of the LSD and we consistently see elevant Branch Misses when Iterations is large enough that it does so. Note that the elevated Branch Misses number corresponds 1-1 with the number of outer loop iterations. For these numbers outer loop iterations = 2^24
At nop count > 32 the loop has to many uops for the LSD and runs out of the Uop Cache. Here we do not see a consistent elevated Branch Misses until Iterations becomes to large for its BHT entry to store its entire history.

nop count > 32 (No LSD)

Once there are too many nops for the LSD the number of Branch Misses stays relatively low with a few consistent spikes until Iterations = 146 where Branch Misses spike to number of outer loop iterations (2 ^ 24 in this case) and remain constants. My guess is that is the upper bound on history the BHT is able to store.

Below is a graph of Branch Misses (Y) versus Iterations (X) for nop count = [33, 39]:

All of the lines follow the same patterns and have the same spikes. The large spikes to outer loop iterations before 146 are at Iterations = [42, 70, 79, 86, 88]. This is consistently reproducible. I am not sure what is special about these values.

The key point, however, is that for the most cast for Iterations = [20, 145] Branch Misses is relatively low indicating that the entire inner loop is being predicted correctly.

nop count <= 32 (LSD)

This data is a bit more noising bit all of the different nop count follow roughly the same trend of spiking initialing to within a factor of 2 of outer loop iterations Branch Misses at Iterations = [21, 25] (note this is 2-3 orders of magnitude) at the same time the lsd.oups spiked by 4-5 orders of magnitude.

There is also a trend between nop count and what iteration value Branch Misses stablize at outer loop iterations with a Pearson Correlation of 0.81. for nop count = [0, 32] the stablization point is in range iterations = [15, 34].

Below is a graph of Branch Misses (Y) versus Iterations (X) for nops = [0, 32]:

Generally, with some noise, all of the different nop count follow the same trend. As well they follow the same trend when compared with lsd.uops.

Below is a table with nop and the Pearson Correlation between Branch Misses and lsd.uop and idq.dsb_uops respectively.

nop	lsd	uop cache
0	0.961	-0.041
1	0.955	-0.081
2	0.919	-0.122
3	0.918	-0.299
4	0.947	-0.117
5	0.934	-0.298
6	0.894	-0.329
7	0.907	-0.308
8	0.91	-0.322
9	0.915	-0.316
10	0.877	-0.342
11	0.908	-0.28
12	0.874	-0.281
13	0.875	-0.523
14	0.87	-0.513
15	0.889	-0.522
16	0.858	-0.569
17	0.89	-0.507
18	0.858	-0.537
19	0.844	-0.565
20	0.816	-0.459
21	0.862	-0.537
22	0.848	-0.556
23	0.852	-0.552
24	0.85	-0.561
25	0.828	-0.573
26	0.857	-0.559
27	0.802	-0.372
28	0.762	-0.425
29	0.721	-0.112
30	0.736	-0.047
31	0.768	-0.174
32	0.847	-0.129

Which should generally indicate that there is a strong correlation between LSD and Branch Misses and no meaningful relationship between the Uop Cache and branch misses.

Overall

Generally I think it is clear that when the inner loop executing out of the LSD is what is causing the Branch Misses until Iterations becomes too large for the BHT entry's history. For the N = [33, 39] save the explained spikes we don't see elevated Branch Misses but we do for the N = [0, 32] case and the only difference I can tell is the LSD.

Epiphytotic answered 16/4, 2021 at 4:1 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Summary

Questions

Benchmark

Graphs

Analysis

Results

Reason

Benchmark

Tests

Results

nop count > 32 (No LSD)

nop count <= 32 (LSD)

Overall

Recommended topics

Hot tags