Why does this function run so much faster when it makes an extra read of memory?
Asked Answered
K

1

16

I'm currently trying to understand the performance properties of certain loops on x86_64 (specifically, my Intel(R) Core(TM) i3-8145U CPU @ 2.10GHz processor). Specifically, adding an extra instruction to read memory inside the body of the loop almost doubles the performance, with the details not being particularly important.

I've been using a test program that consists of two main parts: a testing loop, and a function under test. The testing loop runs the function under test 232 times, once with each signed 32-bit integer as argument (in order from INT_MIN to INT_MAX). The function under test (named body) is a small function that checks to see if it was called with the expected arguments, and records the error in a global variable otherwise. The test program touches a sufficiently small amount of memory that everything is likely to fit in the L1 cache.

In order to eliminate any speed differences that might be caused by the behaviour of a compiler, I wrote both of the functions in question in assembly language (I'm using clang as the assembler), and have been forced to start at fixed addresses (the performance of this sort of test loop often gets dominated by effects relating to alignment or caching, so using fixed addresses will eliminate any alignment effects or cache effects unrelated to the change).

Here's the test loop, disassembled (it takes the address of the function to loop over in %rdi):

  401300:       53                      push   %rbx
  401301:       55                      push   %rbp
  401302:       51                      push   %rcx
  401303:       48 89 fd                mov    %rdi,%rbp
  401306:       bb 00 00 00 80          mov    $0x80000000,%ebx
loop:
  40130b:       89 df                   mov    %ebx,%edi
  40130d:       ff d5                   callq  *%rbp
  40130f:       83 c3 01                add    $0x1,%ebx
  401312:       71 f7                   jno    40130b <loop>
  401314:       59                      pop    %rcx
  401315:       5d                      pop    %rbp
  401316:       5b                      pop    %rbx
  401317:       c3                      retq   

and here's the simplest version of body, the function under test:

  401200:       33 3d 3a 3e 00 00       xor    0x3e3a(%rip),%edi        # 405040 <next_expected>
  401206:       09 3d 30 3e 00 00       or     %edi,0x3e30(%rip)        # 40503c <any_errors>
  40120c:       ff 05 2e 3e 00 00       incl   0x3e2e(%rip)        # 405040 <next_expected>
  401212:       c3                      retq

(The basic idea is that body checks to see if its argument %edi is equal to next_expected, and sets any_errors to a nonzero value if it isn't, otherwise leaving it unchanged. Then it increments next_expected.)

The test loop, with this version of body as %rdi, takes approximately 11 seconds to run on my processor. However, adding in an extra read of memory causes it to run in under 6 seconds (a difference much too large to be explained by random variation):

  401200:       33 3d 3a 3e 00 00       xor    0x3e3a(%rip),%edi        # 405040 <next_expected>
  401206:       09 3d 30 3e 00 00       or     %edi,0x3e30(%rip)        # 40503c <any_errors>
  40120c:       33 3d 2e 3e 00 00       xor    0x3e2e(%rip),%edi        # 405040 <next_expected>
  401212:       ff 05 28 3e 00 00       incl   0x3e28(%rip)        # 405040 <next_expected>
  401218:       c3                      retq

I've tried lots of different variants of this code to see what specific properties of the additional statement (labelled 401212 above) give the "fast" behaviour. The common pattern seems to be that the statement needs to read from memory. The various statements I've tried there (note: each of these is a single statement that's exactly 6 bytes long, so there are no alignment considerations to worry about):

These statements run quickly (~6 seconds):

  • It doesn't seem to matter what operation we read the memory with:
    • xor 0x3e2e(%rip),%edi # 405040 <next_expected>
    • and 0x3e2e(%rip),%edi # 405040 <next_expected>
    • mov 0x3e2e(%rip),%edi # 405040 <next_expected>
  • or what register we read into:
    • and 0x3e2e(%rip),%eax # 405040 <next_expected>
  • or (in most cases) what we're reading:
    • xor 0x11c7(%rip),%edi # 4023d9 <main>
    • or -0x12(%rip),%edi # 401200 <body>
  • or whether we're writing to memory in addition to reading it:
    • xor %edi,0x3e2a(%rip) # 40503c <junk>
  • Additionally, adding xor 0x11c7(%rip),%edi # 4023d9 <main>, but after rather than before the incl command, also gave fast performance.

These statements run slowly (~11 seconds):

  • It isn't sufficient to use an instruction that's 6 bytes long but doesn't read memory:
    • nopw %cs:(%rax,%rax,1)
    • lea 0x3e2e(%rip),%edi # 405040 <next_expected>
  • It isn't sufficient to write memory without reading it:
    • mov %edi,0x3e2a(%rip) # 40503c <junk>

Additionally, I tried writing the read value back to next_expected rather than incrementing it in place:

  401200:       33 3d 3a 3e 00 00       xor    0x3e3a(%rip),%edi        # 405040 <next_expected>
  401206:       09 3d 30 3e 00 00       or     %edi,0x3e30(%rip)        # 40503c <any_errors>
  40120c:       8b 3d 2e 3e 00 00       mov    0x3e2e(%rip),%edi        # 405040 <next_expected>
  401212:       ff c7                   inc    %edi
  401214:       89 3d 26 3e 00 00       mov    %edi,0x3e26(%rip)        # 405040 <next_expected>
  40121a:       c3                      retq

This had a performance very close to the original 11-second program.

One anomaly is the statement xor 0x3e2a(%rip),%edi # 40503c <any_errors>; adding that as the 401212 statement consistently gave a performance of 7.3 seconds, not matching either of the other two performances. I suspect that what's happening here is that the read of memory is sufficient to send the function onto the "fast path", but the statement itself is slow (because we just wrote any_errors in the preceding line; writing and immediately reading the same memory address is something that processors can struggle with), and thus we're getting fast-path performance + a slowdown from the use of a slow statement. Something similar happens if I add a read of next_expected (rather than main after rather than before the incl statement (again, we're reading a memory address that was just written, so the similar behaviour is not surprising).

Another experiment was adding xor next_expected(%rip), %eax earlier in the function (before the write to %edi or right at the start, before the read of next_expected). These gave a performance of around 8.5 seconds.

Anyway, at this point it seems fairly clear what is causing the fast behaviour (adding an additional read from memory is making the function run faster, at least when it's combined with the specific test loop shown here; it wouldn't surprise me if the details of the test loop were relevant). What I don't understand, though, is why the processor would behave like this. In particular, is there a general rule that can be used to work out when adding an extra read to a program will make it run (this much) faster?

If you want to experiment with this yourself

Here's a minimal version of the program that can be compiled and run, and exhibits this problem (this is C with gcc/clang extensions, and specific to x86_64 processors):

#include <limits.h>

unsigned any_errors = 0;
unsigned next_expected = INT_MIN;

extern void body(signed);
extern void loop_over_all_ints(void (*f)(signed));

asm (
    ".p2align 8\n"
    "body:\n"
    "   xor next_expected(%rip), %edi\n"
    "   or %edi, any_errors(%rip)\n"
//    " xor next_expected(%rip), %edi\n"
    "   addl $2, next_expected(%rip)\n"
    "   retq\n"

    ".p2align 8\n"
    "loop_over_all_ints:\n"
    "   push %rbx\n"
    "   push %rbp\n"
    "   push %rcx\n"
    "   mov %rdi, %rbp\n"
    "   mov $0x80000000, %ebx\n"
    "loop:\n"
    "   mov %ebx, %edi\n"
    "   call *%rbp\n"
    "   inc %ebx\n"
    "   jno loop\n"
    "   pop %rcx\n"
    "   pop %rbp\n"
    "   pop %rbx\n"
    "   retq\n"
    );

int
main(void)
{
    loop_over_all_ints(&body);
    return 0;
}

(The commented-out line is an example of an extra memory read that makes the program run faster.)

Further experiments

After posting the question, I tried some further experiments in which the testing loop was unrolled to depth 2, and modified so that the two calls to the function under test could now be made to go to two different functions. When calling the loop with body as both functions, there was still a clear performance difference between the versions of the code with and without the extra memory read (6-7 seconds with, >11 seconds without), giving a clearer platform to look at the differences.

Here are the results of the tests with two separate body functions:

  • Same any_errors/next_expected variables for both, no extra read: ~11 seconds
  • Same any_errors/next_expected variables for both, extra read for both: 6-7 seconds
  • Same any_errors/next_expected variables for both, extra read in one but not the other: 6-7 seconds
  • Same next_expected variable but different any_errors variables, no extra reads: ~11 seconds
  • Same any_errors variable but different next_expected variables (thus an error is reported), no extra reads: 5-5½ seconds (noticeably faster than any case so far)
  • Same any_errors variable but different next_expected variables, addl $2 rather than incl on next_expected (so that no error is reported), no extra reads: 5-5½ seconds
  • Same as previous case, but with the extra reads: 5-5½ seconds (and an almost identical cycle count: it's different by only tens of millions compared to the billions of iterations, so the number of cycles per iteration must be the same)

It looks a lot like whatever is going on here is somehow related to the dependency chain on next_expected, because breaking that chain gives faster performance than anything that's possible with the chain present.

Further experiments #2

I've been trying a lot more variations of the program to try to eliminate possibilities. I've now managed to shrink a test case reproducing this behaviour down to the following asm file (built with gas via assembling with as test.s -o test.o; ld test.o; this isn't linking against libc, thus is Linux-specific):

    .bss
    .align 256
a:
    .zero   4
b:
    .zero   4
c:
    .zero   4

        .text
    .p2align 8, 0
    .globl _start
_start:
    mov $0x80000000, %ebx
1:
//  .byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90
//  .byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x66, 0x90
    mov a(%rip), %esi
    or %esi, b(%rip)
    or $1, a(%rip)
    mov c(%rip), %eax
    add $1, %ebx
    jno 1b

    mov $60, %rax
    mov $0, %rdi
    syscall

There are three versions of the program to compare: the version as written, the version which has 12 single-byte NOP instructions, and the version which has 11 NOP instructions (I made one of them two-byte to get the same alignment as in the 12-NOP case, but it doesn't matter).

When running the program with no NOPs, or with 11 NOPs, it runs in 11 seconds. When 12 single-byte NOPs are used, it runs in 7 seconds.

At this point, I think that it's obvious that something is going wrong when the loop in question runs "too fast", and fixes itself when the loop was artificially slowed down. The original version of the program was presumably bottlenecked on the bandwidth for reading memory from L1 cache; so the problem fixed itself when we added an extra read. This version of the program speeds up when it is (artificially) bottlenecked on the front end; the only difference between "12 single-byte NOPs" and "10 single-byte NOPs and a 2-byte NOP" is how quickly the NOP instructions can get through the front end of the processor. So it seems that the loop runs faster if it's artificially slowed down, and it doesn't seem to matter what mechanism is used to slow it down.

Some performance counter information to exclude possibilities: the loop is running out of the loop stream decoder (lsd.cycles_active over 25 billion, idq.dsb_cycles and idq.mite_cycles less than 10 million, in both the 11-NOP and 12-NOP cases), which eliminates the possibility that the huge number of NOPs added here are overloading the instruction caching mechanisms; and ld_blocks.store_forward is a single digit (I thought store-forwarding might be involved, and it still might be, but this is the only performance counter related to it so we won't get any more information about it this way).

The specific pattern of reads and writes used above is:

  • read memory into a register;
  • read-modify-write a different address, using the same register;
  • read-modify-write the original address;
  • read another address (in the original example, the pop of the instruction pointer served as the unrelated read).

This seems to be the simplest pattern that reproduces the behaviour; I haven't found any further simplifications which cause the behaviour to reproduce.

I still have no idea what's going on here, but hopefully this information will be useful to anyone who wants to try to figure it out.

Koine answered 17/1, 2021 at 1:6 Comment(13)
Adding a redundant assignment speeds up code when compiled without optimization probably explains a significant part of the difference, but I haven't seen it cause a factor of 2 difference. Perhaps a funky interaction with the call/ret loads/stores, like in Loop with function call faster than an empty loop? I assume your timing methods are controlling for idle vs. turbo frequency? Idiomatic way of performance evaluation?Bixler
I've been running the tests multiple times in a random sequence that should have no correlation to processor frequency scaling, so that's unlikely to explain the difference (additionally, the times are consistent for each given test even if I run it repeatedly in a loop, so frequency scaling likely doesn't have a huge impact). I like the store-to-load forwarding theory, but in that case, I'd expect a read of next_expected to have different effects from a read of some other location. The xor %edi,0x3e2a(%rip) vs. mov %edi,0x3e2a(%rip) result may help to rule out a lot of possibilities.Koine
Have you checked perf counters for effects like memory-order mis-speculation? (Unlikely; you don't have any stores with unknown addresses, and you don't have another thread modifying your memory.) Or maybe more plausibly, some kind of uop replay effect (port 2 / 3 dispatch counts higher than total unfused-domain executed count?) About the RIDL vulnerabilities and the "replaying" of loads. But your load addresses are all static RIP-relative constants, not pointers. So none of your loads depend on other loads. Stores do, but IDK if they'd tie up portsBixler
There's not much suspicious in the performance counters: the "fast" version (with the extra read) is hitting the L1 data cache and data TLB more often and running more instructions, and uses fewer cycles, but none of that is surprising. The only real oddity is that the "slow" version seems to have around twice as many L1 instruction cache misses, but that may just be because it runs for twice as long (the number of misses is only in the millions, compared to the billions of iterations, so it's hard to imagine it producing an effect like this).Koine
If you're letting your counters include kernel code, that's probably just noise from interrupts, with counts accumulating at ~constant rate per second, not per unit of work done in your loop. You might want perf stat --all-user to clean that up, equivalent of putting :u on all your events. We certainly don't expect any I-cache misses, assuming branch prediction is working and not speculatively fetching from crazy locations. (No reason to expect that either).Bixler
You're right, adding a :u (my perf is too old to support --all-user) fixed that issue. I've been trawling the detailed performance counters looking for other oddities, but haven't found much. cycle_activity.stalls_mem_any:u is at 7,625,732,425 on the slow path and 373,451,069 on the fast path, which is very likely relevant but too general to deduce much (the total cpu-cycles:u is 80,967,248,265 on the slow path and 45,281,021,934 on the fast path), and I haven't yet found an appropriate "more precise" counter to explain the stalls.Koine
Higher store-forwarding latency would account for execution stalling on the store/reload dependencies through either any_errors or next_expected. double time might be a uop replay, but IDK exactly why that would be happening. Might be possible to figure out which of the 2 memory vars are actually the critical bottleneck, or if either one could be depending on circumstances. e.g. turn inc mem into load / inc / store, and try adding some latency to that value while you have it in a register (e.g. imul $1, %eax, %eax or or $0, %eax) to see if that slows you down by another cycleBixler
It's hard to do things like adding latency without driving the program away from the slow path (although the read, inc-in-register, write program uses the slow path; I might look into working on that). However, it turns out that unrolling the testing loop doesn't change the basic performance behaviour of the program, so I was able to get some insight into the dependencies that way (by alternating between two different functions under test). I edited the results into the question, but as a summary, it seems to be next_expected that's the performance bottleneck.Koine
On my Haswell processor, I'm getting 8.9s for the case without the extra load vs. 8.1s for the case with the extra load at 3.6GHz. This difference is significant and worth investigating, but it's not nearly as big as what you're observing on your Whiskey Lake. Run your experiments again with perf stat -r 5 ./myexe and show the output. BTW, if I compile the code with gcc 9.3.0 at -O0, the assembly gets emitted in the non-executable .data section, so it crashes. But if I compile it with -O3, it goes into the executable .text section and works fine.Publias
@HadiBrais: I'm not convinced that perf's default output format is that useful (except inasmuch as it shows the standard deviations for the various counts), but in case it helps you, here it is. I ran the blocks of 5 runs, for each of the two versions of the program, 3 times, in an attempt to control for CPU frequency scaling effects, cache warming, etc. (although the 3 runs have very consistent timings anyway).Koine
I didn't realize Intel had fixed the LSD as early as Whiskey Lake. Last I heard, it wasn't until Ice Lake. But your test results prove Intel's fixed it; my Skylake with up-to-date microcode always counts 0 for lsd.cycles_active, with uops coming from the DSB at best. Anyway, that's a tangent and probably unrelated to your problem. The fact that uops are still coming from the LSD, not MITE, proves you didn't overflow the uop cache limit for a 32-byte region with that many single-byte nops.Bixler
Is performance reduced when executing loops whose uop count is not a multiple of processor width? investigated the front-end bottleneck of LSD vs. DSB for small loops, on pre-Haswell where the answer was definitely yes, and on post Haswell were the effect is smaller. That might help you figure out what exact effect your front-end changes are having.Bixler
worth noting on Tigerlake with your second example (test.s) I am seeing "fast" time for all 3 versions. Tested on my Whiskey Lake and seeing the same results. Worth noting with the two slow versions I saw that the loads are overscheduled or port3 (p2=9.9 bil, p3=11.5 bil) as opposed to the fast version where they are evenly split between ports 2/3 (p2 = p3 = 10.7 bil). Cant check load port distribution on Tigerlake as they seem to have compressed the p2 / p3 events into one. Don't think port distribution can explain near 2x speedup but it does explain some of it.Siracusa
K
1

What

So @Peter Cordes was along the right lines in the comments: this is caused by stalls due to attempting to read a memory location less than three cycles after writing to it.

Here's a smaller program which produces the same effects, basically a minimal example of the program anove:

    .bss
    .align 256
a:
    .zero   4

    .text
    .p2align 8, 0
    .globl _start
_start:
    mov $0x80000000, %ecx
1:
    mov a(%rip), %edx
    mov %edx, a(%rip)
    .byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90
    add $1, %ecx
    jno 1b

    mov $60, %rax
    mov $0, %rdi
    syscall

The eight no-op instructions inside the loop body slow the loop down to the extent that it runs at the rate of exactly one instruction per three cycles, causing the loop to run in around 3½ seconds (with the exact amount based on frequency scaling). If you remove them, the loop takes more like five seconds to run.

Why

Looking at the small program above

That answers the question of what the trigger is, but it still doesn't really explain what's actually going on to cause the issue, or why it creates such a large performance deviation. The key clue to this was looking at the number of µops dispatched on each port. Here are the numbers for the minimized program above, together with a few other interesting statistics:

          slow             fast
 1,156,849,106    1,426,795,253      uops_dispatched_port.port_0:u
 1,386,792,998    1,432,967,110      uops_dispatched_port.port_1:u
 3,635,449,011    3,582,057,598      uops_dispatched_port.port_2:u
 4,169,272,248    5,003,425,117      uops_dispatched_port.port_3:u
18,538,282,061    6,427,027,115      uops_dispatched_port.port_4:u
 1,763,689,624    1,435,813,046      uops_dispatched_port.port_5:u
 4,299,495,754    4,286,287,760      uops_dispatched_port.port_6:u
   784,337,695       12,429,585      uops_dispatched_port.port_7:u
19,393,649,837   12,969,389,151      cpu-cycles:u
17,197,960,860   51,654,810,808      uops_issued.any:u
21,677,912,988   21,641,546,006      uops_executed.core:u
   5.245720659      3.491356466      seconds time elapsed
   5.220800000      3.483496000      seconds user

The only difference between the two programs is that the "fast" program is running an extra eight nops per iteration; with 232 iterations, that's an extra 34359738368 nops. At the µop level, a nop is a µop that gets issued and then just disappears, so to get a fair comparison between the program, we can subtract 34,359,738,368 from 51,654,810,808 to get an "adjusted" count of 17,295,072,440 µops issued for the "fast" program.

This reveals some pretty strange results when it comes to the µop lifecycle. After removing the nops, the number of µops issued is pretty much the same between the two programs – unsurprising, because they have the same code in that case. The number of µops executed is also pretty much the same, which is again unsurprising. (The issue and execution counts don't match each other because µops have a tendency to be split up, fused, rearranged etc. during execution, but this should affect both programs in the same way.)

On the other hand, the µop dispatch counts don't seem to match at all, which initially seems like it should be impossible. The trick is that, despite the naming, the performance counters for dispatch aren't counting µops directly – rather, they're counting the number of cycles that the port is busy handling a µop. So what the crazy mismatch on port 4 is showing is that the µops are getting "stuck" in port 4 and taking a while to execute (around 3 cycles, based on the statistics).

On the processor I'm using, port 4 is the only execution port capable of doing stores to memory (the MOV instruction itself takes up two ports; 2, 3 or 7 calculate the address, and 4 does the actual storing). I think the mental model I had of the processor oddity in question is that upon trying to read from a memory location less than 3 cycles after storing to it, the read would take longer than normal. However, the actual effects seem to be worse than that: the processor behaviour appears to be that if you attempt to read from a memory location less than 3 cycles after storing to it, all writes on the processor thread get blocked for 3 cycles.

Looking at the program in the question

Now let's look at the program in "Further experiments #2" above, in terms of port usage (a side note: for some reason I had to add an extra three nops to get the "fast" behaviour, so maybe there's been some extra optimisation in a microcode update):

          slow             fast
 8,881,288,565    3,278,012,510      uops_dispatched_port.port_0:u
 8,852,984,797    4,992,142,029      uops_dispatched_port.port_1:u
10,385,944,986   10,726,029,245      uops_dispatched_port.port_2:u
11,048,867,603   10,721,449,693      uops_dispatched_port.port_3:u
25,015,457,606   11,608,396,673      uops_dispatched_port.port_4:u
 9,255,886,026    5,102,583,196      uops_dispatched_port.port_5:u
11,265,935,146    7,295,609,724      uops_dispatched_port.port_6:u
 4,333,803,057    4,304,364,919      uops_dispatched_port.port_7:u
42,554,124,800   26,024,822,032      cpu-cycles:u
38,686,833,876  103,170,056,257      uops_issued.any:u
52,058,581,194   51,940,941,322      uops_executed.core:u
  11.504080112      7.024786187      seconds time elapsed
  11.488620000      7.018980000      seconds user

With fifteen nops being issued per iteration in the "fast" version, the number of non-nop µops issued there is 38,745,546,817. So we're seeing much the same results: not counting the nops, the µop issue and execution rate are the same between the two programs, but some µops are taking much longer to execute in the slow version and tying up the ports.

There's one small difference: the instruction that's reading memory too soon after writing it is now the read-modify-write instruction or $1, a(%rip). The write half of that needs to use both port 4 to do the writing, and a port capable of doing or (0, 1, 5, or 6) to calculate the value to write. So we now see that ports 0, 1, 5, and 6 are all showing as having stuck µops – the instruction is getting stuck due to the early read, and tying up its ALU port in addition to port 4 during the three-cycle "I can't write" period. The main consequence of this appears to be that the rest of the program slows down a little too; my guess at this (which is consistent with the numbers above but might be wrong anyway) is that when port 6 ends up getting stuck, the loop logic has to wait, because port 6 is the only port that can handle the add $1, %ebx; jno 1b macro-fused instruction required to return to the start of the loop. So the program is being slowed by 3 clock cycles per loop iteration, and sometimes more because the loop logic is jammed too.

The interesting effect, though, is how large the performance deviations are. With the small program above, port 4 was stuck for 3 cycles per loop iteration, but that didn't slow down the loop as a whole by that much, because port 4 throughput wasn't the limiting factor. With this more complex program, though, port 4 being stuck for 3 cycles per loop iteration does seem to slow down every loop iteration by 3 cycles (and a little more, perhaps due to port 6 being stuck too), so the port 4 jam is having a close to maximal effect on the program as a whole. It's still not entirely obvious why port 4 being jammed for 3 cycles is causing the program as a whole to stall for 3 cycles (this program isn't obviously limited on port 4 throughput either), but it's at least plausible that an effect like this could be occurring.

Anyway, to answer the main practical question in the OP, "In particular, is there a general rule that can be used to work out when adding an extra read to a program will make it run (this much) faster?": the rule is "on this processor, reading a memory address less than three cycles after writing it will jam write instructions for 3 cycles, which will in turn slow down your program by up to 3 cycles, maybe even a little more; you can avoid it by preventing writes that happen that quickly after a read". Unfortunately, with out-of-order processors, it can be hard to know when the processor will attempt to run instructions, so it can be hard to guarantee that two instructions won't run too quickly after each other; for programs like the one I was initially working on, most likely the main piece of practical advice is to avoid repeatedly reading and writing the same memory address in a tight loop (as opposed to storing the value in a register and repeatedly reading and writing from there – the original assembly code only ended up looking the way it did because I was calling a function through a function pointer, meaning that the compiler couldn't safely make this optimisation).

Koine answered 9/10, 2021 at 11:45 Comment(1)
The all-writes-blocked effect might be something like the 4k aliasing memory disambiguation problem, if it's a real thing. Also, if addresses aren't available when reload is attempted, modern Intel does dynamic prediction of whether it's a reload of a recent store or not. This can lead to machine_clears.memory_ordering events even in single-threaded code. (Haven't read your full answer yet.)Bixler

© 2022 - 2024 — McMap. All rights reserved.