What are the microarchitectural details behind MSBDS (Fallout)?
Asked Answered
T

1

7

CVE-2018-12126 has been assigned to MSBDS (Microarchitectural StoreBuffer Data Sampling), a vulnerability of Intel's processors belonging to the newly created MDS (Microarchitectural Data Sampling) class.

I'm trying to get the microarchitectural details behind these vulnerability. I've started with MSBDS, also known as Fallout (cfr Meltdown) and it allows an attacker to leak the content of the store buffer.

For some reason, cyber security papers discussing micro-architectural details are often imprecise.
Luckily, the MSBDS paper quoted the patent US 2008/0082765 A1 (from which the pictures are taken).

For what I've gathered it seems that in the case of MSBDS the vulnerability resides in how the memory disambiguation algorithm handle loads with an invalid physical address.

This is the algorithm that is supposedly used to check if load match in the store buffer:

Memory disambiguation algorithm for loads and stores

302 check if the offset of the page referenced by the load matches the offset of the page referenced by any previous store in the store buffer.
If this check fails, the loads doesn't match any store and can be executed (it's already dispatched) at 304.
If 302 checks then the upper part of the virtual address of the load is checked1 against the virtual address of the stores.
If a match is found, the load matches and at 308 either the data it needs is forwarded or the load itself is blocked (until the matching store commits) if the forwarding is impossible (e.g. narrow store to wider load).
Note that the same virtual address can be mapped to two different physical addresses (at different time but within the store forwarding window). Incorrect forwarding is prevented not by this algorithm but by draining the store buffer (e.g. with a mov cr3, X which is serialising)2.
If the virtual address of the load doesn't match any virtual address of the stores, the physical address is checked at 310.
This is necessary to handle the case where different virtual addresses map to the same physical address.

Paragraph [0026] adds:

In one embodiment, if there is a hit at operation 302 and the physical address of the load or the store operations is not valid, the physical address check at operation 310 may be considered as a hit and the method 300 may continue at operation 308. In one instance, if the physical address of the load instruction is not valid, the load instruction may be blocked due to DTLB 118 miss. Further, if the physical address of the store operation is not valid, the outcome may be based on the finenet hit/miss results in one embodiment or the load operation may be blocked on this store operation until the physical address of the store operation is resolved in an embodiment.

Which means the CPU will considers only the lower (12) bits of the address if the physical address is not available3.
Considering that the case of a TLB miss is being addressed few lines below, this leaves only the case where the page accessed is not present.

This is indeed how the researchers present their attack:

char * victim_page = mmap (... , PAGE_SIZE , ...) ;
char * attacker_page = mmap (... , PAGE_SIZE, ...) ;

mprotect ( attacker_page , PAGE_SIZE , PROT_NONE ) ;

offset = 7;
victim_page [ offset ] = 42;

//Why people hate specpolines??
if ( tsx_begin () == 0) {
  //Read the stale value and exfiltrate it with a spectre gadget
  memory_access ( lut + 4096 * attacker_page [ offset ]) ;
  tsx_end () ;
}

//Reload phase of FLUSH+RELOAD
for ( i = 0; i < 256; i ++) {
  if ( flush_reload ( lut + i * 4096) ) {
     report ( i ) ;
  }
}

I'm not sure what else would give raise to an invalid physical address (accesses to privileged pages return the correct physical address).

It is really the handling of an invalid physical address that trigger the MSBDS vulnerability?


1The SBA (Store Buffer Address) component holds both the virtual and the physical address of a store, possibly only a fragment of the physical address (with the rest in a dedicated array, possibly named Physical Address Buffer).
2It's unclear to me if it's really possible to trigger a wrong forwarding by changing a page table entry to point somewhere else and then issuing an invlpg.
3My rationale on this is that since we are not in a recoverable case, the load is faulty, skipping another check at the risk of an incorrect forwarding is worth it performance-wise since it will make the load retire (and fault) earlier.

Thetis answered 15/5, 2019 at 19:30 Comment(4)
Regarding the second footnote, invlpg is a fully serializing instruction, so incorrect forwarding cannot occur because the mapping cannot be changed for the same virtual address without committing all previous stores. Regarding para 0026, the last sentence looks important because it describes 4K aliasing, which is what the authors seem to call WTF. I've not read the paper, but it looks like WTF is an exploitation of 4K aliasing, which makes perfect sense. I'm planning to read the paper and maybe post an answer after that just to be sure.Kinesics
@HadiBrais Thank you, I'm in fact trying to understand if it's just a 4K aliasing issue. In all the official news or papers I've read it's stressed out the importance of a "faulting load". Normal 4K aliasing should only cause a delay.Thetis
Yea it looks like this speculation only occurs when the aliasing load faults. The same thing for RIDL. But if it is 4K aliasing, then it would be very strange that the authors have not mentioned 4K aliasing anywhere in the paper, even though it is well known. (Maybe it's intentional to confuse everybody.)Kinesics
@HadiBrais I'm not sure but I think that until the load keeps replaying younger dependent uops cannot dispatch. As I understand it, 4K aliasing is correctly detected and just delays the load while in the case of a faulting load this check is skipped and the load completes execution (allowing dependent uops to dispatch and mount the classical covert channel attack).Thetis
K
5

Memory consistency requires that a load uop obtains the value that was most recently stored into the target memory location. Therefore, the memory order buffer (MOB) must determine whether the load overlaps any earlier store uop in program order. Both the load buffer and store buffer are circular and each load is tagged with the ID of the youngest store that precedes the load in program order (the allocator knows the ID of the last store it has allocated at the time it has to allocate the load). This enables the MOB to correctly determine which stores precede which loads.

Starting with Intel Core microarchitecture and the Goldmont microarchitecture, the scheduler includes a speculative memory disambiguation (SMD) logic that uses the IP of the load to decide whether to allow the load to be dispatched out-of-order with respect to the STA uops of all earlier stores. This is similar to how branch prediction uses the IP of the current 16 byte chunk being fetched to predict control flow, except in this case the IP is used for memory disambiguation. If there are no STAs waiting in the RS or if all STAs can be dispatched in the same cycle as the load uop, the SMD result is ignored and the load is dispatched. Otherwise, if SMD decides to block the load, the scheduler dispatches the load only when all earlier STAs have been dispatched or will be dispatched in the same cycle as the load. For some load uops, the SMD always blocks the load in the RS.

When a load uop is dispatched to one of the load AGU ports, the effective address, i.e., linear address, of the load is calculated using the specified segment base, base register operand, index register operand, scale, and displacement. At the same time, there can be stores in the store buffer. The linear address of the load is compared against the linear addresses of all earlier stores whose STA uops were executed (i.e., the linear address of the store is available). It might be necessary to compare also the physical addresses, but the physical address of the load is still not available at this point (this situation is referred to as an invalid physical address in the patent). To minimize the observable latency of the load, the MOB performs a quick comparison using only the least significant 12 bits of the linear addresses of the load and each earlier store. For more information on this comparison, refer to L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes (but masked uops are not discussed there). This logic is called the loose net, and it constitutes the other part of the speculative memory disambiguation mechanism. The loose net is supported on all Intel microarchitectures since the Pentium Pro (including the in-order Bonnell), but the exact implementation has changed because the size of data a single load or store uop can operate on has increased and because of the introduction of masked memory uops starting with the Pentium II. In parallel to the loose net operation, the linear address of the load is sent to the TLB to obtain the corresponding physical address and perform the necessary page attribute checks and also the segment checks are performed.

If the load does not overlap with any earlier store whose address was known at the time the load was dispatched according to the loose net result, a load request is sent to the L1D. We already know from the RIDL vulnerabilities that some data might be forwarded to the MOB even without having a valid physical address from the TLB, but only if the load causes a fault or assist. On a first-level TLB miss, the load is blocked in the load buffer so that it doesn't continue with its L1D access just yet. Later when the requested page entry reaches the first-level TLB, the MOB is informed about the address of that virtual page, which in turn checks all of the loads and stores that are blocked on that page and unblocks them by replaying the uops as per the availability of TLB ports.

I think the loose net takes only one cycle to compare the address of a given load with any number of stores in the store buffer and determine the youngest overlapping store that is older than the load, if any found. The process of looking up the first-level TLB and providing the physical address to the L1D on a hit should take only one cycle. This is how a best-case load-to-use latency of 4 cycles can be attained (which also requires (1) correct speculation of the physical page address, (2) the base+disp addressing mode without an index or with a zero index, and (3) a segment base address of zero, otherwise there is a penalty of at least one cycle). See the discussion in the comments for more on this.

Note that if the load uop missed in the loose net, it can be concluded that the load does not overlap any previous store, but only if the STAs of all earlier uops were already executed at the time the load uop is dispatched. It's impossible for two linear addresses whose least significant 12 bits are different to overlap.

If the loose net result indicates that the load overlaps with an earlier store, the MOB does two things in parallel. One of them is that the memory disambiguation process continues using the fine net (i.e., full linear address comparison). If the load missed in the fine net, the physical addresses are compared when available. Otherwise, if the load hit in the fine net, the load and the store overlap. Note the x86 ISA requires using a fully serializing instruction after making changes to a paging structure. So there is no need to compare the physical addresses in the fine net hit case. In addition to all of that, whenever a new STA uop is dispatched, this whole process is repeated, but this time with all loads in the load buffer. The results of all of these comparisons are combined and when the load has been checked against all earlier stores, the end result determines how to correctly execute the load uop.

In parallel, the MOB speculates that the store that hit in the loose net with the load has the value that should be forwarded to the load. If the load and store are to the same virtual page, then the speculation is correct. If the load and store are to different virtual pages but the virtual pages are mapped to the same physical page,the speculation is also correct. Otherwise, if the load and store are to different physical pages, the MOB has messed up, resulting in a situation called 4K aliasing. But wait, let's roll back a little.

It may not be possible to forward the store data to the load. For example, if the load is not fully contained in the store, then it has to wait until the store is committed and then the load is allowed to proceed and get the data from the cache. Also what if the STD uop of the store has not executed yet (e.g., it depends on a long latency uop)? Normally, the data is only forwarded from the store buffer when the requirements for store forwarding are met. However, the MSBDS vulnerability shows that this is not the case always. In particular, when the load causes a fault or assist, the store buffer may forward the data to the load without doing any of the store forwarding checks. From the Intel article on MDS:

It is possible that a store does not overwrite the entire data field within the store buffer due to either the store being a smaller size than the store buffer width, or not yet having executed the data portion of the store. These cases can lead to data being forwarded that contains data from older stores.

Clearly, the data may be forwarded even if the STD uop has not executed yet. But where will data come from then? Well, the data field of a store buffer entry is not cleared when deallocated. The size of the data field is equal to the width of a store uop, which can be determined by measuring the number of store uops it takes to execute the widest available store instruction (e.g., from a XMM, YMM, or ZMM register). This seems to be 32 bytes on Haswell and 64 bytes on Skyake-SP. Each data field of a store buffer entry is that big. Since it is never cleared, it may hold some random combination of data from stores that happened to be allocated in that store buffer entry. When the load hits in the loose net and will cause a fault/assist, the data of width specified by the load will be forwarded to the load from the store buffer without even checking for the execution of the STD or the width of the store. That's how the load can get data from one or more stores that may even have been committed a billion instructions ago. Similar to MLBDS, some parts of the data or the whole data that gets forwarded may be stale (i.e, doesn't belong to the store that occupies the entry).

These details were actually only provided by Intel, not the Fallout paper. In the paper, the authors perform an experiment (Section 4) on systems with KPTI disabled (I'll explain why), but they don't exploit the Meltdown vulnerability. Here is how the experiment works:

  1. The attacker performs a sequence of stores, all of which miss in the cache hierarchy. The number of stores is at least as large the number of store buffer entries.
  2. A kernel module is invoked, which performs a sequence of stores, each is to a different offset in a different kernel page. The values stored are known. The number of stores is varied between 1-50 as shown in Figure 5. After that, the kernel module returns to the attacker.
  3. The attacker performs a sequence of loads to user pages (different from the kernel pages) to the same offsets. Each user page is allocated only in the virtual address space and has access permission revoked (by calling mprotect(...,PROT_NONE), marking it as User and Not Present). Table 1 shows that a Supervisor page that is not Present doesn't work. The number of loads is the same as the number of stores performed by the kernel module. The loaded values are then leaked using a traditional FLUSH+RELOAD attack.

The first step attempts to keep the store buffer as much occupied as possible to delay committing the stores from the kernel module. Remember that false store forwarding only works on occupied store buffer entries. The first step works because the stores have to commit in order. In the third step, all that matters is to get loose net hits. Note how in this experiment, the authors were not thinking of leaking any stale data, they just wanted to get the data from the kernel stores that are hopefully is still in the store buffer. When changing the current privilege level, all instructions are retired before executing any instructions in the new privilege level. The stores can retire quickly, even before the RFO request completes, but they still have to wait in the store buffer to commit in order. It was thought that having stores from different privilege levels in the store buffer in this way is not a problem. However, when the attackers begins executing the loads, if the store that is to the same offset as the load currently being dispatched is still in the store buffer, a loose net hit occurs when the (not stale) data is speculatively forwarded. You know the rest.

When KPTI is enabled, most kernel pages live in a different virtual address space than the user pages. Thus, when returning from the kernel module, the kernel has to switch address spaces by writing a value into the CR3 register. But this is a serializing operation, which means that it will stall the pipeline until all (kernel) stores are committed. That's why the authors needed KPTI to be disabled for their experiment to work (i.e., the store buffer would be empty). Unfortunately, since Coffee Lake R has a hardware mitigation for Meltdown, the Linux kernel, by default, disables KPTI on this processor. That's why the authors say that the hardware mitigation has made the processor more vulnerable.

What's described in the Intel article (but not the paper) shows that MSBDS is much more dangerous than that: A faulting/assisting load can leak also stale data from the store buffer. The Intel article also shows that MSBDS works across sibling logical cores: when a logical core goes into a sleep state, its store buffer entries that have been statically allocated for it may become usable by the other logical core. Later if the logical core becomes active again, the store buffer is statically partitioned, which may enable that core to leak stale data from its entries that was written by the other core.

All of this shows that enabling KPTI is not enough to mitigate MSBDS. Also the mitigation recommended in the paper in Section 6 (flushing the store buffer using MFENCE when crossing a security boundary) is also not sufficient. Proper MDS mitigations are discussed here.

I don't know how can the authors in Section 3.2 conclude from the following quote from the Intel patent:

if there is a hit at operation 302 [partial match using page offsets] and the physical address of the load or the store operations is not valid, the physical address check at operation 310 [full physical address match] may be considered as a hit

the following:

That is, if address translation of a load μOP fails and the 12 least significant bits of the load address match those of a prior store, the processor assumes that the physical addresses of the load and the store match and forwards the previously stored value to the load μOP.

The whole patent doesn't mention comparing 12 bits and doesn't say that the load has to fault in order for the false store forwarding to occur. In addition, the conclusion itself is not correct because the 12 least significant bits don't have to match exactly and the load doesn't have to fault (but the attack only works if it faults).

MSBDS is different from Meltdown in that the attacker leaks data from kernel pages that live in a separate virtual address space. MSBDS is different from SSB in that the attacker mistrains the SMD so that it dispatches the load before all STAs that precede the load are dispatched. In this way, there is a less chance that the load will not hit in the loose net, which makes the MOB to issue the load to the L1D cache and get a potentially a value that is not the most recent value according to program order. SMD can be disabled by setting IA32_SPEC_CTRL[2] to 1. When the SMD is disabled, the scheduler handles load uops as in the Pentium Pro.

It's worth noting briefly that there are load and store uops that work differently from what I have described above. Examples include memory uops from MFENCE, SFENCE, and CLFLUSH. But they are not relevant here.

Kinesics answered 20/5, 2019 at 2:14 Comment(6)
L1d doesn't need the full physical address until it has tags loaded to check against. L1 caches are VIPT so TLB access can happen in parallel with fetching tags+data from all ways of the set selected by the index bits. (Which are all in the low 12 bits in Intel designs, i.e. part of the page offset, so they translate for free.) I don't think it's likely that Intel has 1-cycle dTLB access; given the details of the 4-cycle fast path, it may be the TLB that's the critical path. The reg+offset calculation can affect the set index bits without mis-speculating so it's not bypassed for indexingOrdinance
measuring the number of store uops it takes to execute the widest available store instruction. Does that imply that SnB/IvB have 256-bit store-buffer entries, because their 256b loads/stores are single-uop, but they occupy the load-data or store-data execution units for 2 cycles? Or do SnB/IvB actually take 2 entries even for aligned 32-byte loads/stores? 256b loads aren't dispatched twice; a store-address uop can use the load port in the 2nd data-only cycle of a 256b load.Ordinance
Oh, your requirements for 4-cycle load latency aren't quite right. base+disp can have a disp from 0..2047, it doesn't have to be zero. It mis-speculates if the +disp crosses a 4k boundary. Is there a penalty when base+offset is in a different page than the base? This is why I'm concluding that indexing L1d uses the result of the addition, not fully bypassed. The low 12 bits might be ready more quickly than the full 64-bit result, though, even with carry-lookahead etc..Ordinance
@PeterCordes I was thinking that the base address (from the base register) is first sent to the TLB for lookup. In the same cycle, base+disp is calculated in parallel. In the second cycle, the cache set index from the full address is sent to the cache to open a cache set (with the tags) and, in parallel the TLB (on a hit) send the physical address to the cache. In the third cycle, the tags are compared and the matching line is determined. In the 4th cycle, the requested data is placed on the forwarding network. In the 5th cycle, a dependent uop can be dispatched. Hence, 4 cycles.Kinesics
But you might be right. For example, it may take "a half a cycle" to send the base address to the TLB and then the lookup may take up to 2 cycles and the tag comparison may take half a cycle. Notice how all of these operations may be on the critical path simultaneously and constrain the maximum CPU frequency. In the requirements for 4-cycle load, I didn't say that disp has to be zero, I said that the index has to be zero or non-existent and the segment base has to be zero.Kinesics
I don't know the answer to your questions regarding SnB/IvB. An alternative way to measure the width of a store uop is to use a string store instruction and divide the size of the array by the number of uops dispatched to the store buffer. String operations use the full width of memory uops.Kinesics

© 2022 - 2024 — McMap. All rights reserved.