What use is the INVD instruction?

Asked 21/1, 2017 at 3:13 Answered 22/4, 2021 at 10:28

The x86 INVD invalidates the cache hierarchy without writing the contents back to memory, apparently.

I'm curious, what use is such an instruction? Given how one has very little control over what data may be in the various cache levels and even less control over what may have already been flushed asynchronously, it seems to be little more than a way to make sure you just don't know what data is held in memory anymore.

Baccivorous answered 21/1, 2017 at 3:13 Comment(3)

Well, if you want to know whats in memory after the INVD instruction, all you have to do is read it. INVD is for when you don't care what's in memory any more. Intel's manual entry for INVD lists "temporary memory, testing, or fault recovery" as cases where you might use it. – Johny 21/1, 2017 at 3:21

I'm shocked: Somehow SO managed not to ask a single question about the purpose of this x86 instruction invd in all the time it's existed (all the way back to 1989's 80486), and so this isn't a dupe. It's also the #4 hit for "invd" on Google already. – Chordate 21/1, 2017 at 5:56

@IwillnotexistIdonotexist that happens to me quite a lot, when I search an acronym and my answer is the only verbatim search Google result and I realise I'm never going to know whatever it is unless someone from Intel architecture design tells me the goods – Matias 22/4, 2021 at 8:34

Excellent question!

One use-case for such a blunt-acting instruction as invd is in specialized or very-early-bootstrap code, such as when the presence or absence of RAM has not yet been verified. Since we might not know whether RAM is present, its size, or even if particular parts of it function properly, or we might not want to rely on it, it's sometimes useful for the CPU to program part of its own cache to operate as RAM and use it as such. This is called Cache-as-RAM (CAR). During setup of CAR, while using CAR, and during teardown of CAR mode, the kernel must ensure nothing is ever written out from that cache to memory.

Cache-as-RAM

Entering CAR

To set up CAR, the CPU must be set to No-Fill Cache Mode and must designate the memory range to be used for CAR as Write-Back. This can be done by the following steps:

Set up an MTRR (Memory Type Range Register) to designate a chunk of memory as WB (Write-Back).
invd the entire cache, preventing any cached write from being written out and causing chaos.
Set caching mode to Normal Cache Mode (cr0.CD=0).
While in Normal Cache Mode, "touch" all cachelines of the memory span to be used as CAR by reading it, and thus filling the cache with it. Filling cachelines can only be done in Normal Cache Mode.
Set caching mode to No-Fill Cache Mode (cr0.CD=1).

Using CAR

The motivation for setting up CAR is that once set up, all accesses (read/write) within the CAR region will hit cache and will not hit RAM, yet the cache's contents will be addressable and act just like RAM. Therefore, instead of writing assembler code that only ever uses registers, one can now use normal C code, provided that the stack and local/global variables it accesses are restricted to within the CAR region.

Exiting CAR

When CAR is exited, it would be a bad thing for all of the memory writes incurred in this "pseudo-RAM" to suddenly shoot out from cache and trash any actual content at the same address in RAM. So when CAR is exited, once again invd is used to completely delete the contents of the CAR region, and then Normal Cache Mode is set up.

Intel 80486 Manual

Intel alluded to the Cache-as-RAM use in the i486 Microprocessor Programmer's Reference Manual. The Intel 80486 was the CPU that first introduced the invd instruction. Section 12.2 read:

12.2 OPERATION OF THE INTERNAL CACHE

Software controls the operating mode of the cache. Caching can be enabled (its state following reset initialization), caching can be disabled while valid cache lines exist (a mode in which the cache acts like a fast, internal RAM), or caching can be fully disabled.

Precautions must be followed when disabling the cache. Whenever CD is set to 1, the i486 processor will not read external memory if a copy is still in the cache. Whenever NW is set to 1, the i486 processor will not write to external memory if the data is in the cache. This means stale data can develop in the i486 CPU cache. This stale data will not be written to external memory if NW is later set to 0 or that cache line is later overwritten as a result of a cache miss. In general, the cache should be flushed when disabled.

It is possible to freeze data in the cache by loading it using test registers while CD and NW are set. This is useful to provide guaranteed cache hits for time critical interrupt code and data.

Note that all segments should start on 16 byte boundaries to allow programs to align code/data in cache lines.

Examples of Usage

coreboot has a slide-deck presenting their implementation of CAR, which describes the above procedure. The invd instruction is used on Slide 21.

AMD calls it Cache-as-general-storage in §2.3.3: Using L2 Cache as General Storage During Boot.

Other Uses

In certain situations involving cache-incoherency due to DMA (Direct Memory Access) hardware, invd might also prove useful.

Chordate answered 21/1, 2017 at 4:0 Comment(10)

Not necessarily expecting you to know, but the INVD instruction has been around since the 486. Was cache-as-RAM actually a thing back then, too? I'm fairly sure RAM configuration was still done by in hardware by the chipset back then, but I could be wrong. If it was not used for cache-as-RAM purposes on the 486, what might it then have been? – Baccivorous 6/2, 2017 at 4:41

@Baccivorous Well I did know it dated back to the 486, as I put that in a comment underneath your question. AFAIK the 386 did not have on-board cache, so it did not need flush/invalidate instructions (that would then have been the responsibility of the mobo). But as a matter of fact, the Cache-as-RAM trick was alluded to since the very beginning, in the 80486's user manual, §12.2 OPERATION OF THE INTERNAL CACHE – Chordate 6/2, 2017 at 5:30

@IwillnotexistIdonotexist: Great find and an interesting read. – Baccivorous 7/2, 2017 at 3:37

@Iwillnotexistidonotexist Also, surely you'd have to make sure not to push anything out of L3 when using the CAR? I.e. by saturating a set. Unless cr0.NW prevents this – Matias 7/3, 2019 at 10:11

I'm also presuming it would have to be done over memory mapped I/O regions such that a line can actually be read, so it is there to be written to – Matias 7/3, 2019 at 10:17

@LewisKelsey Yes, CR0.NW=1 (No Writethrough) prevents data from being written out unless its cacheline is being evicted, but in no-fill mode no new lines can be cached, so no evictions can be triggered. The code setting up CAR is familiar enough with the geometry of the cache that it will know how large a slab of memory can be cached this way without cache set saturation problems. – Chordate 7/3, 2019 at 10:24

@Iwillnotexistidonotexist I've been thinking about this. Do you know if normal cache coherence rules apply in no fill mode? If there is a miss in L1 does it still check the inclusive L3 before going to memory or does it only check L1 and go straight to memory. I understand obviously that 'no fill' refers to the fact that the cache line isn't filled when it does miss the cache. – Matias 23/3, 2019 at 17:8

Another interesting thing to point out is the 'scope' of CD and NW. I read that setting CD on the cr0 of 1 logical core serialises both logical cores but whether it takes effect for both logical cores is another matter and I'm not sure if it does or not. If it doesn't then the L1d cache controller would have to consult the cr0 of both logical cores before making a decision – Matias 23/3, 2019 at 18:8

@LewisKelsey It's impssible to paste here, but Intel's SDM, Volume 3, §11.5.1, Table 11-5 Cache Operating Modes answers your L1 vs L2/L3 question precisely. §11.5.6 seems to go some way towards addressing under which conditions L1 can be configured to operate cooperatively or competitively between hyperthreads. – Chordate 23/3, 2019 at 21:48

@IwillnotexistIdonotexist coming back to this topic, If you DMA from say a PCIe device to memory, cache coherency is maintained by the IIO and CBos, so I don't think it will be useful there, and you could do a WBINVD just the same to clear the DMA pollution. When would it ever be incoherent? Currently I think only the CAR reason is viable. I'll look for another – Matias 22/4, 2021 at 8:48

To elaborate on IwillnotexistIdonotexists answer about CAR:

I think how it's actually done is

Set up a WB MTRR (and not PAT because paging has to be disabled; PAT operates in the PMH and MTRR operates in load / store buffer or L1d) to cover the desired CAR space. The CAR range I think needs to be mapped in the SAD correctly and there needs to be a backing store – In this case, it makes sense to use the address range which is mapped to the SPI flash and that the code is actually at, then you only need to read it into the cache. You could have this range UC and read it into the cache and then write it to the WB range mapped to somewhere else – there may not need to be a physical device backing, but there does need to be a mapping in the SAD otherwise you might get an MCA either any time it hits the L3 or only when it needs to access the backing store. You do actually need to bring in valid lines from that address range in the first place before switching to CAR mode (unless you can perform a no RFO write to the range such that it doesn't need to bring in any lines, and rep stos is supposed to use a no-RFO protocol like ItoM, I.e. it only sends an invalidate and not an RFO), so I would think it needs to be mapped to an actual device, like SPI flash, because when reading from a device that has a mapping in the SAD but there is no receiving device when it gets to the IIO or the IMC, you'd get an MCA I think. If you actually get 0s when doing this, or use rep stos, then it would be a possible alternative.
You don't need to INVD the cache unless you know there is something in the cache that will be taking up unnecessary space, which is typically not the case at the stage of boot before the memory controller and RAM memory map has been configured.
Read the data into the cache
CR0.CD = 1. You're probably only disabling cache on the BSP, so you'll only be disabling the L1 and L2 by doing this (I don't know whether it disables the L2 for just that logical core or both, I think just that logical core, because I believe CD=1 is a property of the load/store itself within the store buffer, and it doesn't evict anything for that load/store). Evictions will never occur to L3 because the cache is not filled in CAR mode, but you might get a hit on something that was pushed to L3 during the write before CAR was enabled. L3 can't be disabled as there are no ia32_misc_enables and will function normally but it's a possibility that the core can inform the L3 slice CBo that it is operating in no fill cache mode, and won't fill the L3 on a miss. Alternatively, it might bypass the L3 entirely by issuing an UC request to the Cbo. I don't know whether cache coherency is maintained for writes hits, and whether coherent requests from other cores are handled, but a couple of sources claim this is the case, but this is irrelevant when you know only the BSP is active. If it does hit in a lower cache then it doesn't bring it up higher.
In CAR, read hits will read from the cache line, read misses will read from memory but will not fill a cache line. Write hits will write to the cache line, write misses will write to memory and not fill a cache like. This way, no evictions (from L1/L2) can ever occur and cause havoc. Using INVD during this state would effectively completely disable the cache because nothing would hit in the cache, so INVD isn't used during CAR, only at the end.
When finished, INVD. If the CAR region is backed by SPI flash, it will not match the contents of the SPI flash 100%. It still be writing I.e. a stack over some random code, which you don't want to write back to SPI flash accidentally, so you must INVD and not WBINVD, and if the SPI Flash rejects the write you'd get an MCA probably. The fact the instruction was introduced on 486 when the first mention of CAR was, suggests the instruction was introduced for this very purpose and I don't think there's another use case.
CR0.CD = 0

On Intel, CAR is set up by microcode to run the Startup ACM, therefore it doesn't need a specific macroinstruction for this. INVD and CAR is however used by the ACM itself and the SEC core before the memory controller has been initialised, which of course requires the INVD macroinstruction. I'll have to check whether SEC core enables CAR or whether it's already enabled, but I do know that the IBB blocks containing the SEC + PEI are in L3.

An important thing to mention is that when you load code into the cache, you need to make sure it's pushed out of the L1d and into L2, otherwise it won't be accessible by the instruction cache. This can be achieved by loading the code then loading something larger that the size of L1d (which is shared, not statically partitioned between threads, so it needs to be larger than the full size of L1d). I think this is because L1d is not coherent with L1i, although there is something called SMC so I'm not sure to what degree that is true.

Matias answered 22/4, 2021 at 10:28 Comment(7)

PMH = Page Miss Handler? That's not a well-known acronym, nor are many of the others you use. :/ Making the first use a hyperlink to a definition can cover that. – Overline 23/4, 2021 at 12:3

@PeterCordes im pretty sure it all comes up if you put Intel <acronym> into Google and enable verbatim search. 5th result for Intel PMH is a patent about it for instance and there isn't a similar acronym in the architecture. I will add more references. It's slowly started to dawn on me that I don't actually know why you need no fill cache / CAR mode at all... I'm missing something. Eviction only happens when it needs to bring something into the cache and there's no space. Unless it speculatively evicts or something then CAR doesn't make sense, because though it doesn't evict, it will still – Matias 23/4, 2021 at 12:30

write the write that would have caused the eviction directly to the backing store. I thought it only evicts on demand, as part of PLRU – Matias 23/4, 2021 at 12:31

What I mean is, I don't see why you need CR0.CD=1 to effectively set a region to WB and use cache as RAM. If you never access a WB region outside of the range, you should be fine – Matias 23/4, 2021 at 12:42

Yeah, I've wondered the same thing. But are you sure cache never speculatively syncs (write-back without eviction)? I have some recollection of reading that that can happen. If even write-back would crash the system (because you're reprogramming the DRAM controllers), then something that on paper guarantees it won't happen seems nice, and like something BIOS engineers would want to use. – Overline 23/4, 2021 at 12:51

@PeterCordes the fact it exists suggests that writebacks can occur that you don't expect, so there must be some sort of write back algorithm that's not just evict on demand, perhaps evict ahead of time. I'm yet to read anything on it. It could have something to do with it disabling prefetchimg – Matias 23/4, 2021 at 14:55

I doubt evict ahead of time, but certainly write-back (without eviction, to clean a dirty line and transition from Modified to Exclusive) is plausible. That could make use of idle write bandwidth, and make future eviction cheaper / faster. – Overline 23/4, 2021 at 21:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++