Are write-combining buffers used for normal writes to WB memory regions on Intel?

Asked 22/11, 2018 at 17:9 Answered 22/11, 2018 at 21:35

Solved performance x86 intel cpu-architecture

Write-combining buffers have been a feature of Intel CPUs going back to at least the Pentium 4 and probably before. The basic idea is that these cache-line sized buffers collect writes to the same cache line so they can be handled as a unit. As an example of their implications for software performance, if you don't write the full cache line, you may experience reduced performance.

For example, in Intel 64 and IA-32 Architectures Optimization Reference Manual section "3.6.10 Write Combining" starts with the following description (emphasis added):

Write combining (WC) improves performance in two ways:

• On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the returned line.

• Write combining allows multiple writes to be assembled and written further out in the cache hierarchy as a unit. This saves port and bus traffic. Saving traffic is particularly important for avoiding partial writes to uncached memory.

There are six write-combining buffers (on Pentium 4 and Intel Xeon processors with a CPUID signature of family encoding 15, model encoding 3; there are 8 write-combining buffers). Two of these buffers may be written out to higher cache levels and freed up for use on other write misses. Only four write- combining buffers are guaranteed to be available for simultaneous use. Write combining applies to memory type WC; it does not apply to memory type UC.

There are six write-combining buffers in each processor core in Intel Core Duo and Intel Core Solo processors. Processors based on Intel Core microarchitecture have eight write-combining buffers in each core. Starting with Intel microarchitecture code name Nehalem, there are 10 buffers available for write- combining.

Write combining buffers are used for stores of all memory types. They are particularly important for writes to uncached memory ...

My question is whether write combining applies to WB memory regions (that's the "normal" memory you are using 99.99% of the time in user programs), when using normal stores (that's anything other than non-temporal stores, i.e., the stores you are using 99.99% of the time).

The text above is hard to interpret exactly, and since not to have been updated since the Core Duo era. You have the part that says write combing "applies to WC memory but not UC", but of course that leaves out all the other types, like WB. Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".

So are write combining buffers used on modern Intel chips for normal stores to WB memory?

Digastric answered 22/11, 2018 at 17:9 Comment(16)

IIRC, I think I read somewhere that cache-miss stores (to WB memory) can commit into the LFB that's waiting for the data for that line to arrive. Or waiting for the RFO. But I might be mis-remembering, because I'm not sure that would let the core snoop those stores efficiently for store-forwarding. – Regressive 22/11, 2018 at 21:17

@PeterCordes that might also complicate memory ordering, since normal stores have to be strong ordered, so stores to different lines get combined into different in-flight buffers, it puts some strong restrictions about in what order the respective lines can be invalidated/made visible later. Perhaps other ordering concerns already imply this, I'm not sure. – Digastric 22/11, 2018 at 22:11

Hadi's answer on Where is the Write-Combining Buffer located? x86 claims that after gaining Exclusive ownership of a cache line, cache-miss stores can commit into a LFB while waiting for the old copy to actually arrive from DRAM. That was from April 2018, so maybe that's what I was thinking of. Anyway, that might be plausible, but that would still require loads to snoop LFBs, if the data actually left the store buffer. aka memory-order buffer. – Regressive 14/2, 2019 at 8:26

What I was actually looking for was evidence for the store buffer coalescing consecutive writes to the same cache line, saving cache write port bandwidth. This Q&A came up for google on x86 store buffer write coalescing. Ok, I found a comment on Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake where you linked to Dr. Bandwidth's post: software.intel.com/en-us/forums/… – Regressive 14/2, 2019 at 8:29

@Peter - I find it unlikely that things work exactly as Hadi's answer describes it, at least for normal stores to WB regions. For example, I don't think the stores are staged in the LFBs, but rather in the store buffer, until they commit to L1. The LFBs are on the other side of the L1 and I don't think they are snooped by loads that otherwise hit in L1. I think any coalescing that happens in the LFBs and allows a store buffer entry to be freed is very problematic for store ordering on x86, the inter-store ordering is lost. – Digastric 15/2, 2019 at 19:42

Saying that the line has to be held in an exclusive state for this to work doesn't make a lot of sense to me: the E or M state will generally be obtained as part of the response from the outer levels of the cache, essentially at the same time the data itself arrives. So I don't see a scenario where you store miss on a line but somehow have the line in E or M quickly, and then wait a while for data. I am not sure if Hadi is talking about WB regions in any or most of his answer. WC-protocol stuff obviously works differently. – Digastric 15/2, 2019 at 19:46

Yeah, E or M state normally implies actually having a valid copy of the data, and I agree that normally you don't know you have E until the RFO response arrives with the data. I don't think that part of Hadi's answer sounds right, either. Store coalescing only in the store buffer is far more likely. Alpha 21264 definitely did that (Hadi linked ftp.openwatcom.org/devel/docs/21264ev6_hrm.pdf in a comment), so I think we can treat it as a long-established computer-architecture technique to reduce cache-write bottlenecks. – Regressive 15/2, 2019 at 20:10

I think requiring M state does solve mem-ordering problem that you pointed out in reply to my first comment from November. That's exactly equivalent to committing into an M line in L1d, just that you can't respond to read requests until the data arrives and you merge it. So it's plausible, but I don't think Intel's actual designs do work that way. – Regressive 15/2, 2019 at 20:24

They do have to detect loads hitting a pending movnt store, but that flushes instead of reading the LFB. But that could be detected as part of allocating a new line / setting up the request after finding it not present in L1d. movntdqa loads from WC memory do read from an LFB, so load ports are connected to LFBs somehow. (And normal loads may get their data straight from an LFB for early restart? Or do they replay the load uop itself, not just dependent uops, maybe to redo the TLB check?) So LFB snooping is plausible, but I think the main sticking point is having M without data. – Regressive 15/2, 2019 at 20:31

Hmm, so both those examples I gave (movntdqa from WC, and loads that hit NT stores) would miss in L1d, and the special handling could happen only after that. Committing to an LFB would make the load path for store-forwarding involve an L1d miss and then reading from the LFB, but that seems unlikely unless there's some known hump in store forwarding that if the read happens too late, there's some time window where it's worse than forwarding from the store buffer or reading from L1d. (But it's hard to measure dispatch -> ready latency if dispatch isn't bottlenecked by dependencies.) – Regressive 15/2, 2019 at 20:43

@Peter - yes M state "solves it" but doesn't make sense because as we agree the line will never be in M state while you go to the outer caches for the data. Even if that would work somehow any strategy that involves not responding to read requests involving more than one line is prone to deadlock. – Digastric 15/2, 2019 at 22:14

@Peter - yes, the LFBs are definitely probed in all sorts of scenarios, but as above I don't think they are probed in the critical L1-hit case. Once you miss the L1 they are definitely probed, not least to merge requests to the same line. So NT stores can be implemented by first kicking the line out of the L1 (and I think they are) - so the LFBs will naturally be probed on subsequent loads. – Digastric 15/2, 2019 at 22:17

Yup, no LFB probe until L1 miss sounds like a likely design, and is incompatible with this idea. I don't see a deadlock possibility, though. We can't enter M state until after we know for sure the line is definitely coming, so we will definitely be able to respond eventually. It's maybe plausible for L3 tag check, or a snoop filter in a multi-socket system, to be sure that no other core has the line, and maybe send an ok-to-write signal to the requesting core before the data. But that's more message traffic and only helps for writes that can't merge in the store buffer. (alternating lines) – Regressive 15/2, 2019 at 22:31

If one core wants to read while the other wants to write, if L3 / memory controllers see the write RFO first, the writing core probably goes first. (And has to wait for the data to arrive at the writing core before that core will answer a request to share). So we're in the same boat as if it went into M state earlier but still couldn't respond. Unless the mem controller / L3 (wherever the arbitration HW is) could decide that the later-arriving read actually happens first in the global order, and send the line to the reader first when it comes in from DRAM, before answering the RFO. – Regressive 15/2, 2019 at 22:37

@Peter - I'm not sure "early restart" applies for modern designs that transfer an entire cache line in a single cycle (eg between L2 and L1) - but if you can explain how it might I'm interested because I've heard it mentioned repeatedly but can't get my head around it. That said, there definitely seems to be some kind of "arriving line bypass" where the (first) load that triggers an L1 miss can receive its data off the bypass network for an L2 hit - without accessing the L1 again. That's not necessarily exactly "directly" from the LFB - but it's close: as it bypasses L1. – Digastric 15/2, 2019 at 22:38

Oh right, I got the terminology wrong. I was thinking "early restart" meant using the value directly without going through an L1d write / read. But it's actually closely related to critical-word-first and means not waiting for the whole line to arrive. Yeah, it doesn't make sense with a 64B path between L2 and L1. – Regressive 15/2, 2019 at 22:42

Yes, the write combining and coalescing properties of the LFBs support all memory types except the UC type. You can observe their impact experimentally using the following program. It takes two parameters as input:

STORE_COUNT: the number of 8-byte stores to perform sequentially.
INCREMENT: the stride between consecutive stores.

There are 4 different values of INCREMENT that are particularly interesting:

64: All stores are performed on unique cache lines. Write combining and coalescing will not take an effect.
0: All stores are to the same cache line and the same location within that line. Write coalescing takes effect in this case.
8: Every 8 consecutive stores are to the same cache line, but different locations within that line. Write combining takes effect in this case.
4: The target locations of consecutive stores overlap within the same cache line. Some stores might cross two cache lines (depending on STORE_COUNT). Both write combining and coalescing will take an effect.

There is another parameter, ITERATIONS, which is used to repeat the same experiment many times to make reliable measurements. You can keep it at 1000.

%define ITERATIONS 1000

BITS 64
DEFAULT REL

section .bss
align 64
bufsrc:     resb STORE_COUNT*64

section .text
global _start
_start:  
    mov ecx, ITERATIONS

.loop:
; Flush all the cache lines to make sure that it takes a substantial amount of time to fetch them.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.flush:
    clflush [rsi]
    sfence
    lfence
    add rsi, 64
    sub edx, 1
    jnz .flush

; This is the main loop where the stores are issued sequentially.
    lea rsi, [bufsrc]
    mov edx, STORE_COUNT
.inner:
    mov [rsi], rdx
    sfence ; Prevents potential combining in the store buffer.
    add rsi, INCREMENT
    sub edx, 1
    jnz .inner

; Spend sometime doing nothing so that all the LFBs become free for the next iteration.
    mov edx, 100000
.wait:
    lfence
    sub edx, 1
    jnz .wait

    sub ecx, 1
    jnz .loop

; Exit.    
    xor edi,edi
    mov eax,231
    syscall

I recommend the following setup:

Disable all hardware prefetchers using sudo wrmsr -a 0x1A4 0xf. This ensures that they will not interfere (or have minimal interference) with the experiments.
Set the CPU frequency to the maximum. This increases the probability that the main loop will be fully executed before the first cache line reaches the L1 and causes an LFB to be freed.
Disable hyperthreading because the LFBs are shared (at least since Sandy Bridge, but not on all microarchitectures).

The L1D_PEND_MISS.FB_FULL performance counter enables us to capture the effect of write combining regarding how it impacts the availability of LFBs. It is supported on Intel Core and later. It is described as follows:

Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.

First run the code without the inner loop and make sure that L1D_PEND_MISS.FB_FULL is zero, which means the the flush loop has no impact on the event count.

The following figure plots STORE_COUNT against total L1D_PEND_MISS.FB_FULL divided by ITERATIONS.

We can observe the following:

It's clear that there are exactly 10 LFBs.
When write combining or coalescing is possible, L1D_PEND_MISS.FB_FULL is zero for any number of stores.
When the stride is 64 bytes, L1D_PEND_MISS.FB_FULL is larger than zero when the number of stores is larger than 10.

Later you have that "[WC is] particularly important for writes to uncached memory", seemly contradicting the "doesn't apply to UC part".

Both WC and UC are classified as uncachable. So you can put the two statements together to deduce that WC is particularly important for writes to WC memory.

Poplin answered 22/11, 2018 at 21:35 Comment(17)

Interesting tests. However, I don't think the results support the conclusion. Why would the INCREMENT 0, 4 and 8 all also have an "elbow" at exactly 10? You say It appears that write combining or coalescing cannot be performed without some penalty. An LFB seems to be reserved for every issued store until it is determined that it can be merged within an already allocated LFB - but this seems like an unlikely mechanism: allocating an LFB, realizing the mistake, then deallocating it and coalescing the load? Seems prone to races. Lets say that was the mechanism, however... – Digastric 22/11, 2018 at 22:23

... in that case why would they all show different behavior at 10? One would expect this to resolve itself before filling all the buffers. I guess it might have to do with your sfence: perhaps the sfence forces all the stores to get their own LFB. BTW, it's a shame that the l1d_pend_miss.pending and l1d_pend_miss.pending_cycles events don't count LFBs allocated for stores (or that there is not similar events for stores). – Digastric 22/11, 2018 at 22:24

Note that these measurements are taken over the outer loop. Then I'm dividing by ITERATIONS. So I'm not sure whether the elbow at 10 is due to the flush loop, the inner loop, or both. Is there an easy way to measure over only the inner loop so we can know for sure? – Poplin 22/11, 2018 at 22:33

I think the graph can be explained by an observation you already made: This means that LFBs are becoming available much earlier when write combining or coalescing is possible. You are begging the question there: I think you are right that the indication is that more lines become free available sooner for the lower increments, but can't this simply be explained by it taking less time to return 1 line from memory (the 0, 4 increment cases) or 2 lines (the 8 case) than 10 lines (the 64 case)? You don't necessarily need to invoke coalescing. – Digastric 22/11, 2018 at 22:34

I don't think it's "easy" but I have "one shot" mode in uarch-bench to do this: the idea is that you do a rdpmc after/before the region of interest (and subtract out events caused by the rdpmc machinery itself), so you can get more-or-less exact counts for small code segments. – Digastric 22/11, 2018 at 22:35

@Digastric It'd be great if you can repeat the experiments using uarch-bench to count only over the inner loop, before we discuss the graph. And I'd like to learn how to do that. – Poplin 22/11, 2018 at 22:37

@Digastric Or I can just remove the inner loop and repeat the experiments; counting over the outer loop. – Poplin 22/11, 2018 at 22:49

@Digastric OK I've done that. If I subtract the L1D_PEND_MISS.FB_FULL due to the flush loop from the count due to both loops, then, when stride is 0, I get zero FB full cycle count for all values of STORE_COUNT. For other strides, the lines look more flat but not zero. But still much smaller than the case with stride 64. – Poplin 22/11, 2018 at 23:22

For stride 4, L1D_PEND_MISS.FB_FULL till STORE_COUNT=15 is zero. For stride 8, the count till STORE_COUNT=11 is zero. – Poplin 22/11, 2018 at 23:41

@Digastric The flush loop was the culprit. I've changed the code to eliminate its effect on the event count. Now I think the graph looks good. – Poplin 23/11, 2018 at 2:9

Now the graph looks like I would expect it. Isn't this just telling us that storing to 10+ cache lines (the increment 64 case) in rapid succession exceeds the 10 LFBs, whereas storing to 1 or 2 (the other cases), doesn't? I'm actually starting to worry my question is not well-formed. I expected that a given LFB absorb all later read or store requests to the same line, and I think that's what your graph shows. Does that make it "write combining" in the sense of the Intel manual though? Perhaps I didn't do a good job distinguishing the two. – Digastric 23/11, 2018 at 3:11

@Digastric Not sure I understand the difference between showing that an LFB can absorb multiple writes to the same line vs. write-combing LFB for the WB memory type. Yes the graph shows exactly that. – Poplin 23/11, 2018 at 3:16

@Digastric The manual clearly says that since Nehalem, all of the 10 LFBs can do write-combining. Your question was whether this is supported for the WB type, not just WC. My answer shows that even WB lines can be combined in the LFBs. – Poplin 23/11, 2018 at 3:19

I misunderstood this test. I think it is doing the right thing. Basically it shows there is combining going on, or else we'd expect the smaller stride tests to show the same spike. That is, stores that miss in the L1, don't sit at the head of the store buffer, rather they are allocated a fill buffer, so the store buffer can keep draining. It also shows that later stores that hit the same fill buffers can drain into them rather than blocking. The only thing that could maybe be added is a check of resource_stalls.sb to check that the SB is doing what we think. – Digastric 19/3, 2019 at 17:36

I.e., that after some additional number of stores we start to get resource_stalls.sb stalls in the stride 64 case, since the store buffer is filling up once lines can't drain into combining buffers any more, but in the other cases the SB never fills up because there is unlimited combining going on (you may have to throttle the stores to 1 per 2 cycles since otherwise the store port limit may hit you). – Digastric 19/3, 2019 at 17:41

@Digastric I don't think that my test does the right thing actually. Furthermore, I'm leaning towards a "No" answer now. There are resource_stalls.sb stalls always because of SFENCE. I think the first thing we should do is to determine how does SFENCE work, i.e., whether it blocks allocation when it sees the first store or it is handled by the store buffer. I think this is very important to correctly interpret the graph in my answer. I've responded to your comment on my blog post on SFENCE. – Poplin 1/4, 2019 at 18:19

Also I think my test cannot be used to prove that there are 10 LFBs; that would be an invalid conclusion. But already knowing that there are 10 LFBs can be very useful to interpret the results. – Poplin 1/4, 2019 at 18:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags