Is it possible for the RESOURCE_STALLS.RS event to occur even when the RS is not completely full?

Asked 5/10, 2018 at 0:15 Answered 27/10, 2019 at 1:5

performance x86 intel cpu-architecture intel-pmu

The description of the RESOURCE_STALLS.RS hardware performance event for Intel Broadwell is the following:

This event counts stall cycles caused by absence of eligible entries in the reservation station (RS). This may result from RS overflow, or from RS deallocation because of the RS array Write Port allocation scheme (each RS entry has two write ports instead of four. As a result, empty entries could not be used, although RS is not really full). This counts cycles that the pipeline backend blocked uop delivery from the front end.

This basically says that there are two situations where the RS stall event occurs:

When all of the eligible entries of the RS are occupied and the allocator is not stalled.
When "RS deallocation" occurs because there are only two write ports, and the allocator is not stalled.

What does "eligible" mean in the first situation? Does this mean that not all entries can be occupied by all kinds of uops? Because my understanding is that in modern microarchitectures any entry can be used by any kind of uop. Also what is RS array Write Port allocation scheme and how does it cause RS stalls even when not all entries are occupied? Does this mean that there were four write ports in Haswell but now there are only two in Broadwell? Do either of these two situations apply to Skylake or Haswell even though the manual does not explicitly say so?

Manhandle answered 5/10, 2018 at 0:15 Comment(7)

In other contexts in the PMU descriptions, Intel has used "eligible" simply to mean "available" as far as I can tell, so I wouldn't read much into that particular word. About the other half, I'm not sure, but I know that it has been observed that in loops where the RS is full, uops tend to dispatch in a pattern like 4-0-4-0... rather than a pattern like 2-2-2-2 (the exact numbers depend on the uop throughput, but the basic idea is that uops seem to like to issue in "full" groups rather than trickling in as soon as a slot is available in the RS). – Wardship 13/10, 2018 at 0:3

If I understand your tests correctly, each iteration starts with a (presumably) empty RS, then execute a carefully calculated number of "payload" instructions to get the to the target RS occupancy. For example, for add, the basic assumption is that every 4 consecutive dependent add instructions increases the RS occupancy by 3, since 4 uops are going in, but 1 is going out (since they execute at 1 per cycle). Is that right? Overall it seems like a reasonable approach, but I think it also relies on a lot of assumptions, so some of the results may be explained by assumptions that don't hold. – Wardship 14/10, 2018 at 3:45

For example, you use nops to let the ROB drain after the sequence of payload instructions used to reach the target occupancy. However, even though nops don't go into the RS, I wouldn't assume that nops hitting the RAT never increment the RESOURCE_STALLS counter: assuming the RAT works by checking if there is space for 4 uops in the RS, and then taking ops to rename from the IDQ, it could be that even the stream of nop instructions will increment the counter at least until 4 spaces are available in the RS, which could inflate your count by 3 in the add case, for example. – Wardship 14/10, 2018 at 3:54

@Wardship Yes your understanding of the tests is correct. Your analysis is worth further investigation. NOPs could indeed have an impact that I assumed does not exist. – Manhandle 14/10, 2018 at 5:15

Even if nop instructions do behave in that way, I don't think it really explains fully any of the anomalous results. Also, I don't know how to "fix" it: if that's the way the RAT works any instruction would have that problem, unless you could perhaps stall the front-end at exactly the right moment (this seems hard). You could also try bracketing your test with a serializing instruction like cpuid rather than the existing approach of carefully calculated numbers of nops and see if the results are consistent. – Wardship 14/10, 2018 at 5:18

@Wardship I agree. I don't like to use cpuid since it has many uops and I don't understand fully how they all get executed and how much they may overlap with neighboring instructions. – Manhandle 14/10, 2018 at 5:29

or any other serializing instruction, but yes I see the difficulty since you are measuring "from the outside" the results of many interactions so it's hard to remove the effect of whatever serializing instructions. You can also consider putting the rdpmc calls directly in the code at the places you want, perhaps in combination with lfence. This is a more powerful, but harder to implement, type of measurement. – Wardship 14/10, 2018 at 18:32

I've written a program that can be used to explore undocumented limitations of the RS in Intel processors in the hope that I'll be able to eventually answer the question. The basic idea is to make sure that the RS is completely empty before allocating and executing a specific sequence of uops in a loop. The RESOURCE_STALLS.RS can be used to determine whether that sequence has hit a limitation in the RS itself. For example, if RESOURCE_STALLS.RS is 1 per iteration, then the allocator had to stall for one cycle to allocate RS entries for all uops in the sequence. If RESOURCE_STALLS.RS is much smaller than 1 per iteration, then it basically did not have to stall and so we know we didn't not hit any of the RS limitations.

I've experimented with a sequence of dependent ADD instructions, a sequence of dependent BSWAP instructions, a sequence of dependent load instructions to the same location, a sequence of backward or forward unconditional jump instructions, and a sequence of store instructions to the same location. The following two graphs show the results for the sequence of add instructions for different target RS occupancies (the maximum number of RS entries that will be simultaneously required and occupied by the sequence of uops). All the values are shown per iteration.

The following graph shows that RESOURCE_STALLS.RS per iteration becomes at least (or anywhere near) 1 cycle per iteration when the RS occupancy is 50. Although it's not clearly visible, RESOURCE_STALLS.RS becomes larger than zero when the RS occupancy exceeds 43, but only exceeds 1 when the RS occupancy exceeds 49. In other words, I'm only able to simultaneously use up to 49 RS entries out of the 60 (in Haswell) without RS stalls. After that, RESOURCE_STALLS.RS increases on average by 1 per additional uop in the sequence, which is consistent with the bursty behavior of the allocator and the fact that each ADD uop can be completed every cycle (each uop occupies an RS entry for 1 cycle only). cycles increases on average by 2.3 per additional uop. It's larger than 1 per additional uop because there are also additional stalls on the ROB for reasons not related to the add uops, but these are OK because they do not impact RESOURCE_STALLS.RS.

The following graph shows the change in cycles and RESOURCE_STALLS.RS per iteration. It illustrates the strong correlation between execution time and RS stalls.

When the target RS occupancy is between 44-49, RESOURCE_STALLS.RS is very small but still not really zero. I have also noticed that the exact order in which different uops are presented to the allocator slightly impacts the RS occupancy that can be reached. I think this is an effect of the RS array write port allocation scheme mentioned in the Intel manual.

So what's up with the other 11 RS entries (Haswell's RS is supposed to have 60 entries)? The RESOURCE_STALLS.ANY performance event is the key to answer the question. I've updated the code I'm using to perform these experiments to test different kinds of loads:

Loads that can be dispatched with speculative addresses to achieve 4 cycle L1D hit latency. This case is referred to as loadspec.
Loads that cannot be dispatched with speculative addresses. These have an L1D hit latency of 5 cycles on Haswell. This case is referred to as loadnonspec.
Loads that can be dispatched with speculative but incorrect addresses. These have an L1D hit latency of 9 cycles on Haswell. This case is referred to as loadspecreplay.

I followed the same approach with the ADD instructions, but this time we need to watch RESOURCE_STALLS.ANY instead of RESOURCE_STALLS.RS (which actually doesn't capture RS stalls due to loads). The following graph shows the change in cycles and RESOURCE_STALLS.ANY per iteration. The first spike indicates the the target RS occupancy has exceeded the available RS entries for that kind of uop. We can clearly see that for the loadspec case, there are exactly 11 RS entries for load uops! When the target RS occupancy exceeds 11, it takes on average 3.75 cycles for an RS entry to become free to the next load uop. This means that uops are deallocated from the RS when they complete, not when they get dispatched. This also explains how uop replay works. The spike for loadspecreplay occurs at RS occupancy 6. The spike for loadnonspec occurs at RS occupancy 9. As you will see later, these 11 entries are not dedicated for loads. Some of the 11 entries used by the loads may be among the 49 entries used by the ADD uops.

I've also developed two test cases for stores: one that hits the limit of the store buffer and the other hits the limit of the RS. The graph above shows the former case. Note that a store needs two entries in the RS so the cases where the target RS occupancy is odd are the same as the previous even RS occupancies (change is zero). The graph shows that there can be up to 44/2 = 22 stores in the RS simultaneously. (The code I used to make the store graph had a bug in it that would make the achieved RS occupancy larger than what it is. After fixing it, the results show that there can be up to 20 stores in the RS simultaneously.) An entry occupied by a store-address or a store-data uop can be freed in one cycle. Intel says that the Haswell's store buffer has 42 entries, but I was not able to use all of these entries simultaneously. I'll probably have to design a different experiment to achieve that.

The jump sequences did not cause any stalls. I think this can be explained as follows: a jump uop frees the RS entry it occupies in one cycle and the allocator does not behave in a bursty way when it allocates jump uops. That is, every cycle one RS entry becomes free and the allocator will just allocate one jump uop without stalling. So we end up never stalling no matter how many jump uops there are. This is in contrast to add uops where the bursty allocator behavior makes it stall until the required number of RS entries become free (4 entries) even though the latency of an add uop is also one cycle. It makes sense that the jumps are allocated as soon as possible so that any mispredictions can be detected as early as possible. So if the allocator saw a jump and there is enough space in the RS for it but not later uops in its 4 uop group, then it would still allocate it. Otherwise, it might have to wait for potentially many cycles which can significantly delay the detection of mispredictions. This can be very costly

Is there an instruction whose uops can occupy all of the 60 entries of the RS simultaneously? Yes, one example is BSWAP. It requires two RS entries for its two uops and I can clearly see using RESOURCE_STALLS.RS that its uops can use all of the 60 entries of the RS simultaneously (assuming that my calculations are correct as to how the RS occupancy grows using the instruction). This proves that indeed there are exactly 60 entries in the RS. But there are constraints as to how they are used that we still don't know much about.

Manhandle answered 10/10, 2018 at 2:46 Comment(3)

This slide may be relevant. The "Skylake reminder" slide implies that there are two separate RSes: one for all the ALU ports, and one for the load/store ports (and that this increases to 4 in ICL: with one for the store data, and one each for the load/store AGUs). This would imply that perhaps load and store RS entries come from a separate pool? I think you did find some competition between them in your test though... – Wardship 13/12, 2018 at 19:44

@Wardship Interesting. Thank you very much. Most likely the answer to the question is "Yes" but I'm still interested in knowing what the limits of the RS are. The Intel documentation only says it has some X entries on some microarch, but it's clear now there are multiple limits that are relevant to different instructions. It would be interesting and useful to know what all of these limits are. I'm currently digging into patents. Some patents do show that there are many structures in the RS of different sizes. – Manhandle 14/12, 2018 at 4:18

Should "may be among the 39 entries used by" say 49 instead? – Wardship 22/8, 2019 at 3:51

Yes, it is possible for RESOURCE_STALLS to indicate a full RS before the RS is completely full.

As the RS becomes full, allocation of new uops into the RS becomes less ideal until at some point it may stall out entirely, even though some entries remain.

Furthermore, not all RS entries are available to all instructions. For example, on Haswell, I observe that only 30-32 of the 60 RS entries are available to loads: these entries may be special in they support uop replay, for example. On Skylake, the situation is different: the entire RS is not available to any type of instruction: rather, the "97 entry" RS is actually made up of a 64-entry RS for ALU ops, and a 33 entry RS for load ops. So the entire 97 entries of RS(es) will rarely be full, unless by some coincidence both fill up at exactly the same moment.

The RESOURCE_STALLS.RS event (umask 0x4) only triggers when a the "ALU" part of the RS is full (or full enough that an op can't allocate). For the load RS (which overlaps with the ALU RS in Haswell but not Skylake), the corresponding event has umask 0x40. You can use it with perf as 'cpu/event=0xa2,umask=0x40,name=resource_stalls_memrs_full/. Although the events are not documented for Skylake, they seem to work fine (although events with umasks 0x10 through 0x80 are very different than documented on Sandy Bridge.

Future Intel chips are likely to have even finer-grained reservation stations.

Wardship answered 27/10, 2019 at 1:5 Comment(2)

How did you know about the 0x40 if it's not documented? – Liquidize 13/11, 2023 at 20:4

@JosephGarvin - reverse engineering. – Wardship 22/11, 2023 at 13:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags