Conditional jump instructions in MSROM procedures?

Asked 23/4, 2019 at 14:4 Answered 1/2, 2021 at 8:18

Solved x86 intel cpu-architecture branch-prediction micro-architecture

This relates to this question

Thinking about it though, on a modern intel CPU the SEC phase is implemented in microcode meaning there would be a check whereby a burned in key is used to verify the signature on the PEI ACM. If it doesn't match then it needs to do something, if it does match it needs to do something else. Given this is implemented as an MSROM procedure there must be a way of branching but given that the MSROM instructions do not have RIPs.

Usually, when a branch mispredicts as being taken then when the instruction retires, the ROB will check the exception code and hence add the instruction length to the RIP of the ROB line or just use the next ROB entry's IP which will result in the front end being resteered to that address amongst branch prediction updates. With the BOB, this functionality has now been lent to the jump execution units. Obviously this can't happen with an MSROM routine as the front-end has nothing to do with it.

My thoughts would be that there is a specific jump instruction that only the MSROM routine can issue that jumps to a different location in the MSROM and it could be configured such that MSROM branch instructions are always predicted not taken and when the branch execution unit encounters this instruction and the branch is taken, it produces an exception code and perhaps concatenates the special jump destination to it and an exception occurs on retirement. Alternatively, the execution unit could take care of it and it could use the BOB but I'm under the impression that the BOB is indexed by branch instruction RIP then there's also the fact that exceptions that generate MSROM code are usually handled at retirement; a branch misprediction doesn't require the MSROM I don't think and rather all actions are preformed internally.

Gribble answered 23/4, 2019 at 14:4 Comment(1)

What is "MSROM"? How do you know the SEC phase is implemented in microcode? – Cadet 6/1, 2023 at 2:11

Microcode branches are apparently special.

Intel's P6 and SnB families do not support dynamic prediction for microcode branches, according to Andy Glew's description of original P6 (What setup does REP do?). Given the similar performance of SnB-family rep-string instructions, I assume this PPro fact applies to even the most recent Skylake / CoffeeLake CPUs¹.

But there is a penalty for microcode branch misprediction, so they are statically(?) predicted. (This is why rep movsb startup cost goes in increments of 5 cycles for low/medium/high counts in ECX, and aligned vs. misaligned.)

A microcoded instruction takes a full line to itself in the uop cache. When it reaches the front of the IDQ, it takes over the issue/rename stage until it's done issuing microcode uops. (See also How are microcodes executed during an instruction cycle? for more detail, and some evidence from perf event descriptions like idq.dsb_uops that show the IDQ can be accepting new uops from the uop cache while the issue/rename stage is reading from the microcode-sequencer.)

For rep-string instructions, I think each iteration of the loop has to actually issue through the front-end, not just loop inside the back-end and reuse those uops. So this involves feedback from the OoO back-end to find out when the instruction is finished executing.

I don't know the details of what happens when issue/rename switches over to reading uops from the MS-ROM instead of the IDQ.

Even though each uop doesn't have its own RIP (being part of a single microcoded instruction), I'd guess that the branch mispredict detection mechanism works similarly to normal branches.

rep movs setup times on some CPUs seem to go in steps of 5 cycles depending on which case it is (small vs. large, alignment, etc). If these are from microcode branch mispredict, that would appear to mean that the mispredict penalty is a fixed number of cycles, unless that's just a special case of rep movs. May be because the OoO back-end can keep up with the front-end? And reading from the MS-ROM shortens the path even more than reading from the uop cache, making the miss penalty that low.

It would be interesting to run some experiments into how much OoO exec is possible around rep movsb, e.g. with two chains of dependent imul instructions, to see if it (partially) serializes them like lfence. We hope not, but to achieve ILP the later imul uops would have to issue without waiting for the back-end to drain.

I did some experiments here on Skylake (i7-6700k). Preliminary result: copy sizes of 95 bytes and less are cheap and hidden by the latency of the IMUL chains, but they do basically fully overlap. Copy sizes of 96 bytes or more drain the RS, serializing the two IMUL chains. It doesn't matter whether it's rep movsb with RCX=95 vs. 96 or rep movsd with RCX=23 vs. 24. See discussion in comments for some more summary of my findings; if I find time I'll post more details.

The "drains the RS" behaviour was measured with the rs_events.empty_end:u even becoming 1 per rep movsb instead of ~0.003. other_assists.any:u was zero, so it's not an "assist", or at least not counted as one.

Perhaps whatever uop is involved only detects a mispredict when reaching retirement, if microcode branches don't support fast recovery via the BoB? The 96 byte threshold is probably the cutoff for some alternate strategy. RCX=0 also drains the RS, presumably because it's also a special case.

Would be interesting to test with rep scas (which doesn't have fast-strings support, and is just slow and dumb microcode.)

Intel's 1994 Fast Strings patent describes the implementation in P6. It doesn't have an IDQ (so it makes sense that modern CPUs that do have buffers between stages and a uop cache will have some changes), but the mechanism they describe for avoiding branches is neat and maybe still used for modern ERMSB: the first n copy iterations are predicated uops for the back-end, so they can be issued unconditionally. There's also a uop that causes the back-end to send its ECX value to the microcode sequencer, which uses that to feed in exactly the right number of extra copy iterations after that. Just the copy uops (and maybe updates of ESI, EDI, and ECX, or maybe only doing that on an interrupt or exception), not microcode-branch uops.

This initial n uops vs. feeding in more after reading RCX could be the 96-byte threshold I was seeing; it came with an extra idq.ms_switches:u per rep movsb (up from 4 to 5).

https://eprint.iacr.org/2016/086.pdf suggests that microcode can trigger an assist in some cases, which might be the modern mechanism for larger copy sizes and would explain draining the RS (and apparently ROB), because it only triggers when the uop is committed (retired), so it's like a branch without fast-recovery.

The execution units can issue an assist or signal a fault by associating an event code with the result of a micro- op. When the micro-op is committed (§ 2.10), the event code causes the out-of-order scheduler to squash all the micro-ops that are in-flight in the ROB. The event code is forwarded to the microcode sequencer, which reads the micro-ops in the corresponding event handler"

The difference between this and the P6 patent is that this assist-request can happen after some non-microcode uops from later instructions have already been issued, in anticipation of the microcoded instruction being complete with only the first batch of uops. Or if it's not the last uop in a batch from microcode, it could be used like a branch for picking a different strategy.

But that's why it has to flush the ROB.

My impression of the P6 patent is that the feedback to the MS happens before issuing uops from later instructions, in time for more MS uops to be issued if needed. If I'm wrong, then maybe it's already the same mechanism still described in the 2016 paper.

Usually, when a branch mispredicts as being taken then when the instruction retires,

Intel since Nehalem has had "fast recovery", starting recovery when a mispredicted branch executes, not waiting for it to reach retirement like an exception.

This is the point of having a Branch-Order-Buffer on top of the usual ROB retirement state that lets you roll back when any other type of unexpected event becomes non-speculative. (What exactly happens when a skylake CPU mispredicts a branch?)

Footnote 1: IceLake is supposed to have the "fast short rep" feature, which might be a different mechanism for handling rep strings, rather than a change to microcode. e.g. maybe a HW state machine like Andy mentions he wished he'd designed in the first place.

I don't have any info on performance characteristics, but once we know something we might be able to make some guesses about the new implementation.

Tullus answered 23/4, 2019 at 23:17 Comment(20)

Thanks, I've been busy with exams lately, i'll get round to your answer in a bit. I think the exam and curriculum culture just outdid their MBR/MDR stuff though: imgur.com/a/AgUBjit – Gribble 12/5, 2019 at 5:42

@LewisKelsey: IDK, seems fine to me. FIFO and LRU are simple enough algorithms that it's easy enough to understand and implement in software; it doesn't seem unreasonable to ask about them. Virtual memory page-replacement decisions are always done in software. If it was a question about a fully-associative CPU cache then there'd be some room to roll our eyes. The problem with MBR/MDR stuff is that it goes really far down the road into the details of one way of building a CPU which isn't how modern CPUs are built. That question is just about replacement algorithms. – Tullus 12/5, 2019 at 7:58

I'm familiar with the LRU algorithm that windows implements when pruning the working set where each PTE has a corresponding structure storing its age etc which gets incremented if the accessed bit is not set on each traversal. This took me by surprise as it was seemingly a random string of numbers not knowing whether they corresponded to ages of PTEs or what. It turns out it was a Frame No. being read from, which does make sense now but anyway, I thought that ages are reset when a page is added to the working set of a process, in this example, the age remains persistent. – Gribble 17/5, 2019 at 13:46

So, you're saying that REP MOVSB has a startup penalty because the decision process is subject to branch mispredictions. So how does this all work then? Does the complex decoder produce a single MSROM uop which is the one that takes up the full line in the uop cache and when it gets to the allocate stage, it takes over and it has a direct path to the MSROM and issues instructions from a certain location in the MSROM and then it makes static predictions on the branch instructions by continuing to fetch from the same location in the MSROM or a different location? – Gribble 17/5, 2019 at 13:47

@LewisKelsey: Yes, my understanding is that the complex decoder produces a special uop, and it takes over the issue/rename stage when it reaches the front of the IDQ. I don't know the details; presumably microcode branches are a different uop than regular x86 branches, but probably they're hard-coded as either predicted-taken or predicted-not-taken. We already know that Haswell port 0 can run predicted-not-taken x86 branches, but only port6 can run predicted-taken x86 branch uops. (Maybe the port-assignment just depends on the prediction associated, rather than decoding to a different uop.) – Tullus 17/5, 2019 at 13:52

I had already come to the assumption that predicted not taken and predicted taken branches decoded to different uops that caused them to be issued to different ports. Anyway, I assume the microcode for REP MOVSB can check the ECX value so that it doesn't issue too many so there isn't a misprediction at the end? If there is a misprediction, perhaps there is a bit in the ROB that indicates it is an uop from the MSROM , so control is given to the allocator rather than flushing the whole pipeline. This bit could be passed to the branch execution unit. I suppose the BOB could still be used. – Gribble 17/5, 2019 at 14:7

@LewisKelsey: There are a lot of unanswered questions surrounding how microcoded loops take over the pipeline. I've thought about maybe running some tests to see how well out-of-order execution around them actually works, e.g. using a low-count or even rcx=0 vs. high count rep movsb (optimized) or rep scasb (dumb) in place of lfence in Understanding the impact of lfence on a loop with two long dependency chains. But I haven't gotten around to trying it. It would shed some light on whether it can give up the front-end with work still in flight – Tullus 17/5, 2019 at 14:26

@PeterCordes What's the need for the special uop? I think the IDQ is fed through a mux which input is selected by the decode (or even pre-decode) stage. The MSROM (one of the inputs) will then produce a stream of uops resulting from the execution of a "micro-program". For the MSROM I don't see the need to produce a jump unless it want to resteer the program flow (i.e. in case of interrupts/exceptions). The only jumps needed are the "micro-jumps" executed by the MSROM (which probably is just a simple Algorithmic State Machine, so, conceptually, not much more than a plain ROM). – Theran 22/5, 2019 at 14:34

@PeterCordes Aww, no, ignore my last comment. rep has ecx as input. – Theran 22/5, 2019 at 14:36

@MargaretBloom: Yup, and a few other instructions (like div on some uarches) have a data-dependent number of uops. I think the only sane explanation is expanding at the issue end of the IDQ. Otherwise the IDQ would be empty when a microcoded loop finished, and you don't want that; you want a buffer for fetch/decode to get farther and start on icache/iTLB misses... But either way, a "special uop" has to represent the instruction after decode in the uop cache. The fact that they take a whole line to themselves is suspicious, though, if they just expand at the far end of the IDQ. – Tullus 22/5, 2019 at 15:10

@PeterCordes This patent is interesting: The abstract says that an instruction is issued to compute the number of iterations and when it retires a uop assist is used to issue the uops for the number of iterations. This link has a section on ucode that links a lot of interesting patents and has evidence that uop sequences are triggered at retirement. It'd be possible that rep movsd does nothing but triggering a uop assist when it retires ... – Theran 22/5, 2019 at 15:41

@PeterCordes ... (so that the inputs are known) and then the MSROM just work as I said. – Theran 22/5, 2019 at 15:41

@MargaretBloom: Interesting; it's not that simple, though. That paper says string instruction "can handle small arrays in hardware, and issue microcode assists for larger arrays". I tried putting rep movsb or movsd in place of lfence between times 40 imul eax,eax and edx chains (with the addresses and count reset every iteration by mov), and there's a big jump in time (slowdown: 191c/i to 289c/i) from size<96 bytes to size>=96 bytes, whether it's with movsd rcx=24 or movsb rcx=96. and a jump in idq.ms_switches:u from 8 per iter to 10. – Tullus 22/5, 2019 at 17:26

@MargaretBloom: oops, those numbers were with an lfence at the top of the loop, to isolate each rep movs / time T imul / rep movs / times T imul interation. Without that, the difference between 95 and 96 bytes is even more dramatic (factor of 2 cycles), and rs_events.empty_end:u go from 2 per iteration (presumably rep movs somehow drains the RS every time it has to run) to very small, like 0.003 per iter on average. But other_assists.any:u was exactly 0, so it's not literally an assist mechanism of the same form as FP assists. – Tullus 22/5, 2019 at 17:39

@MargaretBloom: I updated How are microcodes executed during an instruction cycle? with a lot more details. I think some microcoded uops result in draining the RS (maybe because microcode branch misses can't be detected until retirement?). With that, my explanation fits everything. The description of the perf event idq.ms_cycles and idq.ms_uops support it: [Uops delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy]. That sounds to me like taking over the issue/rename while the front-end feeds in uops as normal. – Tullus 22/5, 2019 at 18:44

@MargaretBloom: Thanks for the patent link! That conditional-execution of the first n iterations makes a ton of sense, and the mechanism for avoiding a branch on every ucode loop "iteration". But it doesn't say the grab-ECX uop has to retire, just execute. But that was from 1994, Andy Glew et. al.'s original fast-string implementation for P6. P6 didn't have an IDQ, or buffers between most of the major front-end pieces. And it didn't have a uop cache. It's highly likely that the Microcode Sequencer is still only a cycle or 2 away from issue/rename, i.e. end of IDQ. – Tullus 22/5, 2019 at 18:57

@PeterCordes Very interesting! The SGX paper linked in my previous comment mention the RS draining (if I got it right): "The execution units can issue an assist or signal a fault by associating an event code with the result of a micro- op. When the micro-op is committed (§ 2.10), the event code causes the out-of-order scheduler to squash all the micro-ops that are in-flight in the ROB. The event code is forwarded to the microcode sequencer, which reads the micro-ops in the corresponding event handler" – Theran 23/5, 2019 at 9:17

@MargaretBloom: oh yes, thanks. That could be the mechanism that rep movs microcode uses after all. Updated this answer. Maybe there aren't really microcode branches at all? – Tullus 23/5, 2019 at 15:48

@PeterCordes If there are, I don't understand them. The MicroSequencer is an "execution machine" on its own, it doesn't need the OoO core to execute, so a ucode branch will never appear in the OoO. However special uops can send a feedback to the MS from the retirement unit, including changing the ucode IP. This allows the uprogram to branch based on the architectural state. I guess the model is for the MS to keep going on until it is eventually resteered. That may be why u-code branches are "predicted" not taken and why, on P6, rep scasb had a penalty for terminating earlier. – Theran 24/5, 2019 at 8:32

@MargaretBloom: Good summary, agreed. The whole idea of ucode branches having a hard-coded prediction for the back-end (or existing at all in SnB-family) is just a hypothesis that I haven't tested via experiment. So far I've only looked at rep movs which apparently flushes the RS (and ROB?) on miss, so the assist mechanism could explain it. We think they exist on P6 from Andy Glew's phrasing. But that fast-strings microcode on P6 doesn't use them, from the patent. Unless strategy-selection is separate and branchy. – Tullus 24/5, 2019 at 14:28

Intel has patented some very assembly-like functionality for microcode, which includes:

Execution from L1, L2 or L3(!!!!!!!!!!!!!!!!!!!!!!!). Heck, they patented loading a "big" microcode update from mass storage into L3 and then updating from there... -- note that "patented" and "implemented" are distinct, I have no idea if they have currently implemented anything else than execution from L1.

Opcode and Ucode(!) sections in the MCU package (unified microprocessor update) -- the thing we call "microcode update" but really has/can have all sort of stuff inside, including PMU firmware updates, MCROM patches, uncore parameter changes, PWC firmware, etc, that get executed before/after the processor firmware/ucode update procedure.

Subroutine-like behavior including parameters on the Ucode. Conditional branching, or at least conditional loops, they've had for quite a while.

Compression and uncompression of the microcode (unknown if it can be "run" from compressed state directly, but the patent seems to imply it would at least be used to optimize the MCU package).

And WRMSR/RDMSR really are more like an RPC into Ucode than anything else nowadays, which I suppose got really helpful when they find out they need a new MSR, or to do a complex change on an architectural MSR behavior (like the LAPIC base register, which had to be "gatekeeped" to work around the LAPIC memory sinkhole SMM security hole that made the news a few years ago).

So, just look at it as a hardware-accelerated turing-complete RISC machine that implements the "public" instruction architecture.

Cupriferous answered 24/4, 2019 at 11:33 Comment(7)

Yes, the reason they used wrmsr as a mechanism for Spectre mitigation is that microcode updates can add a whole new MSR whose "handler" actually flushes the branch-prediction caches is that it was possible to add that via a ucode update. But adding a whole new instruction would require modifying the decoders and couldn't be done with just a firmware update for existing CPUs. – Tullus 24/4, 2019 at 20:31

I'm not sure RPC is the best description, a better analogy is a "system call" or hypervisor call to modify the state of the machine that's running your instructions. But yeah, WRMSR is a hook for running arbitrary microcode to modify the real uop-executing machinery. – Tullus 24/4, 2019 at 20:31

However, this question isn't (I think) asking about microcode update mechanisms at all. It's just asking how the MS-ROM works. When you say "execution from L3", what do you even mean? Clearly microcode is totally inside the execution core, not stored in unified caches, except during a microcode update. We know how execution of micro-coded instructions works: the IDQ entry for it reaches the front of the IDQ, and then takes over the issue-rename stage to read from the MS-ROM instead of the IDQ. Cache isn't involved. Not even the uop-cache (DSB) directly. See my answer. – Tullus 24/4, 2019 at 20:35

(There's at least a partial answer to the question in here, but I think it's confusing and/or going off on a tangent. That would be ok if you introduced it as such.) – Tullus 24/4, 2019 at 20:36

@PeterCordes thanks, good point about wrmsr ; I was thinking how in earth a microcode update could mitigate something like spectre. I only really understood the suggestion of retpolines or otherwise having modify the underlying microarchitecture completely, like using PCID in the IBTB – Gribble 25/5, 2019 at 2:12

@PeterCordes Actually, what does STIPB actually do? Does it just disable indirect branch prediction on the other logical core while the bit is set on the current logical core. As for IBPB, I guess that flushes the IBTB right and IBRS I think just disables the RSB / IBTB which means that the functionality must have already existed for a uop to trigger one of these events..? – Gribble 25/5, 2019 at 17:26

@LewisKelsey: I don't know. I haven't totally kept up on details of Spectre mitigations. Apparently STIPB has big performance impacts, zdnet.com/article/… so IDK if it maybe partitions the BTB so each logical core has half the branch prediction? But it shows massive overheads in Phoronix microbenchmarks like glibc ffs / ffsll function calls. If that's all mispredictions of a call [ffs@gotpcrel] indirect branch with a constant target, maybe it's disabling some part of normal branch prediction? – Tullus 25/5, 2019 at 18:0

What I know now is that the branches are statically predicted by the MSROM and it uses that prediction in the next IP logic for the next microcode line. These predictions are probably already furnished in the uops stored in the MSROM.

For smaller and more frequent MSROM routines, the complex decoder can emit 1–4 uops before passing control to the MSROM to complete the decoding. Otherwise, it passes control to the MSROM with a delay.

In the preferred embodiment, some of the more frequently-used macroinstructions are decoded by the XLAT PLAs 510-516 into one, two, three, or four of the first Cuops in the micro-operation sequence, which provides high performance at the cost of additional minterms in the XLAT PLAs 510-516. Alternately, for some less frequently-used macroinstructions, the four XLAT PLAs 510-516 issue no Cuops, but simply allow the MS unit 534 to issue all Cuops. This second alternative has a disadvantage of lower performance (i.e., a loss of at least one clock cycle), but can save minterms (entries) in the XLAT PLAs 510-516, which is a design trade-off that reduces die space at the expense of lower performance. This trade-off can be useful for less frequently used instructions or for long microcode flows where the significance of one additional clock is lessened.

The opcodes from the macroinstruction 502 are supplied to the entry point PLA 530 that decodes the opcodes to generate an entry point address into microcode ROM. The generated entry point address is supplied to the MS unit 534 that, responsive to the entry point, generates a series of Cuops. The MS unit 534 includes a microcode ROM ("UROM") that includes microcode routines to supply UROM Cuops for long instructions flows, which in some examples may require over a hundred UROM Cuops. The UROM also includes assist handling routines and other microcode.

The rest is answered here: https://mcmap.net/q/14272/-what-branch-misprediction-does-the-branch-target-buffer-detect

Gribble answered 1/2, 2021 at 8:18 Comment(22)

That sounds consistent with my observations for OoO exec (chain of imul) happening around a short-enough rep movs, but then not happening at all above a certain threshold. Mispredicting the ucode branch that predicted a size <= some_constant leads to draining the back-end and needing to issue more uops. – Tullus 2/2, 2021 at 0:6

I read somewhere that MS uops can be sort of predicated, so the initial burst of uops from rep movs can be enough loads/stores for any size up to a limit, with the later ones executing as NOPs if it turns out that RCX <= their cutoff. This avoids the need for tight feedback between the microcode sequencer and back-end register values for small sizes. (There must also be some actual branching to check size and overlap and maybe alignment, though, not just pure predication.) – Tullus 2/2, 2021 at 0:9

@PeterCordes the big thing I'm trying to work out is whether some uops stall the decoder (or allocator) or whether none do. Also your theory that the uop takes over the allocate stage. Everything I'm reading suggests that the MSROM emits uops in line, and they are 'packed' with other 'fast path' uops from the regular decoders. Patent 5,983,337 AMD mentions the uops from the decoder being issued at the same time as the final line of the MROM if that line only contains 1 or 2 uops. (It also talks about MROM updates, exceptions during MROM procedures, marking MROM instructions etc,) – Gribble 2/2, 2021 at 4:2

And what i mean is your theory that a special uop is emitted by the MSROM I'm not sure about. I thought that IDQ.MS_DSB_UOPS is the MS uops from the MS that were initated by the DSB as opposed to the MITE complex decoder, not uops being delivered by the DSB to the IDQ while the MS is busy with the allocator. The LSD can include uops from the MSROM and we see certain counters i.e. idq.ms_uops which talk about the MS delivering uops into the IDQ, so we know the MSROM uops are delivered to the queue not directly to the allocator. – Gribble 2/2, 2021 at 4:4

When the special uop reaches the allocator, there are likely to be instructions before it in the IDQ, meaning 'delivering MS uops to the IDQ' would deliver them after these instructions, breaking program order. It means that the IDQ would have to be flushed upon reception of this special uop and then basically the whole pipeline would need to be flushed – Gribble 2/2, 2021 at 4:4

My tests were on i7-6700k Skylake (in a loop running from DSB except when it had to switch to MSROM for rep movs, or maybe I was testing with rep stos, I forget. The threshold was something like 80 bytes IIRC). My SKL had updated microcode so LSD disabled. I used something like times 20 imul ecx, ecx as a dep chain to overlap (or not) with rep * throughput. – Tullus 2/2, 2021 at 4:5

Read the description carefully for idq.ms_uops - "Uops delivered to IDQ while Microcode Sequenser (MS) is busy". My understanding of this is that it counts uops added to the tail of the IDQ while the issue/alloc stage is reading from MS instead of the head of the IDQ. i.e. how much IDQ bubble-filling happened while running microcoded instructions. – Tullus 2/2, 2021 at 4:9

If the LSD can include microcoded instructions, it's likely (if my mental model is anywhere near right) that those are just the indirect pointers that trigger the MS to take over the front-end when they reach the head of the IDQ. For microcoded instructions that run a variable number of uops, it would be suck for that to happen with the whole IDQ between the microcode sequencer and the back-end, meaning the IDQ would have to be empty when it was done, or later uops would have to constantly get discarded or something. But actually you know what's going to run next (unless it was a loop insn) – Tullus 2/2, 2021 at 4:13

It does have 2 interpretations. But for ms_uops it says 'uops delivered to the IDQ while the MS is busy, uops may be initiated by the DSB or MITE' and on another Intel manual it says for ms_uops 'uops coming from the microsequencer'. That's what led me to believe that 'busy' meant that it was issuing uops and therefore the number issued to the IDQ while it is busy is the number of uops it issues. As for rep movs, I think you had 96 bytes, agner instruction tables says 9 uops somewhere. I'm guessing 8 of them must be movs instructions and the first one is a special rep uop that... – Gribble 2/2, 2021 at 4:59

... if the value in rcx is less than 8 bytes then it prevents the appropriate number of stores after it from retiring and removes them, otherwise it goes to the MSROM exception address (which could be set by a uop at the start?) – Gribble 2/2, 2021 at 5:0

It works for odd sizes I think; I forget if I tested sizes like 93 bytes with rep movsb, but if so it couldn't just be the final uop cancelling or not some of the earlier uops. It might even be something like masked loads / stores, like "copy up to 16 bytes depending on RCX", then same but depending on RCX-16? IDK. Testing some misaligned and odd-size cases might shed some light. – Tullus 2/2, 2021 at 5:17

Hmm, supporting your idea of the IDQ is that idq.ms_switches says "[Number of switches from DSB (Decode Stream Buffer) or MITE (legacy decode pipeline) to the Microcode Sequencer]". If MS only activated when a special uop reached the head of the IDQ, DSB and MITE uops would already be unified into just IDQ uops. So this does hint that maybe the tail of the IDQ can get MS uops. Maybe some instructions are different, like maybe some fixed-length microcoded insns (like div) might just work as a source of uops for the IDQ, like DSB or MITE. – Tullus 2/2, 2021 at 5:21

I forgot that MOVSB is 5 uops uops.info/html-instr/MOVSB.html shows 4 of those uops in the MSROM get sent to the complex decoder to decode. This is also shown on Pentium 4 on Agner's instruction tables where the MSROM uops always send 4 to the decoder for some unknown reason, and it decodes them there instead of its own decoder... and the decoding is something to do with 'auops' and 'cuops', I don't know the details. According to the instruction tables, SnB has 2n small n best case, 3/16B large n worst case. – Gribble 2/2, 2021 at 6:19

If this is true then it cant be just issuing a bunch of MOVS instructions, because it would be 10 uops for 16B (2 8 byte MOVS). I don't know what 'n' is specifically – Gribble 2/2, 2021 at 6:20

The MSROM doesn't send uops to the decoder, that doesn't make sense. The decoder input is x86 machine code, not uops. uops.info seems to be reporting counts for MITE and MS uops, however the counters report things. – Tullus 2/2, 2021 at 6:33

In Agner's tables, n is the repeat count in E/RCX. The worst case is 2 uops per count, the best case is 3 uops per 16 bytes (not counts) for SnB. IDK if you're mixing up rows because you reversed best/worst case. For P4, not sure where you're seeing "auops", but that might be additional uops, like Agner's "microcode" column? Remember P4 has a trace cache instead of L1i, so some entries in it indirect to microcode, but you can measure trace-cache space consumption separately. Taking the space of 4 normal uops sounds similar to how microcoded uops take up a full line in SnB's uop cache, IIRC. – Tullus 2/2, 2021 at 6:37

There's some evidence that the MSROM uses the complex decoder: abinstein.blogspot.com/2007/05/… patents.google.com/patent/US5721855A/en (search 'MS'). Here are some promising patents I haven't read yet regarding the matter: patents.google.com/patent/US5630083 patents.google.com/patent/US5537629A/en patents.google.com/patent/US5566298 patents.google.com/patent/US5559974A/en uspto.report/patent/grant/5918031 patents.google.com/patent/US5581717A/en patents.google.com/patent/US6041403 – Gribble 2/2, 2021 at 7:2

@PeterCordes Performance Counter stats for benchmark.exe: IDQ.MS_UOPS: 3114 IDQ.MS_DSB_UOPS: 0 IDQ.MS_MITE_UOPS: 644 HW_INTERRUPTS.RECEIVED: 1 Again: IDQ.MS_UOPS: 144 IDQ.MS_DSB_UOPS: 0 IDQ.MS_MITE_UOPS: 39 HW_INTERRUPTS.RECEIVED: 0 so you might be right unless some of those 144 uops were part of an assist / exception handler – Gribble 18/4, 2021 at 0:39

At the same time I still think the flat out 0 MS_DSB_UOPS means that there were no msrom triggering instructions in the DSB (suggesting that my suggestion might be correct that it is showing the number of msrom uops triggered by the dsb), which makes a lot of sense given that this is a loop of macrofused dec jnz which is being issued from the uop cache one per cycle (because the BPU can only supply one cache line per cycle), and the only msrom instructions are going to be exceptions / interrupts / assists which are triggered at retire or ones issued from MITE as part of an interrupt handler – Gribble 18/4, 2021 at 0:47

IDK what machine those test results are from; and I didn't re-read the whole previous comment thread. I think there was debate over whether counters are counting uops from the MS-ROM, or counting uops from other sources added to the IDQ while alloc/rename is reading from the MS-ROM instead. The latter matches the perf list description for idq.ms_dsb_cycles, but perf list on SKL doesn't show a ms_dsb_uops event. 0 count might just mean that there were was no MSROM triggering at all, or at least when there was, it was interrupts so IDQ filling from DSB was halted. – Tullus 18/4, 2021 at 1:29

@PeterCordes The point is, the MS used to stall the simple decoders until it had finished issuing uops so that the uops remained in program order. Therefore I thought that this counter was probably the number of MS uops that were triggered by the decoder. Upon testing the following on KBL rdpmc shl rdx, 20h or rax, rdx mov [r14+rbx*8], rax call rsi (rsi is just a ret) xor ecx, ecx rdpmc I get (testing a single counter at a time). INST_RETIRED.ANY: 7 IDQ.MS_UOPS: 39 IDQ.MS_MITE_UOPS: 23 IDQ.MITE_UOPS: 23 IDQ.DSB_UOPS: 0. – Gribble 18/4, 2021 at 6:46

It shows you are correct that the decoders don't stall, and continue to issue uops to the IDQ. There are 23 MITE uops and 39 MSROM uops (38 I think are from the 2nd rdpmc). All 23 uops issue to the IDQ while the first rdpmc is being issued from the MSROM. The question remains how these uops remain in program order in the IDQ, It could be that you are correct about it taking over the allocation stage and bypassing the IDQ. INST_RETIRED.ANY is clearly counting the first but not the last rdmpc. – Gribble 18/4, 2021 at 6:46

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags