Estimating of interrupt latency on the x86 CPUs
Asked Answered
M

2

8

I looking for the info that can help in estimating interrupt latencies on x86 CPUs. The very usefull paper was found at "datasheets.chipdb.org/Intel/x86/386/technote/2153.pdf". But this paper opened a very important question for me: how can be defined the delay provided by waiting of completion of the current instruction? I mean delay between recognition of the INTR signal and executing of INTR micro-code. As I remember, the Intel Software developer manual also tells something about waiting of completion of the currently executing instruction. But it also tells something about that the some of the instructions can be interrupted in progress. And the main question is: how the maximum completion instruction waiting length can be defined for the particular processor. Estimation in core ticks and memory access operations is needed, not in seconds or microseconds. The cache and TLD misses, and other such stuff that can influence to the waiting should be considered.

This estimation is needed to investigate the possibility of implementing small critical sections that will not influence to the interrupt latency. To achive this the length of the critical section must be below or equal to the length of the most longest uninterruptable instruction of CPU.

Any kinds of help are very welcome. If you know some papers that can be helpfull, please, share the links to it.

Morman answered 31/7, 2011 at 18:12 Comment(2)
Beware that a store buffer full of cache-miss stores can lead to pretty high latency before any stores from an IRQ handler can become visible. Or before its in or out instructions can execute because they flush the store buffer first. iret is serializing so typically your can't get back to user-space without draining the store buffer. With say 50 cache-miss stores buffered to lines that other cores are also contending for, that's potentially a lot of cycles of latency for RFO requests to be answered. (BeeOnRope's deleted answer says approximately this.)Merrile
Why do you care about the delay waiting for the completion of (one of) the "current" instructions? The OS may have masked IRQs (CLI) and the CPU might be in system management mode; so you might need to wait for several thousand instructions to complete before the CPU responds to INTR. Also don't forget the HLT and MWAIT instructions (the time needed to bring the CPU out of a wait/sleep state).Rickie
R
4

If agner fog's optimization manuals (supplimented with the intel developer manuals) don't have anything, its unlikely anyone/anything else will(save for some internal intel/amd data): http://www.agner.org/optimize/

Riordan answered 31/7, 2011 at 20:42 Comment(0)
C
3

In general, there is no guaranteed upper bound on interrupt latency. Consider the following example:

  • Maskable interrupts are disabled by executing the sti instruction, which sets the IF flag.
  • A processor is transitioned to the C1 sleep state by executing the hlt instruction.
  • A maskable interrupt occurs whose affinity specifies that it can only be handled on that processor.

In this case, the processor will not handle the interrupt until an unmaskable interrupt occurs to wake up the processor and the IF flag is cleared to enable handling maskable interrupts.

The interrupt latency for any interrupt (including unmaskable interrupts) can be in the order of hundreds of microseconds if all the processors that are supposed to handle the interrupt are in a very deep sleep state. On my Haswell processor, the wakeup latency of the C7 state is 133 usecs. If this is an issue for you, you can use the Linux kernel parameter intel_idle.max_cstate (in case the intel_idle driver is used, which is the default on Intel processors) or processor.max_cstate (for the acpi_idle driver) to limit the deepest C-state. You can tell the kernel to never put any core to sleep using idle=poll, which may minimize the interrupt latency on an idle core, assuming of course that the frequency is not reduced due to thermal throttling. Using a polling loop also reduces the maximum turbo frequency of all cores, which may reduce overall performance of the system.

On an active core (in state C0), a hardware interrupt is only accepted when the core is an interruptible state. This state occurs at instruction boundaries, except for string instructions, which are interruptible. Intel does not provide an upper bound on the number of instructions that are retired before a pending interrupt is accepted. A reasonable implementation may stop issuing uops into the ROB (at an instruction boundary) and wait until all uops in the ROB retire before beginning the execution of the microcode routine for invoking an interrupt handler. In such an implementation, the interrupt latency depends on the time it takes to retire all of the pending uops. High latency instructions such as loads, complex floating-point arithmetic, and locked instructions can easily make the interrupt latency in the order of hundreds of nanoseconds. However, if one of the pending uops requires a microcode assist for any reason (or some specific reasons), the processor may choose to flush the instruction and all later instructions, instead of invoking the assist. This implementation improves performance and power consumption at cost of increased interrupt latency.

In another implementation tuned for minimizing interrupt latency, all in-flight instructions are immediately flushed without retiring anything. But all of these flushed instructions which went through the pipeline and some of which might have already been completed need to be fetched and go through the pipeline again when the interrupt handler returns. This results in reduced performance and increased power consumption.

Hardware interrupts drain the store buffer and the write-combining buffers on Intel and AMD x86 processors. See: Interrupting an assembly instruction while it is operating.

A paper from Intel titled Reducing Interrupt Latency Through the Use of Message Signaled Interrupts discusses a methodology to measure the latency of an interrupt from a PCIe device. This paper uses the term "interrupt latency" to mean the same thing as "interrupt response time" from the paper you mentioned. You need to somehow take a timestamp at the time the interrupt reaches the processor and then another timestamp at the very beginning of the interrupt handler. An approximation of the interrupt latency can be calculated by subtracting the two. The problem is of course getting the first timestamp (also in a way that is comparable to the second timestamp). The Intel paper proposes to use a PCIe analyzer, which consists of a PCIe device and an application that records all PCIe traffic with timestamps between the device and the CPU. They use a device driver to write to an MMIO location mapped to the device from the interrupt handler to create the second timestamp.

Cornelia answered 21/8, 2019 at 1:13 Comment(1)
Comments are not for extended discussion; this conversation has been moved to chat.Offenbach

© 2022 - 2024 — McMap. All rights reserved.