What happens to a Startup IPI sent to an Active AP that is not in a Wait-for-SIPI state
Asked Answered
N

2

8

In a previous Stackoverflow answer Margaret Bloom says:

Waking the APs

This is achieved by inssuing a INIT-SIPI-SIPI (ISS) sequence to the all the APs.

The BSP that will send the ISS sequence using as destination the shorthand All excluding self, thereby targeting all the APs.

A SIPI (Startup Inter Processor Interrupt) is ignored by all the CPUs that are waked by the time they receive it, thus the second SIPI is ignored if the first one suffices to wake up the target processors. It is advised by Intel for compatibility reason.

I've been doing multi processing code for years and my observation of hardware has been that on some processors it seems different than stated. I'm pretty sure I've observed Application Processors (AP) have their Instruction Pointer modified upon receipt of Startup IPI even when it was active (not in a Wait-for-Startup-IPI).

Is there any Intel documentation that states what an AP will do upon a receipt of a Startup IPI when not in a Wait-for-Startup-IPI state, or documents the behaviour as undefined? I can't seem to find a definitive answer in the Intel Software Documentation Manuals or the supplementary Intel document Minimal Boot Loader for Intel® Architecture.

Generally I write the initialization code to initialize and start an AP by assuming that the AP may get a SIPI and have its Instruction Pointer reset while in an active state (not in a Wait-for-Startup-IPI state).

I'm trying to determine the accuracy of Margaret Bloom's statement that a second Startup IPI will be ignored by an AP that has been previously awoken.

Nineteen answered 30/5, 2019 at 19:10 Comment(8)
The behaviour of IPIs is described in great depth in Volume 3, §10 Advanced Programmable Interrupt Controller. Also, §8.4.4.2 Typical AP Initialization Sequence does not restrict itself to just a Wait-for-IPI state, and §10.8.2 Interrupt Handling with the P6 Family and Pentium Processors states the SIPI is always accepted if the processor was targeted. My understanding is that the interrupt code checks whether it has been executed already and skips itself if it has; The second SIPI isn't ignored by hardware, but by software doing the right thing.Imparity
My understanding has been similar although much of the documentation seems to involve the dispatcher and not the receiving side. I've generally found the AP being targeted and the processor that initiated the startup need to work together to effectively avoid an unwanted second SIPI. Although my meaning of ignore is pretty much that you want to avoid sending a target AP another SIPI if you can detect it has already started.. I find the Intel preferred (and documented) INIT-SIPI-SIPI sequence to be problematic given that the INIT-delay-SIPI-delay-SIPI is done blind.Nineteen
Usually I do each AP one at a time (not broadcast). Send a INIT, delay 10ms, send SIPI delay a couple milliseconds . See if the target AP incremented some shared counter. If it did then it is finished, otherwise send a second SIPI wait a longer period of time (half a second to a second). If it times out and the global counter isn't incremented then I assume it is not available and proceed to next AP (if there are any left to initialize)Nineteen
Would you consider a test as a proof/disproof? :) Assuming the LAPIC behaviour didn't change since Netburst.Vow
SIPIs are ignored when not in wait-for-SIPI according to my testing on a Whiskey-Lake. Bochs also ignores SIPIs on a woke up machine (and warn about it). I also have a kaby-lake to test it in.Vow
@MargaretBloom : Unfortunately that doesn't mean other processors don't or that a non Intel processor works the same way. I did a bit of research after you commented and I have noticed that there seems to be another person who has made a similar observation to mine over the years: @ Brendan wrote this on OSDev Another problem is that often the AP CPU will start on the first SIPI, execute some of your code, then get "restarted" by the second Startup IPI, which can cause bugs .Nineteen
I guess at this point I'm curious if there is a definitive answer in the documentation that says unequivocally that an AP on receipt of a SIPI will never restart while in an active state. I might have to see about pulling some old equipment out of mothballs and seeing if I can reproduce the behaviour I believe I had previously seen.Nineteen
@MichaelPetch That's true indeed. If you need it I can share the testing code I used. Let's see if a patent can shed some light on this issue. I found one patent that reads: "STARTUP IPIs [...] can be issued only one time after RESET or after an INIT IPI reception or pin assertion.)". It confirm the behaviour I've seen but I haven't read it and probably it just refers to MP specs compliant LAPICs. Unfortunately I need to fix a Linux box right now but I'll get back to this interesting question asap.Vow
V
6

I consider my statement correct, up to bugs.

I don't claim that buggy hardware should be ignored but that their impact must first evaluated.
I'd like the remind the reader that while I have an opinionated position on the matter, I wanted this answer to be as neutral as possible.
To full fill this purpose I tried to provide sources for my statements.

While I do trust other users experiences I cannot base my belief on memories alone (for they cannot be verified)1 and I'm looking forward for someone to correct my quoted statement with proofs.

I understand this is an unpopular view, I hope it just won't pass as totally wrong.


First of all, as usual with computers it all boils down to standards. While Intel documents the MP behaviour of their CPUs in the manuals, there went a step further and made a proper MultiProcessor specification.
The importance of this specification is its role in the industry, this is not how Intel's CPUs work, this is, as far as i known, the only x86 SMP industry reference.
AMD and Cyrix pushed the OpenPIC specification but quoting Wikipedia:

No x86 motherboard was released with OpenPIC however.[3] After the OpenPIC's failure in the x86 market, AMD licensed the Intel APIC Architecture for its AMD Athlon and later processors.

In the Appendix B4 of the MP-specification is present the line

If the target processor is in the halted state immediately after RESET or INIT, a STARTUP IPI causes it to leave that state and start executing. The effect is to set CS:IP to VV00:0000h.

As noted in the comment I've parsed the if as a stronger *iif.

Unfortunately, the quoted sentence, as stated, is only a sufficient condition. So it cannot be used to deduce the behaviour of a SIPI on a running CPU.

However I personally believe this is a mistake, the intent of the authors of the specification is to use the SIPI to wake up a CPU in the wait-for-SIPI state.

The SIPI was specifically introduced with the advent of integrated APICs, along with a revision of the INIT IPI, to manage the booting of the APs.
The SIPI has no effect on the BSP (which never enters the wait-for-SIPI state according to Intel's manuals) and it's clear that is should have no affect on a running CPU.
The usefulness of the SIPI, besides being non-maskeable and not requiring the LAPIC to be enabled, is that is avoid running from the reset vector and the need for the warm boot flag for APs.

It makes no sense, from a design perspective, to let SIPI act on running CPUs. CPUs are always restarted with an INIT IPI as the first IPI.

So, I'm confident in parsing the quoted statement as colloquial English with the tacit agreement that it is also a necessary condition.

I believe this sets the official behaviour of SIPI on a woke-up CPU, namely to ignore them.

Fact 1: There is a industry-standard MP specification followed by all major x86 manufacturers, though being ambiguous, it's intent is to set the behaviour of SIPIs.

Page 98 of the Pentium Spec Update seems to confirm that, at least for the Pentium (an presumably for later Intel generations, which may include AMDs since they have bought a license for the LAPIC from Intel):

If an INIT IPI is then sent to the halted upgrade component, it will be latched and kept pending until a STARTUP IPI is received. From the time the STARTUP IPI is received the CPU will respond to further INIT IPls but will ignore any STARTUP IPls. It will not respond to future STARTUP IPls until a RESET assertion or an INIT assertion (INIT Pin or INIT IPI) happens again.

The 75-, 90, and 100-MHz Pentium processors, when used as a primary processor, will never respond to a STARTUP IPI at any time. It will ignore the STARTUP IPI with no effects.

To shutdown the processors the operating system should only use the INIT IPI, STARTUP IPls should never be used once the processors are running.


This doesn't settle the question if there are CPUs where subsequent IPIs are not ignored.
While this question is still to be addressed, we have, by now, turned it into the question "Are there buggy CPUs that ... ?".
This is an huge leap-forward because we can now see how existing OSes deal with it.

I won't discuss Windows, while I recognise this is a big absence I'm not in the mood of digging into Windows binaries right now.
I may do it later.

Linux

Linux sends two SIPIs and I don't see any feedback in this loop. The code is in smpboot.c where we clearly see that num_starts is set to 2.
I won't discuss the difference between the LAPIC and the 82489DX APIC, particularly that the latter didn't have SIPI2.

We can however see how Linux follow the Intel's algorithm and it is not worried by the second SIPI.
In the loop, executed num_starts time, a SIPI is sent to the target AP.

In the comments has been pointed out that the trampoline is idempotent and that Linux as a synchronisation mechanism.
That doesn't match with my experience, of course Linux synchronises code between CPUs but that's done later in the boot after the AP is running.
In fact the trampoline the first C code the AP executes is start_secondary and it doesn't seem idempotent (set_cpu_online is called later in the body, if that counts).

Finally, if the programmers wanted to prevent a double SIPI they'd put the synchronisation logic as early as possible to avoid dealing with complex situations later.
The trampoline goes as far as dealing with SME and vulnerabilities fixes, why would one want to do that before dealing with the SIPI-SIPI issue?

It makes no sense to me to have such a critical check so late.

Free BSD
I wanted to include a BSD OS because BSD code is known to be very clean and robust.
I was able to found a GitHub (unofficial) repository with the Free BSD source and while I'm less confident with that code I've found the routine that starts an AP in mp_x86.c.

Free BSD also uses the Intel's algorithm. To my amusement, the source also explains why there is the need for a second SIPI: the P5 processor (The P54C Pentium family?) did ignore the first SIPI due to a bug:

/*
* next we do a STARTUP IPI: the previous INIT IPI might still be
* latched, (P5 bug) this 1st STARTUP would then terminate
* immediately, and the previously started INIT IPI would continue. OR
* the previous INIT IPI has already run. and this STARTUP IPI will
* run. OR the previous INIT IPI was ignored. and this STARTUP IPI
* will run.
*/

I was unable to find the source for this statement, the only clue I have is the errata AP11 of the Pentium Specification Update found on an old android (i.e. Linux) kernel.
Today Linux seems to have dropped the support for those old buggy LAPICs.

Considering the detailed comments I don't see the need to check for the idempotency of the code up to an hypothetical check.
The BSD code is clearly written with the commented assumptions in mind.

Fact 2: Two mainstream OSes don't consider SIPI bugs occurring often enough to be worth handling.

While searching the Internet I've found a commit in the gem5 simulator with the title X86: Only recognize the first startup IPI after INIT or reset.
Apparently, they got it wrong at first and then fixed it.


Next step is trying to find some online documentation.
I first searched in Google Patents and while a lot of interesting results pop up (including how the APIC IDs are assigned), regarding SIPIs I only found this text in the patent Method and apparatus for initiating execution of an application processor in a clustered multiprocessor system:

STARTUP IPIs do not cause any change of State in the target processor (except for the change to the instruction pointer), and can be issued only one time after RESET or after an INIT IPI reception or pin assertion.

Wikipedia lists VIA as the only other x86 manufacturer still present..
I tried looking for VIA manuals, but it seems they are not public?

About the past manufacturers, I was unable to find if any ever produced MP CPUs at all. E.g. Cyrix 6x86MX didn't have an APIC at all, so they may have been put in a MP system only by an external APIC (which couldn't support SIPIs).

Next step would be to look at all of the AMD and Intel errata and see if there's something about the SIPIs.
However, errata are bugs and so the question turns into a search for a proof of non-existence (i.e. do bugged LAPICs exist?) which is hard to find (simply because bugs are hard to find and there are many micro-architectures).

My understanding is that the first integrated APIC (an LAPIC as known today) shipped with the P54C, I've consulting the errata but found nothing regarding the handling of SIPIs.
However understanding the errata in their full consequences is not trivial.

I've then moved to the Pentium Pro Errata (which is the next uarch, the P6) and found an incorrect handling of the SIPIs though not exactly what we are looking for:

3AP. INIT_IPI After STARTUP_IPI-STARTUP_IPI Sequence May Cause

AP to Execute at 0h**
PROBLEM: The MP Specification states that to wake up an application processor (AP), the interprocessor interrupt sequence INIT_IPI, STARTUP_IPI, STARTUP_IPI should be sent to that processor. On the Pentium Pro processor, an INIT_IPI, STARTUP_IPI sequence will also work. However, if the INIT_IPI, STARTUP_IPI, STARTUP_IPI sequence is sent to an AP, an internal race condition may occur in the APIC logic which leaves the processor in an incorrect state. Operation will be correct in this state, but if another INIT_IPI is sent to the processor, the processor will not stop execution as expected, and will instead begin execution at linear address 0h. In order for the race condition to cause this incorrect state, the system’s core to bus clock ratio must be 5:2 or greater.

IMPLICATION: If a system is using a core to bus clock ratio of 5:2 or greater, and the sequence INIT_IPI, STARTUP_IPI, STARTUP_IPI is generated on the APIC bus to wake up an AP, and then at some later time another INIT_IPI is sent to the processor, that processor may attempt to execute at linear address 0h, and will execute random opcodes. Some operating systems do generate this sequence when attempting to shut the system down, and in a multiprocessor system, may hang after taking the processors offline. The effect seen will be that the OS may not restart the system if ‘shutdown and restart’ or the equivalent is selected upon exiting the operating system. If an operating system gives the user the capability to take an AP offline using an INIT_IPI (Intel has not identified any operating systems which currently have this capability), this option should not be used.

WORKAROUND: BIOS code should execute a single STARTUP_IPI to wake up an application processor. Operating systems, however, will issue an INIT_IPI, STARTUP_IPI, STARTUP_IPI sequence, as recommended in the MP specification. It is possible that BIOS code may contain a workaround for this erratum in systems with C0 or subsequent steppings of Pentium Pro processor silicon. No workaround is available for the B0 stepping of the Pentium Pro processor.

STATUS: For the steppings affected see the Summary Table of Changes at the beginning of this section.

This AP3 erratum is interesting because:

  1. It confirms that an INIT-SIPI sequence is enough to startup an AP. This was evident from the MP specification and from the Free BSD code.
  2. It may lead to a behaviour similar to a restart. The bug will make an INIT IPI (after the INIT-SIPI-SIPI sequence) restart the AP at 0h (linear, presumably after the initialisation).
    If the BIOS uses the INIT-SIPI-SIPI to use the APs and later the OS attempts to use that sequence again, the first INIT will start the AP.
    However, this won't lead to a predictable behaviour unless the LAPIC is left in a corrupted state where any SIPI will be accepted.

Funny enough, in the same errata there is even a bug causing "the opposite behaviour": 8AP. APs Do Not Respond to a STARTUP_IPI After an INIT# or INIT_IPI in Low Power Mode

I've also checked the Pentium II, Pentium II Xeon, Pentium III, Pentium 4 errata and found nothing new about SIPIs.

To my understanding, the first AMD processor capable of SMP was the Athlon MP based on the Palomino uarch.
I've checked the revision guide for the Athlon MP and found nothing, checked the revisions in this list and found nothing.

Unfortunately I have little experience with non AMD non Intel x86 CPUs. I was unable to find which secondary manufactures included an LAPIC.

Fact 3: Official documentation from non AMD/Intel manufacturers is hard to find and errata are not easily searchable. No errata contains a bug related to the acceptance of the SIPI on a running processor but numerous LAPIC bugs are present making plausible the existence of such bugs.


Final step would be a hardware test.
While this test cannot rule out the presence of other behaviour, at least is documented (crappy) code.
Documented code is good because it can be used to repeat an experiment by other researchers, it can be scrutinised for bugs and constitute a proof.
In short, it is scientific.

I have never seen a CPU where subsequent SIPIs restarted it but this doesn't matter because it suffices to have a single buggy CPU to confirm the presence of the bug.
I'm too young, too poor and too human to have conducted an extensive, bug-free, analysis of all the MP CPUs.
So, instead, I made a test and run it.

Fact 4: Whiskey lake, Haswell, Kaby lake and Ivy Bridge all ignore subsequent SIPIs.
Other people are welcome to test on AMD's and older CPUs.
Again this doesn't constitute a proof but it's important to frame the state of the matter correctly.
The more data we have the more accurate knowledge of the bug we get.

The test consist in bootstrapping the APs and making them increment a counter and enter an infinite loop (either with jmp $ or with hlt, the result is the same).
Meanwhile the BSP will send a SIPI each n seconds, where n is at least 2 (but it may be more due to the very imprecise timing mechanism), and print the counter.

If the counter stays at k-1, where k is the number of APs, then the secondary SIPI are ignored.

There are some technical details to address.

First, the bootloader is legacy (not UEFI) and I didn't want to read another sector so I wanted it to fit in 512 bytes and so I shared the booting sequence between the BSP and the APs.

Second, some code must be executed only by the BSP but before entering in protected mode (e.g. video mode setting) so I used a flag (init) instead of checking the BSP flag in the IA32_APIC_BASE_MSR register (which is done later to diverge the APs from the BSP).

Third, I've took some shortcuts. The SIPI bootups the CPU at 8000h so I put a far jump there to 0000h:7c00h. The timing is done with the port 80h trick and it is very imprecise but should suffice. The GDT uses the null entry. The counter is printed a few lines below the top to avoid being cropped by some monitor.

If the loop is modified to include the INIT IPI, the counter is incremented regularly.

Please note that this code is without support.

BITS 16
ORG 7c00h

%define IA32_APIC_BASE_MSR 1bh
%define SVR_REG 0f0h
%define ICRL_REG 0300h
%define ICRH_REG 0310h

xor ax, ax
mov ds, ax
mov ss, ax
xor sp, sp      ;This stack ought be enough

cmp BYTE [init], 0
je _get_pm

;Make the trampoline at 8000h
mov BYTE [8000h], 0eah
mov WORD [8001h], 7c00h
mov WORD [8003h], 0000h

mov ax, 0b800h
mov es, ax
mov ax, 0003h
int 10h
mov WORD [es:0000], 0941h

mov BYTE [init], 0

_get_pm:
;Mask interrupts
mov al, 0ffh
out 21h, al
out 0a1h, al

;THIS PART TO BE TESTED
;
;CAN BE REPLACED WITH A cli, SIPIs ARE NOT MASKEABLE
;THE cli REMOVES THE NEED FOR MASKING THE INTERRUPTS AND
;CAN BE PLACED ANYWHERE BEFORE ENTERING PM (BUT LEAVE xor ax, ax
;AS THE FIRST INSTRUCTION)

;Flush pending ones (See Michael Petch's comments)
sti
mov cx, 15
loop $   

lgdt [GDT]
mov eax, cr0
or al, 1
mov cr0, eax
sti

mov ax, 10h
mov es, ax
mov ds, ax
mov ss, ax
jmp 08h:DWORD __START32__

__START32__: 
 BITS 32

 mov ecx, IA32_APIC_BASE_MSR
 rdmsr
 or ax, (1<<11)          ;ENABLE LAPIC
 mov ecx, IA32_APIC_BASE_MSR
 wrmsr

 mov ebx, eax
 and ebx, 0ffff_f000h    ;APIC BASE

 or DWORD [ebx+SVR_REG], 100h

 test ax, 100h
 jnz __BSP__

__AP__: 
 lock inc BYTE [counter]

 jmp $            ;Don't use HLT just in case

__BSP__:
 xor edx, edx 
 mov DWORD [ebx+ICRH_REG], edx 
 mov DWORD [ebx+ICRL_REG], 000c4500h        ;INIT

 mov ecx, 10_000
.wait1:
 in al, 80h
 dec ecx
jnz .wait1 

.SIPI_loop:
 movzx eax, BYTE [counter]
 mov ecx, 100
 div ecx 
 add ax, 0930h
 mov WORD [0b8000h + 80*2*5], ax

 mov eax, edx 
 xor edx, edx
 mov ecx, 10
 div ecx
 add ax, 0930h
 mov WORD [0b8000h + 80*2*5 + 2], ax

 mov eax, edx
 xor edx, edx
 add ax, 0930h
 mov WORD [0b8000h + 80*2*5 + 4], ax

 xor edx, edx 
 mov DWORD [ebx+ICRH_REG], edx 
 mov DWORD [ebx+ICRL_REG], 000c4608h        ;SIPI at 8000h

 mov ecx, 2_000_000
.wait2:
 in al, 80h
 dec ecx
jnz .wait2

jmp .SIPI_loop


GDT dw 17h
    dd GDT
    dw 0

    dd 0000ffffh, 00cf9a00h
    dd 0000ffffh, 00cf9200h

counter db 0
init db 1

TIMES 510-($-$$) db 0
dw 0aa55h

Conclusions

No definitive conclusion can be draw, the matter is still open.
The reader has been presented with a list of facts.

The intended behaviour is to ignore subsequent SIPIs, the need for two SIPI is due to a "P5 bug".
Linux and Free BSD don't seem to mind about buggy SIPI handling.
Other manufacturers seems to provide no documentation on their LAPICs if they produce any on their own.
Recent Intel's hardware ignore subsequent SIPIs.


1With due respect to all people involved and without attacking anyone credibility. I do believe there are buggy CPUs out there but there are also buggy software and buggy memories. As I don't trust my own old memories I think I'm still within the bounds of a respectful conversation to ask others to no trust their vague ones.

2 Possibly because MP in those days was done with regular CPUs packed together and asserting their INIT# with an external chip (the APIC) was the only way to start them up (along with setting a warm reset vector). However in those years I was too young to have a computer.

According to my testing, SIPIs are ignored when not in a wait-for-SIPI state. I've tested a Whiskey-lake 8565U, of course real-hardware test doesn't constitute a proof.
I'm confident that all the Intel's processors since the Pentium 4 also have the same behaviour but this is just my view.
In this answer I solely want to present the result of a test. Everyone will draw their own conclusions.

Vow answered 1/6, 2019 at 21:1 Comment(16)
The rules of logic don't quite allow you to make that sort of inference from the sentence "If the target processor is in the halted state immediately after RESET or INIT, a STARTUP IPI causes it to leave that state and start executing. The effect is to set CS:IP to VV00:0000h." If P then Q allows for Q without P is permitted, for instance if there exist alternate causes of Q. It does however mean that P without Q is forbidden. So here we learn that a SIPI shall wake up a processor in Wait-for-SIPI state, but we don't learn that SIPI is ignored outside of it.Imparity
I'm going to assume that references to unreal in the code were leftover from a previous incarnation of the code and that it is really pmode now. Bochs choked on this code. I noticed that you do sti after entering protected mode. Although you masked off the interrupts in the PICs it is possible that if interrupts were already disabled (IF=0) pending interrupts still need to be serviced (or cleared)Nineteen
Right after masking you could enable interrupts with STI and then you could add some nops (or a loop instruction that runs a number of times - maybe a loop of 15) to flush any pending interrupts. That way by the time the protected mode code is running there won't be possible IRQs to process in the absence of an IDT.Nineteen
An observation is that this code could avoid going into protected (or unreal mode) altogether by changing the APIC_BASE by writing a new 4KiB aligned address to IA32_APIC_BASE_MSR that is in the first 64KIB of memory. Placing it at 0x1000 might be workable with mov eax, 0x1000 | (1<<11) | (1<<8) and then writing that out to IA32_APIC_BASE_MSR rather than reading the existing one and modifying it.Of course all the code writing to the screen would have to be reworked to use a real mode segment register with 0xb800 in it.Nineteen
I also agree with @IwillnotexistIdonotexist . What is in the quoted documentation only says what happens if the processor is in a halted Wait for SIPI state, but it doesn't actually say that you can't receive one in a non-halted state. The idea one may not want to bother is probably best demonstrated in the fact that Linux and BSD don't concern themselves with the situation.Nineteen
A statement like "if X then Y" tells you nothing about the "if not X" case. For example, "If the target processor is in the halted state then Y" does not and can not be used to imply anything about "If the target processor is NOT in the halted state". For another example "If it's raining my lawn will be wet" doesn't tell you anything about what happens if it's not raining (the lawn might be wet or dry when it's not raining).Salience
For the LInux and FreeBSD code; you can not just look at the AP startup code in isolation. You also have to check the code that a recently started AP executes to determine if there is any kind of guard before anything can be upset by "second IPI causes restart". For example; for Linux, there's is nothing in trampoline.S that can't be safely executed twice so you'd have to look further than that.Salience
Ok; Linux does have a check for "is the CPU already running?" here: github.com/torvalds/linux/blob/…Salience
..and there's a "tell AP to proceed with initialization" (similar to my "wait for sending CPU to modify synchronization variable") here: github.com/torvalds/linux/blob/…Salience
For FreeBSD it seems to be a similar story - nothing in the AP CPU startup trampoline matters if it's executed twice; then AP CPUs wait for BSP to release them here: github.com/freebsd/freebsd/blob/master/sys/arm64/arm64/…Salience
@Salience You are right, but then why wouldn't the check be done as earlier as possible? I mean, that seems pretty risky (other than a waste of resources) for it's easy to break such code in the future. Thanks for the links, have you checked the call chain? I haven't (too much work) The first one seems to be used after the boot, the second one seems just a check after the startup and the third is for ARM?Vow
@IwillnotexistIdonotexist and Brendan: Fair, that's formally true. I parsed the if as both a necessary and a sufficient condition (iif). Maybe that was the intent of the authors or maybe not. Unfortunately that's ambiguous. I'll fix the answer thanks!Vow
@MichaelPetch Yeah, that code is ugly. Not going in PM is actually a better idea for this test! BTW I'd feel more comfortable if I could delete this answer, do you mind un-accepting it? I'm sorry for the inconvenienceVow
@MichaelPetch I edited the question with some additional information (and I hope to have addresses its issue). Do you remember anything about the offending CPUs you tested? Model, year range, anything? I've read a lot of errata but narrowing the search will help in giving more careful reads (and making targeted queries to Google). Thank you!Vow
I stumbled across this question and answer after trying to integrate the macOS Hypervisor.framework's virtual APIC implementation into a VMM, and running into a hang while booting an EDK2/Tianocore UEFI firmware on it. The hang turned out to be the non-BSP vCPUs sitting in an infinite hlt+pause loop, while the BSP vCPU is waiting for them.Impossibly
This is immediately after sending them a SIPI, even though they have previously already been initialised and started up. Clearly the expectation here is that the second SIPI very much does have an effect. EDK2 is pretty widely used as far as I'm aware, and I'm not seeing much in the way of CPU specific quirk behaviour being worked around in this code.Impossibly
S
3

Short Answer

  • Some CPUs do restart on the second SIPI
  • I don't know which CPUs restart on the second SIPI because I've been guarding against it for too long
  • I haven't checked, but I don't think Intel's documentation specifies the behavior for the "SIPI received by running CPU" case
  • If Intel's documentation does specify the behavior for Intel CPUs, then that doesn't mean CPUs from other vendors (AMD, VIA, SiS, Cyrix, ...) behave the same as Intel CPUs. Intel's manual is only "guaranteed" (excluding errata/specification updates) to apply to Intel's CPUs.

Longer Answer

When I first started implementing multi-CPU support (over 10 years ago) I followed Intel's startup procedure (from Intel'sMultiProcessor Specification, with the time delays between INIT, SIPI and SIPI), and after the AP started it incremented a number_of_CPU_running counter (e.g. with a lock inc).

What I found is that some CPUs do restart when they receive the second SIPI; and on some computers that number_of_CPU_running counter would be incremented twice (e.g. with BSP and 3 AP CPUs, the number_of_CPU_running counter could end up being 7 and not 4).

Ever since I've been using memory synchronization to avoid the problem. Specifically, the sending CPU sets a variable (state = 0) before trying to start the receiving CPU, if/when the receiving CPU starts it changes the variable (state = 1) and waits for the variable to be changed again, and when the sending CPU sees that the variable was changed (by receiving CPU) it changes the variable (state = 2) to allow the receiving CPU to continue.

In addition; to improve performance, during the delay after sending the first SIPI the sending CPU monitors that variable, and if the receiving CPU changes the variable it will cancel the delay and won't send a second IPI at all. I also significantly increase the last delay, because it only expires if there's a failure (and you do not want to assume the CPU failed to start when it started too late, and end up with a CPU doing who-knows-what as the OS changes the contents of memory, etc. later).

In other words, I mostly ignore Intel's "Application Processor Startup" procedure (e.g. from section B.4 of Intel's MultiProcessor Specification) and my code for the sending CPU does:

    set synchronization variable (state = 0)
    send INIT IPI
    wait 10 milliseconds
    send SIPI IPI
    calculate time-out value ("now + 200 microseconds")
    while time-out hasn't expired {
        if the synchronization variable was changed jump to the "CPU_started" code
    }
    send a second SIPI IPI
    calculate time-out value ("now + 500 milliseconds")
    while time-out hasn't expired {
        if the synchronization variable was changed jump to the "CPU_started" code
    }
    do "CPU failed to start" error handling and return

CPU_started:
    set synchronization variable (state = 2) to let the started CPU know it can continue

My code for the receiving CPU does this:

    get info from trampoline (address of stack this CPU needs to use, etc), because sending CPU may change the info after it knows this CPU started
    set synchronization variable (state = 1)
    while synchronization variable remains unchanged (state == 1) {
        pause (can't continue until sending CPU knows this CPU started)
    }
    initialize the CPU (setup protected mode or long mode, etc) and enter the kernel

Note 1: Depending on the surrounding code (e.g. if the synchronization variable is in the trampoline and the OS recycles the trampoline to start other CPUs soon after); the sending CPU might need to wait for the receiving CPU to change the synchronization variable one last time (so that the sending CPU knows that it's safe to recycle/reset the synchronization variable).

Note 2: a CPU "almost always" starts on the first SIPI, and it's reasonable to assume that the second SIPI only exists in case the first SIPI got lost/corrupted and reasonable to assume that the 200 microsecond delay is a conservative worst case. For these reasons, my "cancel the time-out and skip the second SIPI" approach is likely to reduce the pair of 200 millisecond delays by a factor of 4 (e.g. 100 uS instead of 400 uS). The 10 millisecond delay (between INIT IPI and first SIPI) can be amortized (e.g. send INIT to N CPUs, then delay for 10 milliseconds, then do the remaining stuff for each of the N CPUs one at a time); and you can "snowball" the AP CPU startup (e.g. use BSP to start a group of N CPUs, then use 1+N CPUs in parallel to start (1+N)*M CPUs, then use 1+N*M CPUs to start (1+N*M)*L CPUs, etc. In other words; starting 255 CPUs with Intel's method adds up to 2.64 seconds of delays; but with sufficiently advanced code this can be reduced to less than 0.05 seconds.

Note 3: The "broadcast INIT-SIPI-SIPI" approach is broken and should never be used by an OS (because it makes detecting "CPU failed to start" hard, because it can start CPUs that are faulty, and because it can start CPUs that were disabled for other reasons - e.g. hyper-threading disabled by the user in the firmware's settings). Sadly, Intel's manual has some example code that describes the "broadcast INIT-SIPI-SIPI" approach that is intended for firmware developers (where the "broadcast INIT-SIPI-SIPI" approach makes sense and is safe), and beginners see this example and (incorrectly) assume that OS can use this approach.

Salience answered 31/5, 2019 at 21:7 Comment(16)
I swear the documentation was what the firmware should do when the processors are brought up for the first time but not necessarily what OS software should do to bring up the processors from an unknown state. I too don't do the broadcast. At a minimum you parse the MADT to find the processors that shouldn't be ignored and isn't the BSP) and send a INIT-SIPI-SIPI to themNineteen
In a way your method has similarities to mine (mine will also change depending on circumstances if warranted. In one of my comments under the question I said "Usually I do each AP one at a time (not broadcast). Send a INIT, delay 10ms, send SIPI delay a couple milliseconds . See if the target AP incremented some shared counter. If it did then it is finished, otherwise send a second SIPI wait a longer period of time (half a second to a second). If it times out and the global counter isn't incremented then I assume it is not available and proceed to next AP (if there are any left to initialize)"Nineteen
This all gets around the problem that I'm usually trying to working an environment where I know I can adequately program a timer to 1khz and deal with millisecond values and not microseconds.Nineteen
I've upvoted your answer. I'll defer accepting an answer to see if someone has any quote from documentation for any processors that might back up this viewNineteen
@MichaelPetch: Yes - the 200 microsecond time delays are almost always horrible during early boot (before you've calibrated/configured timers, etc); especially if you support a wide range of systems (e.g. ancient CPUs that don't support TSC or HPET, and modern systems that might not support PIT).Salience
I see in your code you do calculate time-out value ("now + 200 microseconds") . Do you find that you have to stick to that 200us? I usually don't and go with 1-2ms).Nineteen
@MichaelPetch: How closely I stick to that depends on the time source I'm relying on. I've used BIOS "ticks since midnight" (rounding up to the nearest 55 ms) before without any trouble; but lately (partly due to firmware independence and partly due to wanting more precise timestamps for "event log" entries) I've been calibrating better timers long before wanting to start other CPUs. Mostly, longer than 200 microseconds seems fine (and shorter than 200 microseconds would probably be fine too); but you have to pick a number and "rounded up to >= 200 microseconds" seems like a sane choice.Salience
Do you have a source for your claiming? The INIT-SIPI-SIPI algorithm is used by linux and BSD. The 2nd SIPI is used due to a "P5 bug" that latched the INIT IPI preventing the 1st SIPI to happen. An MP-spec compliant CPU must ignore the second SIPI per specs. CPUs that restart on the second SIPI are bugged. You should mention that, Intel Algorithm is not wrong. The CPUs are.Vow
@MichaelPetch There cannot be a documentation for this behaviour. It's a bug, see my previous comment (particularly the BSD sources).Vow
Yes, the BSD code discusses the bug that forced the concept of the INIT-SIPI-SIPI. I had an AST Premmia SMP system from the mid 90s that exhibited the behaviour (would have been a P5). However since that time I'm pretty sure that there have been processors (and they may have even been AMD processors from a decade ago) where the second SIPI caused a reset. I'm not in disagreement that the second SIPI causing a startup sequence from an active state is likely a bug. I think it is possible this happens to be one of those "there is what is documented and then there is reality" situations.Nineteen
This has all spurred an interesting discussion one way or the other!Nineteen
@MargaretBloom: I can't remember which CPU I encountered the "second SIPI causes restart" problem on - at the time I had a LAN of many test machines (including rare stuff - e.g. Transmeta, NSC, etc) and was also using 3 different virtual machines. Since then I've switched to "minimum requirement is 64-bit" and all the old computers that don't meet my new minimum requirements have been packed away in storage.Salience
@MargaretBloom : somewhat unfortunate issue with APIC relocation is that it will likely work on hardware but QEMU/KVM doesn't support it but BOCHs does. Probably not much off an issue since the test is targeting real hardware.Nineteen
@MichaelPetch Oh, didn't know that, thanks! I was still looking for errata about mishandling of SIPI. I only found an interesting section at page 98 of the Pentium Spec Update that states that the Pentium will not respond to future SIPIs until RESET or INIT. But also says that SIPIs should never be used once the processor is running (this is in the context of a shutdown). Somebody must have documented something somewhere! >:/Vow
I didn't know that until this afternoon when I modified your code to use real mode exclusively and had it run on fine on real hardware/BOCHs/WMWare but QEMU/KVM didn't work. I found that this has been seen beforeNineteen
The KVM source code for kvm_lapic_set_base has trhis test and warning: if ((value & MSR_IA32_APICBASE_ENABLE) && apic->base_address != APIC_DEFAULT_PHYS_BASE) pr_warn_once("APIC base relocation is unsupported by KVM");Nineteen

© 2022 - 2024 — McMap. All rights reserved.