Now I encounter a problem about Linux NMI Watchdog. I want to use Linux NMI watchdog to detect and recovery OS hang. So, I add "nmi_watchdog=1" into grub.cfg. And then check the /proc/interrupt, NMI were triggered per second. But after I load a module with deadlock (double-acquire spinlock), system were hanged totally, and nothing occurs (never panic!). It looks like that NMI watchdog did not work!
Then I read the Documentation/nmi_watchdog.txt, it says:
Be aware that when using local APIC, the frequency of NMI interrupts it generates, depends on the system load. The local APIC NMI watchdog, lacking a better source, uses the "cycles unhalted" event.
What's the "cycles unhalted" event?
It added:
but if your system locks up on anything but the "hlt" processor instruction, the watchdog will trigger very soon as the "cycles unhalted" event will happen every clock tick...If it locks up on "hlt", then you are out of luck -- the event will not happen at all and the watchdog won't trigger.
Seems like that watchdog won't trigger if processor executes "hlt" instruction, then I search "hlt" in "Intel 64 and IA-32 Architectures Software Developer's Manual, Volumn 2A", it describes it as follow:
Stops instruction execution and places the processor in a HALT state. An enabled interrupt (including NMI and SMI), a debug exception, the BINIT# signal, the INIT# signal, or the RESET# signal will resume execution.
Then I am lost...
My question is:
- How does Linux NMI watchdog work?
- Who trigger the NMI?
My OS is Ubuntu 10.04 LTS, Linux-2.6.32.21, CPU Pentium 4 Dual-core 3.20 GHz.
I didn't read the whole source code about NMI watchdog(no time), if I couldn't understand how NMI watchdog work, I want use performance monitoring counter interrupt and inter-processor interrupt (be provided by APIC) to send NMI instead of NMI watchdog.