Intel x86 - Interrupt Service Routine responsibility

Asked 6/12, 2019 at 17:48 Answered 7/12, 2019 at 3:41

Solved assembly x86 intel interrupt osdev

I do not have a problem in the true sense of the word, but rather I will try to clarify a question of content. Suppose we have a microkernel (PC Intel x86; 32 Bit Protected Mode) with working Interrupt Descriptor Table (IDT) and Interrupt Service Routine (ISR) for each CPU exception. The ISR is called successfully, say in case of a Division by Zero exception.

global ir0
extern isr_handler

isr0:

    cli
    push 0x00   ; Dummy error code
    push %1     ; Interrupt number

    jmp isr_exc_handler

isr_exc_handler:

; Save the current processor state

    pusha

    mov ax, ds
    push eax

    mov ax, 0x10 ; Load kernel data segment descriptor
    mov ds, ax
    mov es, ax
    mov fs, ax
    mov gs, ax

    ; Push current stack pointer

    mov eax, esp
    push eax

    call isr_handler ; Additional C/C++ handler function

    pop eax     ; Remove pushed stack pointer

    pop ebx     ; Restore original data segment descriptor
    mov ds, bx
    mov es, bx
    mov fs, bx
    mov gs, bx

    popa

    add esp, 0x08 ; Clean up pushed error code and ISR number
    sti

    iret

The problem is that the interrupt is thrown again and again. As a result, the ISR is called again and again. By trial and error I found out that the line that provokes the exception, int x = 5 / 0, is executed in loop so the Instruction Pointer (EIP) is not incremented.

When I increment IP's value pushed to stack manually, the expected behavior occurs. The CPU executes then the next instruction after the malicious line of code. Of course after the ISR was called once.

To my actual question: Is it necessary that the ISR increments the IP? Or is this the responsibility of the "CPU/Hardware"? What's the correct behavior to move on?

Biogeochemistry answered 6/12, 2019 at 17:48 Comment(1)

Since it’s an exception the previous EIP points to the faulting instruction and if you want to continue you’ll have to change it yourself. The hardware won’t do anything about it. Usually the code will be terminated so there’s no reason to alter it and it’s more useful to have a pointer to the actual place where the error occurred. – Syllepsis 6/12, 2019 at 17:55

You're responsible knowing how and why the processor will call your interrupt service routines and writing code for your ISRs accordingly. You're trying to treat an exception generated by a division by zero error as if it were generated by a hardware interrupt. However this is not how Intel x86 processors handle these kind of exceptions.

How x86 processors handle interrupt and exceptions

There several different kinds of events that will result in the processor invoking an interrupt service routine given in the interrupt vector table. Collectively these are called interrupts and exceptions, and there are three different ways the processor can handle an interrupt or exception, as a fault, as a trap, or as an abort. Your divide instruction generates a Divide Error (#DE) exception, which is handled as a fault. Hardware and software interrupts are handled as traps, while other kinds of exceptions are handled as one of these three ways, depending on the source of the exception.

Faults

The processor handles an exception as a fault if the nature of the exception allows for it to be corrected in some way. Because of this, the return address pushed on the stack points at the instruction that generated the exception so the the fault handler knows what exact instruction caused the fault and to make it possible to resume execution of the faulting instruction after fixing the problem. A Page Fault (#PF) exception is a good example of this. It can be used to implement virtual memory by having the fault handler provide a valid virtual mapping for the address that the faulting instruction tried to access. With a valid page mapping in place the instruction can be resumed and executed without generating another page fault.

Traps

Interrupts and certain kinds of exceptions, all of them software exceptions, are handled as traps. Traps don't imply an error in execution of a instruction. Hardware interrupts occur in between the execution of instructions, and software interrupts and certain software exceptions effectively mimic this behaviour. Traps are handled by pushing the address of next instruction would have been normally executed. This allows the trap handler to resume the normal execution of the interrupted code.

Aborts

Serious and unrecoverable errors are handled as aborts. There only two exceptions that generate aborts, the Machine Check (#MC) exception and the Double Fault (#DF). Machine check instructions are the result of hardware failure in the processor itself being detected, this can't be fixed, and normal execution can't be reliably resumed. Double fault exceptions happen when a exception occurs during the handling of an interrupt or an exception. This leaves the CPU in an inconsistent state, somewhere in the middle of all many necessary steps to invoke an ISR, one that cannot be resumed. The return value pushed on the stack may or may not have anything to with whatever caused the abort.

How divide error exceptions are normally handled

Normally, most operating systems handle divide error exceptions by passing it along to a handler in the executing process to handle, or failing that by terminating the process, indicating that it had crashed. For example, most Unix systems send a SIGFPE signal to the process, while Windows does something similar using its Structured Exception Handling mechanism. This is so the process's programming language runtime can set up its own handler to implement whatever behaviour is necessary for the programming language being used. Since division by zero results in undefined behaviour in C and C++, crashing is an acceptable behaviour, so these languages don't normally install a divide by zero handler.

Note that while you could handle divide error exceptions by "incrementing EIP", this is harder than you might think and doesn't produce a very useful result. You can't just add one or some other constant value to EIP, you need to skip over entire instruction which could be anywhere from 2 to 15 bytes long. There's three instructions that can cause this exception, AAM, DIV and IDIV, and these can be encoded with various prefixes and operand bytes. You'll need decode the instruction to figure out how long it is. The result performing this increment will be as if the instruction was never executed. The faulting instruction won't calculate a meaningful value and you'll have no indication why the program isn't behaving correctly.

Read the documentation

If you're writing your own operating system then you'll need to have the Intel Software Developer's Manual available so you can consult it often. In particular you'll need to read and learn pretty much everything in Volume 3: System Programming Guide, excluding the Virtual Machine Extension chapters and everything afterwards. Everything you need to know about how interrupts and exceptions is covered in detail there, plus a lot of other things you'll need to know.

Extrusive answered 6/12, 2019 at 19:46 Comment(0)

When I increment IP's value pushed to stack manually, the expected behavior occurs.

That's not the expected behavior. An exception can be considered a serious malfunction that demands the termination of the program. So simply returning to business is often not an option.

Is it necessary that the ISR increments the IP?

No. Generally the process is terminated with a "General protection fault" or a "Division by zero error" or something like this.

Or is this the responsibility of the "CPU/Hardware"?

If you want to continue executing the code somewhere (like in the case of SEH (Structured Exception Handling), your OS has to manage this. You can always do this, it's your choice to clean up the possible mess.

What's the correct behavior to move on?

The correct behavior is what you like it to be, because you're the OS designer, arent't you? ;-) The CPU/hardware just notifies you of the current state.

Bumbailiff answered 6/12, 2019 at 17:57 Comment(4)

Theoretically the ISR could find a way of correcting the div 0 error and continue the code execution, for example in simple embedded code which must not stop executing, but in practice it is easier to check for a 0 divisor in advance and take remedial action. If you haven't then the situation is unexpected and a bug was trapped, rather than continuing to execute somehow. – Missioner 6/12, 2019 at 18:41

@WeatherVane: I actually gave some thought to this: how could I (mathematically) justify that a division by zero is ok - and make it a legit part of the OS... But ATM I cannot elaborate on that, because I don't recall. Sorry. – Bumbailiff 6/12, 2019 at 18:47

No problem, I was only trying to add to your answer and upvoted it. – Missioner 6/12, 2019 at 18:49

An OS would be too general to know what to do about a specific instance - when the fault is in an app it knows nothing about. – Missioner 6/12, 2019 at 18:50

Here is what the Intel64 and IA-32 Architectures Software Developer's Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide, chapter 6.5 EXCEPTION CLASSIFICATIONS says:

Faults A fault is an exception that can generally be corrected and that, once corrected, allows the program to be restarted with no loss of continuity. When a fault is reported, the processor restores the machine state to the state prior to the beginning of execution of the faulting instruction. The return address (saved contents of the CS and EIP registers) for the fault handler points to the faulting instruction, rather than to the instruction following the faulting instruction.

While a division by zero can not typically be corrected, Table 6-1. Protected-Mode Exceptions and Interrupts still shows that the cpu designers decided that the #DE Divide error should be a fault type exception.

Knish answered 6/12, 2019 at 17:59 Comment(2)

A division by zero error can be corrected, e.g. by adjusting register contents such that the division yields 0 or some other desired value. – Hemispheroid 6/12, 2019 at 18:26

@fuz: For integer division it can't be correctly corrected. You can only allow "incorrectly corrected" values to propagate while making it significantly harder to identify the underlying cause of bugs. For floating point there's more options ("signalling NaN", infinity), but still no "always correct" option. – Puerilism 7/12, 2019 at 3:10

What's the correct behavior to move on?

Let's talk about the programmer's ability to detect (and then fix) bugs. In order of best to worst (or in order of "how quickly programmers find out about the mistake"), the options are:

detect the bug while the programmer is typing the source code
detect the bug at compile/link time
detect the bug at run-time
detect the bug after spending 3 months trying to figure out why you received a wave of hostile "your software is dodgy trash" emails (that contain no useful clues) from end users

For integer division by zero, detecting the bug while the programmer is typing would require a language and IDE designed for that purpose (it's not practical for the majority of existing languages); and even then it can't be 100% effective (e.g. the bug may be in the compiler and not in the programmer's source code). There are similar problems with detecting the bug at compile/link time.

That means that the "least worst practical option" is detecting the bug at run-time.

However; detecting the bug is only the first step - e.g. if the bug is detected when a random/unknown end user runs the software on a laptop in the UK and the developer is in USA, how does the developer get the information they need to fix the bug?

Ideally; you want some kind of automated system where (after all threads in the buggy process are stopped, but before the process is terminated) all relevant information (where the bug occurred in which version of which program, plus things like the contents of registers, etc) is collected, then the end user is prompted with a "do you want to submit this info as a bug report" dialog box, and then (if the user agrees) the information is forwarded to some kind of "bug collection database" that allows statistics to be tracked (so that developers can determine things like how often the bug occurs, if the bug only occurs for people using a specific version, if the bug only occurs when people have used a certain feature, etc).

Note: The "divide error" on 80x86 indicates overflow, not division by zero (division by zero is just one cause of overflow). For example, if a DIV instruction is used to divide a 64-bit integer and get a 32-bit result; then "0x0123456789ABCDEF / 3 = divide error exception" because the result will not fit in 32 bits.

Puerilism answered 7/12, 2019 at 3:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++