In which conditions the ioctl KVM_RUN returns?

In https://github.com/qemu/qemu/blob/stable-4.2/cpus.c#L1290 lies a very important piece of Qemu. I guess it's the event loop for a CPU on KVM.

Here is the code:

static void *qemu_kvm_cpu_thread_fn(void *arg)
{
    CPUState *cpu = arg;
    int r;

    rcu_register_thread();

    qemu_mutex_lock_iothread();
    qemu_thread_get_self(cpu->thread);
    cpu->thread_id = qemu_get_thread_id();
    cpu->can_do_io = 1;
    current_cpu = cpu;

    r = kvm_init_vcpu(cpu);
    if (r < 0) {
        error_report("kvm_init_vcpu failed: %s", strerror(-r));
        exit(1);
    }

    kvm_init_cpu_signals(cpu);

    /* signal CPU creation */
    cpu->created = true;
    qemu_cond_signal(&qemu_cpu_cond);
    qemu_guest_random_seed_thread_part2(cpu->random_seed);

    do {
        if (cpu_can_run(cpu)) {
            r = kvm_cpu_exec(cpu);
            if (r == EXCP_DEBUG) {
                cpu_handle_guest_debug(cpu);
            }
        }
        qemu_wait_io_event(cpu);
    } while (!cpu->unplug || cpu_can_run(cpu));

    qemu_kvm_destroy_vcpu(cpu);
    cpu->created = false;
    qemu_cond_signal(&qemu_cpu_cond);
    qemu_mutex_unlock_iothread();
    rcu_unregister_thread();
    return NULL;
}

I'm interested on the do loop. It calls kvm_cpu_exec in a loop, which is defined here: https://github.com/qemu/qemu/blob/stable-4.2/accel/kvm/kvm-all.c#L2285

At one point of kvm_cpu_exec, it calls run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);, which calls the KVM_RUN ioctl documented here: https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt

4.10 KVM_RUN

Capability: basic
Architectures: all
Type: vcpu ioctl
Parameters: none
Returns: 0 on success, -1 on error
Errors:
EINTR: an unmasked signal is pending

This ioctl is used to run a guest virtual cpu. While there are no
explicit parameters, there is an implicit parameter block that can be
obtained by mmap()ing the vcpu fd at offset 0, with the size given by
KVM_GET_VCPU_MMAP_SIZE. The parameter block is formatted as a 'struct
kvm_run' (see below).

I am struggling to understand if this ioctl blocks execution? In which cases it returns?

I would like to have a little context of what is happening. Given the line qemu_wait_io_event(cpu), it looks like at least, the ioctl would return every time an event was to be read/written to/from the CPU. I don't know, I'm confused.

The KVM API design requires each virtual CPU in the VM to have an associated userspace thread in the program like QEMU which is controlling that VM (this program is often called a "Virtual Machine Monitor" or VMM, and it doesn't have to be QEMU; other examples are kvmtool and firecracker).

The thread behaves like a normal userspace thread within QEMU up to the point where it makes the KVM_RUN ioctl. At that point the kernel uses that thread to execute guest code on the vCPU associated with the thread. This continues until some condition is encountered which means that guest execution can't proceed any further. (One common condition is "the guest made a memory access to a device that is being emulated by QEMU".) At that point, the kernel stops running guest code on this thread, and instead causes it to return from the KVM_RUN ioctl. The code within QEMU then looks at the return code and so on to find out why it got control back, deals with whatever the situation was, and loops back around to call KVM_RUN again to ask the kernel to continue to run guest code.

Typically when running a VM, you'll see that almost all the time the thread is inside the KVM_RUN ioctl, running real guest code. Occasionally execution will return, QEMU will spend as little time as possible doing whatever it needs to do, and then it loops around and runs guest code again. One way of improving the efficiency of a VM is to try to ensure that the number of these "VM exits" is as low as possible (eg by careful choice of what kind of network or block device the guest is given).

Recommended topics

Hot tags