How many bits there are in a TLB ASID tag for Intel processors? And how to handle 'ASID overflow'?

Intel calls ASIDs process-context identifiers (PCIDs). On all Intel processors that support PCIDs, the size of a PCID is 12 bits. They constitute bits 11:0 of the CR3 register. By default, on processor reset, CR4.PCIDE (bit 17 of CR4) is cleared and CR3.PCID is zero and so if the OS wants to use PCIDs, it has to set that CR4.PCIDE first to enable the feature. Writing a PCID value larger than zero is only allowed when CR4.PCIDE is set. That said, when CR4.PCIDE is set, it is also possible to write zero to CR3.PCID. Therefore, the maximum number of PCIDs that can be simultaneously used is 2^12 = 4096.

I'll discuss how the Linux kernel allocates PCIDs. The Linux kernel itself actually uses the term ASIDs even for Intel processors and so I'll use this term as well.

In general, there are really many ways to manage the ASID space such as the following:

When a new process is needs to be created, allocate a dedicated ASID for the process. If the ASID space has been exhausted, then refuse to create the process and fail. This is simple and efficient, but may severely limit the number of processes.
Instead of limiting the number of processes to the availability of ASIDs, when the ASID space has been exhausted, behave as if ASIDs are not supported. That is, flush the whole TLB on a process-context switch for all processes. Practically, this is a terrible method since you might end up switching between disabling and enabling ASIDs as processes get created and terminated. This method incurs a potentially high performance penalty.
Allow multiple processes to use the same ASID. In this case, you need to be careful when switching between processes that use the same ASID since the TLB entries tagged with that ASID all still need to be flushed.
In all of the previous methods, each process has an ASID and so the OS data structure that represents a process needs to have a field that stores the ASID. An alternative method is store the currently allocated ASIDs in a separate structure. ASIDs are allocated to processes dynamically at the time when they need to execute. Processes that are not active will not have ASIDs assigned to them. This has two advantages over the previous methods. First, the ASID space is used more efficiently since mostly dormant processes do not unnecessarily consume ASIDs. Second, all the currently allocated ASIDs are stored in the same data structure, which could be made small enough to fit within a few cache lines. In this way, finding new ASIDs can be done efficiently.

Linux uses the last method and I'll discuss it in some additional detail.

Linux only remembers the last 6 ASIDs used on each core. This is specified by the TLB_NR_DYN_ASIDS macro. The system creates a data structure for each core of type tlb_state that defines an array as follows:

struct tlb_context {
    u64 ctx_id;
    u64 tlb_gen;
};

struct tlb_state {

    .
    .
    .

    u16 next_asid;
    struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);

The type includes other fields but I've shown only two for brevity. Linux defines the following ASID spaces:

The canonical ASID space: these include ASIDs 0 to 6 (TLB_NR_DYN_ASIDS). These values are stored in the next_asid field and used as indices to the ctxs array.
The kernel ASID (kPCID) space: these include ASIDs 1 to 7 (TLB_NR_DYN_ASIDS + 1). These values are actually stored in CR3.PCID.
The user ASID (uPCID) space: these include ASIDs 2048 + 1 to 2048 + 7 (2048 + TLB_NR_DYN_ASIDS + 1). These values are actually stored in CR3.PCID.

Each process has a single canonical ASID. This is the value used by Linux itself. Each canonical ASID is associated with a kPCID and a uPCID, which are the values that are actually stored in CR3.PCID. The reason for having two ASIDs per process is to support page-table isolation (PTI) which mitigates the Meltdown vulnerability. In fact, with PTI, each process has two virtual address spaces, each has its own ASID, but the two ASIDs have a fixed arithmetic relationship as shown above. So even though Intel processors support 4096 ASIDs per core, Linux only uses 12 per core. I'll get to the ctxs array, just bear with me a little.

Linux assigns ASIDs to processes dynamically on context switches, not on creation. The same process may get different ASIDs on different cores and its ASID may change dynamically whenever a thread of that process is scheduled to run on a core. This is done in the switch_mm_irqs_off function, which gets called whenever the scheduler switches from one thread to another on a core, even if the two threads belong to the same process. There are two cases to consider:

A user thread got interrupted or it performed a system call. In this case, the system switches to kernel-mode to handle the interrupt or the system call. Since the user thread was just running, its process must have an already assigned ASID. If the OS decided later to resume executing the same thread or another thread of the same process, then it will just continue using the same ASID. This case is boring.
The OS decides to schedule a thread of another process to run on the core. So the OS has to assign an ASID to the process. This case is very interesting and will be discussed in detail in the rest of this answer.

In this case, the kernel executes the following function call:

choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);

The first argument, next, points to the memory descriptor of the process to which the thread that scheduler selected to resume belongs. This object contains many things. But one thing we care about here is ctx_id, which is a 64-bit value that is unique per existing process. The next_tlb_gen is used to determine whether a TLB invalidation is required or not as I'll discuss shortly. The function returns new_asid which holds the ASID assigned to the process and need_flush which says whether a TLB invalidation is required. The return type of the function is void.

static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
                u16 *new_asid, bool *need_flush)
{
    u16 asid;

    if (!static_cpu_has(X86_FEATURE_PCID)) {
        *new_asid = 0;
        *need_flush = true;
        return;
    }

    if (this_cpu_read(cpu_tlbstate.invalidate_other))
        clear_asid_other();

    for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
        if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
            next->context.ctx_id)
            continue;

        *new_asid = asid;
        *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
                   next_tlb_gen);
        return;
    }

    /*
     * We don't currently own an ASID slot on this CPU.
     * Allocate a slot.
     */
    *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
    if (*new_asid >= TLB_NR_DYN_ASIDS) {
        *new_asid = 0;
        this_cpu_write(cpu_tlbstate.next_asid, 1);
    }
    *need_flush = true;
}

Logically, the function works as follows. If the processor does not support PCIDs, then all processes get an ASID value of zero and a TLB flush is always required. I'll skip the invalidate_other check since it's not relevant. Next, the loop iterates over all of the 6 canonical ASIDs and use them as indices into the ctxs. The process that has context identifier of cpu_tlbstate.ctxs[asid].ctx_id is currently assigned the ASID value asid. So the loop checks whether the process still has an ASID assigned it. In this case, the same ASID is used and need_flush updated based on next_tlb_gen. The reason that we may need to flush the TLB entries associated with the ASID even though the ASID was not recycled is due to the lazy TLB invalidation mechanism, which is beyond the scope of your question.

If none of the currently used ASIDs have been assigned to the process, then we need to allocate a new one. The call to this_cpu_add_return simply increments the value in next_asid by 1. This gives us a kPCID value. Then when subtracted by 1, we get the canonical ASID. If we have exceeded the maximum canonical ASID value (TLB_NR_DYN_ASIDS), then we wraparound to the canonical ASID zero and write the corresponding kPCID (which is 1) to next_asid. When this happens, it means that some other process was assigned the same canonical ASID and so we definitely want to flush the TLB entries associated with that ASID on the core. Then when choose_new_asid returns to switch_mm_irqs_off, ctxs array and CR3 are updated accordingly. Writing to CR3 will make the core automatically flush the TLB entries associated with that ASID. If the process whose ASID was reassigned to another process is still alive, then the next time one of its threads run, it will get assigned a new ASID on that core. This whole process happens per core. Otherwise, if that process is dead, then at some point in the future, its ASID will get recycled.

The reason that Linux uses exactly 6 ASIDs per core is that it makes the size of the tlb_state type small just enough to fit within two 64-byte cache lines. Generally, there can be dozens of processes that are simultaneously alive on a Linux system. However, most of them are typically dormant. So the way Linux manages the ASID space is practically very efficient. Although it would be interesting to see an experimental evaluation on the impact of the value of TLB_NR_DYN_ASIDS on performance. But I'm not aware of any such published study.

Recommended topics

Hot tags