Why does x86 paging have no concept of privilege rings?
Asked Answered
D

3

6

Back in 1982, when Intel released the 80286, they added 4 privilege levels to the segmentation scheme (rings 0-3), specified by 2 bits in the Global Descriptor Table (GDT) and Local Descriptor Table (LDT).

In the 80386 processor, Intel added paging, but surprisingly, it only has 2 privilege levels (supervisor and user), specified by a single bit in the Page Directory Entry (PDE) and Page Table Entry (PTE).

This means that an OS that only uses paging (like most modern OSes) is unable to benefit from the existence of rings 1 and 2, which could be very useful, for example, for drivers. (Win9x, for example, frequently crashed because it was loading buggy unchecked drivers into ring 0).

From the POV of portability, the existence of rings 1 and 2 is a quirk of the x86 architecture and portable OSes shouldn't use them, because other architectures only have 2 privilege levels.

But I am sure that portability to other platforms is not what Intel engineers were thinking back in 1985 when they were designing the 386.

So why didn't Intel allow paging to have 4 privilege levels, like segmentation?

Donee answered 4/2, 2021 at 20:41 Comment(9)
Paging allows 4 levels of privilegeUnmeasured
@Unmeasured The PDE and PTE have only 1 bit to specify the privilege.Donee
Operating systems use 2 levels because they didn't deem useful to support 4 levels.Unmeasured
@Unmeasured Yes, I was talking from the CPU designer's standpoint, not the OS designer's one.Donee
You are right, I had a wrong memory that page tables had 2 bits specifying the access level.Unmeasured
I think from the point of view of CPU designers, they decided to have 2 levels simply because OSes used 2 levels even when segmentation was a thing.Unmeasured
@Unmeasured Protected mode didn't get much use before the 90s, when the 386 was already 5yrs old. So no, when Intel designed the 386, they couldn't have known that nobody will use 4 privilege levels.Donee
To be fair, OSes like Unix existed when 386 was designed. (On other OSes, but also ported to 8086 as Xenix, I guess without memory protection). But of course, so did Multics, as Brendan points out in his answer. Still, paging on other mainstream CPUs / MMUs like MIPS (and probably earlier non-RISC machines) did exist with I think user vs. supervisor.Hochstetler
@DarkAtom, There were plenty of hardware and software systems using protection prior to the 386.Shelburne
H
3

One guess that occurs to me is that Intel intended that when Ring 1 code is running, it is the supervisor, "supervising" ring 3 code. Not ring 1 running under ring 0.

If the ring 1 code wants to call ring 0 code, it can call through a call-gate, and the ring 0 code can change CR3 to a page table that includes mappings for physical pages that weren't present in the page table the ring 1 or 2 code was using.

I really don't know a lot about this stuff, but https://wiki.osdev.org/Task_State_Segment shows that the TSS includes a CR3 field, so using hardware task-switching I'm guessing that calling through a call-gate can trigger the CR3 change directly. (So the call target does not already have to be mapped, otherwise ring 1 / 2 code could have modified it. Or it could be mapped read-only, along with the page table itself and the GDT, to stop the ring 1 code from taking over ring 0 by modifying it.)

This means that an OS that only uses paging [...] unable to benefit from the existence of rings 1 and 2

That's your mistake: you can't "only use paging". Even making interrupt handling from user-space work on a normal x86 OS (with a flat memory model) requires setting up TSS stuff to set ESP to the kernel stack pointer when switching to kernel mode, even if you don't otherwise use hardware task-switching.

x86 has "task gates" and "call gates" and all kinds of really complex stuff I hope I don't ever have to fully understand, but I expect that spending some time reading up on it might shed some light on the kind of things the architects of 386 thought OSes might want to do.

Separate from my previous guess (about ring 1 supervising ring 3), perhaps Intel expected OSes to use segmentation to separate ring 1 / 2 from ring 0 memory in the same page table if desired1. As you say, they probably weren't trying to create something that portable microkernel OSes could just use as a bonus.

A kernel has the luxury of deciding the layout of virtual address space, so it could well assign chunks of that for use by ring 1 code, setting up CS/DS/ES/SS appropriately when calling it.

I think that would have to mean a non-flat model, though, because x86 segmentation makes addresses go from 0..limit, not e.g. allowing access to a range of virtual addresses from low..high without changing the meaning of a pointer.

Footnote 1:

Is it necessary to have full memory protection between ring 0 and ring 1? An OS might use ring 1 for semi-trusted code.

Some privileged instructions require ring 0 so ring 1 would stop that from happening by accident. IO privilege level can be set separately to allow cli and in/out in ring > 0, but other instructions like invlpg, lgdt, and mov cr, reg require actual ring 0.

Hochstetler answered 4/2, 2021 at 22:25 Comment(8)
BTW, @MichaelPetch or @ IraBaxter, or other users who have some experience playing with toy OSes / bootloaders could probably give a better answer than this, or fill in some details I'm glossing over. Ira has written some about x86 protected-mode and how it allows a Multics-like memory model: What is the "FS"/"GS" register intended for?Hochstetler
I never tested but I'm pretty sure that the limit is ignored once long mode is enabled making segmentation unusable for memory protection.Unmeasured
@user123: Yes, x86-64 long mode neutered segmentation, base=0 and limit=-1 are fixed for segments other than FS/GS. This Q&A is about Intel's design of 386 protected mode, not AMD's decades-later design of AMD64.Hochstetler
Yes of course that holds only for more modern CPUs.Unmeasured
@user123: You can of course run an x86-64 CPU in "legacy mode"; simply don't enable long mode and run a 386 OS. So even modern CPUs have to support all this, just not in long mode.Hochstetler
By "only use paging" I meant that OSes use paging as their main memory protection scheme, and only set up segmenetation, TSS and the rest of that stuff if it is absolutely required. To the best of my knowledge, you can't even use segmentation and hardware task switching in long mode.Donee
@DarkAtom: Long mode obviously has nothing to do with the motivations / ideas of the architects of 386. AMD designed that over 15 years after Intel designed 386. But regardless, as I and Brendan have pointed out, mov cr3, reg requires ring 0, so ring 0 can maintain control of the page tables and only map memory that ring 1 or 2 code should be allowed to access. However, Intel's design for 386 certainly hints that hardware task switching was intended for some ways of using multiple privilege levels, so multiple privilege levels in long mode are less versatile for x86-64 than i386.Hochstetler
@DarkAtom: But note that the low 2 bits of the code segment selector is what sets the current privilege level, with the rest of the bits being an index into GDT or LDT. Segmentation doesn't allow variable base or limit in long mode (except for variable FS and GS base via an MSR to get the full 64-bit range), but it's still the mechanism for choosing the mode (long mode vs. 16 or 32-bit compat mode) as well as privilege level. However, DS and ES can be 0 (the null selector) in long mode. So basically segmentation is vestigial for modern x86-64, and an over-complicated way to set modes.Hochstetler
B
3

There are four privilege levels (called rings) in 386 protected mode as well as in 286: ring 0 has the highest privilege (operating system), rings 1 and 2 are not widely used, and ring 3 has the lowest privilege (user application). Rings 0-2 are called "Supervisor", while ring 3 is called "User".

The current privilege level (CPL) is determined by the address of the instruction you are executing, according to the Descriptor Privilege Level (DPL) of the code segment. For more information about the current privilege level, see CPL vs. DPL vs. RPL.

The bit that you are referring to is the following. It is a bit 2 in a 32-bit Page-Directory Entry (PDE) that maps a 4MB page (or of a 32-bit PDE that references a page table). This bit is called "User/Supervisor" (U/S). The value of "0" in this bit means that the user-mode accesses are not allowed to the 4MB region controlled by this entry. This does not mean that there are, as you wrote, just "2 privilege levels (supervisor and user)". The "supervisor" level still consists of three rings. This comprises, together with the user ring, four rings in total.

See section 4.6 of Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1:

Every access to a linear address is either a supervisor-mode access or a user-mode access. For all instruction fetches and most data accesses, this distinction is determined by the current privilege level (CPL): accesses made while CPL < 3 are supervisor-mode accesses, while accesses made while CPL = 3 are user-mode accesses.

Therefore, CPL can be 0, 1, 2 and 3, effectively having all 4 rings.

Please find more information on the U/S flag from the manual above mentioned:

Some operations implicitly access system data structures with linear addresses [...] called implicit supervisor-mode accesses regardless of CPL. Other accesses made while CPL < 3 are called explicit supervisor-mode accesses. Access rights are also controlled by the mode of a linear address as specified by the paging-structure entries controlling the translation of the linear address. If the U/S flag (bit 2) is 0 in at least one of the paging-structure entries, the address is a supervisor-mode address. Otherwise, the address is a user-mode address.

P.S. My answer does not address the issue why there isn't the same memory protection between ring 1 and ring 0 as it is between ring 3 and rings 0/1/2, so the rings 1 and 2 are unusable if a page-table entry can't distinguish them from ring 0. See the reply by Peter Cordes that addresses this issue.

Bemis answered 4/2, 2021 at 21:38 Comment(2)
The question is why there isn't memory protection between ring 1 and ring 0, the way there is between ring 3 and ring 0/1/2. You're not answering that, just going into detail about the fact that it's "missing". (The question is assuming that ring 1 and 2 are unusable if a page-table entry can't distinguish them from ring 0, not saying they don't exist at all.)Hochstetler
@PeterCordes, thank you, I have just mentioned about this.Bemis
H
3

One guess that occurs to me is that Intel intended that when Ring 1 code is running, it is the supervisor, "supervising" ring 3 code. Not ring 1 running under ring 0.

If the ring 1 code wants to call ring 0 code, it can call through a call-gate, and the ring 0 code can change CR3 to a page table that includes mappings for physical pages that weren't present in the page table the ring 1 or 2 code was using.

I really don't know a lot about this stuff, but https://wiki.osdev.org/Task_State_Segment shows that the TSS includes a CR3 field, so using hardware task-switching I'm guessing that calling through a call-gate can trigger the CR3 change directly. (So the call target does not already have to be mapped, otherwise ring 1 / 2 code could have modified it. Or it could be mapped read-only, along with the page table itself and the GDT, to stop the ring 1 code from taking over ring 0 by modifying it.)

This means that an OS that only uses paging [...] unable to benefit from the existence of rings 1 and 2

That's your mistake: you can't "only use paging". Even making interrupt handling from user-space work on a normal x86 OS (with a flat memory model) requires setting up TSS stuff to set ESP to the kernel stack pointer when switching to kernel mode, even if you don't otherwise use hardware task-switching.

x86 has "task gates" and "call gates" and all kinds of really complex stuff I hope I don't ever have to fully understand, but I expect that spending some time reading up on it might shed some light on the kind of things the architects of 386 thought OSes might want to do.

Separate from my previous guess (about ring 1 supervising ring 3), perhaps Intel expected OSes to use segmentation to separate ring 1 / 2 from ring 0 memory in the same page table if desired1. As you say, they probably weren't trying to create something that portable microkernel OSes could just use as a bonus.

A kernel has the luxury of deciding the layout of virtual address space, so it could well assign chunks of that for use by ring 1 code, setting up CS/DS/ES/SS appropriately when calling it.

I think that would have to mean a non-flat model, though, because x86 segmentation makes addresses go from 0..limit, not e.g. allowing access to a range of virtual addresses from low..high without changing the meaning of a pointer.

Footnote 1:

Is it necessary to have full memory protection between ring 0 and ring 1? An OS might use ring 1 for semi-trusted code.

Some privileged instructions require ring 0 so ring 1 would stop that from happening by accident. IO privilege level can be set separately to allow cli and in/out in ring > 0, but other instructions like invlpg, lgdt, and mov cr, reg require actual ring 0.

Hochstetler answered 4/2, 2021 at 22:25 Comment(8)
BTW, @MichaelPetch or @ IraBaxter, or other users who have some experience playing with toy OSes / bootloaders could probably give a better answer than this, or fill in some details I'm glossing over. Ira has written some about x86 protected-mode and how it allows a Multics-like memory model: What is the "FS"/"GS" register intended for?Hochstetler
I never tested but I'm pretty sure that the limit is ignored once long mode is enabled making segmentation unusable for memory protection.Unmeasured
@user123: Yes, x86-64 long mode neutered segmentation, base=0 and limit=-1 are fixed for segments other than FS/GS. This Q&A is about Intel's design of 386 protected mode, not AMD's decades-later design of AMD64.Hochstetler
Yes of course that holds only for more modern CPUs.Unmeasured
@user123: You can of course run an x86-64 CPU in "legacy mode"; simply don't enable long mode and run a 386 OS. So even modern CPUs have to support all this, just not in long mode.Hochstetler
By "only use paging" I meant that OSes use paging as their main memory protection scheme, and only set up segmenetation, TSS and the rest of that stuff if it is absolutely required. To the best of my knowledge, you can't even use segmentation and hardware task switching in long mode.Donee
@DarkAtom: Long mode obviously has nothing to do with the motivations / ideas of the architects of 386. AMD designed that over 15 years after Intel designed 386. But regardless, as I and Brendan have pointed out, mov cr3, reg requires ring 0, so ring 0 can maintain control of the page tables and only map memory that ring 1 or 2 code should be allowed to access. However, Intel's design for 386 certainly hints that hardware task switching was intended for some ways of using multiple privilege levels, so multiple privilege levels in long mode are less versatile for x86-64 than i386.Hochstetler
@DarkAtom: But note that the low 2 bits of the code segment selector is what sets the current privilege level, with the rest of the bits being an index into GDT or LDT. Segmentation doesn't allow variable base or limit in long mode (except for variable FS and GS base via an MSR to get the full 64-bit range), but it's still the mechanism for choosing the mode (long mode vs. 16 or 32-bit compat mode) as well as privilege level. However, DS and ES can be 0 (the null selector) in long mode. So basically segmentation is vestigial for modern x86-64, and an over-complicated way to set modes.Hochstetler
P
2

The desire is to protect stuff from other stuff. Before paging existed (and before 80x86 existed - the "4 rings" model dates back to Multics if not earlier) the easiest way was to use "rings".

With 4 rings you can have a "D can't access C, and they can't access B, and they all can't access A" arrangement. This is relatively awful for the opposite direction ("C can access everything in D regardless of whether it needs to or not") and relatively awful for granularity (e.g. if you want "C can access part of D but not all of D").

With paging, you can give each thing its own virtual address space and map anything anywhere to control access (as you can't access anything that isn't mapped into your virtual address space). You can still have "D can't access C, and they can't access B, and they all can't access A" (if that's what you actually want) just by mapping all pages belonging to D into A, B and C; and mapping all pages belonging to C into A and B; and so on. However, you can also have any other arrangement - e.g. simulate 10 rings instead of 4 rings, or let C access part of D (but not all of D) and part of B (but not all of B), or...

The question then becomes; if paging alone is enough to simulate any number of rings (and more), why do we still have 2 rings?

The answer is that paging only controls access to things that are in memory (code, data), and doesn't/can't control access to things that aren't in memory (e.g. the CPU's control registers). 2 rings are still needed to control whether things that aren't in memory can/can't be accessed (e.g. whether a mov cr0, eax instruction will cause a general protection fault).

However; there's 2 things that make this less obvious. Switching between different virtual address spaces has some cost associated with it, and people try to minimize that cost (e.g. by not giving shared libraries their own separate virtual address spaces, by not giving individual device drivers their own virtual address space, etc); and because paging was added (with backward compatibility concerns) to a pre-existing "segmentation with 4 rings" design scraps of the old "segmentation with 4 rings" remain in use (e.g. the TSS, the IO permission system, etc).

Parvati answered 5/2, 2021 at 0:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.