Why does Linux on x86 use different segments for user processes and the kernel?

Asked 1/1, 2011 at 18:1 Answered 18/3, 2014 at 10:9

Solved linux-kernel x86 memory-segmentation

So, I know that Linux uses four default segments for an x86 processor (kernel code, kernel data, user code, user data), but they all have the same base and limit (0x00000000 and 0xfffff), meaning each segment maps to the same set of linear addresses.

Given this, why even have user/kernel segments? I understand why there should be separate segments for code and data (just due to how the x86 processor deals with the cs and ds registers), but why not have a single code segment and a single data segment? Memory protection is done through paging, and the user and kernel segments map to the same linear addresses anyway.

Variance answered 1/1, 2011 at 18:1 Comment(0)

The x86 architecture associates a type and a privilege level with each segment descriptor. The type of a descriptor allows segments to be made read only, read/write, executable, etc., but the main reason for different segments having the same base and limit is to allow a different descriptor privilege level (DPL) to be used.

The DPL is two bits, allowing the values 0 through 3 to be encoded. When the privilege level is 0, then it is said to be ring 0, which is the most privileged. The segment descriptors for the Linux kernel are ring 0 whereas the segment descriptors for user space are ring 3 (least privileged). This is true for most segmented operating systems; the core of the operating system is ring 0 and the rest is ring 3.

The Linux kernel sets up, as you mentioned, four segments:

__KERNEL_CS (Kernel code segment, base=0, limit=4GB, type=10, DPL=0)
__KERNEL_DS (Kernel data segment, base=0, limit=4GB, type=2, DPL=0)
__USER_CS (User code segment, base=0, limit=4GB, type=10, DPL=3)
__USER_DS (User data segment, base=0, limit=4GB, type=2, DPL=3)

The base and limit of all four are the same, but the kernel segments are DPL 0, the user segments are DPL 3, the code segments are executable and readable (not writable), and the data segments are readable and writable (not executable).

See also:

Cormac answered 1/1, 2011 at 18:57 Comment(7)

Ok, so the DPL sets the minimum security level for each segment, but it seems like I can access any linear address as the user anyway, so why have the extra segment for the kernel? If, as a user, I want to access memory address x, I just use the user data segment, with an offset of x. The kernel can use the kernel data segment with an offset of x, but this maps to the same linear address, thus the same address in physical memory, so how does this provide any protection? – Variance 1/1, 2011 at 22:19

@anjruu: Some assembly instructions require a certain privilege level or else a general protection (GP) fault is raised. For example, the IN instruction to read a byte from a port requires the current PL (CPL) to be less than or equal to the input/output PL (IOPL; bits 12 and 13 of the FLAGS register), which is 0 for Linux. The CPL is the DPL of the segment descriptor corresponding to the CS (code segment) register. – Cormac 1/1, 2011 at 22:59

@Daniel: Wait, no I don't, sorry to be so dense. I imagine the CPU could be running in Kernel mode, with a privilege level of 0, and be able to execute any instruction, and still have a segment in the CS, SS, or DS registers with a privilege level of 3. The CPU could access any address in the segment, and then once the CPU is done with the syscall, the CPU could switch to level 3, still be able to access anything in the segment, and be unable to execute protected instructions. The paging system could provide memory protection. Sorry, I'm still not seeing it... – Variance 2/1, 2011 at 1:30

@anjruu: "sorry to be so dense" It's okay. I don't mind; in fact, it helps me to remember this stuff. One thing to be clear about is that the CPU does not run in "kernel mode". In order to take advantage of segmentation, the CPU needs to be in protected mode, but the CPL is a property of each task. Each task is fully described by its Task State Descriptor, which, among other things, includes the values of all registers including the segment registers... – Cormac 2/1, 2011 at 2:17

@anjruu: (continued) Now, the way that a task can change its CPL is to load a segment descriptor having a different DPL into its CS register using a far RET instruction. It is possible for a ring 0 task to set its CS register to a segment descriptor with DPL 3 (thus moving the task into ring 3). However, it is not possible for the task to move back to ring 0 because far RET checks that the "return PL" is greater than or equal to the CPL. Thus, if the kernel task moved itself into ring 3, it would be stuck in ring 3, never able to go back! – Cormac 2/1, 2011 at 2:18

@Daniel: Ok, I think I've got it. So, the CPL is a property of the segment descriptor, rather than of the CPU at any given time. Cool, thank you very much! – Variance 2/1, 2011 at 4:16

Thanks. It seems the reason you need separate kernel/user segments for data, is that SS.DPL must match CPL exactly - it cannot be higher, unlike for DS. (Why this rule exists is another question, but I guess it concerns details which are very obscure nowadays). – Pendergast 7/4, 2019 at 21:34

The x86 memory management architecture uses both segmentation and paging. Very roughly speaking, a segment is a partition of a process's address space that has its own protection policy. So, in the x86 architecture, it is possible to split the range of memory addresses that a process sees into multiple contiguous segments, and assign different protection modes to each. Paging is a technique for mapping small (usually 4KB) regions of a process's address space to chunks of real, physical memory. Paging thus controls how regions inside a segment are mapped onto physical RAM.

All processes have two segments:

one segment (addresses 0x00000000 through 0xBFFFFFFF) for user-level, process-specific data such as the program's code, static data, heap, and stack. Every process has its own, independent user segment.
one segment (addresses 0xC0000000 through 0xFFFFFFFF), which contains kernel-specific data such as the kernel instructions, data, some stacks on which kernel code can execute, and more interestingly, a region in this segment is directly mapped to physical memory, so that the kernel can directly access physical memory locations without having to worry about address translation. The same kernel segment is mapped into every process, but processes can access it only when executing in protected kernel mode.

So, in user-mode, the process may only access addresses less than 0xC0000000; any access to an address higher than this results in a fault. However, when a user-mode process begins executing in the kernel (for instance, after having made a system call), the protection bit in the CPU is changed to supervisor mode (and some segmentation registers are changed), meaning that the process is thereby able to access addresses above 0xC0000000.

Refer ed from: HERE

Cuculiform answered 1/1, 2011 at 18:49 Comment(1)

This answer is about paging. The question is about segmentation which is a mapping done before the mapping done by paging. – Coccyx 28/9, 2018 at 20:46

in X86 - linux segment registers are used for buffer overflow check [see the below code snippet which have defined some char arrays in stack] :

static void
printint(int xx, int base, int sgn)
{
    char digits[] = "0123456789ABCDEF";
    char buf[16];
    int i, neg;
    uint x;

    neg = 0;
    if(sgn && xx < 0){
        neg = 1;
        x = -xx;
    } else {
        x = xx;
    }

    i = 0;
    do{
        buf[i++] = digits[x % base];
    }while((x /= base) != 0);
    if(neg)
        buf[i++] = '-';

    while(--i >= 0)
        my_putc(buf[i]);
}

Now if we see the dis-assembly of the code gcc-generated code.

Dump of assembler code for function printint:

 0x00000000004005a6 <+0>:   push   %rbp
   0x00000000004005a7 <+1>: mov    %rsp,%rbp
   0x00000000004005aa <+4>: sub    $0x50,%rsp
   0x00000000004005ae <+8>: mov    %edi,-0x44(%rbp)


  0x00000000004005b1 <+11>: mov    %esi,-0x48(%rbp)
   0x00000000004005b4 <+14>:    mov    %edx,-0x4c(%rbp)
   0x00000000004005b7 <+17>:    mov    %fs:0x28,%rax  ------> obtaining an 8 byte guard from based on a fixed offset from fs segment register [from the descriptor base in the corresponding gdt entry]
   0x00000000004005c0 <+26>:    mov    %rax,-0x8(%rbp) -----> pushing it as the first local variable on to stack
   0x00000000004005c4 <+30>:    xor    %eax,%eax
   0x00000000004005c6 <+32>:    movl   $0x33323130,-0x20(%rbp)
   0x00000000004005cd <+39>:    movl   $0x37363534,-0x1c(%rbp)
   0x00000000004005d4 <+46>:    movl   $0x42413938,-0x18(%rbp)
   0x00000000004005db <+53>:    movl   $0x46454443,-0x14(%rbp)

...
...
  // function end

   0x0000000000400686 <+224>:   jns    0x40066a <printint+196>
   0x0000000000400688 <+226>:   mov    -0x8(%rbp),%rax -------> verifying if the stack was smashed
   0x000000000040068c <+230>:   xor    %fs:0x28,%rax  --> checking the value on stack is matching the original one based on fs
   0x0000000000400695 <+239>:   je     0x40069c <printint+246>
   0x0000000000400697 <+241>:   callq  0x400460 <__stack_chk_fail@plt>
   0x000000000040069c <+246>:   leaveq 
   0x000000000040069d <+247>:   retq

Now if we remove the stack based char arrays from this function , gcc won't generate this guard check .

I have seen the same generated by gcc even for kernel modules. Basically I was seeing a crash while botrapping some kernel code and it was faulting with virtual address 0x28. Later I figured that thought i had initialized the stack pointer correctly and loaded the program correctly, I am not having the right entries in gdt, which would translate the fs based offset into a valid virtual address.

However in case of kernel code it was simply ignoring , the error instead of jumping to something like __stack_chk_fail@plt>.

The relevant compiler option which adds this guard in gcc is -fstack-protector . I think this is enabled by default which compiling a user app.

For kernel , we can enable this gcc flag via config CC_STACKPROTECTOR option.

config CC_STACKPROTECTOR
 699        bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
 700        depends on SUPERH32
 701        help
 702          This option turns on the -fstack-protector GCC feature. This
 703          feature puts, at the beginning of functions, a canary value on
 704          the stack just before the return address, and validates
 705          the value just before actually returning.  Stack based buffer
 706          overflows (that need to overwrite this return address) now also
 707          overwrite the canary, which gets detected and the attack is then
 708          neutralized via a kernel panic.
 709
 710          This feature requires gcc version 4.2 or above.

The relevant kernel file where this gs / fs is linux/arch/x86/include/asm/stackprotector.h

Mendelism answered 18/3, 2014 at 10:9 Comment(0)

-2

Kernel memory should not be readable from programs running in user space.

Program data is often not executable (DEP, a processor feature, which helps guard against executing an overflowed buffer and other malicious attacks).

It's all about access control - different segments have different rights. That's why accessing the wrong segment will give you a "segmentation fault".

Fontes answered 1/1, 2011 at 18:19 Comment(0)

Recommended topics

Hot tags