Linux kernel: schedule() function

Asked 19/12, 2013 at 10:32 Answered 10/7, 2024 at 15:17

c linux multithreading linux-kernel scheduler

I have seen several other questions on the forum that talk about this schedule() function, but my question is a bit different. I have seen several discussions and literature about it's theoretical, algorithmic and implementation aspects.

The thing which is not clear and the question is asked about, is the execution aspect. Of course looking at the source of the kernel in depth and doing all the required debugging, tracing bla bla... might answer this question but it seem not wise to reinvent the wheel.

The questions/confusions are as following:

What is the path traversed by a multi-threaded user program at kernel level?

Who schedules the threads? Which interrupt context(s)? Any name? (if we see at a trace at kernel level, there is nothing called "sched", but there are swappers, inits, ksoft* bla bla) Deos it go like this:

A process (user program) its child threads all are taken by the kernel at first, then kernel makes them as executable threads (by amalgamating them with schedule() and/or other functions, i.e., this new executable thread has some instructions from the kernel maybe from schedule()/others, embedded along with user task instructions. That makes it scheduled automatically if the situation occurs )

schedule() is always executing at some co-processor to observe and act, if necessary, from that co-processor? That's why, sometimes when we see any two threads switching on a cpu there is only swapper executing in-between and before and after, i.e., there is nothing called scheduler at that level, right?

Thanks for reading and sorry for writing my confusions to share with.

Orvas answered 19/12, 2013 at 10:32 Comment(16)

"if we see at a trace at kernel level, there is nothing called "sched", but there are swappers, inits, ksoft"... Those sound like processes or kernel threads. schedule() is a kernel function. – Hogg 19/12, 2013 at 10:40

schedule() looks at the queue of threads that are waiting to run, and swaps it onto the processor. It's typically called from some sort of interrupt context. – Hogg 19/12, 2013 at 10:41

Thanks a lot @JonathonReinhart: Logical, Which interrupt contexts may call it? Do these context appear in the trace .. I mean, are they already there, any names? Again thanks for answer and understanding the query.. – Orvas 19/12, 2013 at 10:48

Possible duplicate of https://mcmap.net/q/1479720/-which-tasks-correspond-to-the-linux-kernel-scheduler/1401351 – Greathearted 19/12, 2013 at 14:31

@Orvas Martin's answer nailed it. – Hogg 19/12, 2013 at 17:53

A thread is no different than a separate task with the exception that the struct mm is shared; Ie, it has the same view of memory. Typically, the OS arch has a struct thread_info on the kernel stack which has pointers to other owned objects. schedule() swaps the active kernel stack and updates the VM if it is needed. – Koffman 20/12, 2013 at 16:3

@artlessnoise Again my question is: if I want to see where a statement, let say: printf("Hello World!"); from a user program took the cpu time, the answer is: It's probably somewhere when the (task) user binary was occupying the cpu. How about schedule()? – Orvas 20/12, 2013 at 17:49

I can not think schedule() is configuring a part of the hardware/control unit then everything happens automatically afterwards. Of course schedule() itself is executed (takes cpu time) to give meaning to its respective instructions (i would say passive source code) and make the desired scheduling happen, right? – Orvas 20/12, 2013 at 17:58

@Greathearted Please read the answer and comments to do a fair deal in the link you have given! btw, that is also my question and over here I have tried again to make my query better phrased with the best of my ability. If it goes against any rule, should i delete the previous one? – Orvas 20/12, 2013 at 18:16

Now, you have changed your question? Who gets the run time accounting for schedule()? This is different depending on how it is sampled. If you depend on the sub-micron time it takes schedule() to execute, then you have other problems. The accounting will be added to who ever the active task_struct is. Look at switch_to.h. This arch magic switches the stack atomic for accounting. – Koffman 20/12, 2013 at 19:19

@artlessnoise by saying, "The accounting will be added to who ever the active task_struct " do you mean, "A process (user program) its child threads all are taken by the kernel at first, then kernel makes them as executable threads (by amalgamating them with schedule()" ? NO, the question is same. To me, of course the one who gets the run time accounting is the one executing it, right? – Orvas 20/12, 2013 at 19:29

See core.c. The switch_to changes the kernel stack. There is always a macro call current() which gets the current task. This will get run time incremented when an interrupt samples to see who is running. On one side of switch_to it is the old task and on the other it is the new task. See current.h. get_thread_info() has light weight thread_info, having a task_struct pointer. – Koffman 20/12, 2013 at 19:49

The task accounting can not be better than the granularity of the system tick unless the timer supplies a mechanism for getting a better resolution; this will depend on CONFIG variables. – Koffman 20/12, 2013 at 19:53

@artlessnoise will you say Yes or No for this?

by saying, "The accounting will be added to who ever the active task_struct " do you mean, "A process (user program) its child threads all are taken by the kernel at first, then kernel makes them as executable threads (by amalgamating them with schedule()" ?

– Orvas 20/12, 2013 at 19:53

Thanks @artlessnoise but I think we are not on the same page – Orvas 20/12, 2013 at 20:3

Maybe Linux threads and processes can help you? I think you don't understand the data structures? You need to understand them to understand the code. – Koffman 21/12, 2013 at 0:33

X OR Y - neither.

These preemptive, multithreading OS kernels are all much the same overall.

Look at it, (very simply), like this:

The OS kernel scheduler/dispatcher is a big, complex, interrupt-handler. Interrupts, in OS terms come in two flavours:

Hardware interrupts from peripherals like disk, network, keyboard, mouse. These interrupts cause drivers to run, and the drivers may request a scheduling run from the kernel when they exit.

Software interrupts from threads - system calls that can change the state of a thread, eg. a thread may request input that is not immediately available, and so that thread will not run on until the input is available.

When an interrupt does occur, the kernel uses its internal state data, together with the request data from the interrupt, to run its scheduling algorithm and decide which threads should be running on the available cores. If it decides that the set of running threads needs to change, it can stop any thread running on any core by using an inter-core driver to cause a hardware-interrupt of the core running the thread.

If there are no interrupts, the kernel does nothing. It can't do anything because it is not being entered from anywhere. It does not need to execute on any co-processor. It does not need to 'inject' any calls into user code.

It's a state-machine with interrupts as inputs and a set of running threads as outputs.

Inexact answered 19/12, 2013 at 11:5 Comment(2)

@Martin James, Thanks Sir, really useful. "It's a state-machine": Does this state-machine configure some control RAM inside the Architecture (May be like a look-up-table) to give this functionality, i.e., at this specific input (interrupt), output this control signal (a signal to select a set of running threads)? Sorry, I am really confused about these things.. – Orvas 20/12, 2013 at 11:33

@Martin James: "decide which threads should be running" dission making needs CPU, right? ... if I want to see it at micro level, should it appear in the form of some task/process/thread? what is the name of that instance? – Orvas 22/12, 2013 at 9:11

Inside the Linux kernel, threads are just processes that share some resources. In other words, for the Linux kernel, threads are just a particular case of processes. The data structure is the same (i.e., task_struct - see include/linux/sched.h).

The system timer is the hardware timer issuing an interrupt at a programmable frequency. Periodically, it issues an hardware interrupt. When the kernel receives such an interrupt, the execution of the current process/thread is interrupted and the system runs the Interrupt Service Routine (ISR) in kernel mode. This routine calls the scheduling functions which choose which process (or thread) should be excecuted in the next time slot. Then, the routine performs a context switch by preempting the currently executing process/thread. This is the basis of multitasking which is the same across all modern general-purpose operating systems.

Lamontlamontagne answered 19/12, 2013 at 10:58 Comment(2)

You seem to be fixated on the timer driver and have ignored all the other interrupts. This is very misleading, but I'm having a 'no-downvoting' day:) – Inexact 19/12, 2013 at 11:8

@MartinJames It is just as wrong as your answer; it is complex and depends on kernel configuration options. For instance, state-machine with interrupts as inputs is contradicted by information in your own answer; a syscall may cause a context switch. For people in IO wait, the other interrupts are important. However, for scheduler work load balance, the timer interrupt is a good thing to fixate upon; besides a task becoming active/sleeping, it is what will cause round robin, etc. – Koffman 20/12, 2013 at 19:26

There are two classes of reasons why/when a task-switch can occur:

the task has nothing to do for the moment, and therefore calls schedule()
the task is busy "for too long", and gets "pre-empted" - by the scheduler, upon a timer interrupt (or some other interrupt for that matter).

Others have hinted at this before. If I recall correctly, Linux has a single multi-purpose IRQ super-service, very thin, which gets invoked by any incoming HW IRQ. This generic super-service almost immediately calls a specific ISR, that's been registered by the HW-specific driver (piece of executable code living in the kernel) for that particular IRQ line. This codepath upon IRQ arrival must be fast, so that specific interrupt service is not delayed. After the HW-specific ISR returns, the Linux super-service has some more time for house-keeping of its own. And, chances are, that it can skim over the scheduling stuff. Upon any incoming IRQ, not just the nominal periodic timer interrupt.

If you remember the old days of DOS, where you'd program the IRQ controller directly, and insert your ISR entry point (service vector) into the interrupt vector table at a well-known position in system RAM... those days have long been over. This low-level stuff is hopefully handled for you by the OS, with its knowledge of the "platform". As a driver author, in your "on driver load" function, you pass the address of your ISR to the system, as a callback function - to call the ISR for you, when its generic super-ISR gets pinged by that particular IRQ line. This probably allows you to write drivers in the C language, including IRQ handling, in a platform-independent fashion. And, it allows the OS to do a bit of IRQ call frequency stats, and speaking of Linux, allows it to muffle IRQ lines that have gone rogue (having an unhandled interrupt source) or to do things like irqpoll (= ask all ISR's registered, if this IRQ event is theirs by any chance) etc. And, among this "etc.", the IRQ super-service can also run the scheduler, if that's not too much of a burden.

To sum up: traditionally, in literature, it is often implied, that only the IRQ from the periodic system timer can invoke a schedule() in the preemptive context. This is IMO not true. Not in Linux.

Unfortunately it's been decades since the last time I browsed the kernel soruce code in this area. Not sure what changes (if any) the arrival of MSI/MSI-X may have done... etc. Just be aware of this eventual possibility.

As mentioned in item #1. high above, in the standard processing context, i.e. outside of interrupt service, schedule() can be called by a running task/process/thread (let's use them as synonyms). As a programmer, you do not call schedule() directly. It is wrapped for you in various other function calls that may "block waiting". I'm not even sure if schedule() is available from the user-space libc directly, probably not. The function lives in the kernel, i.e. "on the dark side of syscalls" (= inside them), albeit in the execution context of a particular user-mode thread. To you as a user-space programmer, schedule() happens behind the scenes, when a write() or read() or select() call blocks waiting, or when you explicitly sleep() for instance.

This "cooperative schedule()" is the function that gets entered from a thread, and returns in another thread :-) only to get called again, and return in yet another thread...

Consider what an interrupt service is. An interrupt service routine (ISR) is in fact just a function. A re-usable sequence of CPU instructions. Just like any other function, written in C or ASM. Typically, upon entry, a function pushes some register state on "stack" (especially on architectures starved of general-purpose CPU registers, such as the x86.) This is to prevent the "register state" from being destroyed/lost - or more precisely, so that the state can be restored upon return from the function call, thus masking register use inside the function.

The "stack" in this case, is a region of memory reserved for this use by a particular process/task/thread. Each thread has a piece of stack space mapped to its virtual address space. The CPU has a special register called the Stack Pointer, that points to the "top of the stack" (except that it typically grows down, namely with the x86 architecture). The CPU also does a required bare minimum of stack operations automatically upon function call and return = the respective push/pop are in fact integrated in the respective CPU instructions (call/ret/iret). The invocation of an ISR works a lot like a call. You can also push/pop things on and off the stack explicitly by dedicated CPU instructions.

Interrupt service routines are especially thorough in pushing on stack everything they may incidentally destroy (or the OS, or the CPU, does that automatically upon IRQ call and return) - to mask their occurrence as much as possible, from the thread or context that was just interrupted. Interrupts are ordered by priority and can also be masked altogether, which is a good idea in certain key "critical sections" of the scheduler, individual ISR's and maybe elsewhere. Disabling interrupts borders on the topic of locking, critical sections, interprocess communication (including communication between processes and ISR's).

So: an ISR is a function that you don't call explicitly. Instead, it gets called for you, by the CPU/hardware. And, you're supposed not to get to know! because after a short instant of time, you're supposed to continue execution at the same place, where it got interrupted. Except, if there's the preemptive version of schedule() inside that interrupt, the IRQ call may in fact return to a different thread, rather than the one that got interrupted! :-D That's the principle of preemptive multitasking anyway...

The scheduler does not run on a "co-processor". It runs on the same CPU core as the benign outside-of-IRQ threads that the scheduler herds. It can be called upon, while your user-space code is stuck in a syscall - or it can arrive on an IRQ thunderbolt out of nowhere, because a part of the scheduler does operate in the IRQ context.

Obviously the scheduler has some private data structures (the "list of processes", in broad terms) that take a piece of system RAM, and that eat a bit of CPU horsepower for housekeeping. I.e. the scheduler has a bit of memory and CPU consumption (overhead) of its own, and the general optimization probably is, to waste as little time as possible in the IRQ context. It is also thinkable that some non-critical areas of the scheduler's activity (janitoring) can be offloaded to a dedicated kernel thread (kthread).

IIRC, the kernel tries to minimize the time spent in IRQ context, i.e. with IRQ's masked. Therefore the driver authoring guidelines mention, that the ISR codepath should preferably be short and swift, with no room for unexpected latencies. Any non-critical work is to be postponed and done in a kthread or tasklet or whatever, if possible. Consequently, I seem to recall that the IRQ super-service tries to ack and unmask the IRQ at the earliest opportunity. After which, the super-ISR is left running where: in the kernel space memory context, in the execution context of an IRQ, with IRQ sources unmasked? If the scheduler tries to do some work in this environment, it may itself face being interrupted again, and a concurrent invocation of schedule() may get started. Also, threads running on different CPU cores (in an SMP machine) may call schedule() explicitly, or within interrupts, in parallel. Hence there's likely some highly optimized locking around the scheduler's manipulation of the table of processes, or some lockless arrangement, to guarantee the scheduler's internal integrity, while maximizing throughput / minimizing contention.

The scheduler's memory structures (heap) are allocated in the kernel's shared virtual memory space. Yes, the kernel has a virtual memory space of its very own, similar to the isolated virtual memory spaces inhabited by user-space processes. The kernel's virtual space is different from the machine's physical address space. Come to think about this, the threads of a process definitely share a heap, but I'm wondering if they may have separate stack spaces... I don't know the definitive answer. It would seem weird if you couldn't pass pointers to addresses on stack (C/C++ local variables) to another thread... The other option would be, that stack spaces are assigned to threads at some wide enough apart locations within the shared virtual memory space.

It would seem that kthreads and ISR's have access to the shared kernel virtual memory space... except that the ISR, upon invocation, must probably push its return address (and further registers explicitly?) onto the stack that's mapped to the foreground thread's virtual memory space.

In other words, in an operating system with memory protection, the ISR must push all the baggage of the current foreground thread onto that thread's stack, then perform a "memory context switch" to the kernel space, do its dirty background job, and then "context switch back" to the foreground thread that was previously interrupted. A "context switch" means, among other things, to get some CPU core registers pointed to another set (tree) of memory mapping tables. And, get the stack pointer primed for the new "context". Now... in the context of a multitasking OS it can so happen, that an IRQ gets invoked in one foreground thread, but the scheduler (its part running in the IRQ context) arranges that upon return, the IRQ returns into a different thread. The contexts (including stack pointers) are just swapped accordingly. Don't worry - the "abandoned" thread retains the IRQ return address, and all the other "baggage pushed on its stack", and is just waiting until its turn comes again, for the CPU to return to it, from what originally was a random old IRQ :-)

Tangential note: this particular arrangement of virtual address spaces, especially that the kernel has a single shared memory space, is not carved in stone in general. This is the way Linux has it. FreeBSD and Windows are probably similar. Other OS'es, especially those fond of security hardening, may be more compartmentalized = various subsystems of the kernel may be encapsulated in isolated memory spaces. How that plays with ISR's... I don't know.

Regarding the note by Martin James, that if there are no interrupts, the kernel does nothing: principally very correct, the way things are.

Obviously you can imagine a workload, where you do want just to let the machine crunch to the max of its available horsepower. Convert a video file to another resolution, mine bitcoin, whatever. You don't even need interrupts to achieve I/O - in theory, you can poll some inputs in a tight loop and whatnot. The point is, that this is inefficient. Not the way things are done, nowadays.

Interrupts are a method of event delivery in the hardware - from peripheral I/O devices, or between CPU cores within an SMP machine, or about some CPU exceptions (internal error events). A method of delivery, that is swift (minimal latency), allows for concurrency, while at the same time being light on the CPU / bus / memory use (bandwidth). Especially well suited for relatively sparse events BTW. The optimal way to construct a computer, whose main role it is to respond to user input, including the OS, is: make it all event-driven. The software running on the machine, structured into a myriad threads, is written such that the normal state of each thread is: waiting for something to happen. Waiting for input. Waiting for user interaction, waiting for data to arrive from disk, waiting for a data write to finish, waiting for something to come from the network. Theads can even form meshes of producer-consumer relationships, where they each do its bit, and wait for each other to take over. The point is: when noone really has anything to do, everybody will block in some syscall, eventually tagging its own thread "I've got nothing to do" and calling schedule(). And, when the scheduler finds out "oh, noone wants to do anything!" it knows that it can go for a nap. Enter a C-state for instance, by calling the hlt or mwait instruction (x86). The first IRQ that comes, any sort of IRQ, will wake up the CPU, and it will continue executing the scheduler from that sleepy instruction. And the clockworks starts ticking again.

This answer feels that it has not been meticulously researched. I may be wrong about some details. Corrections are welcome.

Earthly answered 10/7, 2024 at 15:17 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags