Is it possible to create threads without system calls in Linux x86 GAS assembly?

Asked 3/4, 2009 at 17:31 Answered 14/10, 2016 at 1:47

Solved linux multithreading assembly gnu-assembler

Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.

The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.

My fear however is that, in order to use threading, I must make system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...

my question is therefore twofold (with an extra bonus question underneath it):

Is it possible to write assembler code which can run multiple threads simultaneously on multiple cores at once, without the need of system calls?
Will I get a performance gain if I have really tiny threads (tiny as in the total execution time of the thread), performance loss, or isn't it worth the effort at all?

My guess is that multithreaded assembler code is not possible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?

Philippopolis answered 3/4, 2009 at 17:31 Comment(1)

There's a similar (though not duplicate IMHO) question here: stackoverflow.com/questions/980999/… The answers there might give you some insight – Clique 15/6, 2009 at 8:41

The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.

User threads won't give you the threading support that you want, because they all run on the same hardware thread.

Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.

Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.

Clique answered 15/6, 2009 at 8:45 Comment(4)

I marked your answer as the correct answer; I was indeed looking for a way to run code simultaneously on multiple cores. I already accepted the fact that this was not possible in the way I wanted it to be... Do you, by chance, know the correct way to to this? the information on this topic is rather thin spread. and many thanks for your answer! – Philippopolis 15/6, 2009 at 14:13

That's actually very OS dependent. I can tell you how it's done at the system programming level in x86, but I don't know how to do it as a user in any OS. – Clique 15/6, 2009 at 17:3

Likely you can only if you drop the operating system, otherwise you have to pass through OS provided mechanisms. – Insatiate 25/6, 2010 at 16:27

There have historically been some CPUs (like the Tera MTA and the Redcode virtual CPU used in Core Wars) that did indeed support user-level forking into multiple physical threads — there is a separate instruction to fork a new thread. No current CPU that I know of does this. – Leaper 29/4, 2011 at 20:9

"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".

The short answer is you can do multithreaded programming without calling expensive OS task management primitives. Simply ignore the OS for thread scheduling operations. This means you have to write your own thread scheduler, and simply never pass control back to the OS. (And you have to be cleverer somehow about your thread overhead than the pretty smart OS guys). We chose this approach precisely because windows process/thread/ fiber calls were all too expensive to support computation grains of a few hundred instructions.

Our PARLANSE programming langauge is a parallel programming language: See http://www.semdesigns.com/Products/Parlanse/index.html

PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism construct, and schedules such grains by a combination of a highly tuned hand-written scheduler and scheduling code generated by the PARLANSE compiler that takes into account the context of grain to minimimze scheduling overhead. For instance, the compiler ensures that the registers of a grain contain no information at the point where scheduling (e.g., "wait") might be required, and thus the scheduler code only has to save the PC and SP. In fact, quite often the scheduler code doesnt get control at all; a forked grain simply stores the forking PC and SP, switches to compiler-preallocated stack and jumps to the grain code. Completion of the grain will restart the forker.

Normally there's an interlock to synchronize grains, implemented by the compiler using native LOCK DEC instructions that implement what amounts to counting semaphores. Applications can fork logically millions of grains; the scheduler limits parent grains from generating more work if the work queues are long enough so more work won't be helpful. The scheduler implements work-stealing to allow work-starved CPUs to grab ready grains form neighboring CPU work queues. This has been implemented to handle up to 32 CPUs; but we're a bit worried that the x86 vendors may actually swamp use with more than that in the next few years!

PARLANSE is a mature langauge; we've been using it since 1997, and have implemented a several-million line parallel application in it.

Cruse answered 16/6, 2009 at 4:48 Comment(7)

hi, you brought up parlanse in several of your postings, is it actually available to end users? I checked out the examples on your webpage (semdesigns.com/Products/Parlanse/examples.html) and it looks rather LISPish? – Thanh 16/6, 2009 at 22:56

PARLANSE is available, but only as part of the DMS Software Reengineering Toolkit. It looks like LISP but isn't LISP; no CAR or CDR anywhere! The base language is C-ish: scalars, structs, pointers, functions, but there it diverges: no pointer arithmetic, lambda with real lexical scopes, dynamic strings (UNICODE) and arrays, parallelism (the main point of PARLANSE) and exception handling that works across parallelism boundaries. You can get a better sense of the language from the technical paper at semdesigns.com/Company/Publications/… – Cruse 17/6, 2009 at 6:42

@IraBaxter, How is it even possible to guarantee "never pass control back to the OS"? The OS would force an interrupt anyway isn't it? – Frankie 12/5, 2016 at 10:17

What I mean by that is that PARLANSE does its own thread scheduling. It multiplexes Windows threads on top of "grains"; when a grain completes execution, it passes control to the PARLANSE scheduler, which picks another ready-to-run grain from the PARLANSE per-thread ready-to-run-grains queue, or tries to steal a grain from ready-grain queue, and becomes that grain. Yes, it is true that to do OS functions these threads have to make a real call on the OS but that is expected to be a really rare (e.g read really big blocks from files), and no, I can't prevent device or clock tick interrupts. – Cruse 12/5, 2016 at 13:4

@IraBaxter, I saw your profile. Looks like you are one of the victims of "Meta drowning" too and and now reside in Quora. What do you think of Quora thus far? – Frankie 13/5, 2016 at 1:6

"Meta drowning": now there's a term. I've had my fun at SO, yes :-} Quora: mostly silly questions, an occasional really interesting one. What is stunning is the quality of the people that answer them. I don't reside more at Quora than here, though. – Cruse 13/5, 2016 at 3:14

Doctor, my hat is raised. Not much similarly interesting seen in true-[PARALLEL] languages since hardware-driven occam-pi. The explicit language expressivity available for a user-defined block-dependency graph is also a cool design feature for a "just"-[CONCURRENT] type of scheduling. The impressive almost-linear scaling of speedups from a parallelised code execution, demonstrated on PI-example is a lovely piece, to use together with overhead-strict Amdahl Law re-formulation. GREAT THANKS TO HAVE POSTED THE [PARLANSE] EXPERIENCE HERE, INDEED, SIR. – Coridon 24/10, 2017 at 4:4

Implement user-mode threading.

Historically, threading models are generalised as N:M, which is to say N user-mode threads running on M kernel-model threads. Modern useage is 1:1, but it wasn't always like that and it doesn't have to be like that.

You are free to maintain in a single kernel thread an arbitrary number of user-mode threads. It's just that it's your responsibility to switch between them sufficiently often that it all looks concurrent. Your threads are of course co-operative rather than pre-emptive; you basically scatted yield() calls throughout your own code to ensure regular switching occurs.

Gitagitel answered 3/4, 2009 at 17:38 Comment(5)

Yes... that's the only manageable way to do this and have an actual perf improvement. System threads are designed for long-running tasks, not short bits of code that are multi-threaded only to be able to soak up more cpu time. Beware of the cost of maintaining mem consistency, though... – Effervescent 3/4, 2009 at 17:56

The idea you suggest sounds nice, but how can I implement this in assembler? what system calls/assembler statements can I use for this? – Philippopolis 4/4, 2009 at 17:37

The key is to play around with the call stack. – Gitagitel 4/4, 2009 at 19:7

A word of caution: doing use-mode threading like this will not gain any performance, since it will all run on one CPU core. To get simultaneous multithreading, you really need the kernel's help. – Lightweight 10/4, 2009 at 14:42

This is a wrong answer, since the OP specifically specifies running simultaneously on multiple cores. – Clique 15/6, 2009 at 8:46

If you want to gain performance, you'll have to leverage kernel threads. Only the kernel can help you get code running simultaneously on more than one CPU core. Unless your program is I/O bound (or performing other blocking operations), performing user-mode cooperative multithreading (also known as fibers) is not going to gain you any performance. You'll just be performing extra context switches, but the one CPU that your real thread is running will still be running at 100% either way.

System calls have gotten faster. Modern CPUs have support for the sysenter instruction, which is significantly faster than the old int instruction. See also this article for how Linux does system calls in the fastest way possible.

Make sure that the automatically-generated multithreading has the threads run for long enough that you gain performance. Don't try to parallelize short pieces of code, you'll just waste time spawning and joining threads. Also be wary of memory effects (although these are harder to measure and predict) -- if multiple threads are accessing independent data sets, they will run much faster than if they were accessing the same data repeatedly due to the cache coherency problem.

Lightweight answered 10/4, 2009 at 14:49 Comment(2)

thank you for your valuable input! I will most certainly take a look at 'sysenter', but a question remains for me: how can I call a kernel thead in assembler? and how can I be sure that it will run on a seperate core? – Philippopolis 10/4, 2009 at 20:22

While the last half of this answer seems on the mark, the bit about "use kernal threads" where kernal means "inside the OS" is simply wrong. You do need to use plain ol' threads (or additional processes, if you can stand the conext switching time) for which Windows and Linux both provide just fine calls. Agreed, the overhead of those calls is higher than one would like. – Cruse 17/6, 2009 at 13:34

Quite a bit late now, but I was interested in this kind of topic myself. In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.

Obligatory BLUF:

Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.

Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.

Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.

Onto my 2-cents worth of content.

I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).

This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.

To implement this system (entirely in userspace and also via non-root access if desired) the following were required:

A notion of what threads boil down to: A stack for stack operations (kinda self explaining and obvious) A set of instructions to execute (also obvious) A small block of memory to hold individual register contents

What a scheduler boils down to: A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).

A thread context switcher: A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.

So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.

I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.

Gymnast answered 14/10, 2016 at 1:47 Comment(0)

First you should learn how to use threads in C (pthreads, POSIX theads). On GNU/Linux you will probably want to use POSIX threads or GLib threads. Then you can simply call the C from assembly code.

Here are some pointers:

Posix threads: link text
A tutorial where you will learn how to call C functions from assembly: link text
Butenhof's book on POSIX threads link text

Baskin answered 10/4, 2009 at 14:39 Comment(1)

glib threads (linuxthread first, NPTL then) are POSIX threads, POSIX is just a norm. – Anabantid 10/4, 2009 at 14:47

System calls are not that slow now, with syscall or sysenter instead of int. Still, there will only be an overhead when you create or destroy the threads. Once they are running, there are no system calls. User mode threads will not really help you, since they only run on one core.

Barocchio answered 10/4, 2009 at 14:44 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags