Lockless Inc

A Different Kind of Threading

There are many different kinds of execution contexts in computer science. The largest of these is the "process". A modern operating system will multi-task between multiple processes, and divide system resources between them. Smaller than a process is a "thread". A process is made of one or more threads of execution, and all resources are shared commonly between them. Finally, a thread may be made of "fibers". Fibers are user-based threads. Since the kernel isn't involved, switching between them is a very fast operation.

The most advanced applications use processes, threads and fibers, and they use the correctly sized execution contexts for their given task. Each of the types of execution context has advantages and disadvantages. Using the wrong type can create an inefficient program. The trick is knowing when to use what.


The advantage of processes is isolation. A process can only affect another process through a limited interface. Typically, this may consist of shared file descriptors, signals, and shared memory. Since the amount of interaction is low, if one process misbehaves, it is quite likely that other processes will not be badly affected. Thus simple programs are just a single process, with no extra threads or other complexity.

The disadvantage of processes is isolation. It can be inefficient to communicate between individual processes. Even if shared memory is used, you may need some form of arbitration over its use. For even more isolation, you can use pipes, sockets or message-queues. These are kernel-controlled, with a corresponding loss of speed.

An example of a multi-process application is the Chrome web browser. It uses a process per web-page tab. If something goes wrong within a given tab, it won't bring down the whole browser. The application can seamlessly close down the offending process and continue. Due to isolation, it doesn't have to worry about that corrupting state elsewhere.


Threads on Linux are like processes, they just share more state. Threads share their address space, open file handles, and other more technical data, like the identity of their parent process. Originally, threads on Linux were exactly like processes. However, improving compatibility with POSIX threading meant that now they have explicit kernel support.

A given Linux process has a pid (process id), and a tid (thread id). In a single-threaded application, the pid and the tid have the same value. In a multi-threaded application, each thread shares the same pid, but the tid's differ in each thread. Some system calls differentiate between pids and tids. For example, you may broadcast a signal to a whole process (or group of processes)... or just to a single tid.

The advantage to threads is their shared state. This means that multi-threaded applications can use highly efficient algorithms to operate on shared data structures. Copying is avoided, and only the memory that absolutely needs to change will be altered. Multi-threaded applications can also take advantage of multi-core cpus. Multi-process applications can do the same, but converting a single-process application to multi-process is quite a bit harder than converting it to be multi-threaded.

The disadvantage of multi-threading is complexity. You don't want the differing threads to stomp on each other's toes. To prevent problems, constructs like locks and barriers can be used to avoid one thread looking at data that is concurrently being modified by another.

The issue is that while it is easy to have coarse-grained locking, such locking prevents concurrency. Fined-grained locking also can be a loser due to the extra overhead of all of the extra locks. The most efficient code has just the right amount of synchronization, and no more. Finding this happy balance can be tricky.

Of course, lock-free data structures exist. If you are lucky, then someone has already made the one that corresponds to your particular problem. If so, then you can get superior performance with threads with comparatively little effort. Unfortunately, usually the world isn't so kind. Many data structures are difficult, if not impossible, to make lock free. In most cases, you will be stuck with traditional lock-based code.


Fibers, user-space threading, or "green threads" are an execution context that are in some sense smaller than threads. They basically consist of just a stack frame. A context switch between them can be done by swizzling a few registers, and then swapping stack pointers. This is much faster than context switching between threads, which is something that needs to go through the kernel.

The disadvantage of fibers is that the kernel doesn't know of their existence. If a fiber blocks waiting on data, the kernel can't switch to some other fiber. This means that although it is possible to build a threading library purely from these user-space threading constructs, it is difficult to get the behaviour exactly right. The program needs to manually switch between fibers, adding complexity

An advantage of fibers (other than their speed) is that only one can be running at a given time in a thread. This means that fibers may not need synchronization between themselves. This is due to the one-to-one mapping of fibers to concept of co-routines. With careful programming, it is even possible to have "stackless" fibers, making them have even lighter weight. You just need to record the current state in some other data-structure when you do your context switching.

Data-driven programs can use fibers (or co-routines) to greatly simplify their control flow. A series of co-routines can be chained together to process the data, and each can symmetrically read from, and write to, the next in the chain. Whenever extra data is needed a fiber can "block" waiting for more, and then transfer control to some other fiber. Their generality becomes very helpful when each stage in the data-flow process can require differently sized chunks of data.


The above constructs are well-known... but there is another way of splitting up threads of execution that isn't used as much as it once was. It is possible to have multiple programs running concurrently within the same address space. (As opposed to multiple threads of the same program.) This was relatively common before memory management units with virtual memory and memory protection were available.

The resulting layout looks like an onion. On the outermost layer is possibly a group of interacting meta-processes. Inside each of those is a shared address space, which inside are multiple processes. Inside those are perhaps multiple threads, which finally may contain multiple fibers.

So why is this useful? The whole point of having processes within separate virtual memory spaces is to have isolation. Having them within the same address space destroys that property. However, what it does do is provide the same performance characteristics of threads. So if you care about performance, then this unusual process layout might be interesting.

A case where this matters is MPI. The Message Passing Interface is used by many programs which strongly care about performance. The difference between using shared memory, and using a shared address space can be more than a factor of two, and as much as an order of magnitude in speed. (Arbitration of shared memory is an often overlooked cost, as is the effects of increased cache footprint.) Thus version 1.1 of Lockless MPI used threads with shared address space to gain these performance advantages.

MPI specifies that individual MPI ranks act like processes. They have their own global variables untouched by other ranks. This means that if you run the ranks by using threads, you have a problem. The threads will share the global variables, and will stomp on each other's use of them. The way version 1.1 of Lockless MPI worked around this was to compile the source code of MPI programs with all global variables converted into thread-local variables.

Converting the storage class of variables from global to thread-local is relatively trivial in C. You just need to prepend "__thread" to their definitions. However, in other languages, it may be much more difficult or impossible. In C++, how would one convert a singleton into a "per-thread-singleton", with the limitation that there are no such thing as per-thread constructors? In FORTRAN, the concept of threads may not even exist, with all variables by default being statically allocated.

Thus even though using threads for each rank was extremely fast, it also was extremely limiting. Much MPI code is written in languages other than C, and could not be used with this mechanism. The way to fix this is to alter the mechanism, and use a Meta-process layout.

We can launch each rank of an MPI job as a separate process within a shared address space, defining a meta-process. The power of the clone() system call allows us to choose exactly what details are shared between these processes. Finally, each process can then launch multiple threads of execution if so desired. With this trick, it is possible to use code compiled from any language in a shared-address-space MPI job. The only requirement is of position independent code, which is supported by nearly all compilers.

The trick is to notice that multiple copies of perhaps the same process can exist in the same address space. Each copy will be situated at a different address. If we use position-independent code, then this is physically possible. (Most libraries are compiled with position independence, and many executables are now as well for security reasons.) When we access a global variable, it will be via %rip-based addressing. Since the instruction pointer will be in a different spot in each copy, each copy will get a different set of global variables.

The really nice thing about this process layout is that the kernel will share all of the copies of the program code between each duplicate. The result is increased virtual memory usage, but hardly any extra physical memory usage. Since we have 64 bits of address space on modern machines (of which you might only be able to access 48 or 56), this is plenty.

So what are the disadvantages of this scheme? The biggest is that it is non-standard so there is missing linker, loader and debugger support. If you wish to use this trick, you'll need to tweak many low-level details that other code can take for granted.

Implementing Meta-Processes

The first question is how do you actually load multiple copies of the same program into memory? You cannot use the dlopen() function. That will simply increment a ref-count, and won't actually map a new version. The dlmopen() function is more promising, but also doesn't work as desired. It is limited to a low fixed number of copies, and doesn't work with the program's main executable.

This means we will have to implement our own loader. Or not. There is another way. We can use the fact that ld.so is an executable as well as a library. If we can just load ld.so then it can do the heavy lifting of loading the rest of the process. This is good because extra features are slowly being added to the elf spec, and it would be difficult to have to support them all. The loader, being a relatively simple program, is much easier to load. (Pun intended.)

Thus with a small amount of code, it is possible to parse the elf headers of ld.so and map it into memory. We just need to set up the arguments so that it loads the program we wish, which also isn't difficult. We finally need to set up the auxiliary vector with data originally from the kernel before we can launch. Fortunately, the vector is readable from /proc/$pid/auxv, so we can obtain the tricky things like hardware capabilities without too much effort.

In effect, we can make a user-space version of the exec() functions. It becomes trivial to do such tricks as making a utility that acts somewhat like xargs in breaking the limitation with execv() in how large the arguments to a new process can be. i.e. one quarter of the stack space defined in rlimit.

However, if you use the above to launch more than one process within the same address space you will run into some subtle bugs. The problem is sbrk(). The memory allocator will use that system call to find out where the break segment is. However, the second process will also do that, and may end up thinking it owns the same region of memory as the first. The result is that the two copies of the memory allocator can cause massive data corruption as allocated structures overlap.

Thus you'll need to overload either sbrk or the whole memory allocator to avoid problems. This can be done with the LD_PRELOAD mechanism, or simply by requiring that code using this trick be linked to something that has that effect. (The second option is taken by the Lockless MPI library.)

Your problems will not end there. The problem is that the memory allocator will probably need to use per-thread variables. Per-thread variables work by using the %fs segment register to refer to different blocks of memory in each thread. Thus by using %fs-based indexing, we can use the same code to access thread specific variables. The problem is that the slots in that block of memory are chosen by the loader/linker. The slot for a given variable can differ from program to program.

Normally, the particular tls slot chosen is completely transparent to user code. Everything just works, you don't really need to know how or why. However, if different programs try to call each other's code, then the slots might not agree. The end result is that code may read or write to the wrong place in memory. This is catastrophic, especially since basically any system call will set errno, which on modern systems is a thread-local variable Thus we need to override thread local variables, and in fact all syscalls in program-shared code.

We can't use %fs in shared code, but there is one segment register remaining, %gs. By using this second segment register, we can efficiently have process global code. (It would be nice to be able to tell the loader to save a few tls slots in %fs instead, but that isn't possible with the current setup.)

The Lockless MPI library uses this technique to implement processes living within a shared address space. mpiexec will launch all MPI ranks within itself. Those ranks can then call back inside mpiexec when required. Since there is only one copy of the i/o code in mpiexec, this simplifies interfacing with the outside world. For example, the locking hidden within the Infiniband library wouldn't work as intended if multiple copies of the Infiniband library were loaded simultaneously.


If you try to debug a meta-process with gdb it will not work they way you might like. The problem is that the debugger will not know about all the extra processes running in the address space. Somehow, it needs to be told about them. Otherwise stack traces will be full of unresolved symbols.

A debugger finds out about the state of the application it is running by looking at the structure pointed to by the _r_debug symbol. Since that structure can be updated at runtime by dlopen() loading more libraries, there is also the _dl_debug_state() function. The C library helpfully calls that stub of a function before and after it modifies _r_debug. This allows the debugger to notice the changes, and simultaneously know when _r_debug may be in an inconsistent state due to it being modified.

The problem is that each process inside our meta-program will have its own copy of the C library, and hence its own _r_debug. gdb will only see the first (outermost) of these. Thus we need to somehow use the information in the other structures to update the master version.

Unfortunately, glibc also uses the _r_debug data structure to store its knowledge of the loaded libraries. Thus if the main program tries to use dlopen() then it may not like seeing that list in an altered state. Note that even if dlopen is not called directly, some parts of the C library use it. For example, using DNS lookups will load the NSS library, and thus make an internal call to dlopen.

So if you want to get a meta-process based program to work with debugging you need to be very careful. You'll need to hook quite a few places so that the loaded object list is in the required state at all times. Sometimes it needs to be what glibc expects, sometimes what gdb expects. One trick to make this happen is to ensure all problematic C library calls only happen from "initial process" context. The Lockless MPI library uses a structure with function pointers referenced by the %gs segment selector for this purpose.

The end result is that debugging a meta-process based application can be no more difficult than debugging a multi-threaded application. gdb already has support for multiple "inferiors", and using processes or threads as the inferiors works equally well. gdb is smart enough to notice that a given symbol name can appear in multiple places, and will hook all required functions for a given breakpoint. It is also smart enough to know which version of a given global variable is wanted from a given inferior.


Meta-processes add a new type of executable context to the list of possible ways of constructing a program. They combine the features of both processes and threads in an elegant way. They add a new class of variable: global to a process and its threads, but not to the whole program.

Lockless MPI uses the meta-process technique to allow languages other than C to use the large performance improvements that shared address-space programming yields. The only cost is minimal, requiring that the application (and its libraries) be compiled with position-independent code.


David Bakin said...
Windows CE operated somewhat like this: The OS and all user processes shared the same address space. That's why there were a maximum of 32 simultaneous processes: The address space was split up so that each process got a 32Mb region of it (including the OS), which I believe were called slots. DLLs were loaded into one or two slots reserved for the purpose. Memory protection was used so that each process had its own page map and had R/W access to its own slot, Read only access to code slots (and thus all DLLs were sharable), and no access to the rest of the address space. (By default, you could map additional memory.) The OS (including drivers) ran in a process which had a page map which had access to everything.

A process switch was a thread switch + changing a fixed number of page table entries (to adjust permissions) (and maybe TLB flush, etc.)

Because this was the way the OS was designed none of the problems you mention actually happened: all the tools were designed properly, all the programs were compiled properly, etc.

The major issues in practice were the hard 32 process limit and the restricted 32Mb data space for each process. This was considered acceptable in the initial applications the OS was targeted for: embedded systems, including (early) smartphones. But of course, as embedded systems became more and more powerful these limitations became difficult to work around and eventually were showstoppers.
WilliamPess said...
I need to contact admin.
Thank you.

Enter the 10 characters above here

About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.