A Different Kind of Threading
There are many different kinds of execution contexts in computer science. The largest of these is the "process". A modern operating system will multi-task between multiple processes, and divide system resources between them. Smaller than a process is a "thread". A process is made of one or more threads of execution, and all resources are shared commonly between them. Finally, a thread may be made of "fibers". Fibers are user-based threads. Since the kernel isn't involved, switching between them is a very fast operation.
The most advanced applications use processes, threads and fibers, and they use the correctly sized execution contexts for their given task. Each of the types of execution context has advantages and disadvantages. Using the wrong type can create an inefficient program. The trick is knowing when to use what.
The advantage of processes is isolation. A process can only affect another process through a limited interface. Typically, this may consist of shared file descriptors, signals, and shared memory. Since the amount of interaction is low, if one process misbehaves, it is quite likely that other processes will not be badly affected. Thus simple programs are just a single process, with no extra threads or other complexity.
The disadvantage of processes is isolation. It can be inefficient to communicate between individual processes. Even if shared memory is used, you may need some form of arbitration over its use. For even more isolation, you can use pipes, sockets or message-queues. These are kernel-controlled, with a corresponding loss of speed.
An example of a multi-process application is the Chrome web browser. It uses a process per web-page tab. If something goes wrong within a given tab, it won't bring down the whole browser. The application can seamlessly close down the offending process and continue. Due to isolation, it doesn't have to worry about that corrupting state elsewhere.
Threads on Linux are like processes, they just share more state. Threads share their address space, open file handles, and other more technical data, like the identity of their parent process. Originally, threads on Linux were exactly like processes. However, improving compatibility with POSIX threading meant that now they have explicit kernel support.
A given Linux process has a pid (process id), and a tid (thread id). In a single-threaded application, the pid and the tid have the same value. In a multi-threaded application, each thread shares the same pid, but the tid's differ in each thread. Some system calls differentiate between pids and tids. For example, you may broadcast a signal to a whole process (or group of processes)... or just to a single tid.
The advantage to threads is their shared state. This means that multi-threaded applications can use highly efficient algorithms to operate on shared data structures. Copying is avoided, and only the memory that absolutely needs to change will be altered. Multi-threaded applications can also take advantage of multi-core cpus. Multi-process applications can do the same, but converting a single-process application to multi-process is quite a bit harder than converting it to be multi-threaded.
The disadvantage of multi-threading is complexity. You don't want the differing threads to stomp on each other's toes. To prevent problems, constructs like locks and barriers can be used to avoid one thread looking at data that is concurrently being modified by another.
The issue is that while it is easy to have coarse-grained locking, such locking prevents concurrency. Fined-grained locking also can be a loser due to the extra overhead of all of the extra locks. The most efficient code has just the right amount of synchronization, and no more. Finding this happy balance can be tricky.
Of course, lock-free data structures exist. If you are lucky, then someone has already made the one that corresponds to your particular problem. If so, then you can get superior performance with threads with comparatively little effort. Unfortunately, usually the world isn't so kind. Many data structures are difficult, if not impossible, to make lock free. In most cases, you will be stuck with traditional lock-based code.
Fibers, user-space threading, or "green threads" are an execution context that are in some sense smaller than threads. They basically consist of just a stack frame. A context switch between them can be done by swizzling a few registers, and then swapping stack pointers. This is much faster than context switching between threads, which is something that needs to go through the kernel.
The disadvantage of fibers is that the kernel doesn't know of their existence. If a fiber blocks waiting on data, the kernel can't switch to some other fiber. This means that although it is possible to build a threading library purely from these user-space threading constructs, it is difficult to get the behaviour exactly right. The program needs to manually switch between fibers, adding complexity
An advantage of fibers (other than their speed) is that only one can be running at a given time in a thread. This means that fibers may not need synchronization between themselves. This is due to the one-to-one mapping of fibers to concept of co-routines. With careful programming, it is even possible to have "stackless" fibers, making them have even lighter weight. You just need to record the current state in some other data-structure when you do your context switching.
Data-driven programs can use fibers (or co-routines) to greatly simplify their control flow. A series of co-routines can be chained together to process the data, and each can symmetrically read from, and write to, the next in the chain. Whenever extra data is needed a fiber can "block" waiting for more, and then transfer control to some other fiber. Their generality becomes very helpful when each stage in the data-flow process can require differently sized chunks of data.
The above constructs are well-known... but there is another way of splitting up threads of execution that isn't used as much as it once was. It is possible to have multiple programs running concurrently within the same address space. (As opposed to multiple threads of the same program.) This was relatively common before memory management units with virtual memory and memory protection were available.
The resulting layout looks like an onion. On the outermost layer is possibly a group of interacting meta-processes. Inside each of those is a shared address space, which inside are multiple processes. Inside those are perhaps multiple threads, which finally may contain multiple fibers.
So why is this useful? The whole point of having processes within separate virtual memory spaces is to have isolation. Having them within the same address space destroys that property. However, what it does do is provide the same performance characteristics of threads. So if you care about performance, then this unusual process layout might be interesting.
A case where this matters is MPI. The Message Passing Interface is used by many programs which strongly care about performance. The difference between using shared memory, and using a shared address space can be more than a factor of two, and as much as an order of magnitude in speed. (Arbitration of shared memory is an often overlooked cost, as is the effects of increased cache footprint.) Thus version 1.1 of Lockless MPI used threads with shared address space to gain these performance advantages.
MPI specifies that individual MPI ranks act like processes. They have their own global variables untouched by other ranks. This means that if you run the ranks by using threads, you have a problem. The threads will share the global variables, and will stomp on each other's use of them. The way version 1.1 of Lockless MPI worked around this was to compile the source code of MPI programs with all global variables converted into thread-local variables.
Converting the storage class of variables from global to thread-local is relatively trivial in C. You just need to prepend "__thread" to their definitions. However, in other languages, it may be much more difficult or impossible. In C++, how would one convert a singleton into a "per-thread-singleton", with the limitation that there are no such thing as per-thread constructors? In FORTRAN, the concept of threads may not even exist, with all variables by default being statically allocated.
Thus even though using threads for each rank was extremely fast, it also was extremely limiting. Much MPI code is written in languages other than C, and could not be used with this mechanism. The way to fix this is to alter the mechanism, and use a Meta-process layout.
We can launch each rank of an MPI job as a separate process within a shared address space, defining a meta-process. The power of the
The trick is to notice that multiple copies of perhaps the same process can exist in the same address space. Each copy will be situated at a different address. If we use position-independent code, then this is physically possible. (Most libraries are compiled with position independence, and many executables are now as well for security reasons.) When we access a global variable, it will be via %rip-based addressing. Since the instruction pointer will be in a different spot in each copy, each copy will get a different set of global variables.
The really nice thing about this process layout is that the kernel will share all of the copies of the program code between each duplicate. The result is increased virtual memory usage, but hardly any extra physical memory usage. Since we have 64 bits of address space on modern machines (of which you might only be able to access 48 or 56), this is plenty.
So what are the disadvantages of this scheme? The biggest is that it is non-standard so there is missing linker, loader and debugger support. If you wish to use this trick, you'll need to tweak many low-level details that other code can take for granted.
The first question is how do you actually load multiple copies of the same program into memory? You cannot use the
This means we will have to implement our own loader. Or not. There is another way. We can use the fact that
Thus with a small amount of code, it is possible to parse the elf headers of
In effect, we can make a user-space version of the
However, if you use the above to launch more than one process within the same address space you will run into some subtle bugs. The problem is
Thus you'll need to overload either
Your problems will not end there. The problem is that the memory allocator will probably need to use per-thread variables. Per-thread variables work by using the %fs segment register to refer to different blocks of memory in each thread. Thus by using %fs-based indexing, we can use the same code to access thread specific variables. The problem is that the slots in that block of memory are chosen by the loader/linker. The slot for a given variable can differ from program to program.
Normally, the particular tls slot chosen is completely transparent to user code. Everything just works, you don't really need to know how or why. However, if different programs try to call each other's code, then the slots might not agree. The end result is that code may read or write to the wrong place in memory. This is catastrophic, especially since basically any system call will set
We can't use %fs in shared code, but there is one segment register remaining, %gs. By using this second segment register, we can efficiently have process global code. (It would be nice to be able to tell the loader to save a few tls slots in %fs instead, but that isn't possible with the current setup.)
The Lockless MPI library uses this technique to implement processes living within a shared address space.
If you try to debug a meta-process with gdb it will not work they way you might like. The problem is that the debugger will not know about all the extra processes running in the address space. Somehow, it needs to be told about them. Otherwise stack traces will be full of unresolved symbols.
A debugger finds out about the state of the application it is running by looking at the structure pointed to by the
The problem is that each process inside our meta-program will have its own copy of the C library, and hence its own
Unfortunately, glibc also uses the
So if you want to get a meta-process based program to work with debugging you need to be very careful. You'll need to hook quite a few places so that the loaded object list is in the required state at all times. Sometimes it needs to be what glibc expects, sometimes what gdb expects. One trick to make this happen is to ensure all problematic C library calls only happen from "initial process" context. The Lockless MPI library uses a structure with function pointers referenced by the %gs segment selector for this purpose.
The end result is that debugging a meta-process based application can be no more difficult than debugging a multi-threaded application. gdb already has support for multiple "inferiors", and using processes or threads as the inferiors works equally well. gdb is smart enough to notice that a given symbol name can appear in multiple places, and will hook all required functions for a given breakpoint. It is also smart enough to know which version of a given global variable is wanted from a given inferior.
Meta-processes add a new type of executable context to the list of possible ways of constructing a program. They combine the features of both processes and threads in an elegant way. They add a new class of variable: global to a process and its threads, but not to the whole program.
Lockless MPI uses the meta-process technique to allow languages other than C to use the large performance improvements that shared address-space programming yields. The only cost is minimal, requiring that the application (and its libraries) be compiled with position-independent code.
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.