Lockless Inc
Purchase »
- Linux
- Windows
- Developers
Benchmarks
Installation
Articles
Technical
Downloads
Documentation
What's New
About Us
Help

How does it work?

The Lockless MPI Library has a very different design than other MPI implementations.

Threads/Processes

Other MPI implementations (and the MPI specification itself) are process based. Each rank in the MPI job has a completely isolated address space. This design allows programs using MPI to seamlessly be ported between clusters and SMP systems. However, with the rise of popularity of multi-core systems, extra efficiency can be gained by changing this.

The Lockless MPI implementation uses a hybrid of threads and processes via the clone() system call. Each rank on the same machine shares address space with other ranks on that machine. (Ranks on different machines still maintain different address spaces.) This has performance implications due to the fact that less copies are required in order to transfer data from rank to rank.

Performance in the Lockless MPI implementation is greatly improved over other MPI libraries due to the shared address space. Typically, on same-machine ranks will communicate via shared memory. Thus messages need to be copied from the address space of the source to to this shared region, and then copied again out of the shared region into the address space of the destination receiving rank. So at least two memory copies are required for each message with this technique. However, with a shared address space, only a single copy is required. By simply passing a pointer from the source to the destination, the receiving rank may directly copy from the source buffer into its request. For large messages, the time spent in these copies dominates. Thus Lockless MPI is over a twice as fast in that regime due to using half as many memcpy() calls.

One subtlety are global variables. In the MPI spec, these variables are "global" only within the scope of a single MPI rank (which may decide to run many threads). This means we cannot use a standard program address-space layout. We would also like to be able to support this altered concept of globalness in code not compiled under a particular MPI compiler so that external libraries can be loaded without issue.

The solution is to use position-independent code. Typically, libraries are compiled with that option anyway, and for security reasons, more and more executables are as well. We can load multiple copies of the same program at different locations within the address space. Each copy will have its own set of global variables. Unfortunately, there is no simple way of doing this. The only documented way of multiply loading an object file is via the dlmopen() function, and that doesn't duplicate the address space usage.

Lockless MPI uses low-level code independent of the C library to load duplicate copies of executables (and their libraries) into memory. These duplicates don't greatly increase real memory usage as the kernel is smart enough to share the backing pages between instances. It does, however, require an increase in virtual memory usage. With 64 bits of address space, this isn't much of an issue.

The fact that any position-independent code can be loaded in this way greatly increases the amount of libraries that are MPI compatible, compared to previous versions. It also increases compatibility with low-level code that deals with per-process features like signals or threads. An MPI rank becomes much more process-like.

There is one final issue with multiple shared-address space processes, thread-local variables. The thread-local variable implementation on x86_64 uses the %fs segment register to refer to different chunks of memory in each thread. By using a relative offset to the base address of %fs, you can read and write to different locations in each thread using the same object code.

The problem here is that the thread local storage (tls) slots can vary from program to program, and library to library. The exact offset used by a given variable can change depending on what libraries are loaded. Thus if you try to call a function in one loaded executable from another executable bad things will happen, as for example, errno might be written into the wrong offset. Don't pass function pointers between MPI ranks, and expect the resulting functions to work correctly.

Normally, the complexity of thread local storage is handled by the linker and loader transparently. However, neither of them understand the non-standard process layout used by Lockless MPI. Fortunately, a second segment register exists. Lockless MPI uses the %gs segment register to implement a parallel set of tls so that low-level code like the memory allocator or libibverbs can work correctly. One of the link options within the mpi compilation shell-scripts hooks the pthread_create() function so that we can be sure that %gs is always in a consistent state.

Messages

Messages in Lockless MPI are handled by a wait-free queue for each rank. This means that sending a message is handled by a non-blocking operation with ideal scalability. The speed is further improved by including the Lockless Memory Allocator within Lockless MPI, allowing nonblocking allocations and frees. Through the liberal use of MPI_Isend(), MPI_Irecv() and MPI_Test(), and similar functions, it is possible to make lock-free algorithms with this MPI implementation without any of the difficult per-cpu knowledge normally required.

Small messages are stored within the envelope on the wait-queue. Larger messages use an allocated buffer that has the message information packed into it. The MPI_bufcopy_set() and MPI_bufcopy_get() functions can be used to tune the maximal size where this happens. The largest messages are passed by reference. The receiving rank obtains a pointer to the source buffer, and copies the data from it.

Since we usually do not want to have to wait until a matching receive is posted, send operations need to have the ability to complete early. This is done through the use of a signal handler. The lowest priority real-time signal is used for this. The sender copies the data into a newly allocated buffer, and then notifies the receiver that it should use that buffer instead of the source specified in the original message.

IO

IO is handled by a separate completion process. This process uses epoll to efficiently wait on the sockets connecting to other machines in the cluster. Multiple ranks within the same machine will share the same connecting sockets. This greatly reduces the number of open file descriptors required as compared to other MPI implementations.

Since connections are shared between ranks, they use a thread-safe wait-free queue similar to the message queues. The synchronization is fine-grained so that one thread may be reading and another simultaneously writing on the same socket. If no MPI rank is currently executing IO operations, then the IO process will make sure outstanding sends and receives continue to complete over the connections. This allows improved overlap of communication and computation.

This implementation of MPI also supports Infiniband for connecting machines. Due to its low-latency nature, Infiniband uses one additional completion thread within the standard IO process. This extra thread polls Infiniband state whenever messages are rapidly being sent or received. Unlike other MPI implementations, Infiniband communication will proceed even if calls to the MPI library are not being made. (Most other implementations require polling of Infiniband resources in order for communication to happen.)

The Infiniband transport protocol used by Lockless MPI has different algorithms for small, medium, and large messages. The smallest messages are copied whole into the address space of the destination, and directly inserted into the wait-free message queue. Medium sized messages are transferred over a byte-stream based ring-buffer using the same stream protocol as is used over ethernet sockets. The largest messages use Infiniband to register memory in the source and destination with the hardware. The hardware is directed to copy from the source to the destination without interrupting either cpu.

Since multiple ranks share the same address space, much less Infiniband resources are required as compared to other MPI implementations. Only one queue-pair and a third of a MiB is required per connected machine, no matter how many cores it has. Unfortunately, the current version of MPI does not support the use of multiple Infiniband ports on a machine. Only the first connected port will be used. This version of MPI also assumes the network topology is such that if Infiniband is used, that all machines can talk to all other machines by it.

Sleeping

Most MPI implementations will spin-wait when a requested message hasn't arrived. This increases cpu usage drastically when the required message is delayed. Lockless MPI takes a different tactic. It will spin for a short while (about a millisecond), and then will go to sleep. When a new message is posted, or Isend completes the rank will wake up again. This is similar to the difference between spinlocks and mutexes. Spinlocks work well if the wait-time is short, and you know that there aren't more potential waiters than cpus. Mutexes are more versatile, and perform well in many situations where spin-locks will not.

Thus the Lockless MPI implementation will work quite well even if you have more executing ranks than cores on a machine. You can dynamically tune your workload by watching total cpu usage, and adding more ranks if it is too low. Note that this sleeping can interfere with some of the logic within some MPI benchmarks. They may expect 100% cpu usage at all times, and can give spuriously bad results when that isn't the case.

The ability to sleep isn't affected by the use of Infiniband. The Infiniband completion thread will notice when messages occur, and will process them in a timely manner even if no MPI calls are happening. For very large messages, the Infiniband HCA will offload nearly all the work, leaving extremely low cpu usages.

Debugging

Debuggers need to know what object files (executables and libraries) have been loaded by a program. To do this, they use the data structure pointed to by the _r_debug symbol. The C library will update this data structure as libraries are loaded and unloaded by the executable. It notifies the debugger by calling the _dl_debug_state() stub function which the debugger can hook.

The Lockless MPI implementation loads multiple executables in the same address space. The resulting meta-program has multiple copies of the C library, each with a different idea of what is loaded and where. Somehow the debugger needs to make sense of that, and by default, it can't.

Lockless MPI takes control over the _r_debug structure from the C library within mpiexec. It then can add the extra information specifying what other programs are loaded and where they are situated in memory. It can only do this once an MPI job has started up, and calls MPI_Init(). Thus before you call MPI_Init() a program will not have had itself registered properly with the debugger. This can make debugging early start-up issues difficult. Fortunately, most MPI programs call MPI_Init() very early on.

Another issue is that the same data structure that MPI edits is also used by the C library itself to know what is loaded. Thus if the IO process exits unexpectedly (before it has had time to clean up any changes it makes) then control can be returned to the C library with the loaded library list in a "corrupt" state. The C library then gets confused and prints out a warning. This warning is harmless, as the MPI program is exiting anyway, and exiting due to a fatal error.

Note that due to multiple copies of a program being loaded into memory at the same time, a given function may exist in many places. Thus when you insert a breakpoint you need to make sure the right instance of a function is used. Fortunately, gdb has the concept of "inferiors" which allow it to know that a symbol can have different resolutions in differing execution contexts. Lockless MPI makes sure to tell gdb enough information so that it will automatically add breakpoints where you might expect them to be. Debugging a multiple shared address-space process becomes very similar to debugging a threaded process.

Older Versions

Older versions of the MPI library used a different technique to share address space. The individual MPI ranks were represented as threads within a single process. This meant that there was less isolation between them, and thread-safe libraries were needed. MIMD programs required to be run on multiple machines, and the only language supported was C. For more information see the old technical description.

Linux

Windows

Developers

About Us

Returns Policy

Send us Feedback