Lockless Inc
Purchase »
- Linux
- Windows
- Developers
Benchmarks
Installation
Articles
Technical
Downloads
Documentation
What's New
About Us
Help

How does it work? (For MPI Version 1.1)

The Lockless MPI Library has a very different design than other MPI implementations.

Threads

Other MPI implementations (and the MPI specificiation itself) are process based. Each rank in the MPI job has a completely isolated address space. This design allows programs using MPI to seemlessly be ported between clusters and SMP systems. However, with the rise of popularity of multi-core systems, extra efficiency can be gained by changing this.

The Lockless MPI implementation uses threads instead of processes. Thus each rank on the same machine shares address space with other ranks on that machine. (Ranks on different machines still maintain different address spaces.) This has performance implications due to the fact that less copies are required in order to transfer data from rank to rank. It also implies that one cannot simply compile and use normal MPI programs with this scheme as global variables would "stomp" on each other.

Performance in the Lockless MPI implementation is greatly improved over other MPI libraries due to the shared address space. Typically, on same-machine ranks will communicate via shared memory. Thus messages need to be copied from the address space of the source to to this shared region, and then copied again out of the shared region into the address space of the destination receiving rank. So at least two memory copies are required for each message with this technique. However, with a shared address space, only a single copy is required. By simply passing a pointer from the source to the destination, the receiving rank may directly copy from the source buffer into its request. For large messages, the time spent in these copies dominates. Thus Lockless MPI is over a twice as fast in that regime due to using half as many memcpy() calls.

The disadvantage of the Lockless MPI implementation is that one cannot simply use a normally compiled program and link to the MPI library and have the resulting program work correctly. Ranks on the same machine will share the locations of global variables, breaking the MPI spec, and thus erroneous execution will occur. Thus code linked to the library needs to be pre-processed to remove these by converting them into thread-local variables instead. The mpicpp binary parses C code, and does this by prepending __thread annotations to any global variables not in system headers that it finds.

Parsing C is relatively simple due to it being a small language. C++ is unfortunately much more complex, so there is currently no version of mpicpp that works for it. However, one may still try to link to the Lockless MPI library if all global variables are manually converted into thread-local ones. It is often seen to be good C++ style to have as little global variables (and singletons) as possible, making this job relatively simple for many programs. The other language commonly used with MPI is Fortran. Unfortunately, standard Fortran lacks thread local variables, so there is no version of mpicpp for it either. In addition, compiler optimizations make using threads in Fortran difficult. Future edditions of the Lockless MPI library may rectify these issues.

Messages

Messages in Lockless MPI are handled by a wait-free queue for each rank. This means that sending a message is handled by a non-blocking operation with ideal scalability. The speed is further improved by including the Lockless Memory Allocator within Lockless MPI, allowing nonblocking allocations and frees. Through the liberal use of MPI_Isend(), MPI_Irecv() and MPI_Test(), and similar functions, it is possible to make lock-free algorithms with this MPI implementation without any of the difficult per-cpu knowledge normally required.

Small messages are stored within the envelope on the wait-queue. Larger messages use an allocated buffer that has the message information packed into it. The MPI_bufcopy_set() and MPI_bufcopy_get() functions can be used to tune the maximal size where this happens. The largest messages are passed by reference. The receiving rank obtains a pointer to the source buffer, and copies the data from it.

Since we usually do not want to have to wait until a matching receive is posted, send operations need to have the ability to complete early. This is done through the use of a signal handler. The lowest priority real-time signal is used for this. The sender copies the data into a newly allocated buffer, and then notifies the receiver that it should use that buffer instead of the source specified in the original message.

IO

IO is handled by a seperate completion thread. This thread uses epoll to efficiently wait on the sockets connecting to other machines in the cluster. Multiple ranks within the same machine (which are threads within a single process) will share the same connecting sockets. This greatly reduces the number of open file descriptors required as compared to other MPI implementations.

Since connections are shared between ranks, they use a thread-safe wait-free queue similar to the message queues. The synchronization is fine-grained so that one thread may be reading and another simultaneously writing on the same socket. If no MPI rank is currently executing IO operations, then the IO thread will make sure outstanding sends and receives continue to complete over the connections. This allows improved overlap of communication and computation.

This implementation of MPI also supports Infiniband for connecting machines. Since Infiniband requires hooks in the memory allocator in order to work correctly, we also provide a version of the MPI library without Infiniband support, for extra speed on ethernet or SMP-only installations. Due to its low-latency nature, Infiniband uses one additional completion thread in addition to the standard IO thread. This extra thread polls Infiniband state whenever messages are rapidly being sent or received.

The Infiniband transport protocol used by Lockless MPI has different algorithms for small, medium, and large messages. The smallest messages are copied whole into the address space of the destination, and directly inserted into the wait-free message queue. Medium sized messages are transferred over a byte-stream based ring-buffer using the same stream protocol as is used over ethernet sockets. The largest messages use Infiniband to register memory in the source and destination with the hardware. The hardware is directed to copy from the source to the destination without interrupting either cpu.

Since multiple ranks share the same process, much less Infiniband resources are required as compared to other MPI implementations. Only one queue-pair and a third of a MiB is required per connected machine, no matter how many cores it has. Unfortunately, the current version of MPI does not support the use of multiple Infiniband ports on a machine. Only the first connected port will be used. This version of MPI also assumes the network topology is such that if Infiniband is used, that all machines can talk to all other machines by it.

Sleeping

Most MPI implementations will spin-wait when a requested message hasn't arrived. This increases cpu usage drastically when the required message is delayed. Lockless MPI takes a different tactic. It will spin for a short while (about a millisecond), and then will go to sleep. When a new message is posted, or Isend completes the rank will wake up again. This is similar to the difference between spinlocks and mutexes. Spinlocks work well if the wait-time is short, and you know that there aren't more potential waiters than cpus. Mutexes are more versatile, and perform well in many situations where spin-locks will not.

Thus the Lockless MPI implementation will work quite well even if you have more executing ranks than cores on a machine. You can dynamically tune your workload by watching total cpu usage, and adding more ranks if it is too low. Note that this sleeping can interfere with some of the logic within some MPI benchmarks. They may expect 100% cpu usage at all times, and can give spuriously bad results when that isn't the case.

The ability to sleep isn't affected by the use of Infiniband. The Infiniband completion thread will notice when messages occur, and will process them in a timely manner even if no MPI calls are happening. For very large messages, the Infiniband HCA will offload nearly all the work, leaving extremely low cpu usages.

Debugging

Unlike other MPI implementations, Lockless MPI uses threads to represent ranks. This means that only one debug session is needed to monitor all ranks on a given machine. You can use the gdb thread command to switch between ranks to investigate their state. The first few threads created will be these ranks. The final thread(s) will be the IO thread, and Infiniband IO thread, if it exists. If you use the -debug option to mpiexec, gdb will be launched in a new window for each machine in the cluster.

Debugging memory access problems can be difficult. Fortunately, Lockless MPI provides a seperate library compiled with Valgrind support. This removes the use of the Lockless Memory Allocator, so Valgrind can hook all memory creation and destruction events correctly. However, due to this, Infiniband is not supported. This shouldn't be too much of a problem, since Valgrind cpu emulation will be slow anyway. You can use the --use_valgrind compile-time flag to enforce linking against the debug library, rather than the normal one.

Linux

Windows

Developers

About Us

Returns Policy

Send us Feedback