How does it work? (For MPI Version 1.1)The Lockless MPI Library has a very different design than other MPI implementations. ThreadsOther MPI implementations (and the MPI specificiation itself) are process based. Each rank in the MPI job has a completely isolated address space. This design allows programs using MPI to seemlessly be ported between clusters and SMP systems. However, with the rise of popularity of multi-core systems, extra efficiency can be gained by changing this. The Lockless MPI implementation uses threads instead of processes. Thus each rank on the same machine shares address space with other ranks on that machine. (Ranks on different machines still maintain different address spaces.) This has performance implications due to the fact that less copies are required in order to transfer data from rank to rank. It also implies that one cannot simply compile and use normal MPI programs with this scheme as global variables would "stomp" on each other. Performance in the Lockless MPI implementation is greatly improved over other MPI libraries due to the shared address space. Typically, on same-machine ranks will communicate via shared memory. Thus messages need to be copied from the address space of the source to to this shared region, and then copied again out of the shared region into the address space of the destination receiving rank. So at least two memory copies are required for each message with this technique. However, with a shared address space, only a single copy is required. By simply passing a pointer from the source to the destination, the receiving rank may directly copy from the source buffer into its request. For large messages, the time spent in these copies dominates. Thus Lockless MPI is over a twice as fast in that regime due to using half as many The disadvantage of the Lockless MPI implementation is that one cannot simply use a normally compiled program and link to the MPI library and have the resulting program work correctly. Ranks on the same machine will share the locations of global variables, breaking the MPI spec, and thus erroneous execution will occur. Thus code linked to the library needs to be pre-processed to remove these by converting them into thread-local variables instead. The Parsing C is relatively simple due to it being a small language. C++ is unfortunately much more complex, so there is currently no version of MessagesMessages in Lockless MPI are handled by a wait-free queue for each rank. This means that sending a message is handled by a non-blocking operation with ideal scalability. The speed is further improved by including the Lockless Memory Allocator within Lockless MPI, allowing nonblocking allocations and frees. Through the liberal use of Small messages are stored within the envelope on the wait-queue. Larger messages use an allocated buffer that has the message information packed into it. The Since we usually do not want to have to wait until a matching receive is posted, send operations need to have the ability to complete early. This is done through the use of a signal handler. The lowest priority real-time signal is used for this. The sender copies the data into a newly allocated buffer, and then notifies the receiver that it should use that buffer instead of the source specified in the original message. IOIO is handled by a seperate completion thread. This thread uses Since connections are shared between ranks, they use a thread-safe wait-free queue similar to the message queues. The synchronization is fine-grained so that one thread may be reading and another simultaneously writing on the same socket. If no MPI rank is currently executing IO operations, then the IO thread will make sure outstanding sends and receives continue to complete over the connections. This allows improved overlap of communication and computation. This implementation of MPI also supports Infiniband for connecting machines. Since Infiniband requires hooks in the memory allocator in order to work correctly, we also provide a version of the MPI library without Infiniband support, for extra speed on ethernet or SMP-only installations. Due to its low-latency nature, Infiniband uses one additional completion thread in addition to the standard IO thread. This extra thread polls Infiniband state whenever messages are rapidly being sent or received. The Infiniband transport protocol used by Lockless MPI has different algorithms for small, medium, and large messages. The smallest messages are copied whole into the address space of the destination, and directly inserted into the wait-free message queue. Medium sized messages are transferred over a byte-stream based ring-buffer using the same stream protocol as is used over ethernet sockets. The largest messages use Infiniband to register memory in the source and destination with the hardware. The hardware is directed to copy from the source to the destination without interrupting either cpu. Since multiple ranks share the same process, much less Infiniband resources are required as compared to other MPI implementations. Only one queue-pair and a third of a MiB is required per connected machine, no matter how many cores it has. Unfortunately, the current version of MPI does not support the use of multiple Infiniband ports on a machine. Only the first connected port will be used. This version of MPI also assumes the network topology is such that if Infiniband is used, that all machines can talk to all other machines by it. SleepingMost MPI implementations will spin-wait when a requested message hasn't arrived. This increases cpu usage drastically when the required message is delayed. Lockless MPI takes a different tactic. It will spin for a short while (about a millisecond), and then will go to sleep. When a new message is posted, or Thus the Lockless MPI implementation will work quite well even if you have more executing ranks than cores on a machine. You can dynamically tune your workload by watching total cpu usage, and adding more ranks if it is too low. Note that this sleeping can interfere with some of the logic within some MPI benchmarks. They may expect 100% cpu usage at all times, and can give spuriously bad results when that isn't the case. The ability to sleep isn't affected by the use of Infiniband. The Infiniband completion thread will notice when messages occur, and will process them in a timely manner even if no MPI calls are happening. For very large messages, the Infiniband HCA will offload nearly all the work, leaving extremely low cpu usages. DebuggingUnlike other MPI implementations, Lockless MPI uses threads to represent ranks. This means that only one debug session is needed to monitor all ranks on a given machine. You can use the gdb Debugging memory access problems can be difficult. Fortunately, Lockless MPI provides a seperate library compiled with Valgrind support. This removes the use of the Lockless Memory Allocator, so Valgrind can hook all memory creation and destruction events correctly. However, due to this, Infiniband is not supported. This shouldn't be too much of a problem, since Valgrind cpu emulation will be slow anyway. You can use the |
About Us | Returns Policy | Privacy Policy | Send us Feedback |
Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |