Lockless Inc

Lockless MPI Demo

MPI is a Message Passing Interface standard. It specifies a series of library functions and types that allow one to construct a cross-platform program that uses message-passing for parallel communication. It is designed so that the same program may run on both SMP shared-memory machines, and clusters of machines without shared memory. This flexibility allows programs to be debugged on the desktop, and then launched on the big-iron when done.

The core concept of MPI is that nothing is shared. This share-nothing ideal means that there is no global state that needs to be protected by locks or other synchronization primitives. Instead, each MPI "rank" runs with its own memory and no obscure and slow atomic instructions are needed for computations. All communication is handled through the MPI library, which can be especially optimized so that this communication is as fast as possible.

The beauty of MPI is that only very few of its functions need to be understood in order to use it. Most of the rest can be thought as being orthogonally composed from a few primitives. The primary ones being MPI_Send() which sends a message, and MPI_Recv() which receives one. Non-blocking versions of these functions are MPI_Isend(), and MPI_Irecv(), which create a handle of type MPI_Request that can be tested via MPI_Test() or waited via MPI_Wait() for completion. Other "explicitly buffering" and "ready" send types also exist, and these also come in a non-blocking variety.

Lockless MPI Demo

The demo version of Lockless MPI is a free download which lets you test out the library in a limited configuration. The main constraint is that a maximum of only two MPI ranks is enabled. However, you may choose to have these two either on the same machine (to test local message-passing), or on seperate machines (to test message-passing over the network). Since only two ranks are allowed, we have disabled nearly all "collective" operations (which typically only make sense with larger numbers of MPI ranks), and communicator-manipulating functions (which also only make sense if you have many ranks). For simplicity, Infiniband is also disabled (it requires the ibverbs library to be installed). However, if you would like to test Infiniband, just contact us.

The demo version is however perfect for doing benchmarks. You can test all of the point-to-point communication methods, and compare latency, bandwidth, and cpu usage with other MPI libraries. We hope you'll find this useful for seeing the advantages of our design

The key feature of Lockless MPI is that it uses a thread-based layout instead of a process-based layout on each machine. This means that ranks can share address-spaces, reducing message-passing latency, and increasing bandwidth through halving the number of copies. To implement this scheme, we use a pre-processor which converts all global variables in your C program into thread-local ones. It similarly changes all function-static variables into their thread-local equivalents. Since the pre-processor only works with C; C++ and Fortran (and other languages) are currently unsupported. However, if you remove the use of global variables from your program, you may still have luck in getting this ultra-fast MPI to work.

Using MPI

The first thing you should do when testing out MPI is to create a "hostfile". This allows you to describe the shape of a cluster to the library by using the -file option. Lockless MPI uses the standard format for such a file:

# You probably want this to be the name of the local machine (but not "localhost")
user@machine slots=1

user@some_other_machine slots=1

# This machine defaults to using four ranks
some_other_user@machine3 slots=4

# Another quad-core machine
user@machine4 slots=4

Since MPI is special, we can't use the normal commands for compiling, linking, and running MPI programs. Instead, the MPI specification lists several executables that should be used instead. To compile or link a MPI program, use mpicc, to run one, use mpiexec. In the case of Lockless MPI, mpicc is a shell script. You can modify it to point to your favourite C compiler. By default, it uses gcc. mpiexec uses ssh to launch programs on the machines described in the hostfile, using a given total number of ranks to launch.

So to create a simple benchmark, we will use a "ping-pong" algorithm. Messages will be passed back and forth between two ranks. For simplicty, we will use the MPI_Sendrecv() function for this, which simultaneously sends and receives a message. By changing the size of the messages used, we can see how the latency and bandwidth are affected.

Such a function may look like:

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_P2	21

static void ping_pong(int rank, int ping, int pong, int counts)
	int i, j;
	int tag = 1;
	int source = 0, dest = 0;

	int *inmsg = NULL, *outmsg = NULL;
	double startwtime = 0, endwtime;

	if ((rank == ping) || (rank == pong))
		/* Allocate message buffers */
		inmsg = malloc(1LL << MAX_P2);
		outmsg = malloc(1LL << MAX_P2);

		/* Fill them with something */
		memset(inmsg, 57, 1LL << MAX_P2);
		memset(outmsg, 57, 1LL << MAX_P2);

	/* Set source and destinations */
	if (rank == ping)
		dest = pong;
		source = pong;
	else if (rank == pong)
		dest = ping;
		source = ping;

	/* Make sure everyone is here before we start */

	for (j = 0; j < MAX_P2; j++)
		if (rank == ping)
			startwtime = MPI_Wtime();

		if ((rank == ping)  || (rank == pong))
			for (i = 0; i < counts; i++)
				/* Send and Receive a message from our partner */
				MPI_Sendrecv(outmsg, 1LL << j, MPI_CHAR, dest, tag,
				inmsg, 1LL << j, MPI_CHAR, source, tag,

		if (rank == ping)
			endwtime = MPI_Wtime();
			printf("%d took %f seconds.  Bandwidth %f MiB/s\n", j, endwtime-startwtime, (1LL << j) * counts / ((1LL << 20) *(endwtime-startwtime)) );

	/* Clean up */
	if ((rank == ping) || (rank == pong))

Note how in the above, all ranks used the same control-flow even though they may not have been part of the ping-pong benchmark. This is considered good programming practise. The MPI spec has many details which encourage this. In fact, the above could be linearized even more, with extra ranks sending and receiving with "MPI_PROC_NULL", which does nothing. Other examples may use "inter-communicators" and "topologies" for this purpose.

To use the function, we can call it like so:

int main(int argc, char **argv)
	int numprocs;
	int rank;

	int namelen;

	char processor_name[MPI_MAX_PROCESSOR_NAME];

	/* Intialize MPI */
	MPI_Init(&argc, &argv);

	/* Print out some information about ourselves */
	MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Get_processor_name(processor_name, &namelen);
	printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

	/* Send messages between ranks 0 and 1. */
	ping_pong(rank, 0, 1, 1000);

	/* Done with MPI */

	return 0;

The results of the above program are shown on the MPI benchmarks page. Try to reproduce them yourself. Note that you may need to run it a few times to get a stable result. Your computer tends to be running many tasks, and these may steal time, giving spuriously slow results occasionally.


Enter your comments here

Enter the 10 characters above here

About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.