Lockless Inc. mpiexec

Lockless Inc
Purchase »
- Linux
- Windows
- Developers
Benchmarks
Installation
Articles
Technical
Downloads
Documentation
What's New
About Us
Help

NAME

mpiexec - Execute a MPI process

SYNOPSIS

Single Process, Multiple Data (SPMD) mode:

mpiexec [ options ] <program> [ <args> ]

Multiple Process, Multiple Data (MPMD) mode :

mpiexec [ global_options ] [ local_options1 ] <program1> [ <args1> ] : [ local_options2 ] <program2> [ <args2> ] : ... : [ local_optionsN ] <programN> [ <argsN> ]

QUICK SUMMARY

If you just want to run an MPI program, you probably want to use a command line like:

mpiexec [ -n # ] [ -file <hostfile> ] <program> <program args>

This will start a # rank process for the program using the hostfile to work out which machines to run on. The remote processes will be launched via ssh. If there are more ranks launched than are described in the hostfile , then the extra ranks will be allocated in a round-robin pattern amoungst the available machines.

GLOBAL OPTIONS

<program> - MPI application to launch

<args> - The command-line arguments to be passed to the MPI application

-?, -help - Prints a short usage message

-x, -xterm - Create an xterm for every machine. Run the MPI application inside each xterm.

-d, -debug - Create an xterm for every machine, and run the MPI application inside each xterm using gdb.

-v, -verbose - Be more verbose when running an MPI application. Sets the MPI_VERBOSE environment variable in the context of the MPI application.

-V, -version - Print the Lockless MPI version number, and MPI specification version.

-q, -quiet - Don't output connection messages.

-S, -sshport <N> - Changes the default port to log on to ssh on remote machines.

-c, -configfile <filename> - Use a config file to launch a MPI application.

-noinfiniband - Disables infiniband uage.

--remote - Internally used to launch mpiexec on a remote machine with information to connect back to the master node.

LOCAL OPTIONS

-n, -np <N> - Execute this many ranks

-s, -soft {<A>, <A:B>, <A:B:C>, <>} - Range of numbers of ranks to execute. A processes, A to B processes, A to B in increments of C processes. Multiple triplets can be used, seperated by commas.

-h, -host <machine> - Use this host

-a, -arch <architecture> - The only values accepted are: "x86_64", "X86_64", "amd64" or "AMD64".

-w, -wdir <directory> - The working directory of the executing process on the remote machine

-p, -path <path> - The path to use to find the remote executable

-f, -file <filename> - Hostfile to find extra information about the host.

-m, -mpiexec <filename> - Location of mpiexec executable on remote host. Use this if you have multiple versions of MPI installed, and some other version of mpiexec appears first in your PATH.

DESCRIPTION

Specifying Host Nodes

A host is specified by using the -host option. The number of ranks to run on that host must be specified with a -n or -np option. So to run 3 ranks on c1 we would use:

mpiexec -n 3 -host c1 <prog> <args>

If multiple hosts are needed, seperate them by using colons. Each should specify how many ranks to use. Note that you should not use a hostname that is not globally constant, like "localhost", as DNS lookups will give differing answers on each machine. To launch a single rank on c1, c2 and c3 each:

mpiexec -n 1 -host c1 <prog> : -n 1 -host c2 <prog> : -n 1 -host c3 <prog>

Obviously, the above form can be tedious for large numbers of machines. So using hostfiles is much simpler. The format of a hostfile is:

# Comments have a leading hash. Blank lines are ignored.

# A single rank: user@machine1

# Multiple ranks: user2@machine2 slots=4

# Use default login: machine3 slots=2

Using such a hostfile is done by including the -file option on the command line. The number of slots is the default number of ranks to run on that machine before moving to the next in the list. If the total number of ranks requested is less than the number described in the hostfile, mpiexec will use the first machines listed in the hostfile when selecting ranks. If the total number of ranks is greater, then mpiexec will assign them in a round-robin fashion.

i.e. mpiexec -n 1 -file hostfile <prog> <args>

will execute <prog> with one rank on machine1, using user1 as a login name.

mpiexec -n 3 -file hostfile <prog> <args>

will execute one rank on machine1, and two ranks on machine2.

mpiexec -n 9 -file hostfile <prog> <args>

will launch <prog> with 2 ranks on machine1, 5 on machine2, and 2 on machine3.

Multiple hostfiles may be used, seperated by colons:

mpiexec -n 12 -file hostfile1 <prog1> <args> : -n 10 -file hostfile2 <prog2> <, args>

will execute 12 processes described by hostfile1, and 10 by hostfile2.

The number of processes can also be selected by using the -soft option. However, this implementation of MPI will simply calculate the maximal number of ranks allowed by the syntax, and use that as if it were the value entered in a -n or -np option.

Current Working Directory

The current working directory of a rank on a remote machine will be the default directory for ssh logins. This will most likely be the home directory. A locally launched rank will not change the current working directory by default. To change this use the -wdir command line option.

Standard I/O

The stdout file descriptor of remote ranks will by default be tunneled by ssh to the local machine.

The stderr file descriptor is not redirected. This allows an MPI process to choose where its output goes. By using the dup2() system call, MPI applications can choose at runtime to output to a local terminal window, or have all output tunneled to the root rank.

The stdin file descriptor is not redirected. This allows input at remote machines by using their xterminals. Note that multiple ranks may share a common stdin due to multiple ranks running as processes on the same machine. A programmer can use the same techniques as is used in threaded programs to make sure input goes to the rank required.

Signals

This implementation of MPI uses a real-time signal to accelerate blocked message sends. It uses the undocumented __libc_allocate_rtsig() function to allocate an unused signal on startup. The lowest priority unallocated realtime-signal is used for this purpose. If you are debugging MPI applications in gdb you may want to use the

(gdb) handle SIG64 noprint

command to turn off halting when this signal is raised. (Assuming that SIG64 is the lowest unallocated one.)

If a MPI process running on a machine is killed via a signal, other processes in the same job will notice. They will exit immediately when their socket connection to the dead process is shut down.

No signals are propagated by MPI, so local ranks will need to inform remote ranks programatically if for example, SIGUSR1 is received.

The IO threads, which handles communication to other machines, should not be signalled unecessarily. The other processess, which correspond to MPI ranks, can have their signal properties adjusted at will (provided that the low priority real-time signal is left unblocked).

Process Termination

If during the execution of an MPI application any rank dies unexpectedly, the corresponding process on that machine will also die. Remote machines will exit immediately when their socket connection to the dead process is shut down.

User signal handlers should avoid calling MPI functions. This MPI implementation is not async-signal safe. For example, if a MPI_Send() is called with a bad buffer and a segmentation fault occurs, the internal state of MPI is undefined. If the SIGSEGV signal handler attempts to call MPI_Finalize() it is unlikely the necessary communication can take place. If a critical error like this occurs, the safest thing to do is clean up non-MPI state, and then exit.

Config Files

The -configfile <filename> option allows most of the command line given to mpiexec to be stored in a file. The lines of <filename> are of the form seperated by colons of the input to mpiexec. Lines begininng with '#' are comments, and are ignored. Lines ending with a slash '´ continue onto the next line.

The advantage of this invocation method is that it allows arguments to programs that contain colons. Whereas this is impossible with the normal execution method.

Infiniband

By default, if an Infiniband network interface is detected, then it will be used by MPI. However, if the -noinfiniband command line argument is given then MPI will fall back to purely using ethernet for inter-machine communication. This can be useful for debugging Infinband issues. Note that all machines in a cluster should have Infiniband available, or all have it not available. Mixed-use clusters are not supported.

It is assumed that all machines are routable to each other over Infiniband. i.e. the Infiniband network isn't split.

It is also assumed that there is an TCP-IP link to each machine in the cluster. The MPI implementation will start each node via ssh. Once an MPI process is running on each node after startup, communication will switch to using Infiniband.

This MPI implementation will only use one Infiniband port per machine. If multiple ports are available, it will detect the first connected one, and use that for communication.

Note that if an MPI application is misbehaving whilst using Infiniband, try testing the following:

Test to see if your application is using the fork() system call. If so, memory corruption could result if the child process alters memory mapped by the Infiniband hardware. To test for this, use the pthread_atfork() function to register handlers that will notify you on this condition.

This MPI implementation can stress your machine in ways that other applications do not. This may expose latent hardware issues:

Try updating your Infinband HCA firmware. Some older firmware may crash when subjected to the high loads this MPI implementation can sustain.

Try testing your ram. The DMA from Infiniband to marginally working ram may trigger issues that are rarely detectable in normal use.

Debugging

To ease debugging, this version of MPI has the -debug option for startup. This will launch remote MPI ranks within gdb. Note that the local ranks will not be affected by this flag. (To debug them, just launch mpiexec with gdb.)

Lockless MPI uses an unusual process arrangement, which can be confusing to the debugger. It starts by executing the mpiexec process. That process then uses the clone() system call to create new processes that share the same address space with mpiexec. The new processes correspond to the individual ranks within the MPI job.

This means that multiple copies of the same program will be executing within the same address space. gdb has some support for this, however you will need to make sure that when you create or remove breakpoints that you do it from the correct gdb 'inferior'.

MPI adjusts the list of loaded object files within the C library so that gdb can find all copies loaded. This means that on some unintended exits, the list might not be in a state it expects. If this happens, the C library can output some warning messages. The messages are harmless, and simply correspond to its lack of knowledge of the real process state.

Unfortunately, this version of Lockless MPI does not support being run within the memory tester Valgrind. The MPI library needs to use a few system calls not yet supported by Valgrind. Also, the use of the %gs segment register is not yet supported there.

Startup

Once the command line parameters are parsed, mpiexec will then attempt to launch the requested programs on the remote machines. It does this by invoking the ssh executable. In order for this to work, ssh needs to be accessable somewhere in your PATH. The port of the remote ssh server can be altered by using the -sshport option.

By default, ssh will use the name of the current user to log in to the remote machines. Other user names can be used by using the "name@host" syntax instead of "host" to describe the remote machines.

Since entering the passwords every time a MPI job is started is very tedious, it may be useful to set up ssh-agent. Once the public keys of all the remote machines are known, it is possible to have password-less logins. Use ssh-agent bash to start a bash shell with an associated agent. Then use ssh-add to add your keys. Finally, run mpiexec and no passwords will be needed if the agent is working correctly. Note that if the machine connection messages are now no longer needed, use the -quiet flag to disable them.

In order to discover the location of the remote executables, an altered PATH may be needed. To set one, use the -path option. This will replace the default PATH.

The login and connection process may fail. To prevent MPI jobs waiting forever for a process to start up, there is a timeout of 60 seconds. If after this, a remote node still cannot access the master it will exit.

Finally, the MPI jobs will attempt to connect to each other over ethernet. They do this by creating a server IO thread at each machine. These threads then attempt to open sockets connecting to the root machine. This implementation of MPI uses arbitrary high ports. The port numbers are tunneled over ssh in the startup sequence.

ENVIRONMENT VARIABLES

This MPI implementation uses environment variables to determine how to execute an MPI process. No command line arguments are used, so a program may inspect them before calling MPI_Init() if required. mpiexec sets the environment based on its command line.

MPI_RANKS - The rank within MPI_COMM_WORLD for this process. You should not modify this value.

MPI_PROGNAME - The name of the program. This is used to set the process name for this MPI rank, so it will appear as something different than 'mpiexec' in utilities such as top.

MPI_WDIR - The working directory. On startup, MPI_Init() will change to this directory.

MPI_BUFCOPY_SIZE - Sets the maximal size where messages will be automatically buffered. See MPI_bufcopy_set()

MPI_IB_MR_MAX - This sets the maximum number of Infiniband Memory Regions to use. The default is to use all available. Using this allows other Infiniband applications to share use of the HCA.

MPI_GLOBALS - Internal pointer to machine-global variables initialized within mpiexec.

RETURN VALUE

The return value of mpiexec will be whatever the return value of the root rank of the MPI application is. This will be 0 if all ranks started exit by calling MPI_Finalize() and then the root process exits with EXIT_SUCCESS.

If a remote process shuts down prematurely, then its error code will not be propagated to the output of mpiexec. Instead, the root process will notice the failing machine by the disconnection of its tcp socket, and then will exit with a status of 1.

If a remote machine dies using the MPI_Abort() function. The error code passed to it will be propagated to the root rank. The root rank will then exit with the requested error code, and so then will mpiexec.