# 2D Wave Equation

A simple yet useful example of the type of problem typically solved in a HPC context is that of the 2D wave equation. This partial differential equation (PDE) can be discretized onto a grid. We can then construct a set of equations describing how the wave amplitude propagates forward, time-step by time-step. Here, we will only worry about the case where the wave speed is constant within the region of interest. We will also use absorbing boundaries, to try and minimize the amount of reflections. The kernel of such a routine then looks like:

``````
#define ASIZE 1000

#define A(x, i, j)  x[(i)+ ((j) * ASIZE)]

/* Single-step forward in time to time-step c from old time-step a, and current time-step b */
static void step(double *restrict a, double *restrict b, double *restrict c)
{
int i, j;
double o, m;

/* Stable as long as lambda <= 0.5 */
double lambda = 0.5;

for (j = 0; j < ASIZE; j++)
{
for (i = 0; i < ASIZE; i++)
{
m = A(b, i, j);
o = A(a, i, j);

/* Top */
if (!j)
{
if (!i)
{
/* Corner */
A(c, i, j) = 0;
}
else if (i == ASIZE - 1)
{
/* Corner */
A(c, i, j) = 0;
}
else
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i, j + 1) + o - A(a, i, j + 1)
+ 0.5 * (A(b, i - 1, j) - 2*m + A(b, i + 1, j))) - o + 2*m;
}
}

/* Bottom */
else if (j == ASIZE - 1)
{
if (!i)
{
/* Corner */
A(c, i, j) = 0;
}
else if (i == ASIZE - 1)
{
/* Corner */
A(c, i, j) = 0;
}
else
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i, j - 1) + o - A(a, i, j - 1)
+ 0.5 * (A(b, i - 1, j) - 2*m + A(b, i + 1, j))) - o + 2*m;
}
}

/* Middle */
else
{
if (!i)
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i + 1, j) + o - A(a, i + 1, j)
+ 0.5 * (A(b, i, j - 1) - 2*m + A(b, i, j + 1))) - o + 2*m;
}
else if (i == ASIZE - 1)
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i - 1, j) + o - A(a, i - 1, j)
+ 0.5 * (A(b, i, j - 1) - 2*m + A(b, i, j + 1))) - o + 2*m;
}
else
{
/* Middle */
A(c, i, j) = lambda*(A(b, i - 1, j) + A(b, i + 1, j) - 4*m
+ A(b, i, j - 1) + A(b, i, j + 1)) - o + 2*m;
}
}

get_colour(A(c, i, j), &bitmap->data[i*4 + j*ASIZE*4+2],
&bitmap->data[i*4 + j*ASIZE*4+1], &bitmap->data[i*4 + j*ASIZE*4]);
}
}
}
``````

Note how in the above, we have split the function into nine special cases, corresponding to the corners, edges or middle of the region. (We can't use the same code in each because the stencil for the middle region would then try to access data outside the grid.) This is a general feature of such PDE solvers, where most of the code has to deal with the boundaries. This situation gets worse in three dimensions. Fortunately, the compiler is smart and can do "loop peeling" for us, so we don't have to that particular optimization by hand. Instead, we can keep the code in the more generic form above for simplicity.

The only slightly non-trivial thing in the above is the introduction of the `lambda` parameter. This describes the spacing of the grid in the time direction relative to the spacial directions. If the grid is incorrectly spaced, then waves can propagate so fast that they need information outside the stencil describing their motion. When that happens, the result is numerical instability. So long as lambda is small enough, the above will work. This limitation is known as the Courant - Friedrichs - Lewy condition (CFL condition), and occurs in many methods of solving PDE's

Since we will plot the results on the screen as the simulation progresses, we need to generate a colour for the wave amplitude. This is the duty of the `get_colour()` function, which simply uses a blue colour gradient:

``````
static void get_colour(double sv, char *r, char *g, char *b)
{
int c1 = (sv + 1.0) * 128.0;

if (c1 > 255) c1 = 255;
if (c1 < 0) c1 = 0;

/* Write rgb value */
*b = (char) 255;
*g = c1;
*r = c1;
}
``````

To put the image on the screen now requires a little fiddling with xlib. Fortunately, this isn't all that hard. We will construct a window, and use our bitmap as its background. We can then update the bitmap as needed to get some crude animation. (Yes, this isn't the best way of doing this - but like most scientific visualization, simple and crude does the job.) A real code would output every n time-steps to a datafile for later processing and replay, plus allow for check-pointing if something goes wrong.

The resulting X code looks like:

``````
#include <X11/Xlib.h>
#include <X11/Xutil.h>

/* X11 data */
static Display *dpy;
static Window win;
static XImage *bitmap;
static char *bitmap_data;
static GC gc;
static Atom wmDeleteMessage;

static void exit_x11(void)
{
XDestroyImage(bitmap);
XDestroyWindow(dpy, win);
XCloseDisplay(dpy);
}

static void init_x11(int size)
{
unsigned long white, black;

char name = "Wave Equation";
char *n = name;
int depth;

Visual *visual;

Status st;

XTextProperty tp;

/* Attempt to open the display */
dpy = XOpenDisplay(NULL);

/* Failure */
if (!dpy) exit(0);

white = WhitePixel(dpy,DefaultScreen(dpy));
black = BlackPixel(dpy,DefaultScreen(dpy));

win = XCreateSimpleWindow(dpy, DefaultRootWindow(dpy),
0, 0, ASIZE, ASIZE, 0, black, white);

/* We want to be notified when the window appears */

/* Make it appear */
XMapWindow(dpy, win);

while (1)
{
XEvent e;
XNextEvent(dpy, &e);
if (e.type == MapNotify) break;
}

st = XStringListToTextProperty(&n, 1, &tp);
if (st) XSetWMName(dpy, win, &tp);

/* Wait for the MapNotify event */
XFlush(dpy);

depth = DefaultDepth(dpy, DefaultScreen(dpy));
visual = DefaultVisual(dpy, DefaultScreen(dpy));

if ((depth != 32) && (depth != 24)) errx(1, "Need 32bit colour screen!\n");

/* Make bitmap */
bitmap = XCreateImage(dpy, visual, depth,
ZPixmap, 0, bitmap_data,
size, size, 32, 0);

/* Init GC */
gc = XCreateGC(dpy, win, 0, NULL);
XSetForeground(dpy, gc, black);

wmDeleteMessage = XInternAtom(dpy, "WM_DELETE_WINDOW", False);
XSetWMProtocols(dpy, win, &wmDeleteMessage, 1);
}

static int test_exit(void)
{
XEvent event;
KeySym key;
char text;

/* Handle pending events */
while (XPending(dpy))
{
/* Get the pending event */
XNextEvent(dpy, &event);

/* Press 'q' to quit */
if ((event.type == KeyPress) &&
XLookupString(&event.xkey, text, 255, &key, 0) == 1)
{
if (text == 'q') return 1;
}

/* Or simply close the window */
if ((event.type == ClientMessage) &&
((Atom) event.xclient.data.l == wmDeleteMessage))
{
return 1;
}
}

return 0;
}
``````

Now we just need the main routine that uses the time-stepper, and displays the result. We will add a Gaussian-shaped drop in the center to start with. After 300 time steps, and every 300 thereafter, we will add another drop randomly located within the grid. This will give a series of ring-shaped waves that will expand outwards.

The main trick to the driver routine is that we only ever need three time-steps active any one time. Two of them are needed to calculate the third. Once that third time-step is complete, the oldest of the original two can be dropped. By simply renaming the arrays used to call the `step()` function, we can avoid copies.

``````
int main(void)
{
double *a = malloc(sizeof(double) * ASIZE * ASIZE);
double *b = malloc(sizeof(double) * ASIZE * ASIZE);
double *c = malloc(sizeof(double) * ASIZE * ASIZE);
int i, j;
int x, y;

int count = 0;

if (!a || !b || !b) errx(1, "Failed to allocate arrays\n");

/* Make bitmap */
bitmap_data = malloc(ASIZE * ASIZE * 4);
if (!bitmap_data) errx(1, "Couldn't allocate bitmap\n");

/* Init rng */
srand(time(NULL));

/* Make a window! */
init_x11(ASIZE);

/* Init array */
for (j = 0; j < ASIZE; j++)
{
for (i = 0; i < ASIZE; i++)
{
double dx = i - ASIZE/2;
double dy = j - ASIZE/2;

double d2 = (dx*dx+dy*dy) / 20;

A(a, i, j) = 10 * exp(-d2);
A(b, i, j) = A(a, i, j);
}
}

while (1)
{
step(a, b, c);
step(b, c, a);
step(c, a, b);

/* Display it */
XPutImage(dpy, win, gc, bitmap,
0, 0, 0, 0, ASIZE, ASIZE);

XFlush(dpy);

if (test_exit()) break;

count++;
if (count == 100)
{
count = 0;

x = (((ASIZE - 101.0) * rand()) / RAND_MAX) + 50.0;
y = (((ASIZE - 101.0) * rand()) / RAND_MAX) + 50.0;
for (j = 0; j < ASIZE; j++)
{
for (i = 0; i < ASIZE; i++)
{
double dx = i - x;
double dy = j - y;

double d2 = (dx*dx+dy*dy) / 20;

A(a, i, j) += 10 * exp(-d2);
A(b, i, j) += 10 * exp(-d2);
}
}
}
}

free(a);
free(b);
free(c);

/* Done! */
exit_x11();

return 0;
}
``````

The above very simple code will produce the following output: The next step is to try and speed up the code. The obvious things to try are vectorization and/or putting it on a GPU. However, generically that may be difficult. The numeric scheme may require special functions to be evaluated (which are difficult to vectorize). The numeric scheme may also require something like Newton Iteration to converge to the next time-step, (even if it is still an explicit method). Complex logic doesn't work very well on GPUs. Instead, we will look at methods that will always work. The simplest of which is to divide the calculating region amongst many threads.

We will split the array into horizontal strips, and distribute them between threads. A thread will then only need to calculate its own region. So the stepper routine should be altered a bit. Here, we simply remove the outer loop, and let the caller determine which y-coordinate to use as a scan-line:

``````
/* Single-step forward in time to time-step c from old time-step a, and current time-step b */
static void pstep(double *restrict a, double *restrict b, double *restrict c, int j)
{
int i;
double o, m;

/* Stable as long as lambda <= 0.5 */
double lambda = 0.5;

for (i = 0; i < ASIZE; i++)
{
m = A(b, i, j);
o = A(a, i, j);

/* Top */
if (!j)
{
if (!i)
{
/* Corner */
A(c, i, j) = 0;
}
else if (i == ASIZE - 1)
{
/* Corner */
A(c, i, j) = 0;
}
else
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i, j + 1) + o - A(a, i, j + 1)
+ 0.5 * (A(b, i - 1, j) - 2*m + A(b, i + 1, j))) - o + 2*m;
}
}

/* Bottom */
else if (j == ASIZE - 1)
{
if (!i)
{
/* Corner */
A(c, i, j) = 0;
}
else if (i == ASIZE - 1)
{
/* Corner */
A(c, i, j) = 0;
}
else
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i, j - 1) + o - A(a, i, j - 1)
+ 0.5 * (A(b, i - 1, j) - 2*m + A(b, i + 1, j))) - o + 2*m;
}
}

/* Middle */
else
{
if (!i)
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i + 1, j) + o - A(a, i + 1, j)
+ 0.5 * (A(b, i, j - 1) - 2*m + A(b, i, j + 1))) - o + 2*m;
}
else if (i == ASIZE - 1)
{
/* Edge */
A(c, i, j) = lambda * (-m + A(b, i - 1, j) + o - A(a, i - 1, j)
+ 0.5 * (A(b, i, j - 1) - 2*m + A(b, i, j + 1))) - o + 2*m;
}
else
{
/* Middle */
A(c, i, j) = lambda*(A(b, i - 1, j) + A(b, i + 1, j) - 4*m
+ A(b, i, j - 1) + A(b, i, j + 1)) - o + 2*m;
}
}

get_colour(A(c, i, j), &bitmap->data[i*4 + j*ASIZE*4+2],
&bitmap->data[i*4 + j*ASIZE*4+1], &bitmap->data[i*4 + j*ASIZE*4]);
}
}
``````

We will need to send to each thread the arrays to use. (We could make these global variables... but this is a bit nicer.) We also need to let each thread know which one it is, so it can calculate which strip of the array to use.

``````
typedef struct args args;
struct args
{
int me;
double *a;
double *b;
double *c;
};
``````

The main routine now needs to launch `NTHREAD - 1` threads to get `NTHREAD` threads running at the same time. (Since the main thread can itself do computational work as well.) Fortunately, this is easy to do with the pthread library:

``````

int main(void)
{
double *a = malloc(sizeof(double) * ASIZE * ASIZE);
double *b = malloc(sizeof(double) * ASIZE * ASIZE);
double *c = malloc(sizeof(double) * ASIZE * ASIZE);
int i, j;

args *arg;

if (!a || !b || !b) errx(1, "Failed to allocate arrays\n");

/* Make bitmap */
bitmap_data = malloc(ASIZE * ASIZE * 4);
if (!bitmap_data) errx(1, "Couldn't allocate bitmap\n");

/* Init rng */
srand(time(NULL));

/* Make a window! */
init_x11(ASIZE);

/* Init array */
for (j = 0; j < ASIZE; j++)
{
for (i = 0; i < ASIZE; i++)
{
double dx = i - ASIZE/2;
double dy = j - ASIZE/2;

double d2 = (dx*dx+dy*dy) / 20;

A(a, i, j) = 10 * exp(-d2);
A(b, i, j) = A(a, i, j);
}
}

/* Parallel */
for (i = 0; i < NTHREAD - 1; i++)
{
arg = malloc(sizeof(args));
if (!arg) errx(1, "Couldn't allocate args\n");
arg->me = i;
arg->a = a;
arg->b = b;
arg->c = c;
}
arg = malloc(sizeof(args));
if (!arg) errx(1, "Couldn't allocate args\n");
arg->a = a;
arg->b = b;
arg->c = c;

/* Wait for the threads to exit */
for (i = 0; i < NTHREAD - 1; i++)
{
}

free(a);
free(b);
free(c);

/* Done! */
exit_x11();

return 0;
}
``````

In the above, all of the computation has been abstracted out into the function `thread_main()`. Now each thread can use its number, to work out which region to calculate. After calculating that region, it needs to wait until its compatriots are done. If so, they all can proceed onto the next time-step. Thus we will interleave computation with `pthread_barrier_wait()` calls to synchronize everything. Finally, one thread can display the results and add the extra drops as time progresses.

``````
static int exiting = 0;
{
args *ar = arg;
int me = ar->me;
double *a = ar->a;
double *b = ar->b;
double *c = ar->c;

/* Which region do we calculate? */
int y1 = ASIZE * me / NTHREAD;
int y2 = (ASIZE * (me + 1) / NTHREAD);

int i, j;
int x, y;
int count = 0;

free(ar);

while (1)
{
for (j = y1; j < y2; j++)
{
pstep(a, b, c, j);
}

for (j = y1; j < y2; j++)
{
pstep(b, c, a, j);
}

for (j = y1; j < y2; j++)
{
pstep(c, a, b, j);
}

if (!me)
{
/* Display it */
XPutImage(dpy, win, gc, bitmap,
0, 0, 0, 0, ASIZE, ASIZE);

XFlush(dpy);

exiting = test_exit();

count++;

if (count == 100)
{
count = 0;

x = (((ASIZE - 101.0) * rand()) / RAND_MAX) + 50.0;
y = (((ASIZE - 101.0) * rand()) / RAND_MAX) + 50.0;
for (j = 0; j < ASIZE; j++)
{
for (i = 0; i < ASIZE; i++)
{
double dx = i - x;
double dy = j - y;

double d2 = (dx*dx+dy*dy) / 20;

A(a, i, j) += 10 * exp(-d2);
A(b, i, j) += 10 * exp(-d2);
}
}
}
}

if (exiting) break;
}

return NULL;
}
``````

Notice how none of the X11 code has changed. Also notice how the only synchronization we need is a single pthreads barrier. The above technique is very easy to use, and easy to debug. It results in about double the frames per second as the non-parallel code. Why not more? The problem is that updating the bitmap, and displaying the result is slow. If we didn't have to output, the speedup is closer to the four times expected by the number of threads (equal to the number of cores on this machine).

### Using MPI

The above is slightly inefficient though. Every thread needs to wait for every other thread at the end of every time-step. In reality, each thread only needs to wait for its neighbours which affect the grids on its boundary. It also only needs to wait until those particular grid-cells are calculated. If the threads calculate the boundary cells first, then there could presumably be a much better overlap of computation and synchronization.

Unfortunately, pthreads doesn't really offer a primitive that will work in the way we want. Due to races, it is possible for a thread to have finished one time-slice, and want to start a second, before its neighbours have even acknowledged getting its data. The MPI library offers a solution. The implicit buffering it does decouples the threads from each other nicely. They just need to wait for the right messages when needed.

The only function that needs to change is the main one. Instead of breaking out a separate function for the computation, we can go back to how things were organized in the single-threaded case. We just need to use MPI communication to let each thread know what its neighbours are doing.

``````
#include "mpi.h"

int main(int argc, char **argv)
{
double *a, *b, *c;

int rank;
int nranks;
int y1, y2;
int i, j;

int x, y;

int count = 0;

int uprank, downrank;
MPI_Request up_rq, down_rq;

int *bitmap_counts;
int *bitmap_displ;

int exiting;

MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &nranks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

y1 = ASIZE * rank / nranks;
y2 = (ASIZE * (rank + 1) / nranks);

if (!rank)
{
uprank = MPI_PROC_NULL;
downrank = 1;

/* Make bitmap */
bitmap_data = malloc(ASIZE * ASIZE * 4);
if (!bitmap_data) errx(1, "Couldn't allocate bitmap\n");

/* Init data for bitmap gather */
bitmap_counts = malloc(nranks * sizeof(int));
bitmap_displ = malloc(nranks * sizeof(int));

bitmap_displ = 0;
for (i = 0; i < nranks; i++)
{
bitmap_counts[i] = 4 * ASIZE * ((ASIZE * (i + 1) / nranks) - ASIZE * i / nranks);
if (i != nranks - 1) bitmap_displ[i + 1] = bitmap_displ[i] + bitmap_counts[i];
}

/* Make a window */
init_x11(ASIZE);

/* Init rng */
srand(time(NULL));
}
else
{
if (rank != nranks - 1)
{
uprank = rank - 1;
downrank = rank + 1;
}
else
{
uprank = rank - 1;
downrank = MPI_PROC_NULL;
}

bitmap_data = malloc((y2 - y1) * ASIZE * 4);
if (!bitmap_data) errx(1, "Couldn't allocate bitmap\n");

/* Shift bitmap_data */
bitmap_data -= y1 * ASIZE * 4;

bitmap_counts = NULL;
bitmap_displ = NULL;
}

/* Allocate shifted arrays using void * arithmetic */
a = malloc(sizeof(double) * (y2 - y1 + 2) * ASIZE);
b = malloc(sizeof(double) * (y2 - y1 + 2) * ASIZE);
c = malloc(sizeof(double) * (y2 - y1 + 2) * ASIZE);

if (!a || !b || !b) errx(1, "Failed to allocate arrays\n");

/* Shift the arrays so that we can use zero-based offsets */
a -= (y1 - 1) * ASIZE;
b -= (y1 - 1) * ASIZE;
c -= (y1 - 1) * ASIZE;

/* Init my part of the array + overlap */
for (j = y1 - 1; j <= y2; j++)
{
for (i = 0; i < ASIZE; i++)
{
double dx = i - ASIZE/2;
double dy = j - ASIZE/2;

double d2 = (dx*dx+dy*dy) / 20;

A(a, i, j) = 10 * exp(-d2);
A(b, i, j) = A(a, i, j);
}
}

while (1)
{
/* Calculate boundaries as early as possible, and then send them */
pstep(a, b, c, y1);
MPI_Isend(&A(c, 0, y1), ASIZE, MPI_DOUBLE, uprank, 0, MPI_COMM_WORLD, &up_rq);

pstep(a, b, c, y2 - 1);
MPI_Isend(&A(c, 0, y2 - 1), ASIZE, MPI_DOUBLE, downrank, 0, MPI_COMM_WORLD, &down_rq);

for (j = y1 + 1; j < y2 - 1; j++)
{
pstep(a, b, c, j);
}

/* Wait for the results */
MPI_Recv(&A(c, 0, y1 - 1), ASIZE, MPI_DOUBLE, uprank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Recv(&A(c, 0, y2), ASIZE, MPI_DOUBLE, downrank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Wait(&up_rq, MPI_STATUS_IGNORE);
MPI_Wait(&down_rq, MPI_STATUS_IGNORE);

/* Calculate boundaries as early as possible, and then send them */
pstep(b, c, a, y1);
MPI_Isend(&A(a, 0, y1), ASIZE, MPI_DOUBLE, uprank, 0, MPI_COMM_WORLD, &up_rq);

pstep(b, c, a, y2 - 1);
MPI_Isend(&A(a, 0, y2 - 1), ASIZE, MPI_DOUBLE, downrank, 0, MPI_COMM_WORLD, &down_rq);

for (j = y1 + 1; j < y2 - 1; j++)
{
pstep(b, c, a, j);
}

/* Wait for the results */
MPI_Recv(&A(a, 0, y1 - 1), ASIZE, MPI_DOUBLE, uprank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Recv(&A(a, 0, y2), ASIZE, MPI_DOUBLE, downrank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Wait(&up_rq, MPI_STATUS_IGNORE);
MPI_Wait(&down_rq, MPI_STATUS_IGNORE);

/* Calculate boundaries as early as possible, and then send them */
pstep(c, a, b, y1);
MPI_Isend(&A(b, 0, y1), ASIZE, MPI_DOUBLE, uprank, 0, MPI_COMM_WORLD, &up_rq);

pstep(c, a, b, y2 - 1);
MPI_Isend(&A(b, 0, y2 - 1), ASIZE, MPI_DOUBLE, downrank, 0, MPI_COMM_WORLD, &down_rq);

for (j = y1 + 1; j < y2 - 1; j++)
{
pstep(c, a, b, j);
}

/* Wait for the results */
MPI_Recv(&A(b, 0, y1 - 1), ASIZE, MPI_DOUBLE, uprank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Recv(&A(b, 0, y2), ASIZE, MPI_DOUBLE, downrank, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Wait(&up_rq, MPI_STATUS_IGNORE);
MPI_Wait(&down_rq, MPI_STATUS_IGNORE);

/* Send image information to root */
MPI_Gatherv(&bitmap_data[y1*ASIZE*4], ASIZE*4*(y2 - y1), MPI_BYTE,
bitmap_data, bitmap_counts, bitmap_displ, MPI_BYTE, 0, MPI_COMM_WORLD);

if (!rank)
{
/* Display it */
XPutImage(dpy, win, gc, bitmap,
0, 0, 0, 0, ASIZE, ASIZE);

XFlush(dpy);

exiting = test_exit();
}
MPI_Bcast(&exiting, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (exiting) break;

count++;

if (count == 100)
{
count = 0;

/* Decide where */
if (!rank)
{
x = (((ASIZE - 101.0) * rand()) / RAND_MAX) + 50.0;
y = (((ASIZE - 101.0) * rand()) / RAND_MAX) + 50.0;
}

/* Tell other ranks */
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&y, 1, MPI_INT, 0, MPI_COMM_WORLD);

for (j = y1 - 1; j <= y2; j++)
{
for (i = 0; i < ASIZE; i++)
{
double dx = i - x;
double dy = j - y;

double d2 = (dx*dx+dy*dy) / 20;

A(a, i, j) += 10 * exp(-d2);
A(b, i, j) += 10 * exp(-d2);
}
}
}
}

/* Invert shifts */
a += (y1 - 1) * ASIZE;
b += (y1 - 1) * ASIZE;
c += (y1 - 1) * ASIZE;
bitmap_data += y1 * ASIZE * 4;

free(a);
free(b);
free(c);

/* Done! */
if (!rank)
{
/* X will free our bitmap for us */
exit_x11();
}
else
{
free(bitmap_data);
}

MPI_Finalize();

return 0;
}
``````

The above is a little bit long, and there a few tricky parts. The main idea is to let every MPI rank use the same instruction sequence as much as possible. This is done by only rarely checking what the rank is, and using things like `MPI_PROC_NULL` (which when sent to or received from, does nothing) so that edge cases disappear. The goal is to keep the inner loop as simple and stream-lined as possible.

Another trick is to notice that most threads only need memory allocated for their strip of the arrays. Instead of wasting memory, we can use C pointer arithmetic to just allocate what we need, and then offset everything so that normal array accesses can be used. This means that the `pstep()` function doesn't need to be changed. Similarly, we can do the same thing with the bitmap. However, since the root rank needs to display the bitmap on the screen it does need room for the whole thing.

The only difficulty is that memory isn't shared between MPI ranks. For things like the random drop locations, or whether or not we are exiting, we need explicit communication. Fortunately, MPI has the `MPI_Bcast()` function which does a broadcast of information. As can be seen, the same form can be used on both the sending and receiving end, which simplifies things enormously.

The result is a technique that scales from a single multi-core machine, to a whole cluster. It is even a few percent faster than the pthreads method (the barriers are quite heavy-weight). You just need to worry about how to distribute data as well as computation. 