## 2D Wave EquationA simple yet useful example of the type of problem typically solved in a HPC context is that of the 2D wave equation. This partial differential equation (PDE) can be discretized onto a grid. We can then construct a set of equations describing how the wave amplitude propagates forward, time-step by time-step. Here, we will only worry about the case where the wave speed is constant within the region of interest. We will also use absorbing boundaries, to try and minimize the amount of reflections. The kernel of such a routine then looks like:
Note how in the above, we have split the function into nine special cases, corresponding to the corners, edges or middle of the region. (We can't use the same code in each because the stencil for the middle region would then try to access data outside the grid.) This is a general feature of such PDE solvers, where most of the code has to deal with the boundaries. This situation gets worse in three dimensions. Fortunately, the compiler is smart and can do "loop peeling" for us, so we don't have to that particular optimization by hand. Instead, we can keep the code in the more generic form above for simplicity. The only slightly non-trivial thing in the above is the introduction of the Since we will plot the results on the screen as the simulation progresses, we need to generate a colour for the wave amplitude. This is the duty of the
To put the image on the screen now requires a little fiddling with xlib. Fortunately, this isn't all that hard. We will construct a window, and use our bitmap as its background. We can then update the bitmap as needed to get some crude animation. (Yes, this isn't the best way of doing this - but like most scientific visualization, simple and crude does the job.) A real code would output every n time-steps to a datafile for later processing and replay, plus allow for check-pointing if something goes wrong. The resulting X code looks like:
Now we just need the main routine that uses the time-stepper, and displays the result. We will add a Gaussian-shaped drop in the center to start with. After 300 time steps, and every 300 thereafter, we will add another drop randomly located within the grid. This will give a series of ring-shaped waves that will expand outwards. The main trick to the driver routine is that we only ever need three time-steps active any one time. Two of them are needed to calculate the third. Once that third time-step is complete, the oldest of the original two can be dropped. By simply renaming the arrays used to call the
The above very simple code will produce the following output:
## Parallelization with PthreadsThe next step is to try and speed up the code. The obvious things to try are vectorization and/or putting it on a GPU. However, generically that may be difficult. The numeric scheme may require special functions to be evaluated (which are difficult to vectorize). The numeric scheme may also require something like Newton Iteration to converge to the next time-step, (even if it is still an explicit method). Complex logic doesn't work very well on GPUs. Instead, we will look at methods that will always work. The simplest of which is to divide the calculating region amongst many threads. We will split the array into horizontal strips, and distribute them between threads. A thread will then only need to calculate its own region. So the stepper routine should be altered a bit. Here, we simply remove the outer loop, and let the caller determine which y-coordinate to use as a scan-line:
We will need to send to each thread the arrays to use. (We could make these global variables... but this is a bit nicer.) We also need to let each thread know which one it is, so it can calculate which strip of the array to use.
The main routine now needs to launch
In the above, all of the computation has been abstracted out into the function
Notice how none of the X11 code has changed. Also notice how the only synchronization we need is a single pthreads barrier. The above technique is very easy to use, and easy to debug. It results in about double the frames per second as the non-parallel code. Why not more? The problem is that updating the bitmap, and displaying the result is slow. If we didn't have to output, the speedup is closer to the four times expected by the number of threads (equal to the number of cores on this machine). ## Using MPIThe above is slightly inefficient though. Every thread needs to wait for every other thread at the end of every time-step. In reality, each thread only needs to wait for its neighbours which affect the grids on its boundary. It also only needs to wait until those particular grid-cells are calculated. If the threads calculate the boundary cells first, then there could presumably be a much better overlap of computation and synchronization. Unfortunately, pthreads doesn't really offer a primitive that will work in the way we want. Due to races, it is possible for a thread to have finished one time-slice, and want to start a second, before its neighbours have even acknowledged getting its data. The MPI library offers a solution. The implicit buffering it does decouples the threads from each other nicely. They just need to wait for the right messages when needed. The only function that needs to change is the main one. Instead of breaking out a separate function for the computation, we can go back to how things were organized in the single-threaded case. We just need to use MPI communication to let each thread know what its neighbours are doing.
The above is a little bit long, and there a few tricky parts. The main idea is to let every MPI rank use the same instruction sequence as much as possible. This is done by only rarely checking what the rank is, and using things like Another trick is to notice that most threads only need memory allocated for their strip of the arrays. Instead of wasting memory, we can use C pointer arithmetic to just allocate what we need, and then offset everything so that normal array accesses can be used. This means that the The only difficulty is that memory isn't shared between MPI ranks. For things like the random drop locations, or whether or not we are exiting, we need explicit communication. Fortunately, MPI has the The result is a technique that scales from a single multi-core machine, to a whole cluster. It is even a few percent faster than the pthreads method (the barriers are quite heavy-weight). You just need to worry about how to distribute data as well as computation. |

About Us | Returns Policy | Privacy Policy | Send us Feedback |

Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |

## Comments