Optimization

Optimization isn't what it used to be. Quite a lot of the advice for speeding up your code is now out of date. Modern machines operate quite differently from those in the 80's and 90's. In that time, processors were in-order machines. They typically also were connected to ram that ran at either the same frequency as the cpu, or a fixed small fraction of it. They also only had a single processor core, and lacked vectorization instructions.

Modern processors are "out of order" machines. This means that they look at the stream of instructions that they want to execute, and then internally reorder them into the most efficient order. Operations that can be done immediately can be scheduled early so that their dependencies can execute sooner. In addition, recent machines split instructions into micro-ops, and can perform "macro-op fusion" This means that the instructions that the machine actually executes may have very little correspondence with the individual opcodes the assembler outputs.

This large change means that the actual instructions you give your processor to run don't matter nearly as much as they used to. Since compilers tend to be relatively stupid, and most code is generated via compilation, most code that exists is relatively poorly optimized at the instruction level. This has caused chip manufacturers to have to optimize their processors to cope with this poor input. Their are exceptions of course, like the Pentium 4, but in general cycle counting is a futile exercise these days.

Thus the old trick of counting the cycles of the instructions within your highly optimized routine is now a poor predictor of its actual performance. It is more important to generate instructions with as few dependencies as possible between them. The less dependencies, the more instructions your cpu can execute simultaneously. Modern machines can execute 5+ instructions at the same time, if only your code would allow it. Unfortunately, most algorithms can't be so easily sped up.

Another big change is that ram is so much slower than it used to be. Now there are perhaps three layers of cache between the processor and the ram to try and mask the latency of accessing it. If your problem fits within the cache, then it may run an order of magnitude or more faster than if it doesn't. Conversely, if your algorithm jumps about in ram randomly, then the caches will be useless, and your program will execute incredibly slowly.

Thus in order to speed up your program you need to take into account the new realities of how processors work. No longer can you simply count the number of multiplications as a function of problem size, and then divide by the number of FLOPs the machine can execute.

Profiling

So you have some code that is running too slowly, and you need to speed it up. What should you do? The very first thing is to work out what it is spending all its time doing. Thus a profiler is a must. The problem with profilers is that like the uncertainty principle of quantum mechanics, they alter what which they are trying to measure. For very small functions, the overhead of the profiler can be relatively large. Of course, the profiler will try to remove this effect, but the out-of-order nature of modern machines means that the perturbation introduced isn't particularly simple to model.

Instead of profiling, another trick is to simulate your program on a virtual machine. (Kcachegrind of the valgrind suite is a free program to do this.) The virtual machine can be quite close to being cycle-accurate. The disadvantage here is that there is a huge slowdown due to the emulation. Also, since the kernel is not emulated, time spent there isn't correctly attributed.

The most fail-safe way of profiling is to simply time your code. Give it a problem large enough to take a reasonable amount of time. A few seconds is good - large enough that the variance due to kernel scheduling is small, but small enough that you do not have to wait long for it to finish. If you time your code before and after every change, you can have concrete evidence that your change helped or not.

So after you profile your code, you see it is spending quite a bit of time in a certain function. The obvious thing to do is to alter that function in some way in order to speed it up until something else is the limiting factor. By repeating this, you eventually get to the state where there is no obvious thing that can be changed that would help. Many programmers give up at this point.

Speeding up a Function

So how do you speed up a function? The problem here is that the advice many people know about is now out of date. Compilers are very smart these days, so tricks like replacing array offsets with pointer arithmetic, or using pre-increment instead of post-increment in loops do not help. The compiler replaces your code with a data-flow diagram within itself. Simple data-flow optimizations like common subexpression elimination, and removal of dead stores completely invalidate these tricks. The compiler is smart enough to do them itself.

The first thing to do is reduce the number of memory accesses. Memory is actually distributed in cache-line (64 byte) sized chunks on modern machines. If you read any value from memory, then the processor will pull the entire cache line along as well into cache. Thus if you read other memory in that same cache line, it will be orders of magnitude faster than some other location which probably won't be in cache.

Thus if you can rearrange your data structures so that your memory accesses are localized, then your code will execute much much faster. As an example, a binary tree is much slower than a btree, even though the latter is technically less efficient computationally. The extra computational overhead is massively outweighed by memory effects.

The next thing to do is to try changing your algorithm. Often, a function is first implemented in a simple manner because a simpler function is easier to verify. Once you have a working function, you can then use it as part of a test-suite in building a more complex version. If the complex function doesn't give the same output, then there is a bug somewhere than needs to be fixed. However, make sure that the more complex version actually improves things. There is no point keeping complex code of dubious value.

A small amount of research may show much faster ways of implementing the bottleneck. For example, if you are sorting strings within a BWT compressor then the obvious method is quite slow for some inputs. Using a customized suffix-sort algorithm can be much much faster. The problem here is knowing which search terms to use when trying to find an improved version of your algorithm. Unfortunately, the only answer to that problem is education. You'll need to find out what the particular algorithms you use are called so that you can research their possibly better analogues.

So you are now using a smarter algorithm, with a more compact memory layout, what is next? The next most important thing to do is investigate the instruction set of your cpu. Processors these days have vectorized instructions which allow you to process multiple chunks of data simultaneously. If your problem can be mapped onto something that can be accelerated by these, then it may be in your best interest to make the change. Note that some compilers support automatic vectorization. Try testing to see if this helps. If not, you may want to try manual vectorization.

The problem with the vector instruction extensions is that they tend to be non-orthogonal. Your problem may not have the exact instruction that it needs. If so, you may be out of luck. However, most of the time you may be able to work out some way of shoe-horning your algorithm into the available instructions. Quite often, changing your algorithm to being data-parallel with the vectorized instructions will probably cause you to have to rearrange your data layout. Don't be afraid to do this. Often the biggest improvements can be obtained only after rethinking the problem, and rearranging things in a completely new way.

Beware, not all machines that support the SSE instructions actually execute them quickly. For example, the early AMD machines support packed double floating-point operations. However, the processor internally separates them into two individual operations. Thus the factor of two speedup you may be expecting may not occur. In general, single precision is much faster than double precision. If you can get away with using that level of precision, then that may be a good option. You may also want to investigate fixed-point arithmetic, or some other mathematical representation of your problem. Mathematical creativity can yield order of magnitude improvements, but you need to think laterally.

Finally, you want to be aware of your compiler's limitations. Compilers tend to be very good at optimizing common types of code. However, after hand-optimization, your code may not be "common" any more. For example, gcc tends to handle double-sized integer types (128 bits on 64 bit machines, and 64 bits on 32 bit machines) particularly poorly. It also has trouble optimizing expressions involving complex numbers, or compiler intrinsics. You may want to rewrite functions where this is a problem into assembly language for optimal performance after trying all other avenues.

Parallel Code

Once you have made use of the data-parallel vector instructions you usually have done the best possible with a single threaded application. It is now time to consider using the power of multiple cores. This is another big change compared to the processors of previous decades. By using multiple threads, it is now possible to speed things up by an order of magnitude or more on top-class server hardware. As the number of cores (and hyperthreads) available on processors increases, this potential improvement also increases.

The problem with parallel code is that it is much more difficult to write and verify than single threaded code. You need to make sure that your multiple threads do not interfere with each other. The easiest way to do this is to use locks. By locking data structures you prevent concurrent modifications from occurring, and thus invalid states from being entered.

Unfortunately, a thread waiting on a lock isn't doing any work. Thus if too many threads are blocked, due to the locking being to "coarse", then your program might actually execute slower than the single-threaded version. This happens surprisingly often, so be sure to profile your new parallel code to make sure you have obtained the speedups you are after. A related problem occurs when locks are too finely-grained. There, there is less chance of blocking, but the overhead of obtaining and releasing the many more locks is a cause of slow-down. Thus the optimal parallel program has exactly the right number of locks, and no more.

In addition to locks causing performance problems, they also can cause problems with program correctness. Technically, it should be possible to construct a hierarchy of locks such that locks higher in the hierarchy are never taken when a lock lower in the hierarchy is owned. This is the only sure way to avoid the problem of deadlock. Unfortunately, most programs are so complex that enforcing this is extremely difficult if not impossible. Thus there is the ever-present threat of some rarely executed code path causing a deadlock.

Lock-free Code

One way to avoid some of the problems with locks is to replace them with lock-free algorithms. Lock-free code uses individual atomic instructions to perform work, and multiple threads may be manipulating the same part of a data structure at the same time. Since parallel updates are allowed, many more program states are possible. Making sure all those new states result in correct behaviour is what makes programming in a lock-free manner so difficult. Indeed "simply" coming up with a lock-free version of a well known data structure is worthy of a PhD in computer science. However, if there is already a known lock-free version of the data structure you are using, then replacing your code with it may result in a large speedup.

The problem with lock-free code is that only a small number of algorithms are known. If you wish to use it, then you will probably need to alter the data structures and algorithms within your program to fit them, rather than vice-versa. Another problem is that lock-free algorithms tend to use many atomic instructions. Unfortunately, atomic instructions are slow, up to two orders of magnitude slower than normal instructions. Thus it may be the case that the lock-free algorithm is actually slower than the higher-contended lock based algorithm it replaced. Only profiling before and afterwards can tell you if the change really was a performance increase. Remember, it is performance you are after, not the buzzword-compliance of using a complex lock-free algorithm.

Even faster than lock-free algorithms are wait-free algorithms. A lock-free algorithm is designed so that at least one thread makes progress no matter how much contention there is. A wait-free algorithm is designed so that all threads can make progress, even if some of those threads currently aren't running on a cpu. Due to this property, wait-free algorithms have excellent scalability, and tend to have the highest performance if many cores are available.

The downside of wait-free code is that it is even harder to design an algorithm with this property. A typical lock-free algorithm transitions between states via atomic compare-and-swap (CAS) instructions. If another thread interferes, then it retries the CAS once verifying which state the data structure now is in. The act of retrying isn't possible with wait-free code. There, every atomic instruction needs to make forward progress. Thus wait-free algorithms are strongly constrained by the atomic primitives available on your machine. If you can find a wait-free algorithm that speeds up your program you have been extremely lucky. Unfortunately, you can't count on such beasts existing, but it is definitely worth checking.

Global Optimization

So now you've speed up every function you can. You've tried vectorization and parallelization. What is left? The slowdowns that remain now are global issues. The biggest of these is due to memory layout. Many common data structures involve "pointer chasing" in their use. (The linked list is the epitome of this.) The act of following a pointer is a very slow operation if the pointed to memory isn't in the cache. Thus quite a large speedup can be obtained if some other data structure is used. For example, replacing a classic linked list with an unrolled linked list could be a huge improvement.

Another global problem is memory fragmentation. If you allocate and free many objects of differing sizes, your memory allocator needs to work very hard to avoid fragmentation. Often, it can't do a perfect job, and the effective memory footprint of your application could be many times the size of the memory it actually uses. One trick to fix this is to provide an extra layer of indirection. If all of your large objects are allocated as a series of equal-sized chunks that are then "glued" together logically by an array of pointers, you can prevent a huge amount of wastage. Memory allocators tend to handle the "many blocks of the same size" case very well.

Another problem that you may not realize exists is that the libraries you use (including the C library, or C++ runtime) have internal locks to protect their data structures. If your application is multithreaded, then contention on these locks could be a major slowdown. Since it isn't your code that is the problem, this may only show indirectly in performance profiles. The biggest offenders are often the i/o functions and the memory allocator.

Since memory is handled in cache-line sized chunks, there can be slowdowns in the case of "false sharing" between threads. This happens when different threads want to modify variables that they own that are adjacent to each other in memory. Since the same cache line is used, this cache line needs to be transferred between cores, which is a slow operation. The biggest offender is using an array where you have assigned one element per thread. If the array elements are not multiples of 64 bytes in size, then false sharing could be a problem. Another issue occurs when your memory allocator isn't aware of this issue. If it doles out memory to different threads that is in the same cache-line, a similar slowdown occurs. The problem here, is that exactly which heap variables become cache-line aliased can be quite random. Such slowdowns could vary between run to run.

The final source of hidden slowdowns is something mentioned before, use of the cache. The processor cache isn't perfect. It uses the middle bits of an address as a hash key, and is only able to keep a certain number of cache-lines with the same key. The problem is that if you step through memory with a stride that is a power of two, then you could hit the worst-case behaviour where every new memory access evicts something you have recently used from the cache. Since many data structures tend to be sized in powers of two, this happens fairly commonly. If you simply resize those structures to be a few bytes larger, you may be able to avoid this worse-case behaviour.

Optimization Example: Lockless Memory Allocator

The Lockless Memory Allocator uses quite a few of these tricks to obtain its performance advantage. It pays particular care to the number of memory accesses it does in allocating and freeing memory. Touching an extra cache-line can cause double-digit percentage performance changes in something that needs to run as quickly as a memory allocator. In addition it also makes sure that cache-lines are not falsely shared between threads, removing that hidden performance drain.

The current version of the Lockless Memory Allocator does not use much SSE vectorization. However, an earlier version did. An attempt was made to speed up the btree code by using vectorized instructions to do a parallel search though the btree nodes. Unfortunately, the simpler linear search was faster in this case. (Since SSE3 cannot be assumed, there are few ways to do horizontal comparisons in a SSE register. Doing them manually results in fairly slow code.) The Lockless Memory Allocator does use SSE registers in non-trivial ways, however. Sometimes using instructions in ways unintended by their creators can yield performance increases.

The Lockless Memory Allocator is designed to be fastest on parallel code. However, the overhead compared to a single-thread only version is minimal. For allocations that are freed on the same thread that they were allocated on, no atomic instructions are needed. This avoids touching the processor bus, giving more bandwidth for the application. Intra-thread allocations and frees tend to be rare, but even so this is optimized by using wait-free queues to synchronize. In fact the only blocking operations in the library occur on thread startup and shutdown. Since the C library does its own locking in this case, the extra overhead is negligible.

To prevent fragmentation, the Lockless Memory Allocator uses "slabs" for small allocation sizes. This means that all allocations of a given size will be lumped together, improving locality. It also prevents the problem of a small allocation blocking the full use of a large chunk of free memory. For larger sizes, the allocator periodically defragments itself allowing best-fit allocation to occur.

No one optimization is the source of the performance advantage you can get by using the allocator. It is the combination of many of the above tricks that yield the factor of two or so improvement over other allocators.

Summary

There are many things you can do to improve performance. Unfortunately, these things are much more difficult than the old simple tricks like using pointers instead of array indices. Compilers have improved over time, but now to gain optimal performance you need to think about parallelization and vectorization. Rudimentary compiler support for doing these automatically exists, but doing them manually is highly likely to result in much better speedups. Profile your code to find out where it is slow... but bear in mind that there can be global causes of slowdown that do not show up anywhere as a profiling hotspot.

Comments

Mohamed said...

Les agradezco mucho su gednorsiead con esta pe1gina, con a ayuda de un amiguito he podido traducirla y es por esta razf3n que recie9n estoy enviando mi mensaje de gratitud par uds, no saben lo feledz que estoy porque es esta mi mayor distraccif3n, amo el punto de cruz y uyds. me han proporcionado maravillas que yo quereda y no teneda. mil gracias

Kairi said...

I guess finding useful, reliable inotrmaoifn on the internet isn't hopeless after all. http://vssyqgbxhu.com [url=http://jxzoqj.com]jxzoqj[/url] [link=http://uyvbqtzet.com]uyvbqtzet[/link]

Malerie said...

Wow, this is in every <a href="http://ovfjvgruudr.com">repcest</a> what I needed to know.