An Improved Random Number Generator
The parallel random number generator described in an earlier article is very good. However, it can be improved.
The biggest flaw with it is that there is high long-range correlation between the the output from differing seed states. To see this, investigate what happens with two initializations that differ by one. Since the output depends mostly on the the high 64 bits of the 128 bits, that output will not be changed very much. This correlation will persist as the rng is stepped forwards. Since the low 64 bits are simply added in the final calculation, their effect is minimal.
Fortunately, the states that are problematic (close in seed-value) are far apart in generation order (by large multiples of 264). Thus, the random number generator is still very good for moderate amounts of output. However, the lack of long-range mixing means that the amount of results that should be used is much less than might initially be suspected.
The problem here comes down to two things. The first is that the seed is not mixed very well as the rng is stepped forwards. The low bits do not affect the high bits to any great degree. The second problem is that the output hash function really only depends on the upper half of the seed.
These problems are easy to remedy. The first can be fixed by changing to a linear congruential generator for seed updates. By doing a multiply as well as an addition, the lower bits can affect the upper bits in a greater manner. The problem here is that the previous prng algorithm is very fast, and we don't want to slow it down. Adding extra multiplications will go against that goal.
Fortunately, we can use a multiplier of 264 + 1 in our LCG. Since it is equal to 1 mod 4, it will give us a complete period, provided that the addend is an odd constant. This multiplier is extremely convenient to implement. The previous seed update code in assembly looked like:
and this changes into
which is a mere single instruction longer. Since the load of the old lower 64 bits can pair with other operations, the above new method may even be as fast as the old. In effect, we get extra randomness "for free".
This leaves the problem of the hash function that converts the seed into the output random number. We would like something that uses all 128 bits of the seed, rather than just the upper 64 bits. There are many possibilities here. We could use a 128 bit xorshift instead of a 64 bit xorshift. We could use interleaved multiplies. We could try something more complex with rotates, or add instructions to mix the bits. The possibilities are extremely wide open.
The problem is speed. Lets push the envelope, and try for something that is even faster than the old prng. To do this, we need to relax some of the assumptions used in the previous implementation.
The biggest assumption was that the hash function needed to be reversible. Is this actually required? Well... we would really like the rng to output all possible 64 bit quantities in a uniform manner. However, slight non-uniformities aren't really a problem. In some cases, they could even make the statistical performance even better. What we really want is hash output to be statistically indistinguishable from a random source. This means we need to pay attention to the concept of "entropy".
Entropy is proportional to the logarithm of the number of states something could be in. A random source will have maximal entropy. The reason we wanted a reversible hash was that entropy is conserved in one. Every input state corresponds to a unique output state, and vice-versa. This means the input entropy must be identical to the output entropy.
There is one other possibility though. We could use a function that loses a small amount of entropy, and correct for that later. This means that the function would map more than one input state onto the same output state. It would be irreversible. Provided that the correction has enough entropy to make all output states possible, we should be fine.
A really interesting instruction we can use is the double precision multiply. It does a 64 × 64 → 128 bit operation. Notice how the entropy within each bit increases uniformly in the lower 64 bits of the result, and decreases uniformly in the upper 64 bits. This means that by combining the two outputs we can have a result with a flat entropy profile. On the other hand, this operation is non-reversible. If we assume that the two halves are not correlated, we lose a bit of entropy. Since a single bit of entropy is easy to obtain, this looks like it might be a very good kernel for our random number generator.
We can repeat the above twice to ensure good mixing of both the upper and lower 64 bits of the 128 bit seed. The result is the extremely fast prng:
This is actually three different random number generators in one.
For non-threaded use, pick the one without the extra xor instruction. This is fastest, taking 3.77s to output 10^9 64bit random numbers on this machine.
If you have multiple threads, then pick option two. The rng will use the different addresses of the seeds to make sure that each thread gets a different output sequence. This takes 4.05s to output 10^9 64bit random numbers on this machine.
If you are using multiple computers, or require multi-threaded repeatability, then pick option three. Pass a variable describing the thread/machine-number to the rng. (Another option is to store this extra state into a slightly expanded seed, and use "xor 16(%rdi), %rax" instead.)
In effect, options two and three choose a 2^128 subsequence from a 2^192 element sequence.
This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits, middle 32 bits, and alternating bottom and top 32 bit bitstream, with a total of zero failures.
Unfortunately, the above is implemented in assembly language only. It is possible to make a C version. We can use the
However, the above is not recommended at the present time. GCC (version 4.7 tested with -O3) does a very poor job of optimizing it. It doesn't use an add-with-carry (adc) instruction for the 128 bit LCG update. It also has many extra register-register moves. The extra instructions slow things down to where an inline-asm version would be quite a bit better.
An Improved Hash Function
We can also use the technique described above to make an improved hash function. Instead of using an entropy-conserving mix function, we can use something that just loses a small amount per iteration. If we add enough extra every time, then the loss isn't significant.
Such a function might look like:
(Compare with hash_swap or hash_rol in the article.) The inner loop has less instructions, but the multi-precision "mul" is expensive. The two effects cancel out somewhat. The result is that this hash function is perhaps a little bit slower than the sse based hash function described in that article. However, since the mixing is much better, the results may be statistically better.
The extra tool in our toolkit didn't quite help in this case. However, there may be perhaps other areas where it might be useful...
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.