Lockless Inc

A Parallel Random Number Generator

See Update Here.

The task for today is to create a high-quality random number generator that is designed for parallel execution.

It is not easy to create a good random number generator. The most obvious way to do it is to create an "arbitrary" function f(x), and use the output of repeatedly calling this function on its own output as a sequence. i.e. x, f(x), f(f(x)), f(f(f(x))), etc. The problem with this is that for an arbitrarily chosen function this sequence of numbers is highly likely to get in a relatively small cycle of outputs. If f(x) maps to a "random" output, then the length of this cycle will be of the order of the square root of the number of states. So a 64bit random number generator built with this technique may be easily trapped in a cycle of length 2^32, which is far too small to be useful in modern computing.

To prevent "wastage" of bits in the seed, the resulting cycle should be very close if not equal to in length the number of states those seed bits can represent. In order to do this, any random number generator can be described via two functions. The first maps the seed onto a new seed. The second maps a seed into an output. i.e.

snew=f(sold)
output=g(sold)

If we choose f() such that it has a maximal cycle, and a g() that is a random mapping, then we may have a good random generator.

The above is qualified with the word "may" because it is difficult to know if a random number generator is good. The only real way to check is to test it. One may examine the output and compare its properties to what may be expected from a random source. If the generator deviates from what is expected by a statistically significant margin, then we may have found a problem with it.

Unfortunately, statistical tests have a well-defined chance of being failed even by good random sources. i.e. if your test result says that there is only a 5% chance that the sequence of numbers seen could have come from a truely random source, then this actually should happen 1 time in 20. If hundreds of tests are used, then we should expect to see results like this happening as often as predicted. Thus it is possible to make meta-tests comparing the distribution of test results, and indeed possible to make meta-meta-tests comparing the results from meta tests. Each such level of test can detect different flaws in the random number generator.

An example test is one that counts the number of 1bits in the sequence. If this deviates too far from 1/2 of the total, then we may have a problem. A random number generator that strictly alternates between 0 and 1 would not fail this test. However, a meta-test of the results from many such sequences from this extremely poor generator will show the problem.

So which tests should we use? Luckily, there are a few random number test batteries already created. The oldest is that given by Knuth in his art of computer programming books. Most of those tests are out of date though, or subsumed within more recent test suites. Another old and not particularly good program is "ent". If a random number generator can't pass that, it is not very good indeed.

The first of the modern test suites is diehard. This test suite combines many of those from Knuth, with a few new ones. Unfortunately, some of the tests are broken with larger input runs. They were designed with small sequence lengths in mind, and use interpolation of various distribution functions. If the sequence length tested is too long, then the quantization introduced by the approximated "correct" distributions is detected as the random number generator failing. (When in reality, it is the test itself being buggy.)

The direct successor of diehard is the dieharder suite. This takes many of the tests from diehard and updates them to handle the longer sequence runs, and also subsumes most of the useful tests from NIST. It runs the diehard tests multiple times, and compares the output p-values with the uniform distribution from 0 to 1. These meta-tests identify problems with some generators which otherwise would pass the diehard suite.

The best test suite is Testu01, which has three different levels; "Small Crush", "Crush", and "Big Crush". These take a few seconds, about an hour, and several hours to run, respectively. The Big Crush test is extremely powerful, and very few random number generators pass it. The goal is to construct the fastest random number generator that will pass this test.

To start off with, we will design this rng for modern 64bit multicore processors. This means that we can assume that 64bit multiplication is fast, which was not the case in the past. We will also assume that the consumer of the random numbers wants a full 64bit value.

First, we investigate creating a "random mapping" function. This function must take any 64bit value, and map it onto another 64bit value. The whole ensemble of values must map onto unique values. i.e. this function must be invertable. Two candidate functions are:

  1. Multiplication by an odd integer
  2. The xorshift function with the shift equal to half the bitdepth.

The first of these is inverted by calculating the modular inverse, and the second by repetition. (Note that other xorshift shifts are also possible to invert, but we choose this one for simplicity.)

By combining these two functions, we can create an arbitrarily complex invertable mapping function. Note that repeating either of the two is useless. Two multiplications are equivalent to a single multiplication, and two xorshifts with this shift are idempotent. This means that we need to alternate. i.e. let h(x) = c * xorshift(x, 32), and the g(x) defined above is h(x) or h(h(x)), or h(h(h(x))), etc.

Assuming we have a very good g(x); f(x), the seed-mapping function does not need to be complex, it just needs to have maximal period. The simplest f(x) that has this property is f(x) = x + c, where c is some odd constant. So by combining the f and g functions, we can create a random number generator. The only question is how complex g must be to pass Testu01, or in other words, how many h iterations we need.

It turns out that three or four h iterations with a reasonably random multiplication constant are all that is required to pass Testu01. Unfortunately, this is a little slow. There are other random number generators that use less multiplications, and so are faster. Another problem is that this yields a random number generator with a period of 2^64. This is a little small for modern computational work, where you really don't want to use more than about the square root total period. 2^32 numbers is just too small.

The first trick we can use is to increase the size of the seed to 128bits. By adding a 128bit constant to it every iteration, we can use the fact that the top 64bits of the seed will have a period of 2^128. Another trick is to notice that some multiplication constants are better than others. By choosing one with good properties, less h iterations are required. Finally, we can squeeze a tiny bit more performance by xoring the output by the lower 64bits of the seed. This allows use to use one less h iteration.

The resulting code is:

typedef unsigned long long u64b;

static u64b rng64(u64b *s)
{
	u64b c = 7319936632422683419ULL;
	u64b x = s[1];

	/* Increment 128bit counter */
	s[0] += c;
	s[1] += c + (s[0] < c);

	/* Two h iterations */
	x ^= x >> 32;
	x *= c;
	x ^= x >> 32;
	x *= c;

	/* Perturb result */
	return x + s[0];
}

Notice how in the above, we reuse the multiplication constant for the upper and lower 64bits of the seed-mapping constant. Also notice how we overlap the calculation of the output with the updating of the seed, which is faster on modern out of order machines.

So how do we make this efficient in threaded code? There are three main ways of making a multithreaded random number generator. The first is to put locking around access to the seed. Let every thread use the same seed, and let those locks prevent the inner logic of the random number generator from stomping on itself. There are two problems with this technique. The most obvious is that it is slow. A thread waiting on a lock isn't doing any work. Also, due to race conditions, the individual random numbers that threads get will be unpredictable. i.e. Thread 1 may race with thread 2 for the first two random numbers, and who gets which one will change depending on who wins. Since some code depends on repeatable sequences of random numbers to the individual threads, this is very bad.

The second technique is to give each thread its own seed. By doing this carefully, we can partition the sequence into sub-sequences, one for each thread. The difficulty here is predicting how many numbers a given thread might need, since we do not want the sub-sequences to overlap. Another problem is that many random number generators have long-range correlations, and thus comparing subsequences may be statistically suspect. Finally, it simply may not be possible to partition the sequence due to the difficulty of fast-forwarding a badly designed generator.

Finally, it may be possible to construct a random number generator that uses a different function for each thread. Since the functions are different, so must the output sequences. This is obviously more difficult to develop as all such functions must be shown to be good enough to pass the statistical tests.

What we choose to do here is a combination of the second and third techniques. We use the address of the seed itself as a way of parametrizing the random output function. If a user requires identical sequences between runs, then the seed locations can be statically allocated, or a thread id number can be used instead. Assuming 32bits worth of entropy in this pointer, this is equivalent to having a random number generator with a period of 2^160, with each thread getting a 2^128 long sub-sequence:


typedef unsigned long long u64b;

static u64b rng64(u64b *s)
{
	u64b c = 7319936632422683419ULL;
	u64b x = s[1];

	/* Increment 128bit counter */
	s[0] += c;
	s[1] += c + (s[0] < c);

	/* Two h iterations */
	x ^= (x >> 32) ^ (u64b) s;
	x *= c;
	x ^= x >> 32;
	x *= c;

	/* Perturb result */
	return x + s[0];
}

Finally we investigate optimizing this function. The biggest problem is that gcc doesn't use the adc instruction for the 128bit addition. gcc has 128bit types we may use instead. The code for this looks like:


typedef unsigned long long u64b;

static u64b rng64(u64b *s)
{
	u64b c = 7319936632422683419ULL;
	u64b x = s[1];

	typedef __uint128_t u128b;
	union u128_64
	{
		u64b seed[2];
		u128b val;
	};
	
	/* Increment 128bit counter */
	((union u128_64 *)s)->val += c + ((u128b) c << 64); 

	/* Two h iterations */
	x ^= (x >> 32) ^ (u64b) s;
	x *= c;
	x ^= x >> 32;
	x *= c;

	/* Perturb result */
	return x + s[0];
}

Unfortunately, the code is still not optimal. 128bit types in gcc require their upper and lower parts to be in certain registers: %rax and %rdx. Since the output must be in %rax, this confuses the register allocator and extra register to register copies are generated. In addition, gcc fails to realize that the constant is never written to, and thus doesn't need to be loaded multiple times. To get the most optimal code assembly is required:


typedef unsigned long long u64b;

static u64b rng64b(u64b *s)
{
	u64b x;

	asm volatile(
	"mov    $0x6595a395a1ec531b,%%rcx\n"
	"movl   0xc(%1),%k0\n"
	"add    %%rcx,(%1)\n"
	"adc    %%rcx,0x8(%1)\n"
	"xor	0x8(%1),%0\n"
	"xor    %1,%0\n" //Remove this line if you are single threaded
	"imul   %%rcx,%0\n"
	"mov    %0,%%rdx\n"
	"shr    $0x20,%0\n"
	"xor    %%rdx,%0\n"
	"imul   %%rcx,%0\n"
	"add    (%1),%0\n"
	:"=r" (x): "+r"(s): "%rcx", "%rdx", "memory", "cc");
	return x;
}

The above implementation is extremely fast. It also passes all tests of the Testu01 Big Crush suite. (Not to mention diehard and dieharder of course.)


static u64b seed[2];

/* Sample usage */
int main(void)
{
	u64b result;

	/* Initialize seed */
	seed[0] = something
	seed[1] = something else.

	/* Get a random 64bit integer */
	result = rng64(seed);

	/* Get another random 64bit integer */
	result = rng64(seed);
}

Note that the random number generator given above is not secure enough for cryptographic usage. Firstly, a block size of 64bits is too small. Secondly, a few more h iterations are required. My guess is that six is probably enough. Finally, for cryptographic purposes, multiple multiplication constants should be used to prevent round-shifting attacks. Also note that if the seed is discovered, then all output of the rng in the past can be derived. A true cryptographic rng shouldn't have that vulerablilty.

Comments

Yes. said...
The code above is not working. It is giving constant, either positive or negative number depending on seed value.
sfuerst said...
Hmmm, it looks like modern gcc can miscompile the asm. I've changed the constraints so that the seed and result cannot end up mistakenly in the same register. Hopefully, this fixes the problem.
sschaem said...
try this http://www.flipcode.com/archives/07-15-2002.shtml
Generate 4x 32bit random number in parallel on a modern processor using only 2 CPU instruction.
Yet, is stronger than rand() and pass nearly all random function tests.
Dhafer said...
Hello,I am glad that All in ONE SEO is integrating Sitemap Generator.I am very happy be.I have a susetggion.There are many plugins, specially Directory Listing plugins, where we put special SHORTCODE in a Page.But that one single page is then used to show/list different products/listings using different URL query parameters.I don't know the exact tram of it but it looks like this -mysite.com/?page_id=159&aid=1mysite.com/?page_id=159&aid=2etc.I hope that your Sitemap Generator will able to include this type of URLs.
Ayisha said...
I personally use Google (XML) Sitemaps Generator for WordPress.It also have good<a href="http://eybcpwstj.com"> feautres</a> but I will love to use All in One SEO's Sitemap generator because in that case I will have lesser no of plugins in my system
GeorgeS said...
rng64b() is not equivalent to rng64(). In the former, you xor the high 32 bits of the pre-incremented s[1] with the post-incremented s[1]. In the latter, you use the pre-incremented values both times.

May I suggest the following code, which uses asm() for the awkward 128-bit addition and C code everywhere else:

#include <stdint.h>
typedef unsigned __int128 uint128_t;

uint64_t rng64c(uint64_t s[2])
{
uint64_t const c = 7319936632422683419ULL;
uint64_t x = (s[1] >> 32);

asm(
"addq%2, %0\n\t"
"adcq%2, %1"
: "+g" (s[0]), "+g" (s[1]) : "r" (c));

x ^= (uint64_t)s;
x ^= s[1];
x *= c;
x ^= x >> 32;
x *= c;

/* Perturb result */
return x + s[0];
}

When compiled standalone, gcc prefers to load s[] into registers, but you can force it to leave them in memory by changing the destination constraint to m.

Other useful gcc constraints:
a, b, c, d, S, D: Speific registers
%: Marks this and the following parameter as commutative
&: Marks an output operand as earlyclobber; it may not overlap with an input operand. Useful for temporaries; use "=&r" (temp) rather than assigning a specific register like %%rdx and adding it to the clobber list.
said...
Enter your comments here
said...
Enter your comments here
scurvydog said...
I hope you begin writing articles again. I really enjoy and get much benefit from them.
Thank you very much

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.