Lockless Inc

An Improved Random Number Generator

The parallel random number generator described in an earlier article is very good. However, it can be improved.

The biggest flaw with it is that there is high long-range correlation between the the output from differing seed states. To see this, investigate what happens with two initializations that differ by one. Since the output depends mostly on the the high 64 bits of the 128 bits, that output will not be changed very much. This correlation will persist as the rng is stepped forwards. Since the low 64 bits are simply added in the final calculation, their effect is minimal.

Fortunately, the states that are problematic (close in seed-value) are far apart in generation order (by large multiples of 264). Thus, the random number generator is still very good for moderate amounts of output. However, the lack of long-range mixing means that the amount of results that should be used is much less than might initially be suspected.

The problem here comes down to two things. The first is that the seed is not mixed very well as the rng is stepped forwards. The low bits do not affect the high bits to any great degree. The second problem is that the output hash function really only depends on the upper half of the seed.

These problems are easy to remedy. The first can be fixed by changing to a linear congruential generator for seed updates. By doing a multiply as well as an addition, the lower bits can affect the upper bits in a greater manner. The problem here is that the previous prng algorithm is very fast, and we don't want to slow it down. Adding extra multiplications will go against that goal.

Fortunately, we can use a multiplier of 264 + 1 in our LCG. Since it is equal to 1 mod 4, it will give us a complete period, provided that the addend is an odd constant. This multiplier is extremely convenient to implement. The previous seed update code in assembly looked like:


	mov    $0x6595a395a1ec531b, %rcx
	add    %rcx,(%rdi)
	adc    %rcx,0x8(%rdi)
and this changes into

	mov    $0x6595a395a1ec531b, %rcx
	mov    (%rdi), %rdx
	add    %rcx,(%rdi)
	adc    %rdx,0x8(%rdi)

which is a mere single instruction longer. Since the load of the old lower 64 bits can pair with other operations, the above new method may even be as fast as the old. In effect, we get extra randomness "for free".

This leaves the problem of the hash function that converts the seed into the output random number. We would like something that uses all 128 bits of the seed, rather than just the upper 64 bits. There are many possibilities here. We could use a 128 bit xorshift instead of a 64 bit xorshift. We could use interleaved multiplies. We could try something more complex with rotates, or add instructions to mix the bits. The possibilities are extremely wide open.

The problem is speed. Lets push the envelope, and try for something that is even faster than the old prng. To do this, we need to relax some of the assumptions used in the previous implementation.

The biggest assumption was that the hash function needed to be reversible. Is this actually required? Well... we would really like the rng to output all possible 64 bit quantities in a uniform manner. However, slight non-uniformities aren't really a problem. In some cases, they could even make the statistical performance even better. What we really want is hash output to be statistically indistinguishable from a random source. This means we need to pay attention to the concept of "entropy".

Entropy is proportional to the logarithm of the number of states something could be in. A random source will have maximal entropy. The reason we wanted a reversible hash was that entropy is conserved in one. Every input state corresponds to a unique output state, and vice-versa. This means the input entropy must be identical to the output entropy.

There is one other possibility though. We could use a function that loses a small amount of entropy, and correct for that later. This means that the function would map more than one input state onto the same output state. It would be irreversible. Provided that the correction has enough entropy to make all output states possible, we should be fine.

A really interesting instruction we can use is the double precision multiply. It does a 64 × 64 → 128 bit operation. Notice how the entropy within each bit increases uniformly in the lower 64 bits of the result, and decreases uniformly in the upper 64 bits. This means that by combining the two outputs we can have a result with a flat entropy profile. On the other hand, this operation is non-reversible. If we assume that the two halves are not correlated, we lose a bit of entropy. Since a single bit of entropy is easy to obtain, this looks like it might be a very good kernel for our random number generator.

We can repeat the above twice to ensure good mixing of both the upper and lower 64 bits of the 128 bit seed. The result is the extremely fast prng:


.align 16
.globl rng_hash_128
.type  rng_hash_128,@function
rng_hash_128:
	mov    8(%rdi), %rax
	
	movabs $0x6595a395a1ec531b, %rcx

#   Option 1)
#   Non-threaded, fastest.  No xor instruction used.

#   Option 2)
#   Threaded, pass the nonce as a second parameter.
#	xor    %rsi, %rax

#   Option 3)
#   Threaded, use the address of the seed as a nonce.
#   xor    %rdi, %rax  
	
	mov    (%rdi), %rsi
	mul    %rcx
	
	add    %rcx, (%rdi)
	adc    %rsi, 8(%rdi)
	
	xor    %rsi, %rax
	xor    %rdx, %rax
	mul    %rcx
    add    %rsi, %rax
	add    %rdx, %rax
	
	retq
.size	rng_hash_128, .-rng_hash_128

This is actually three different random number generators in one.

For non-threaded use, pick the one without the extra xor instruction. This is fastest, taking 3.77s to output 10^9 64bit random numbers on this machine.

If you have multiple threads, then pick option two. The rng will use the different addresses of the seeds to make sure that each thread gets a different output sequence. This takes 4.05s to output 10^9 64bit random numbers on this machine.

If you are using multiple computers, or require multi-threaded repeatability, then pick option three. Pass a variable describing the thread/machine-number to the rng. (Another option is to store this extra state into a slightly expanded seed, and use "xor 16(%rdi), %rax" instead.)

In effect, options two and three choose a 2^128 subsequence from a 2^192 element sequence.

This rng passes TestU01 BigCrush with the top 32 bits, bottom 32 bits, middle 32 bits, and alternating bottom and top 32 bit bitstream, with a total of zero failures.

Unfortunately, the above is implemented in assembly language only. It is possible to make a C version. We can use the __uint128_t 128 bit type in gcc to do the heavy lifting. Such a function might look like:


typedef unsigned long long u64b;
typedef __uint128_t u128b;

u64b rng_hash_128(u64b *s)
{
	u64b c = 7319936632422683419ULL;
	u64b x = s[1];
	u64b y = s[0];
	u128b xx;

	/* Increment 128bit LCG */
	s[0] += c;
	s[1] += (s[0] < c) + y;

	/* Default threaded option of xoring seed location */
	x ^= (u64b) s;

	/* Hash result */
	xx = (u128b)x * c;
	x = xx ^ y ^ (xx >> 64);
	xx = (u128b)x * c;
	return xx + y + (xx >> 64);
}

However, the above is not recommended at the present time. GCC (version 4.7 tested with -O3) does a very poor job of optimizing it. It doesn't use an add-with-carry (adc) instruction for the 128 bit LCG update. It also has many extra register-register moves. The extra instructions slow things down to where an inline-asm version would be quite a bit better.

An Improved Hash Function

We can also use the technique described above to make an improved hash function. Instead of using an entropy-conserving mix function, we can use something that just loses a small amount per iteration. If we add enough extra every time, then the loss isn't significant.

Such a function might look like:


.globl hash_mul
.type   hash_mul,@function
hash_mul:
	movabs	$0xac95a395a9ec535b, %r8
	mov		%rsi, %rax
	xor     %r8, %rax
	sub		$0x8, %rsi
	jb		2f
.align 8, 0x90
1:	xor		(%rdi), %rax
	mul     %r8
	add		$0x8, %rdi
	xor     %rsi, %rax
	xor	    %rdx, %rax
	sub		$0x8, %rsi
	jae		1b
2:	cmp		$-0x8, %rsi
	jne		4f
3:	mov     %rax, %rsi
	mul     %r8
	xor     %rsi, %rax
	xor     %rdx, %rax
	mul     %r8
	add     %rsi, %rax
	add     %rdx, %rax
	retq
4:	xor		%rdx, %rdx
	test	$0x4, %rsi
	je		5f
	mov		(%rdi), %edx
	add		$0x4, %rdi
5:  test	$0x2, %rsi
	je		6f
	movzwl	(%rdi), %ecx
	shl		$0x10, %rdx
	add		%rcx, %rdx
	add		$0x2, %rdi
6:	test	$0x1, %rsi
	je		7f
	movzbl	(%rdi), %ecx
	shl		$0x8, %rdx
	add		%rcx, %rdx
7:  xor		%rdx, %rax
    mul     %r8
	xor     %rdx, %rax
	jmp 3b
.size	hash_mul, .-hash_mul

(Compare with hash_swap or hash_rol in the article.) The inner loop has less instructions, but the multi-precision "mul" is expensive. The two effects cancel out somewhat. The result is that this hash function is perhaps a little bit slower than the sse based hash function described in that article. However, since the mixing is much better, the results may be statistically better.

The extra tool in our toolkit didn't quite help in this case. However, there may be perhaps other areas where it might be useful...

Comments

Debra Pally said...
You're a star Steven! This website is really impressive.

Having trouble with the Captcha response code though. I've tried 6 times to type different codes - the text is really hard to see for some letters.

Love Deb (Your sister!!!! hahahaa)
schaem said...
overly complicated, try this
http://www.flipcode.com/archives/07-15-2002.shtml
only 2 instruction and pass more random number torture test.(SSE2, 64bit add)
scmfuwwqjr said...
fwnnympdlmfttjod, <a href="http://www.ubytuxkhkh.com/">ygwovaidfu</a>
ggybzdcbqe said...
xrtlxmpdlmfttjod, <a href="http://www.tejfygofvu.com/">ktupbwsypt</a> , [url=http://www.gfuctascdq.com/]ddcjvvvdfs[/url], http://www.dgdjfjoipc.com/ ktupbwsypt

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.