Lockless Inc

256 Bit Arithmetic

Large integer types are very useful. They are used in cryptography, where the larger the type, the more possibilities, and thus exponentially more security. Another use is in fixed point arithmetic. Floating point numbers have intrinsic inaccuracy. Large bit-depth fixed point arithmetic allows expressions to be evaluated without the risk of truncation or rounding errors. They also can be the primitive base used in arbitrary precision arithmetic where the larger the integer, the less overhead.

gcc has in-built 128 bit integer types on 64 bit machines. These can be accessed via __uint128_t for unsigned and __int128_t for signed 128 bit types. These are useful, because their use exposes the 64×64 bit ⇒ 128 bit multiplication assembly instruction, which would otherwise have to be described as a series of 32×32 bit ⇒ 64 bit multiplications.

Unfortunately, the gcc optimizer does not handle these types very efficiently. For example the two simple routines


typedef __uint128_t u128b;

u128b add128b(u128b x, u128b y)
{
	return x + y;
}

u128b mul128b(u128b x, u128b y)
{
	return x * y;
}
compile with maximal optimization into

add128b:
	mov    %rbx,-0x10(%rsp)
	mov    %rdi,%rax
	mov    %rcx,%rbx
	mov    %rdx,%rcx
	mov    %rsi,%rdx
	mov    %rbp,-0x8(%rsp)
	add    %rcx,%rax
	mov    -0x8(%rsp),%rbp
	adc    %rbx,%rdx
	mov    -0x10(%rsp),%rbx
	retq

mul128b:
	imul   %rdi,%rcx
	mov    %rdx,%rax
	imul   %rdx,%rsi
	mul    %rdi
	add    %rsi,%rcx
	lea    (%rcx,%rdx,1),%rdi
	mov    %rdi,%rdx
	retq

with gcc version 4.3 This is absolutely horrible code. The problem here is that the rdx register is defined as an input register for part of the y variable by the ABI. It is also defined to be high part of the output. This confuses the gcc register allocator, and a large number of extra register-register copies are generated. In addition, this causes stack spills, which makes the code even worse.

Moving to gcc version 4.4 helps a little:


add128b:
	mov	%rdx,%rax
	mov	%rcx,%rdx
	push	%rbx
	add	%rdi,%rax
	adc	%rsi,%rdx
	pop	%rbx
	retq

mul128b:
	imul	%rdx,%rsi
	mov	%rdx,%rax
	imul	%rdi,%rcx
	mul	%rdi
	add	%rsi,%rcx
	lea	(%rcx,%rdx,1),%rdx
	retq

This code is noticeably better. However, it still contains obvious extra instructions. To get an optimal implementation, we will need to move to assembly language. Normally, this would best be done via inline gcc asm. Unfortunately, in this case that doesn't work as well as we'd like. The problem is that gcc defines the "A" register description as rax and rdx. We would like to use this combination as an output. However, we need rdx as an input to match the ABI. The overlapping register description is an error. A possible work around is to use the "A" register description as in input/output register. However, this causes the compiler to want to add an extra initialization of rax. Another work around is to specify rax and rdx separately, and construct the return 128 bit type from the results. This doesn't work either - with the compiler again creating the extra register-register moves we are trying to avoid. Thus moving to assembly is the only option. The result is


add128:
	mov	%rdx,%rax
	mov	%rcx,%rdx
	add	%rdi,%rax
	adc	%rsi,%rdx
	retq

mul128:	
	imul	%rdi, %rcx
	mov	%rdx, %rax	
	imul	%rdx, %rsi
	add	%rsi, %rcx
	mul	%rdi
	add	%rcx, %rdx
	retq
Implementing other operations remains an exercise.

So what about the next larger power of two, 256 bit integers? Here we must leave the built-in types behind and develop everything from scratch. For simplicity we will define the structure


typedef unsigned long long u64b;
typedef __uint128_t u128b;
typedef struct u256b u256b;
struct u256b
{
	u64b lo;
	u64b mid;
	u128b hi;
};

as the data layout for our 256 bit unsigned integer. This matches the endianess of the other integers, and also allows easy access to the component 64 bit and 128 bit parts we will use in the following. Implementing the addition and multiplication operations is relatively easy to do by using the inbuilt 128 bit type to handle the carries.


u256b add256b(u256b *x, u256b *y)
{
	u128b lo = (u128b) x->lo + y->lo;
	u128b mid = (u128b) x->mid + y->mid + (lo >> 64);
	u256b result =
	{
		.lo = lo,
		.mid = mid,
		.hi = x->hi + y->hi + (mid >> 64),
	};
	
	return result;
}

/* Dumb n^2 way */
u256b mul256b(u256b *x, u256b *y)
{
	u128b t1 = (u128b) x->lo * y->lo;
	u128b t2 = (u128b) x->lo * y->mid;
	u128b t3 = x->lo * y->hi;
	u128b t4 = (u128b) x->mid * y->lo;
	u128b t5 = (u128b) x->mid * y->mid;
	u64b t6 = x->mid * y->hi;
	u128b t7 = x->hi * y->lo;
	u64b t8 = x->hi * y->mid;

	u64b lo = t1;
	u128b m1 = (t1 >> 64) + (u64b)t2;
	u64b m2 = m1;
	u128b mid = (u128b) m2 + (u64b)t4;
	u128b hi = (t2 >> 64) + t3 + (t4 >> 64) + t5 + ((u128b) t6 << 64) + t7
		 + ((u128b) t8 << 64) + (m1 >> 64) + (mid >> 64);
	
	u256b result = {lo, mid, hi};
	
	return result;
}

Where we have used, as the comment describes a dumb manual n2 multiplication algorithm. Expanding the 128 bit multiplications, this contains a total of six double-precision, and four single precision multiplications. Can we do better?

There are several multiplication algorithms which are asymptotically better than brute-force n2 multiplication. The first of these is Karatsuba multiplication. This replaces four multiplications by three via the trick:(x1 + t x2) × (y1 + t y2) = A + t((x1 + x2)(y1 + y2) - A - B) + B t2, where A = x1y1, and B = x2y2, where x and y are the two numbers to be multiplied, and expanded in terms of a base, t. This can be done recursively, halving the base at each step, yielding a total of nine multiplications for our problem. This is one better than the total of ten given above. However, these multiplications need to all be double precision. This slows the algorithm down, and the one less multiplication doesn't help.

Another method of multiplying high precision numbers is called Toom-Cook multiplication. This works by noticing that the we are multiplying polynomials in t, the base. A polynomial can be described by its value at n+1 points, where n is the maximal order. i.e. if you know the value of a quadratic at three points, then you can derive its functional form. This means we can use point-wise multiplication to evaluate the polynomial multiplication. For a multiplication of n×n components (here we have 4 64 bit components to make up the 256 bit total) we require 2n-1 points, and thus 2n-1 multiplications. For our problem, this yields 7 multiplications. Unfortunately, this isn't the whole story. To go from the point values back to the polynomial expansion form, which will be our wanted result, we need to a few more mathematical operations. Unfortunately, for the case given here, we need to divide by 6 or 3. This is most efficiently done as yet another multiplication, giving a gross total of 8 double-precision multiplications. Again, this is slower than the dumb brute-force method.

Why are the advanced methods slower? The reason is that we are doing a 256 bit × 256bit ⇒ 256 bit multiplication. The full result is 512 bits in length. The more advanced methods calculate this full calculation. If the brute force method were used, it would require 16 double-precision multiplications. Only requiring the half-sized result saves us quite a bit. Can the advanced methods be modified to give the lower bits we require? Unfortunately, there is no simple modification of Toom-Cook to do what we want. In the best case, we might save a single multiplication, but this isn't enough to help.

The Karatsuba algorithm is another story. Here, it is indeed possible to modify it to lower the number of required multiplications significantly. Expanding the terms as x = x1+tx2+t2x3+t3x4, y = y1+ty2+t2y3+t3y4, and z = x×y = z1+tz2+t2z3+t3z4 we can rewrite the multiplication as

z1 = x1 y1
z2 = (x1 + x2)(y1 + y2) - x1 y1 - x2 y2
z3 = (x1 + x3)(y1 + y3) - x1 y1 - x3 y3 + x2 y2
z4 = (x2 + x3)(y2 + y3) - x2 y2 - x3 y3 +x1 y4+x4 y1

If the values calculated multiple times are reused we require a total of 8 multiplies, with five of them double precision, and three single precision. This is lower than the brute force method, so may just be faster. Unfortunately, we cannot use the algorithm shown above as is. The problem is overflow. The additions cause us to have to multiply 65 bit numbers together. This can be fixed by using subtractions instead. Since of a pair of numbers, one must be larger than or equal to another, we can order the subtraction so that the result will always fit in 64 bits. For one set of orderings we might have:

z1 = x1 y1
z2 = (x1 - x2)(y2 - y1) + x1 y1 + x2 y2
z3 = (x1 - x3)(y3 - y1) + x1 y1 + x2 y2 + x3 y3
z4 = (x2 - x3)(y3 - y2) + x2 y2 + x3 y3 +x1 y4+x4 y1
To do the 65 bit multiply, we need to check the orderings of the terms:

u128b mul65(u64b x1, u64b x2, u64b y1, u64b y2, u64b *carry)
{
	*carry = 0;
	if (x1 > x2)
	{
		u64b xx12 = x1 - x2;
		
		if (y1 > y2)
		{
			u64b yy12 = y1 - y2;
			u128b xy12 = (u128b) xx12 * yy12;
			if (xy12) *carry -= 1;
			return -xy12;
		}
		else
		{
			u64b yy12 = y2 - y1;
			u128b xy12 = (u128b) xx12 * yy12;
			return xy12;
		}
	}
	else
	{
		u64b xx12 = x2 - x1;
		
		if (y1 > y2)
		{
			u64b yy12 = y1 - y2;
			u128b xy12 = (u128b) xx12 * yy12;
			return xy12;
		}
		else
		{
			u64b yy12 = y2 - y1;
			u128b xy12 = (u128b) xx12 * yy12;
			if (xy12) *carry -= 1;
			return -xy12;
		}
	}
}

Note how we also need to return the "carry" which is actually a borrow. This is required when the result is negative. We need to return a 192 bit two's complement integer to get the result correct.

The above isn't quite as efficient as it might be though. A little optimization yields the simpler:


u128b mul65b(u64b x1, u64b x2, u64b y1, u64b y2, u64b *carry)
{
	u64b c = 0;
	u64b t1 = x1 - x2;
	u64b t2 = y2 - y1;
	u128b result = (u128b) t1 * t2;

	if ((x2 > x1) && t2)
	{
		result -= (u128b) t2 << 64;
		c = ~c;
	}
	
	if ((y1 > y2) && t1)
	{
		result -= (u128b) t1 << 64;
		c = ~c;
	}
	
	*carry = c;
	return result;
}

Unfortunately, gcc isn't all that good at compiling the above. To get a better result, assembly is required again:


mul65c:
	mov	%rdi, %rax
	sub	%rsi, %rax
	sbb	%rsi, %rsi
	sub	%rdx, %rcx
	sbb	%rdi, %rdi
	mov	%rsi, %r9
	and	%rcx, %rsi
	xor	%rdi, %r9
	and	%rax, %rdi
	mul	%rcx
	
	# Correct for overflow
	sub	%rsi, %rdx
	sub	%rdi, %rdx
	
	# fixup carry
	mov	%rax, %rdi
	or	%rdx, %rdi
	cmove	%rdi, %r9
	
	mov	%r9, (%r8)
	retq

This uses a relation between signed and unsigned multiplication, together with the "sbb" trick to have a branchless function. This can be further simplified in the case where the carry (borrow) isn't required:


mul65d:
	mov	%rdi, %rax
	sub	%rsi, %rax
	sbb	%rsi, %rsi
	sub	%rdx, %rcx
	sbb	%rdi, %rdi
	and	%rcx, %rsi
	and	%rax, %rdi
	mul	%rcx
	
	# Correct for overflow
	sub	%rsi, %rdx
	sub	%rdi, %rdx
	
	retq

Some C code which can use these routines to calculate the 256 bit multiplication via the pseudo-Karatsuba algorithm is:


u256b mul256c(u256b *x, u256b *y)
{
	u256b result;
	
	u64b x1 = x->lo;
	u64b x2 = x->mid;
	u64b x3 = x->hi;
	u64b x4 = x->hi >> 64;
	
	u64b y1 = y->lo;
	u64b y2 = y->mid;
	u64b y3 = y->hi;
	u64b y4 = y->hi >> 64;

	u128b xy11 = (u128b) x1 * y1;
	u128b xy22 = (u128b) x2 * y2;
	u128b xy33 = (u128b) x3 * y3;
	u64b xy14 = x1 * y4 + x4 * y1;
	u64b xy23 = (x3 - x2)*(y2 - y3);
	
	u64b carry;
	
	u128b t1 = xy11;
	u128b t2 = mul65c(x1, x2, y1, y2, &carry);
	u128b t3 = mul65d(x1, x3, y1, y3) + xy11 + xy22 + xy33 + (t2 >> 64); 
	u64b t4 = xy14 + xy23 + xy22 + xy33 + carry;

	t2 = (u64b) t2;
	t2 += t1 >> 64;
	t3 += t2 >> 64;
	
	t2 = (u64b) t2;
	t2 += xy11;
	t3 += t2 >> 64;
	
	t2 = (u64b) t2;
	t2 += xy22;
	t3 += t2 >> 64;

	result.lo = t1;
	result.mid = t2;
	result.hi = t3 + ((u128b) t4 << 64);
	
	return result;
}

The normal method takes 10.9 seconds to evaluate 228 multiplications, and the pseudo-Karatsuba algorithm takes 8.8 seconds. (Using gcc 4.4, which generates the better code.) This is approximately the 8/10 speedup predicted. Now how much faster can we make them go if we use full assembly versions? The pseudo-Karatsuba algorithm is rather complex, so is rather long in machine code:


mul256pseudok:
	mov	0x10(%rdx), %rax
	mov	%rbp, -0x8(%rsp)
	mov	(%rsi), %r11
	mov	(%rdx), %r8
	mov	%rbx, -0x10(%rsp)
	mov	0x18(%rdx), %rcx
	mov	%r12, -0x18(%rsp)
	mov	0x10(%rsi), %rbx
	imul	%r11, %rcx
	
	#xy13
	mov	%r11, %rbp
	mov	%rax, %r10
	sub	%r8, %rax
	sbb	%r12, %r12
	sub	%rbx, %r11
	sbb	%r9, %r9
	and	%r11, %r12
	and	%rax, %r9
	sub	%r12, %rcx
	sub	%r9, %rcx
	mov	0x8(%rdx), %r9
	mul	%r11
	mov	%rax, %r12
	add	%rdx, %rcx
	
	#xy33
	mov	%rbx, %rax
	mul	%r10
	add	%rax, %r12
	adc	%rax, %rcx
	add	%rdx, %rcx
	
	#xy11 + xy14
	mov	%rbp, %rax
	mul	%r8
	mov	%rax, (%rdi)
	mov	%rax, %r11
	add	%rax, %r12
	adc	%rdx, %rcx
	mov	%r8, %rax
	imul	0x18(%rsi), %rax
	add	%rdx, %r11
	adc	%rdx, %r12
	adc	%rax, %rcx
		
	#xy23
	mov	%rbp, %rdx
	mov	0x8(%rsi), %rbp
	mov	%r9, %rax
	sub	%rbp, %rbx
	sub	%r10, %rax
	mov	%rdx, %rsi
	imul	%rax, %rbx
	add	%rbx, %rcx
		
	#xy22
	mov	%rbp, %rax
	mul	%r9
	add	%rax, %r11
	adc	%rax, %r12
	adc	%rax, %rcx
	add	%rdx, %r12
	adc	%rdx, %rcx
		
	#xy12
	mov	%r8, %rax
	sub	%r9, %rax
	sbb	%r8, %r8
	sub	%rsi, %rbp
	sbb	%rsi, %rsi
	mov	%r8, %rbx
	and	%rbp, %r8
	xor	%rsi, %rbx
	and	%rax, %rsi
	mul	%rbp
	mov	-0x8(%rsp), %rbp
	mov	%rax, %r9
	sub	%r8, %rdx
	sub	%rsi, %rdx
	or	%rdx, %r9
	cmove	%r9, %rbx
	add	%rax, %r11
	mov	%r11, 0x8(%rdi)
	adc	%rdx, %r12
	mov	%r12, 0x10(%rdi)
	mov	-0x18(%rsp), %r12
	adc	%rbx, %rcx
	mov	%rcx, 0x18(%rdi)
	mov	-0x10(%rsp), %rbx

	retq
The brute force method simplifies much more:

mul256brute:
	mov	(%rsi), %rax
	mov	(%rdx), %r8
	mov	0x8(%rdx), %r9
	mov	%rbp, -0x8(%rsp)
	mov	%rax, %rbp
	mov	0x18(%rdx), %rcx
	imul	%rbp, %rcx
	mov	0x10(%rdx), %r10
	mul	%r8
	mov	%rax, (%rdi)
	mov	%rdx, %r11
	mov	%rbp, %rax
	mul	%r10
	mov	%r12, -0x10(%rsp)
	mov	%rax, %r12
	add	%rdx, %rcx
	mov	%rbp, %rax
	mul	%r9
	add	%rax, %r11
	mov	0x8(%rsi), %rbp
	adc	%rdx, %r12
	adc	$0, %rcx
	imul	%rbp, %r10
	mov	%rbp, %rax
	mul	%r8
	add	%rax, %r11
	mov	%rbp, %rax
	adc	%rdx, %r12
	adc	%r10, %rcx
	mov	0x10(%rsi), %rbp
	mul	%r9
	mov	%r11, 0x8(%rdi)
	imul	%rbp, %r9
	add	%rax, %r12
	mov	%rbp, %rax
	adc	%rdx, %rcx
	mov	0x18(%rsi), %rbp
	imul	%r8, %rbp
	add	%rbp, %rcx
	mov	-0x8(%rsp), %rbp
	mul	%r8
	add	%rax, %r12
	mov	%r12, 0x10(%rdi)
	mov	-0x10(%rsp), %r12
	adc	%r9, %rcx
	add	%rdx, %rcx
	mov	%rcx, 0x18(%rdi)
	retq

Running the multipliers again for 228 times gives a time of 5.5 seconds for the assembly optimized pseudo-Karatsuba algorithm. However, doing this for the brute force method yields a time of 4.0 seconds. Thus the dumb brute force method is faster overall. Why is this? The reason is that the more complex algorithm requires many more additions and subtractions. The time taken for these adds up, and overwhelms the small advantage in lower number of multiplies. The end result is code which is more than twice as fast than the original C code. The reason we can speed it up so much is because even in version 4.4, gcc does not generate very good code for 128 bit types. Hopefully version 4.5 will be better.

Finally, the optimal 256 bit adder is:


add256:
	mov	(%rdx), %r8
	mov	0x8(%rdx), %r9
	mov	0x10(%rdx), %rcx
	mov	0x18(%rdx), %rax
	add	(%rsi), %r8
	adc	0x8(%rsi), %r9
	adc	0x10(%rsi), %rcx
	adc	0x18(%rsi), %rax
	mov	%r8, (%rdi)
	mov	%r9, 0x8(%rdi)
	mov	%rcx, 0x10(%rdi)
	mov	%rax, 0x18(%rdi)
	retq

The horrible captcha has been updated. For prosperity, the evil old one used to look like: Old evil captcha

The new one should be a little easier for humans to read, and a bit harder for machines.

Comments

Jeff said...
Wow, hardest captcha ever.

I would expect that branch mispredictions on the conditionals would obliterate any benefit of the pseudo-Karatsuba version, if you used unpredictable inputs.

When you did your timings, did you go through integers sequentially (or just use the same inputs over and over), or were they somewhat random?
captchabot said...
your captcha sucks. any bot can apply a few image manipulation function to get the letters into matchable shapes. making it hard to read for humans != making it hard to read for computers.

a good way to think about this is photoshop. if you can make the letters readable by throwing a few photoshop filters at it, then so can a bot.

use a mixture of fonts (big, small, serif, non-serif, outline, italics, etc) instead of those lame dots. combine that with a little bit of random warping, and your captcha will be 100x harder to break, while still being human readable.
Mike said...
Jeff, the good news is that if you can pass the captcha, you know you're not color-blind!
Jaime said...
I was planning on commenting on the topic of the article until I saw the captcha. Then, I forgot what I was going to write. Now I'm just left with this: "WOWZA! WTH where you people thinking!"

The captcha totally stole the spotlight from your article! :)

YATroll said...
Good Lord, that captcha sucks balls. Shit, dude, where did you get this piece of crap? I mean, seriously, wasn't reCAPTCHA free and stuff?
timmey said...
LOL, how everyone is talking about the captcha!
sfuerst said...
I've updated the captcha. It should be a little better now. The old one was something thrown together in a few hours. The idea was to have something unique... the more common a captcha is, the more effort spammers spend to try to crack it. Of course security through stupidity is never a good idea.

Jeff:
The psuedo-Karatsuba code benchmark results use the branchless asm version of the 65bit multiplier. The only "branch" is the cmove instruction.

The benchmark calculated x(n)=x(n-1)*x(n-2), starting with the initial x's being odd numbers. This tests a wide range of numbers, not just adjacent ones, and is faster than using a random number generator.
trayres@gmail.com said...
This article made me decide to go mess with assembly. Thanks for this - very very cool.
Anonymous said...
What about 256bit division?
said...
Really interesting article! How can you wrap the assembly code given in the examples, for example mul65d, such that it is callable from within the C code as shown in mul256c?

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.