256 Bit ArithmeticLarge integer types are very useful. They are used in cryptography, where the larger the type, the more possibilities, and thus exponentially more security. Another use is in fixed point arithmetic. Floating point numbers have intrinsic inaccuracy. Large bit-depth fixed point arithmetic allows expressions to be evaluated without the risk of truncation or rounding errors. They also can be the primitive base used in arbitrary precision arithmetic where the larger the integer, the less overhead. gcc has in-built 128 bit integer types on 64 bit machines. These can be accessed via Unfortunately, the gcc optimizer does not handle these types very efficiently. For example the two simple routines
compile with maximal optimization into
with gcc version 4.3 This is absolutely horrible code. The problem here is that the rdx register is defined as an input register for part of the Moving to gcc version 4.4 helps a little:
This code is noticeably better. However, it still contains obvious extra instructions. To get an optimal implementation, we will need to move to assembly language. Normally, this would best be done via inline gcc asm. Unfortunately, in this case that doesn't work as well as we'd like. The problem is that gcc defines the "A" register description as rax and rdx. We would like to use this combination as an output. However, we need rdx as an input to match the ABI. The overlapping register description is an error. A possible work around is to use the "A" register description as in input/output register. However, this causes the compiler to want to add an extra initialization of rax. Another work around is to specify rax and rdx separately, and construct the return 128 bit type from the results. This doesn't work either - with the compiler again creating the extra register-register moves we are trying to avoid. Thus moving to assembly is the only option. The result is
Implementing other operations remains an exercise.
So what about the next larger power of two, 256 bit integers? Here we must leave the built-in types behind and develop everything from scratch. For simplicity we will define the structure
as the data layout for our 256 bit unsigned integer. This matches the endianess of the other integers, and also allows easy access to the component 64 bit and 128 bit parts we will use in the following. Implementing the addition and multiplication operations is relatively easy to do by using the inbuilt 128 bit type to handle the carries.
Where we have used, as the comment describes a dumb manual n2 multiplication algorithm. Expanding the 128 bit multiplications, this contains a total of six double-precision, and four single precision multiplications. Can we do better? There are several multiplication algorithms which are asymptotically better than brute-force n2 multiplication. The first of these is Karatsuba multiplication. This replaces four multiplications by three via the trick:(x1 + t x2) × (y1 + t y2) = A + t((x1 + x2)(y1 + y2) - A - B) + B t2, where A = x1y1, and B = x2y2, where x and y are the two numbers to be multiplied, and expanded in terms of a base, t. This can be done recursively, halving the base at each step, yielding a total of nine multiplications for our problem. This is one better than the total of ten given above. However, these multiplications need to all be double precision. This slows the algorithm down, and the one less multiplication doesn't help. Another method of multiplying high precision numbers is called Toom-Cook multiplication. This works by noticing that the we are multiplying polynomials in t, the base. A polynomial can be described by its value at n+1 points, where n is the maximal order. i.e. if you know the value of a quadratic at three points, then you can derive its functional form. This means we can use point-wise multiplication to evaluate the polynomial multiplication. For a multiplication of n×n components (here we have 4 64 bit components to make up the 256 bit total) we require 2n-1 points, and thus 2n-1 multiplications. For our problem, this yields 7 multiplications. Unfortunately, this isn't the whole story. To go from the point values back to the polynomial expansion form, which will be our wanted result, we need to a few more mathematical operations. Unfortunately, for the case given here, we need to divide by 6 or 3. This is most efficiently done as yet another multiplication, giving a gross total of 8 double-precision multiplications. Again, this is slower than the dumb brute-force method. Why are the advanced methods slower? The reason is that we are doing a 256 bit × 256bit ⇒ 256 bit multiplication. The full result is 512 bits in length. The more advanced methods calculate this full calculation. If the brute force method were used, it would require 16 double-precision multiplications. Only requiring the half-sized result saves us quite a bit. Can the advanced methods be modified to give the lower bits we require? Unfortunately, there is no simple modification of Toom-Cook to do what we want. In the best case, we might save a single multiplication, but this isn't enough to help. The Karatsuba algorithm is another story. Here, it is indeed possible to modify it to lower the number of required multiplications significantly. Expanding the terms as x = x1+tx2+t2x3+t3x4, y = y1+ty2+t2y3+t3y4, and z = x×y = z1+tz2+t2z3+t3z4 we can rewrite the multiplication as
If the values calculated multiple times are reused we require a total of 8 multiplies, with five of them double precision, and three single precision. This is lower than the brute force method, so may just be faster. Unfortunately, we cannot use the algorithm shown above as is. The problem is overflow. The additions cause us to have to multiply 65 bit numbers together. This can be fixed by using subtractions instead. Since of a pair of numbers, one must be larger than or equal to another, we can order the subtraction so that the result will always fit in 64 bits. For one set of orderings we might have:
Note how we also need to return the "carry" which is actually a borrow. This is required when the result is negative. We need to return a 192 bit two's complement integer to get the result correct. The above isn't quite as efficient as it might be though. A little optimization yields the simpler:
Unfortunately, gcc isn't all that good at compiling the above. To get a better result, assembly is required again:
This uses a relation between signed and unsigned multiplication, together with the "sbb" trick to have a branchless function. This can be further simplified in the case where the carry (borrow) isn't required:
Some C code which can use these routines to calculate the 256 bit multiplication via the pseudo-Karatsuba algorithm is:
The normal method takes 10.9 seconds to evaluate 228 multiplications, and the pseudo-Karatsuba algorithm takes 8.8 seconds. (Using gcc 4.4, which generates the better code.) This is approximately the 8/10 speedup predicted. Now how much faster can we make them go if we use full assembly versions? The pseudo-Karatsuba algorithm is rather complex, so is rather long in machine code:
The brute force method simplifies much more:
Running the multipliers again for 228 times gives a time of 5.5 seconds for the assembly optimized pseudo-Karatsuba algorithm. However, doing this for the brute force method yields a time of 4.0 seconds. Thus the dumb brute force method is faster overall. Why is this? The reason is that the more complex algorithm requires many more additions and subtractions. The time taken for these adds up, and overwhelms the small advantage in lower number of multiplies. The end result is code which is more than twice as fast than the original C code. The reason we can speed it up so much is because even in version 4.4, gcc does not generate very good code for 128 bit types. Hopefully version 4.5 will be better. Finally, the optimal 256 bit adder is:
The horrible captcha has been updated. For prosperity, the evil old one used to look like: The new one should be a little easier for humans to read, and a bit harder for machines. |
About Us | Returns Policy | Privacy Policy | Send us Feedback |
Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |
Comments
Jeff said...I would expect that branch mispredictions on the conditionals would obliterate any benefit of the pseudo-Karatsuba version, if you used unpredictable inputs.
When you did your timings, did you go through integers sequentially (or just use the same inputs over and over), or were they somewhat random?
a good way to think about this is photoshop. if you can make the letters readable by throwing a few photoshop filters at it, then so can a bot.
use a mixture of fonts (big, small, serif, non-serif, outline, italics, etc) instead of those lame dots. combine that with a little bit of random warping, and your captcha will be 100x harder to break, while still being human readable.
The captcha totally stole the spotlight from your article! :)
Jeff:
The psuedo-Karatsuba code benchmark results use the branchless asm version of the 65bit multiplier. The only "branch" is the cmove instruction.
The benchmark calculated x(n)=x(n-1)*x(n-2), starting with the initial x's being odd numbers. This tests a wide range of numbers, not just adjacent ones, and is faster than using a random number generator.
Maybe the captcha is too easy for bots? Ironic, since it's taken me 3 tries.
int main(void) {
u256 bignum, bignum2;
u256* x = &bignum;
u256* y = &bignum2;
x->lo = 18446743073708559078;
x->mid = 18145508812048236159;
x->hi = 0;
y->lo = 14021700194098315921;
y->mid = 15956328402188045531;
y->hi = 0;
add256b(x, y);
mul256b(x, y);
return 0;
}
with the given values and the result was exactly correct for both functions.
Fantastic, I can't wait to do the brute force asm functions.
cheers
(t2 >> 64) + t3 + (t4 >> 64) + t5 + ((u128b) t6 << 64) + t7 + ((u128b) t8 << 64) + (m1 >> 64) + (mid >> 64);
to:
(t2 >> 64) +(t3 >> 64) + t4 + (m1 >> 64) + (mid >> 64);
and get the same results???