## Creating a Fast Hash FunctionHash functions convert a stream of arbitrary data bytes into a single number. By the pigeon-hole principle, many possible inputs will map to the same output. However, if a hash function is chosen well, then it is difficult to find two keys that will hash to the same value. A very good hash function will have an output indistinguishable from a random number generator. However, usually we do not wish to have output quite that uniform. Small correlations between the input and output bits are typically not a problem provided they are not large. The larger the correlations we allow, the faster a function we can create. For example an extremely poor hash function is:
This function simply sums the input bytes. If the input keys are random enough, then this may be a perfect solution. However, in the real world, the keys we wish to hash are decidedly non-random. We thus would rather have something that would produce a less correlated output. To quantify whether or not a hash function is any good, we will need some tests. The simplest test is to see how a stream of NUL bytes is handled. A good hash will give a different answer for each length.
The bad add-the-input-bytes hash described above will fail this test. No matter how long the stream of zeros, it will output the same value: zero. This test also makes sure that odd-sized inputs are handled well. More advanced hash functions tend to input their data in large chunks. The handling of the extra bytes in the partially-full last chunk can be poor. Another useful thing is to test for avalanche, where altering a single bit in the key changes the hash. For a good hash, every key bit should statistically alter every output bit about 50% of the time. Bob Jenkins has some nice code that tests for this. A simple version tests that each key bit affects each hash bit:
(The above has been cleaned up to support 64bit hashes and the coding style altered to match that in the rest of this website.) If a hash function doesn't pass the above test, then there is something seriously wrong with it. Next, we should test for correlation between bits. To do this, a helper function for bit-string manipulation is useful:
We would like to test how often an output bit is changed by input bits, and would like to see a 50% correlation between every bit-pair. Unfortunately, for any given number of trials, there is statistical noise. Thus we can only say there is a problem when the correlation differs from 50% by more than an amount that varies as the square root of the total number of trials. Then there is the fact that there are multiple bit-pairs we are checking simultaneously. Simply by random chance, this increases the probability that any given bit-pair will signal a false over or under-correlation. In order to get good statistical results, about a million trials are required.
An even more complex test compares pairs of output bits with each input bit. This is more sensitive to subtle problems with a hash function, and can indicate when the output isn't quite mixed enough.
Using about a million trials, the above test will say if your hash function differs from a random oracle at the 1% level. This is good enough for most (non-cryptographic) purposes. Other tests for hash-functions exist. However, I have found that the above ones tend to do a good job. The Maurer test passes some quite poor hash functions. The second and third-order ANFS tests are somewhat discerning. However, the simpler bit correlation tests are quicker to run, and give more useful information on failure. The task for today is to construct the fastest possible hash function that passes all of these tests. (Even faster hash functions are possible, like variants of the add-all-bytes one, but may lead to statistical problems for non-random input data.) For ultimate speed, we will use assembly language for the hash function descriptions. This can give up to a 2× speedup in some cases. The first hash function uses a similar algorithm to the parallel random number generator described in another article. The hash state is repeatedly multiplied by a large odd number. It then is "folded" with a half-shift + xor operation. These operations are chosen because they are invertible. This means that the available phase-space is conserved, and thus all possible hash values can be obtained. The multiplication by an odd number (mod a power of two) can be inverted by finding the modular inverse. The xorshift operation can be inverted by repeating itself. The multiplication mixes lower bits into higher bits. The xorshift does the reverse, mixing information downwards to the least significant bits. The combination of the two spreads information throughout the hash state. We need two multiplication + xorshifts per mix-in of key state in this algorithm. If only one is done, then the tests do not pass. The second part of the algorithm involves dealing with the fraction of an 8-byte chunk remaining. Since we don't want to read beyond the end of the key buffer, we mix in the size-four remaining chunk if it exists, then the size two, and finally the size one. This is faster than using a jump-table for a calculated goto. Finally two more multiplications, and one xorshift is done to mix in the new bits. The result is an algorithm which passes the above tests, and is quite fast. The inner loop is quite small, and it processes eight bytes per iteration.
So how should we benchmark the above? The problem is that its speed depends on the length of the key buffer. To be fair, we will need to include a few sizes in the benchmark, and weight each equally. One which does this is:
This benchmark completes in 1.24s for the above hash function on this machine. Can we do better? In fact we can. Instead of using an xorshift to shift information downwards, we can use other instructions. (Using a multiplication to shift it upwards is fairly optimal.) One that comes to mind is the rotate instruction. By rotating by 32 bits we can swap the upper and lower parts of our 64 bit state. A hash function which does this is:
Since a rotate doesn't do as much information mixing as an xorshift does, we need more of them. However, since many less instructions are required per iterations, we have a net win. This hash function successfully passes the tests, and completes the benchmark in 1.13s, which is a fair speedup. It turns out that there is an even better downward-mixing instruction than rotate. The
This is a little faster, at 1.07s This is because we don't need the extra xor by the constant in the inner loop. (Without it, the rotate version doesn't pass the correlation tests.) Is this the end? Not quite, we can do a bit better by expanding the amount of hash state in the inner loop. By using 128 bits of state, we can use 64×64 →128 bit multiplications. This is a more efficient use of processor resources. In addition, the fact that the high and low results of the multiplication are in separate registers allows us to perform the xorshift operation in a single instruction. The resulting code looks like:
Using the above algorithm results in superior mixing. This means that only one 128 bit + xor operation is needed per 128 bits of key state. It also means that less mixes are required after the final merge. Even though the double-word multiplication instruction is slow, the fact that we are now processing twice the data per iteration more than makes up for it. The benchmark now completes in 0.84s, which well and truly breaks the one-second barrier. Of course, this isn't the quite best code. Now that we are processing 128 bits at a time, how about trying the bswap algorithm within the inner loop, and then mixing the final results together with 128 bit multiplies and xorshifts? This removes dependencies in the inner loop, and allows the processor to extract extra parallelism it couldn't previously.
The extra parallelism increases speed a little bit. Note how the superior mixing means that we only need one Of course we still aren't quite done. There is one other set of instructions we haven't touched yet. The streaming SIMD instructions (SSE) allow computation on a large amount of data in a short amount of time. If we can construct a hash function which uses them efficiently, we may be able to beat the above record. There are several problems with the SSE instructions. The first is that they require 16-byte alignment of memory. Normally, this isn't a problem, as the user of a SSE function can enforce alignment, However, a hash function cannot be so choosey. Therefor, we will need to check the alignment of the input buffer, and fall back on a slightly slower inner loop in the unaligned case. Another problem with the SSE instructions is that they are not very orthogonal. There are a wide variety of instructions, but there doesn't seem to have been much of a plan about what was implemented, and what not. The instructions chosen were selected on their use in multimedia, not the integer arithmetic within hash functions. Thus we will need to think laterally in order to find ones useful for our task. We require two types of instructions for the inner loop. Ones which mix bits "upwards", and ones which mix them "downwards". By using both sets, we can mix all the bits, and create a nice function. The problem is that SSE instructions are designed to work on multiple bits of data in parallel. Having information cross the boundaries between those chunks of data isn't easy. The method we used to move information upwards previously was via multiplication by large odd constants. The problem here is that we are limited in the SSE2 instruction set about which multiplications are allowed. We would like to use as large a multiplication as possible. However, the largest one which computes two 32×32→64 bit multiplies only uses half of the data within a SSE register. Doing the same on the other half requires other instructions to extract the information, and perhaps yet more instructions to recombine everything. It seems the best multiply instruction available is Mixing information downwards presents a new problem. There are no bswap or rotate instructions in SSE2. There are 128 bit shifts, but implementing a rotate out of them is slow. Another possibility is to use a shift + xor. However, since the 128 bit shifts are limited to be in units of whole bytes, there aren't many good possibilities. The amount of mixing available with this technique is limited. What saves us are the unpack instructions. We can use
The above code is quite complicated. However, it is extremely fast in the benchmark, taking 0.785s. It manages to be so fast because it processes 32 bytes for every iteration of the inner loop. The relatively poor mixing of the SSE instructions is made up for by an extra 128bit multiply and xorshift. If you are hashing keys which are large buffers, the above is a very good function to use. For smaller sizes, the non-sse functions may be faster. |

About Us | Returns Policy | Privacy Policy | Send us Feedback |

Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |

## Comments

sfuerst said...sub $0x10, %rsi ; subtract 16 from rsi (buffer length)

ja 1b ; continue if rsi > 0

or the C equiv:

do { } while ( (rsi -= 16) > 0 );

Shouldn't this be:

do { } while ( (rsi -= 16) > 15 );

instead? Since the last 15 bytes are handled by the final chunk parts.

$ gcc -O2 h.c -lrt

$ a.out

Asm 551332706 nanosecs (ba560400)

C 414688636 nanosecs (ba560400)

Asm 452776232 nanosecs (ba560400)

C 414525092 nanosecs (ba560400)

NH 1778447535 nanosecs (b0351680)

Asm 452955337 nanosecs (ba560400)

NH 1778373149 nanosecs (b0351680)

code, full source here: http://www.eetbeetee.org/h.c

unsigned int

hash_128_swapc( const void *p, unsigned int len )

{

register unsigned long long r8 = 0x1591aefa5e7e5a17ULL,

r9 = 0x2bb6863566c4e761ULL,

rax = len ^ r8,

rcx = r9,

rdx;

#define bswap( r ) \

__asm__ __volatile__ ( "bswapq %0" : "+r" (r) : : )

#define mul128( a, d, r ) \

__asm__ __volatile__ ( "mulq %2" : "+a" (a), "=d" (d) : "r" (r) : )

while ( len >= 16 ) {

rax = ( rax ^ ((unsigned long long *) p)[ 0 ] ) * r8;

rcx = ( rcx ^ ((unsigned long long *) p)[ 1 ] ) * r9;

bswap( rax );

bswap( rcx );

p = &((unsigned long long *) p)[ 2 ];

len -= 16;

}

if ( len != 0 ) {

if ( ( len & 8 ) != 0 ) {

rdx = 0;

rax ^= ((unsigned long long *) p)[ 0 ];

p = &((unsigned long long *) p)[ 1 ];

}

if ( ( len & 4 ) != 0 ) {

rdx = ((unsigned int *) p)[ 0 ];

p = &((unsigned int *) p)[ 1 ];

}

if ( ( len & 2 ) != 0 ) {

rdx = ( rdx << 16 ) | ((unsigned short *) p)[ 0 ];

p = &((unsigned short *) p)[ 1 ];

}

if ( ( len & 1 ) != 0 ) {

rdx = ( rdx << 8 ) | ((unsigned char *) p)[ 0 ];

}

rcx ^= rdx;

}

mul128( rax, rdx, r8 );

rcx = ( rcx * r9 ) + rdx;

rax ^= rcx;

mul128( rax, rdx, r8 );

rcx = ( rcx * r9 ) + rdx;

rax ^= rcx;

return ( rax >> 32 ) ^ rax;

}

I counted the instructions in the newhash inner loop vs bswap, bswap: 11 / 16 bytes = .6875 inst per byte, newhash64: 73 / 24 bytes = 3.042. With that, bswap should be 4.42 times as fast, and it is fairly close to the actual multiple: 4.29.

Although, I have to admit, the final few functions do cheat a bit. There is an increased correlation between bits 23 bytes apart. This is a bit larger than the 8 bytes tested by the correlation checking tests, so the flaw isn't picked up. Fortunately, for most uses, the statistical weakness can be ignored.

I've just updated the article with the slightly more sensible version that updates the key buffer pointer.

I haven't really tested the newer versions of MurmurHash. You are right, it'll probably be quite interesting to see how they compare with those here.

I've also just done some benchmarking with Murmurhash3. It takes 1.5s to run the benchmark on this machine. hash_sse() is about twice as fast, which is pretty amazing...

This machine is a rather old AMD Opteron, so may not have quite the same hash performance characteristics as machines based on Intel's Core architecture.

I was lucky to speed up even more FNV1A_Jesteress, it resulted in appearance of FNV1A_Yorikke at:

http://www.sanmayce.com/Fastest_Hash/index.html#KT_torture

It is copyleft i.e. free.

I would enjoy some testing from your side.

Regards

You can find it here (for both GAS and MASM, respectively):

https://github.com/upcaste/upcaste/blob/master/src/upcore/src/cstring/gas/i686/memhash.s

https://github.com/upcaste/upcaste/blob/master/src/upcore/src/cstring/msvc/i686/memhash.asm

The challenging part was performing the 128-bit multiplication and mix towards the end of the function, especially given the reduced number of available registers. My original version just used the x86 ALU instructions, and used some stack space for some temporaries.

However, I realized I could keep everything in the SSE registers, batch some of my multiplies together using the PMULUDQ instruction, and then move only what was needed to the 32-bit x86 registers to compute the final 128-bit result via the ADC instruction. This final version doesn't use any stack space for temporaries, is a bit faster as a result, and computes hash values exactly identical to the x86-64 version.

I tested it on both a Core i7 860 and a Core i7 2720QM, and against a few other hash functions compiled in 32-bit mode with Clang, GCC, and MSVC.

Against its x86-64 version it's nearly the same in terms of performance. This isn't surprising, considering that the main loops are identical between them. It only really takes a hit for small keys due to the additional logic needed to handle the 128-bit multiplies.

Of note when compared to Georgi's FNV1A variants, it's nearly as fast as them on small keys below 16-bytes (within 5%), and out-performs them on larger keys greater than 1024-2048 bytes (beats FNV1A-Yorrike by as much as 25%).

Given that Steven's hash function generates 64-bit keys and has much better properties, I'd say it's a real winner. It passes all of his tests that he described in this page. Plus being able to use the exact same hash function in both 32-bit and 64-bits modes, and not having to worry about performance, conjures certain benefits. Furthermore, in my english language dictionary test, it had zero collisions, even when the upper 32-bit were discared and only the lower 32-bits of the result were considered. The higher-dimensionality of the intermediate values used while the hash is accumulated definitely helps here.

I might try and port it to 32-bit ARM using NEON instructions at some point in the future. A 64-bit ARM port w/ NEON should be pretty simple to do too.

http://www.sanmayce.com/Fastest_Hash/index.html#Yoshimura

AFAIK, for the first time the nifty technique which I call INTERLEAVING came into the play drawing/fetching from two not-so-adjacent memory positions. This approach has one more nifty speed-wise feature reducing the branching. My tests show ULTRA fast behavior:

E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r1>hash Sentence-list_00,032,359_English_The_Holy_Bible.txt

32359 lines read

65536 elements in the table (16 bits)

Jesteress: 20075 [ 6883]

Meiyan: 20195 [ 6897]

Yorikke: 19221 [ 6872]

Yoshimura: 17738 [ 6903] !!! SIGNIFICANTLY fastEST !!!

Yoshimitsu: 19699 [ 6937]

YoshimitsuTRIAD: 19491 [ 6843]

FNV-1a: 47179 [ 6840]

Larson: 48009 [ 6889]

CRC-32: 42940 [ 6891]

Murmur2: 25825 [ 6786]

Murmur3: 25753 [ 6850]

XXHfast32: 18481 [ 6859]

XXHstrong32: 20691 [ 6887]

E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r1>

It is worth the attempt to explore the Jesteress-Yorikke (i.e. 1 hash line 4+4 vs 2 hash lines 4+4) 8-16 GAP in order to lessen their collisions further more.

Simply put, 3 hash lines 4 bytes each, 12 bytes per loop.

The 'non power of 2' workaround I see as one MONOLITH function with no remainder mixing at all.

The idea #1 is to exterminate all nasty IFs outwith the main loop, I believe such branchless etude will outperform Jesteress.

The idea #2 is to STRESS memory by fetching not-so-adjacent areas.

For example:

Key: hash_with_overlapping_aye_aye

Key left-to-right quadruplets and remainder: 'hash', '_wit', 'h_ov', 'erla', 'ppin', 'g_ay', 'e_ay', 'e'

Key right-to-left quadruplets and remainder: 'h', 'ash_', 'with', '_ove', 'rlap', 'ping', '_aye', '_aye'

Key_Length: 29

Loop_Counter: 3 //if ( Key_Length%(3*4) ) Loop_Counter = Key_Length/(3*4)+1; else Loop_Counter = Key_Length/(3*4);

Loop #1 of 3:

Hash line 1: hash

Hash line 2: h_ov

Hash line 3: ping

Loop #2 of 3:

Hash line 1: _wit

Hash line 2: erla

Hash line 3: _aye

Loop #3 of 3:

Hash line 1: h_ov

Hash line 2: ppin

Hash line 3: _aye

I don't know the internals, whether lines are 32/64/128 bytes long is a secondary concern.

Well, the key is too short, in reality the above key may span only 1|2 cache lines, if the key is longer than 4 cache lines (assuming 32) e.g. 128+2 bytes then it may span 5|6 lines.

My dummy/clueless premise is that is possible (in future systems) to access effectively RAM in such manner.

Does someone know whether such type of accessing has any practical value in nowadays CPUs?

Of course, it is worth trying to "interleave" in that way all the short string hashers, yes?

I believe Yoshimura is the best balance between speed & dispersion, in one hand, and between LATENCY & BANDWIDTH, in the whole INTERNET.

Excuse me for this big dump, but I wanted some PRECISE benchmarking to come into the play along with so many functions/tests.

Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):

Linear speed on Sanmayce's 'Bonboniera' laptop (Core 2 T7500, 2200MHz, RAM bus: 666MHz Duad Channel):

// 'BURST_Read_4XMM128bit' Main Loop:

.B1.182:

009d0 8b f1 mov esi, ecx

009d2 41 inc ecx

009d3 c1 e6 06 shl esi, 6

009d6 81 f9 00 00 80

00 cmp ecx, 8388608

009dc 66 0f 6f 04 32 movdqa xmm0, XMMWORD PTR [edx+esi]

009e1 66 0f 7f 05 00

00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0

;;; xmm1 = xmmload(pointerflush+(KT+1)*16);

009e9 66 0f 6f 4c 32

10 movdqa xmm1, XMMWORD PTR [16+edx+esi]

009ef 66 0f 7f 0d 00

00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1

;;; xmm2 = xmmload(pointerflush+(KT+2)*16);

009f7 66 0f 6f 54 32

20 movdqa xmm2, XMMWORD PTR [32+edx+esi]

009fd 66 0f 7f 15 00

00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2

;;; xmm3 = xmmload(pointerflush+(KT+3)*16);

00a05 66 0f 6f 5c 32

30 movdqa xmm3, XMMWORD PTR [48+edx+esi]

00a0b 66 0f 7f 1d 00

00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3

00a13 72 bb jb .B1.182

// 'BURST_Read_8XMM128bit' Main Loop:

.B1.194:

00b00 8b d9 mov ebx, ecx

00b02 41 inc ecx

00b03 c1 e3 07 shl ebx, 7

00b06 81 f9 00 00 40

00 cmp ecx, 4194304

00b0c 66 0f 6f 04 1a movdqa xmm0, XMMWORD PTR [edx+ebx]

00b11 66 0f 7f 05 00

00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0

;;; xmm1 = xmmload(pointerflush+(KT+1)*16);

00b19 66 0f 6f 4c 1a

10 movdqa xmm1, XMMWORD PTR [16+edx+ebx]

00b1f 66 0f 7f 0d 00

00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1

;;; xmm2 = xmmload(pointerflush+(KT+2)*16);

00b27 66 0f 6f 54 1a

20 movdqa xmm2, XMMWORD PTR [32+edx+ebx]

00b2d 66 0f 7f 15 00

00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2

;;; xmm3 = xmmload(pointerflush+(KT+3)*16);

00b35 66 0f 6f 5c 1a

30 movdqa xmm3, XMMWORD PTR [48+edx+ebx]

00b3b 66 0f 7f 1d 00

00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3

;;; xmm4 = xmmload(pointerflush+(KT+4)*16);

00b43 66 0f 6f 64 1a

40 movdqa xmm4, XMMWORD PTR [64+edx+ebx]

00b49 66 0f 7f 25 00

00 00 00 movdqa XMMWORD PTR [_xmm4], xmm4

;;; xmm5 = xmmload(pointerflush+(KT+5)*16);

00b51 66 0f 6f 6c 1a

50 movdqa xmm5, XMMWORD PTR [80+edx+ebx]

00b57 66 0f 7f 2d 00

00 00 00 movdqa XMMWORD PTR [_xmm5], xmm5

;;; xmm6 = xmmload(pointerflush+(KT+6)*16);

00b5f 66 0f 6f 74 1a

60 movdqa xmm6, XMMWORD PTR [96+edx+ebx]

00b65 66 0f 7f 35 00

00 00 00 movdqa XMMWORD PTR [_xmm6], xmm6

;;; xmm7 = xmmload(pointerflush+(KT+7)*16);

00b6d 66 0f 6f 7c 1a

70 movdqa xmm7, XMMWORD PTR [112+edx+ebx]

00b73 66 0f 7f 3d 00

00 00 00 movdqa XMMWORD PTR [_xmm7], xmm7

00b7b 72 83 jb .B1.194

// 'BURST_Read_4DWORDS' Main Loop:

.B2.3:

02e4c 83 c3 f0 add ebx, -16

;;; hash32 = *(uint32_t *)(p+0);

02e4f 8b 07 mov eax, DWORD PTR [edi]

;;; hash32B = *(uint32_t *)(p+4);

02e51 8b 77 04 mov esi, DWORD PTR [4+edi]

;;; hash32C = *(uint32_t *)(p+8);

02e54 8b 57 08 mov edx, DWORD PTR [8+edi]

;;; hash32D = *(uint32_t *)(p+12);

02e57 8b 4f 0c mov ecx, DWORD PTR [12+edi]

02e5a 83 c7 10 add edi, 16

02e5d 83 fb 10 cmp ebx, 16

02e60 73 ea jae .B2.3

// 'BURST_Read_8DWORDSi' Main Loop:

.B3.3:

;;; for(; Loop_Counter; Loop_Counter--, p += 4*sizeof(uint32_t)) {

;;; hash32 = *(uint32_t *)(p+0) ^ *(uint32_t *)(p+0+Second_Line_Offset);

02ebc 8b 07 mov eax, DWORD PTR [edi]

;;; hash32B = *(uint32_t *)(p+4) ^ *(uint32_t *)(p+4+Second_Line_Offset);

02ebe 8b 77 04 mov esi, DWORD PTR [4+edi]

;;; hash32C = *(uint32_t *)(p+8) ^ *(uint32_t *)(p+8+Second_Line_Offset);

02ec1 8b 57 08 mov edx, DWORD PTR [8+edi]

;;; hash32D = *(uint32_t *)(p+12) ^ *(uint32_t *)(p+12+Second_Line_Offset);

02ec4 8b 4f 0c mov ecx, DWORD PTR [12+edi]

02ec7 33 04 1f xor eax, DWORD PTR [edi+ebx]

02eca 33 74 1f 04 xor esi, DWORD PTR [4+edi+ebx]

02ece 33 54 1f 08 xor edx, DWORD PTR [8+edi+ebx]

02ed2 33 4c 1f 0c xor ecx, DWORD PTR [12+edi+ebx]

02ed6 83 c7 10 add edi, 16

02ed9 4d dec ebp

02eda 75 e0 jne .B3.3

// 'FNV1A_YoshimitsuTRIADii' Main Loop:

.B4.4:

;;; hash32 = (hash32 ^ (_rotl_KAZE(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME;

02f4c 8b 2b mov ebp, DWORD PTR [ebx]

02f4e c1 c5 05 rol ebp, 5

02f51 33 2c 33 xor ebp, DWORD PTR [ebx+esi]

02f54 33 fd xor edi, ebp

;;; hash32B = (hash32B ^ (_rotl_KAZE(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME;

02f56 8b 6c 33 04 mov ebp, DWORD PTR [4+ebx+esi]

02f5a c1 c5 05 rol ebp, 5

02f5d 33 6b 04 xor ebp, DWORD PTR [4+ebx]

02f60 33 cd xor ecx, ebp

;;; hash32C = (hash32C ^ (_rotl_KAZE(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+8+Second_Line_Offset))) * PRIME;

02f62 8b 6b 08 mov ebp, DWORD PTR [8+ebx]

02f65 c1 c5 05 rol ebp, 5

02f68 33 6c 33 08 xor ebp, DWORD PTR [8+ebx+esi]

02f6c 83 c3 0c add ebx, 12

02f6f 33 c5 xor eax, ebp

02f71 69 ff e7 d3 0a

00 imul edi, edi, 709607

02f77 69 c9 e7 d3 0a

00 imul ecx, ecx, 709607

02f7d 69 c0 e7 d3 0a

00 imul eax, eax, 709607

02f83 4a dec edx

02f84 75 c6 jne .B4.4

; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011";

; mark_description "-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FAcs";

And the full dump:

Copying a 256MB block 1024 times i.e. 256GB READ + 256GB WRITTEN ...

memcpy(): (256MB block); 262144MB copied in 141321 clocks or 1.855MB per clock

Fetching a 512MB block 1024 times i.e. 512GB ...

BURST_Read_4XMM128bit: (512MB block); 524288MB fetched in 107812 clocks or 4.863MB per clock

Fetching a 512MB block 1024 times i.e. 512GB ...

BURST_Read_8XMM128bit: (512MB block); 524288MB fetched in 113162 clocks or 4.633MB per clock

Fetching a 64MB block 8*1024 times i.e. 512GB ...

BURST_Read_4XMM128bit: (64MB block); 524288MB fetched in 107765 clocks or 4.865MB per clock

Fetching a 6