## Creating a Fast Hash FunctionHash functions convert a stream of arbitrary data bytes into a single number. By the pigeon-hole principle, many possible inputs will map to the same output. However, if a hash function is chosen well, then it is difficult to find two keys that will hash to the same value. A very good hash function will have an output indistinguishable from a random number generator. However, usually we do not wish to have output quite that uniform. Small correlations between the input and output bits are typically not a problem provided they are not large. The larger the correlations we allow, the faster a function we can create. For example an extremely poor hash function is:
This function simply sums the input bytes. If the input keys are random enough, then this may be a perfect solution. However, in the real world, the keys we wish to hash are decidedly non-random. We thus would rather have something that would produce a less correlated output. To quantify whether or not a hash function is any good, we will need some tests. The simplest test is to see how a stream of NUL bytes is handled. A good hash will give a different answer for each length.
The bad add-the-input-bytes hash described above will fail this test. No matter how long the stream of zeros, it will output the same value: zero. This test also makes sure that odd-sized inputs are handled well. More advanced hash functions tend to input their data in large chunks. The handling of the extra bytes in the partially-full last chunk can be poor. Another useful thing is to test for avalanche, where altering a single bit in the key changes the hash. For a good hash, every key bit should statistically alter every output bit about 50% of the time. Bob Jenkins has some nice code that tests for this. A simple version tests that each key bit affects each hash bit:
(The above has been cleaned up to support 64bit hashes and the coding style altered to match that in the rest of this website.) If a hash function doesn't pass the above test, then there is something seriously wrong with it. Next, we should test for correlation between bits. To do this, a helper function for bit-string manipulation is useful:
We would like to test how often an output bit is changed by input bits, and would like to see a 50% correlation between every bit-pair. Unfortunately, for any given number of trials, there is statistical noise. Thus we can only say there is a problem when the correlation differs from 50% by more than an amount that varies as the square root of the total number of trials. Then there is the fact that there are multiple bit-pairs we are checking simultaneously. Simply by random chance, this increases the probability that any given bit-pair will signal a false over or under-correlation. In order to get good statistical results, about a million trials are required.
An even more complex test compares pairs of output bits with each input bit. This is more sensitive to subtle problems with a hash function, and can indicate when the output isn't quite mixed enough.
Using about a million trials, the above test will say if your hash function differs from a random oracle at the 1% level. This is good enough for most (non-cryptographic) purposes. Other tests for hash-functions exist. However, I have found that the above ones tend to do a good job. The Maurer test passes some quite poor hash functions. The second and third-order ANFS tests are somewhat discerning. However, the simpler bit correlation tests are quicker to run, and give more useful information on failure. The task for today is to construct the fastest possible hash function that passes all of these tests. (Even faster hash functions are possible, like variants of the add-all-bytes one, but may lead to statistical problems for non-random input data.) For ultimate speed, we will use assembly language for the hash function descriptions. This can give up to a 2× speedup in some cases. The first hash function uses a similar algorithm to the parallel random number generator described in another article. The hash state is repeatedly multiplied by a large odd number. It then is "folded" with a half-shift + xor operation. These operations are chosen because they are invertible. This means that the available phase-space is conserved, and thus all possible hash values can be obtained. The multiplication by an odd number (mod a power of two) can be inverted by finding the modular inverse. The xorshift operation can be inverted by repeating itself. The multiplication mixes lower bits into higher bits. The xorshift does the reverse, mixing information downwards to the least significant bits. The combination of the two spreads information throughout the hash state. We need two multiplication + xorshifts per mix-in of key state in this algorithm. If only one is done, then the tests do not pass. The second part of the algorithm involves dealing with the fraction of an 8-byte chunk remaining. Since we don't want to read beyond the end of the key buffer, we mix in the size-four remaining chunk if it exists, then the size two, and finally the size one. This is faster than using a jump-table for a calculated goto. Finally two more multiplications, and one xorshift is done to mix in the new bits. The result is an algorithm which passes the above tests, and is quite fast. The inner loop is quite small, and it processes eight bytes per iteration.
So how should we benchmark the above? The problem is that its speed depends on the length of the key buffer. To be fair, we will need to include a few sizes in the benchmark, and weight each equally. One which does this is:
This benchmark completes in 1.24s for the above hash function on this machine. Can we do better? In fact we can. Instead of using an xorshift to shift information downwards, we can use other instructions. (Using a multiplication to shift it upwards is fairly optimal.) One that comes to mind is the rotate instruction. By rotating by 32 bits we can swap the upper and lower parts of our 64 bit state. A hash function which does this is:
Since a rotate doesn't do as much information mixing as an xorshift does, we need more of them. However, since many less instructions are required per iterations, we have a net win. This hash function successfully passes the tests, and completes the benchmark in 1.13s, which is a fair speedup. It turns out that there is an even better downward-mixing instruction than rotate. The
This is a little faster, at 1.07s This is because we don't need the extra xor by the constant in the inner loop. (Without it, the rotate version doesn't pass the correlation tests.) Is this the end? Not quite, we can do a bit better by expanding the amount of hash state in the inner loop. By using 128 bits of state, we can use 64×64 →128 bit multiplications. This is a more efficient use of processor resources. In addition, the fact that the high and low results of the multiplication are in separate registers allows us to perform the xorshift operation in a single instruction. The resulting code looks like:
Using the above algorithm results in superior mixing. This means that only one 128 bit + xor operation is needed per 128 bits of key state. It also means that less mixes are required after the final merge. Even though the double-word multiplication instruction is slow, the fact that we are now processing twice the data per iteration more than makes up for it. The benchmark now completes in 0.84s, which well and truly breaks the one-second barrier. Of course, this isn't the quite best code. Now that we are processing 128 bits at a time, how about trying the bswap algorithm within the inner loop, and then mixing the final results together with 128 bit multiplies and xorshifts? This removes dependencies in the inner loop, and allows the processor to extract extra parallelism it couldn't previously.
The extra parallelism increases speed a little bit. Note how the superior mixing means that we only need one Of course we still aren't quite done. There is one other set of instructions we haven't touched yet. The streaming SIMD instructions (SSE) allow computation on a large amount of data in a short amount of time. If we can construct a hash function which uses them efficiently, we may be able to beat the above record. There are several problems with the SSE instructions. The first is that they require 16-byte alignment of memory. Normally, this isn't a problem, as the user of a SSE function can enforce alignment, However, a hash function cannot be so choosey. Therefor, we will need to check the alignment of the input buffer, and fall back on a slightly slower inner loop in the unaligned case. Another problem with the SSE instructions is that they are not very orthogonal. There are a wide variety of instructions, but there doesn't seem to have been much of a plan about what was implemented, and what not. The instructions chosen were selected on their use in multimedia, not the integer arithmetic within hash functions. Thus we will need to think laterally in order to find ones useful for our task. We require two types of instructions for the inner loop. Ones which mix bits "upwards", and ones which mix them "downwards". By using both sets, we can mix all the bits, and create a nice function. The problem is that SSE instructions are designed to work on multiple bits of data in parallel. Having information cross the boundaries between those chunks of data isn't easy. The method we used to move information upwards previously was via multiplication by large odd constants. The problem here is that we are limited in the SSE2 instruction set about which multiplications are allowed. We would like to use as large a multiplication as possible. However, the largest one which computes two 32×32→64 bit multiplies only uses half of the data within a SSE register. Doing the same on the other half requires other instructions to extract the information, and perhaps yet more instructions to recombine everything. It seems the best multiply instruction available is Mixing information downwards presents a new problem. There are no bswap or rotate instructions in SSE2. There are 128 bit shifts, but implementing a rotate out of them is slow. Another possibility is to use a shift + xor. However, since the 128 bit shifts are limited to be in units of whole bytes, there aren't many good possibilities. The amount of mixing available with this technique is limited. What saves us are the unpack instructions. We can use
The above code is quite complicated. However, it is extremely fast in the benchmark, taking 0.785s. It manages to be so fast because it processes 32 bytes for every iteration of the inner loop. The relatively poor mixing of the SSE instructions is made up for by an extra 128bit multiply and xorshift. If you are hashing keys which are large buffers, the above is a very good function to use. For smaller sizes, the non-sse functions may be faster. |

About Us | Returns Policy | Privacy Policy | Send us Feedback |

Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |

## Comments

Dmitry V'jukov said...sub $0x10, %rsi ; subtract 16 from rsi (buffer length)

ja 1b ; continue if rsi > 0

or the C equiv:

do { } while ( (rsi -= 16) > 0 );

Shouldn't this be:

do { } while ( (rsi -= 16) > 15 );

instead? Since the last 15 bytes are handled by the final chunk parts.

Fortunately, the code can be fixed without affecting the inner loop by offsetting the size by the chunk length. We can do this by changing the initial "cmp" into a "sub", and then having a "jae" instead of a "ja" as the loop condition.

I've replaced the code in the article with the new version. The only other change is that hash_128_swap() needs to do a little more mixing in the odd-sized case. It doesn't passing the 31-byte mix-test otherwise.

$ gcc -O2 h.c -lrt

$ a.out

Asm 551332706 nanosecs (ba560400)

C 414688636 nanosecs (ba560400)

Asm 452776232 nanosecs (ba560400)

C 414525092 nanosecs (ba560400)

NH 1778447535 nanosecs (b0351680)

Asm 452955337 nanosecs (ba560400)

NH 1778373149 nanosecs (b0351680)

code, full source here: http://www.eetbeetee.org/h.c

unsigned int

hash_128_swapc( const void *p, unsigned int len )

{

register unsigned long long r8 = 0x1591aefa5e7e5a17ULL,

r9 = 0x2bb6863566c4e761ULL,

rax = len ^ r8,

rcx = r9,

rdx;

#define bswap( r ) \

__asm__ __volatile__ ( "bswapq %0" : "+r" (r) : : )

#define mul128( a, d, r ) \

__asm__ __volatile__ ( "mulq %2" : "+a" (a), "=d" (d) : "r" (r) : )

while ( len >= 16 ) {

rax = ( rax ^ ((unsigned long long *) p)[ 0 ] ) * r8;

rcx = ( rcx ^ ((unsigned long long *) p)[ 1 ] ) * r9;

bswap( rax );

bswap( rcx );

p = &((unsigned long long *) p)[ 2 ];

len -= 16;

}

if ( len != 0 ) {

if ( ( len & 8 ) != 0 ) {

rdx = 0;

rax ^= ((unsigned long long *) p)[ 0 ];

p = &((unsigned long long *) p)[ 1 ];

}

if ( ( len & 4 ) != 0 ) {

rdx = ((unsigned int *) p)[ 0 ];

p = &((unsigned int *) p)[ 1 ];

}

if ( ( len & 2 ) != 0 ) {

rdx = ( rdx << 16 ) | ((unsigned short *) p)[ 0 ];

p = &((unsigned short *) p)[ 1 ];

}

if ( ( len & 1 ) != 0 ) {

rdx = ( rdx << 8 ) | ((unsigned char *) p)[ 0 ];

}

rcx ^= rdx;

}

mul128( rax, rdx, r8 );

rcx = ( rcx * r9 ) + rdx;

rax ^= rcx;

mul128( rax, rdx, r8 );

rcx = ( rcx * r9 ) + rdx;

rax ^= rcx;

return ( rax >> 32 ) ^ rax;

}

However, the comparison between C and asm isn't quite right. GCC is smart, and will inline the C version of the hash - removing some of the size-checks in the benchmark. To prevent that, you'll need to add __attribute__((noclone, noinline)) to get the speed that a user with a non-compiler-optimizable buffer length will see.

I counted the instructions in the newhash inner loop vs bswap, bswap: 11 / 16 bytes = .6875 inst per byte, newhash64: 73 / 24 bytes = 3.042. With that, bswap should be 4.42 times as fast, and it is fairly close to the actual multiple: 4.29.

Although, I have to admit, the final few functions do cheat a bit. There is an increased correlation between bits 23 bytes apart. This is a bit larger than the 8 bytes tested by the correlation checking tests, so the flaw isn't picked up. Fortunately, for most uses, the statistical weakness can be ignored.

I noticed your very first hash has a bug. It never moves the key pointer forward, so you simply add the first byte to itself len times, which is an even worse hashing function than you intended.

How do your hash functions compare to MurmurHash3? https://code.google.com/p/smhasher/

I've just updated the article with the slightly more sensible version that updates the key buffer pointer.

I haven't really tested the newer versions of MurmurHash. You are right, it'll probably be quite interesting to see how they compare with those here.

I've also just done some benchmarking with Murmurhash3. It takes 1.5s to run the benchmark on this machine. hash_sse() is about twice as fast, which is pretty amazing...

"Lockless Inc. seeks to make the highest performance software available. The goal is to optimize for modern machines, and take advantage of the changes in technology that have occurred in the past few years. We seek to extract the power of multiple cores, vectorization, and 64 bit technology to its fullest."

Hi to all,

a few months ago a new superfast hash function was revealed (by chance) based on FNV1A named FNV1A_Jesteress, I should like you to test it along with your implementations and if it is possible with FNV1A_Mantis. It screams on new architectures like Core i3+, for reference: http://www.strchr.com/hash_functions

It would be very useful to use your experience in sake of more clarity on this subject, don't you think?

Regards

This machine is a rather old AMD Opteron, so may not have quite the same hash performance characteristics as machines based on Intel's Core architecture.

it is interesting, sadly I have no access to AMDs (even Opteron).

Looking at code (in my view very complicated) presented on your very informative web-page I suggest (and would be very glad) you or someone else to rewrite my FNV variants, to make them 64bit, it is obvious they are with great potential, or maybe you are not so much impressed as I am?

Thanks again and good luck.

I was lucky to speed up even more FNV1A_Jesteress, it resulted in appearance of FNV1A_Yorikke at:

http://www.sanmayce.com/Fastest_Hash/index.html#KT_torture

It is copyleft i.e. free.

I would enjoy some testing from your side.

Regards

You can find it here (for both GAS and MASM, respectively):

https://github.com/upcaste/upcaste/blob/master/src/upcore/src/cstring/gas/i686/memhash.s

https://github.com/upcaste/upcaste/blob/master/src/upcore/src/cstring/msvc/i686/memhash.asm

The challenging part was performing the 128-bit multiplication and mix towards the end of the function, especially given the reduced number of available registers. My original version just used the x86 ALU instructions, and used some stack space for some temporaries.

However, I realized I could keep everything in the SSE registers, batch some of my multiplies together using the PMULUDQ instruction, and then move only what was needed to the 32-bit x86 registers to compute the final 128-bit result via the ADC instruction. This final version doesn't use any stack space for temporaries, is a bit faster as a result, and computes hash values exactly identical to the x86-64 version.

I tested it on both a Core i7 860 and a Core i7 2720QM, and against a few other hash functions compiled in 32-bit mode with Clang, GCC, and MSVC.

Against its x86-64 version it's nearly the same in terms of performance. This isn't surprising, considering that the main loops are identical between them. It only really takes a hit for small keys due to the additional logic needed to handle the 128-bit multiplies.

Of note when compared to Georgi's FNV1A variants, it's nearly as fast as them on small keys below 16-bytes (within 5%), and out-performs them on larger keys greater than 1024-2048 bytes (beats FNV1A-Yorrike by as much as 25%).

Given that Steven's hash function generates 64-bit keys and has much better properties, I'd say it's a real winner. It passes all of his tests that he described in this page. Plus being able to use the exact same hash function in both 32-bit and 64-bits modes, and not having to worry about performance, conjures certain benefits. Furthermore, in my english language dictionary test, it had zero collisions, even when the upper 32-bit were discared and only the lower 32-bits of the result were considered. The higher-dimensionality of the intermediate values used while the hash is accumulated definitely helps here.

I might try and port it to 32-bit ARM using NEON instructions at some point in the future. A 64-bit ARM port w/ NEON should be pretty simple to do too.

it seems you have done it. I like your style.

Assembler is beauty, like stars in the sky.

You see that fine etudes behave sometimes surprisingly not as expected, I mean even a single/simple reordering of instructions leads to butterfly effects on different machines.

Sadly I am not in the assembly (no time for that fun), so it makes me joyous to see such interesting and useful works as yours.

By the way, what is the name of your hasher?

You wrote 'FNV1A-Yorrike' but please don't, it is FNV1A_Yorikke, the name is important.

In fact FNV1A_Yorikke still is my personal favoritess, yet in my tools I use FNV1A_Jesteress simply because my string (x-gram) sizes are in range 1..126 bytes - in my tests she outjuggled the rest hasher(esse)s.

I looked briefly at https://github.com/upcaste/upcaste/blob/master/src/upcore/test/cstring/cstring.hash.cpp and I got frightened how clean your style is and how chaotic mine (after all I am not a programmer) and please if it is not against your policy test FNV1A_Yorikke in your avalanche test, my primary interest lie in short English text strings, saying that your English language dictionary test is perfect interests me a lot how FNV1A_Yorikke fares there as well.

FNV1A_YoshimitsuTRIAD's main loop (code generated by Intel 12.1, /Ox used) is only (291a-28df+2)=61bytes:

[CODE]

;;; for(; wrdlen >= 3*2*sizeof(uint32_t); wrdlen -= 3*2*sizeof(uint32_t), p += 3*2*sizeof(uint32_t)) {

028d7 83 fa 18 cmp edx, 24

028da 72 43 jb .B4.5 ; Prob 10%

; LOE eax edx ecx ebx ebp esi edi

.B4.2: ; Preds .B4.1

028dc 89 34 24 mov DWORD PTR [esp], esi ;

; LOE eax edx ecx ebx ebp edi

.B4.3: ; Preds .B4.2 .B4.3

;;; hash32 = (hash32 ^ (_rotl(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+4))) * PRIME;

028df 8b 31 mov esi, DWORD PTR [ecx]

028e1 83 c2 e8 add edx, -24

028e4 c1 c6 05 rol esi, 5

028e7 33 71 04 xor esi, DWORD PTR [4+ecx]

028ea 33 de xor ebx, esi

;;; hash32B = (hash32B ^ (_rotl(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+12))) * PRIME;

028ec 8b 71 08 mov esi, DWORD PTR [8+ecx]

028ef c1 c6 05 rol esi, 5

028f2 33 71 0c xor esi, DWORD PTR [12+ecx]

028f5 33 fe xor edi, esi

;;; hash32C = (hash32C ^ (_rotl(*(uint32_t *)(p+16),5) ^ *(uint32_t *)(p+20))) * PRIME;

028f7 8b 71 10 mov esi, DWORD PTR [16+ecx]

028fa c1 c6 05 rol esi, 5

028fd 33 71 14 xor esi, DWORD PTR [20+ecx]

02900 83 c1 18 add ecx, 24

02903 33 ee xor ebp, esi

02905 69 db e7 d3 0a

00 imul ebx, ebx, 709607

0290b 69 ff e7 d3 0a

00 imul edi, edi, 709607

02911 69 ed e7 d3 0a

00 imul ebp, ebp, 709607

02917 83 fa 18 cmp edx, 24

0291a 73 c3 jae .B4.3 ; Prob 82%

[/CODE]

As you can see from the results below on non-overclocked Mr.Eiht's Rig (i7 3930K 3.2GHz 8x4GB @1337MHz) FNV1A_YoshimitsuTRIAD is faster than FNV1A_Yorikke, but my strings are short and I need 24..30 bit hash table lookups so this 10+% gain is like driving a dragster instead of electric car in non-aligned short city streets:

BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 6101 clocks or 10.742MB per clock

FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 7307 clocks or 8.969MB per clock

FNV1A_Yorikke: (64MB block); 65536MB hashed in 7958 clocks or 8.235MB per clock

CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 44693 clocks or 1.466MB per clock

BURST_Read_3DWORDS: (10MB block); 81920MB fetched in 5781 clocks or 14.171MB per clock

FNV1A_YoshimitsuTRIAD: (10MB block); 81920MB hashed in 7499 clocks or 10.924MB per clock

FNV1A_Yorikke: (10MB block); 81920MB hashed in 8474 clocks or 9.667MB per clock

CRC32_SlicingBy8K2: (10MB block); 81920MB hashed in 54796 clocks or 1.495MB per clock

BURST_Read_3DWORDS: (5MB block); 40960MB fetched in 2517 clocks or 16.273MB per clock

FNV1A_YoshimitsuTRIAD: (5MB block); 40960MB hashed in 3402 clocks or 12.040MB per clock

FNV1A_Yorikke: (5MB block); 40960MB hashed in 3941 clocks or 10.393MB per clock

CRC32_SlicingBy8K2: (5MB block); 40960MB hashed in 27029 clocks or 1.515MB per clock

BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 3806 clocks or 17.219MB per clock

FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 5374 clocks or 12.195MB per clock

FNV1A_Yorikke: (2MB block); 65536MB hashed in 6246 clocks or 10.492MB per clock

CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 43143 clocks or 1.519MB per clock

BURST_Read_3DWORDS: (128KB block); 65536MB fetched in 3109 clocks or 21.079MB per clock

FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 5195 clocks or 12.615MB per clock

FNV1A_Yorikke: (128KB block); 65536MB hashed in 6108 clocks or 10.730MB per clock

CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 42873 clocks or 1.529MB per clock

BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 3081 clocks or 21.271MB per clock

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 5206 clocks or 12.589MB per clock

FNV1A_Yorikke: (16KB block); 65536MB hashed in 6068 clocks or 10.800MB per clock

CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 42888 clocks or 1.528MB per clock

For blocks 16KB..5MB the gain is 16%..15%, but they are cached, when the bigger part is outwith the CPU caches the gain is only (8.969-8.235)/8.235*100% = 9%, and this is linear speed which is not what I need.

No point here to be made, just sharing some thoughts.

http://www.sanmayce.com/Fastest_Hash/index.html#Yoshimura

AFAIK, for the first time the nifty technique which I call INTERLEAVING came into the play drawing/fetching from two not-so-adjacent memory positions. This approach has one more nifty speed-wise feature reducing the branching. My tests show ULTRA fast behavior:

E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r1>hash Sentence-list_00,032,359_English_The_Holy_Bible.txt

32359 lines read

65536 elements in the table (16 bits)

Jesteress: 20075 [ 6883]

Meiyan: 20195 [ 6897]

Yorikke: 19221 [ 6872]

Yoshimura: 17738 [ 6903] !!! SIGNIFICANTLY fastEST !!!

Yoshimitsu: 19699 [ 6937]

YoshimitsuTRIAD: 19491 [ 6843]

FNV-1a: 47179 [ 6840]

Larson: 48009 [ 6889]

CRC-32: 42940 [ 6891]

Murmur2: 25825 [ 6786]

Murmur3: 25753 [ 6850]

XXHfast32: 18481 [ 6859]

XXHstrong32: 20691 [ 6887]

E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r1>

It is worth the attempt to explore the Jesteress-Yorikke (i.e. 1 hash line 4+4 vs 2 hash lines 4+4) 8-16 GAP in order to lessen their collisions further more.

Simply put, 3 hash lines 4 bytes each, 12 bytes per loop.

The 'non power of 2' workaround I see as one MONOLITH function with no remainder mixing at all.

The idea #1 is to exterminate all nasty IFs outwith the main loop, I believe such branchless etude will outperform Jesteress.

The idea #2 is to STRESS memory by fetching not-so-adjacent areas.

For example:

Key: hash_with_overlapping_aye_aye

Key left-to-right quadruplets and remainder: 'hash', '_wit', 'h_ov', 'erla', 'ppin', 'g_ay', 'e_ay', 'e'

Key right-to-left quadruplets and remainder: 'h', 'ash_', 'with', '_ove', 'rlap', 'ping', '_aye', '_aye'

Key_Length: 29

Loop_Counter: 3 //if ( Key_Length%(3*4) ) Loop_Counter = Key_Length/(3*4)+1; else Loop_Counter = Key_Length/(3*4);

Loop #1 of 3:

Hash line 1: hash

Hash line 2: h_ov

Hash line 3: ping

Loop #2 of 3:

Hash line 1: _wit

Hash line 2: erla

Hash line 3: _aye

Loop #3 of 3:

Hash line 1: h_ov

Hash line 2: ppin

Hash line 3: _aye

I don't know the internals, whether lines are 32/64/128 bytes long is a secondary concern.

Well, the key is too short, in reality the above key may span only 1|2 cache lines, if the key is longer than 4 cache lines (assuming 32) e.g. 128+2 bytes then it may span 5|6 lines.

My dummy/clueless premise is that is possible (in future systems) to access effectively RAM in such manner.

Does someone know whether such type of accessing has any practical value in nowadays CPUs?

Of course, it is worth trying to "interleave" in that way all the short string hashers, yes?

I believe Yoshimura is the best balance between speed & dispersion, in one hand, and between LATENCY & BANDWIDTH, in the whole INTERNET.

http://www.overclock.net/t/1319572/benchmarking-the-fastest-hash-function/20_20#post_19531418

Feel free to use it in your tasks ... freely.

Excuse me for this big dump, but I wanted some PRECISE benchmarking to come into the play along with so many functions/tests.

Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):

Linear speed on Sanmayce's 'Bonboniera' laptop (Core 2 T7500, 2200MHz, RAM bus: 666MHz Duad Channel):

// 'BURST_Read_4XMM128bit' Main Loop:

.B1.182:

009d0 8b f1 mov esi, ecx

009d2 41 inc ecx

009d3 c1 e6 06 shl esi, 6

009d6 81 f9 00 00 80

00 cmp ecx, 8388608

009dc 66 0f 6f 04 32 movdqa xmm0, XMMWORD PTR [edx+esi]

009e1 66 0f 7f 05 00

00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0

;;; xmm1 = xmmload(pointerflush+(KT+1)*16);

009e9 66 0f 6f 4c 32

10 movdqa xmm1, XMMWORD PTR [16+edx+esi]

009ef 66 0f 7f 0d 00

00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1

;;; xmm2 = xmmload(pointerflush+(KT+2)*16);

009f7 66 0f 6f 54 32

20 movdqa xmm2, XMMWORD PTR [32+edx+esi]

009fd 66 0f 7f 15 00

00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2

;;; xmm3 = xmmload(pointerflush+(KT+3)*16);

00a05 66 0f 6f 5c 32

30 movdqa xmm3, XMMWORD PTR [48+edx+esi]

00a0b 66 0f 7f 1d 00

00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3

00a13 72 bb jb .B1.182

// 'BURST_Read_8XMM128bit' Main Loop:

.B1.194:

00b00 8b d9 mov ebx, ecx

00b02 41 inc ecx

00b03 c1 e3 07 shl ebx, 7

00b06 81 f9 00 00 40

00 cmp ecx, 4194304

00b0c 66 0f 6f 04 1a movdqa xmm0, XMMWORD PTR [edx+ebx]

00b11 66 0f 7f 05 00

00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0

;;; xmm1 = xmmload(pointerflush+(KT+1)*16);

00b19 66 0f 6f 4c 1a

10 movdqa xmm1, XMMWORD PTR [16+edx+ebx]

00b1f 66 0f 7f 0d 00

00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1

;;; xmm2 = xmmload(pointerflush+(KT+2)*16);

00b27 66 0f 6f 54 1a

20 movdqa xmm2, XMMWORD PTR [32+edx+ebx]

00b2d 66 0f 7f 15 00

00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2

;;; xmm3 = xmmload(pointerflush+(KT+3)*16);

00b35 66 0f 6f 5c 1a

30 movdqa xmm3, XMMWORD PTR [48+edx+ebx]

00b3b 66 0f 7f 1d 00

00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3

;;; xmm4 = xmmload(pointerflush+(KT+4)*16);

00b43 66 0f 6f 64 1a

40 movdqa xmm4, XMMWORD PTR [64+edx+ebx]

00b49 66 0f 7f 25 00

00 00 00 movdqa XMMWORD PTR [_xmm4], xmm4

;;; xmm5 = xmmload(pointerflush+(KT+5)*16);

00b51 66 0f 6f 6c 1a

50 movdqa xmm5, XMMWORD PTR [80+edx+ebx]

00b57 66 0f 7f 2d 00

00 00 00 movdqa XMMWORD PTR [_xmm5], xmm5

;;; xmm6 = xmmload(pointerflush+(KT+6)*16);

00b5f 66 0f 6f 74 1a

60 movdqa xmm6, XMMWORD PTR [96+edx+ebx]

00b65 66 0f 7f 35 00

00 00 00 movdqa XMMWORD PTR [_xmm6], xmm6

;;; xmm7 = xmmload(pointerflush+(KT+7)*16);

00b6d 66 0f 6f 7c 1a

70 movdqa xmm7, XMMWORD PTR [112+edx+ebx]

00b73 66 0f 7f 3d 00

00 00 00 movdqa XMMWORD PTR [_xmm7], xmm7

00b7b 72 83 jb .B1.194

// 'BURST_Read_4DWORDS' Main Loop:

.B2.3:

02e4c 83 c3 f0 add ebx, -16

;;; hash32 = *(uint32_t *)(p+0);

02e4f 8b 07 mov eax, DWORD PTR [edi]

;;; hash32B = *(uint32_t *)(p+4);

02e51 8b 77 04 mov esi, DWORD PTR [4+edi]

;;; hash32C = *(uint32_t *)(p+8);

02e54 8b 57 08 mov edx, DWORD PTR [8+edi]

;;; hash32D = *(uint32_t *)(p+12);

02e57 8b 4f 0c mov ecx, DWORD PTR [12+edi]

02e5a 83 c7 10 add edi, 16

02e5d 83 fb 10 cmp ebx, 16

02e60 73 ea jae .B2.3

// 'BURST_Read_8DWORDSi' Main Loop:

.B3.3:

;;; for(; Loop_Counter; Loop_Counter--, p += 4*sizeof(uint32_t)) {

;;; hash32 = *(uint32_t *)(p+0) ^ *(uint32_t *)(p+0+Second_Line_Offset);

02ebc 8b 07 mov eax, DWORD PTR [edi]

;;; hash32B = *(uint32_t *)(p+4) ^ *(uint32_t *)(p+4+Second_Line_Offset);

02ebe 8b 77 04 mov esi, DWORD PTR [4+edi]

;;; hash32C = *(uint32_t *)(p+8) ^ *(uint32_t *)(p+8+Second_Line_Offset);

02ec1 8b 57 08 mov edx, DWORD PTR [8+edi]

;;; hash32D = *(uint32_t *)(p+12) ^ *(uint32_t *)(p+12+Second_Line_Offset);

02ec4 8b 4f 0c mov ecx, DWORD PTR [12+edi]

02ec7 33 04 1f xor eax, DWORD PTR [edi+ebx]

02eca 33 74 1f 04 xor esi, DWORD PTR [4+edi+ebx]

02ece 33 54 1f 08 xor edx, DWORD PTR [8+edi+ebx]

02ed2 33 4c 1f 0c xor ecx, DWORD PTR [12+edi+ebx]

02ed6 83 c7 10 add edi, 16

02ed9 4d dec ebp

02eda 75 e0 jne .B3.3

// 'FNV1A_YoshimitsuTRIADii' Main Loop:

.B4.4:

;;; hash32 = (hash32 ^ (_rotl_KAZE(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME;

02f4c 8b 2b mov ebp, DWORD PTR [ebx]

02f4e c1 c5 05 rol ebp, 5

02f51 33 2c 33 xor ebp, DWORD PTR [ebx+esi]

02f54 33 fd xor edi, ebp

;;; hash32B = (hash32B ^ (_rotl_KAZE(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME;

02f56 8b 6c 33 04 mov ebp, DWORD PTR [4+ebx+esi]

02f5a c1 c5 05 rol ebp, 5

02f5d 33 6b 04 xor ebp, DWORD PTR [4+ebx]

02f60 33 cd xor ecx, ebp

;;; hash32C = (hash32C ^ (_rotl_KAZE(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+8+Second_Line_Offset))) * PRIME;

02f62 8b 6b 08 mov ebp, DWORD PTR [8+ebx]

02f65 c1 c5 05 rol ebp, 5

02f68 33 6c 33 08 xor ebp, DWORD PTR [8+ebx+esi]

02f6c 83 c3 0c add ebx, 12

02f6f 33 c5 xor eax, ebp

02f71 69 ff e7 d3 0a

00 imul edi, edi, 709607

02f77 69 c9 e7 d3 0a

00 imul ecx, ecx, 709607

02f7d 69 c0 e7 d3 0a

00 imul eax, eax, 709607

02f83 4a dec edx

02f84 75 c6 jne .B4.4

; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011";

; mark_description "-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FAcs";

And the full dump:

Copying a 256MB block 1024 times i.e. 256GB READ + 256GB WRITTEN ...

memcpy(): (256MB block); 262144MB copied in 141321 clocks or 1.855MB per clock

Fetching a 512MB block 1024 times i.e. 512GB ...

BURST_Read_4XMM128bit: (512MB block); 524288MB fetched in 107812 clocks or 4.863MB per clock

Fetching a 512MB block 1024 times i.e. 512GB ...

BURST_Read_8XMM128bit: (512MB block); 524288MB fetched in 113162 clocks or 4.633MB per clock

Fetching a 64MB block 8*1024 times i.e. 512GB ...

BURST_Read_4XMM128bit: (64MB block); 524288MB fetched in 107765 clocks or 4.865MB per clock

Fetching a 64MB block 8*1024 times i.e. 512GB ...

BURST_Read_8XMM128bit: (64MB block); 524288MB fetched in 113085 clocks or 4.636MB per clock

Fetching a 128KB block 4*1024*1024 times i.e. 512GB ...

BURST_Read_4XMM128bit: (128KB block); 524288MB fetched in 37580 clocks or 13.951MB per clock

Fetching a 128KB block 4*1024*1024 times i.e. 512GB ...

BURST_Read_8XMM128bit: (128KB block); 524288MB fetched in 36380 clocks or 14.411MB per clock

Fetching a 16KB block 8*4*1024*1024 times i.e. 512GB ...

BURST_Read_4XMM128bit: (16KB block); 524288MB fetched in 20436 clocks or 25.655MB per clock

Fetching a 16KB block 8*4*1024*1024 times i.e. 512GB ...

BURST_Read_8XMM128bit: (16KB block); 524288MB fetched in 17456 clocks or 30.035MB per clock

Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...

BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15148 clocks or 4.326MB per clock

BURST_Read_8DWORDSi: (64MB block); 65536MB fetched in 14087 clocks or 4.652MB per clock

FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14539 clocks or 4.508MB per clock

FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 15912 clocks or 4.119MB per clock

FNV1A_Yorikke: (64MB block); 65536MB hashed in 16427 clocks or 3.990MB per clock

FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14695 clocks or 4.460MB per clock

CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71557 clocks or 0.916MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...

BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 9532 clocks or 6.875MB per clock

BURST_Read_8DWORDSi: (2MB block); 65536MB fetched in 8907 clocks or 7.358MB per clock

FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 9407 clocks or 6.967MB per clock

FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9750 clocks or 6.722MB per clock

FNV1A_Yorikke: (2MB block); 65536MB hashed in 10187 clocks or 6.433MB per clock

FNV1A_Yoshimura: (2MB block); 65536MB hashed in 9984 clocks or 6.564MB per clock

CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69763 clocks or 0.939MB per clock

Fetching/Hashing a 128KB block 512*1024 times ...

BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 9220 clocks or 7.108MB per clock

BURST_Read_8DWORDSi: (128KB block); 65536MB fetched in 10577 clocks or 6.196MB per clock

FNV1A_YoshimitsuTRIADii: (128KB block); 65536MB hashed in 10686 clocks or 6.133MB per clock

FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 9656 clocks or 6.787MB per clock

FNV1A_Yorikke: (128KB block); 65536MB hashed in 10094 clocks or 6.493MB per clock

FNV1A_Yoshimura: (128KB block); 65536MB hashed in 11278 clocks or 5.811MB per clock

CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 69795 clocks or 0.939MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...

BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7878 clocks or 8.319MB per clock

BURST_Read_8DWORDSi: (16KB block); 65536MB fetched in 7909 clocks or 8.286MB per clock

FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8923 clocks or 7.345MB per clock

FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 8955 clocks or 7.318MB per clock

FNV1A_Yorikke: (16KB block); 65536MB hashed in 9719 clocks or 6.743MB per clock

FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9734 clocks or 6.733MB per clock

CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69732 clocks or 0.940MB per clock

Mr.Towner what do you think, are you impressed of 'YoshimitsuTRIADii' as I am?

This is my last post, I hope other people to use above info/techniques to improve their functions.