Lockless Inc

Creating a Fast Hash Function

Hash functions convert a stream of arbitrary data bytes into a single number. By the pigeon-hole principle, many possible inputs will map to the same output. However, if a hash function is chosen well, then it is difficult to find two keys that will hash to the same value. A very good hash function will have an output indistinguishable from a random number generator.

However, usually we do not wish to have output quite that uniform. Small correlations between the input and output bits are typically not a problem provided they are not large. The larger the correlations we allow, the faster a function we can create. For example an extremely poor hash function is:


typedef unsigned long long u64b;
u64b hash(char *key, u64b len)
{
	u64b result = 0;
	while (len)
	{
		result += *key;
		key++;
		len--;
	}
	
	return result;
}

This function simply sums the input bytes. If the input keys are random enough, then this may be a perfect solution. However, in the real world, the keys we wish to hash are decidedly non-random. We thus would rather have something that would produce a less correlated output.

To quantify whether or not a hash function is any good, we will need some tests. The simplest test is to see how a stream of NUL bytes is handled. A good hash will give a different answer for each length.


/* check for problems with nulls and odd-sized hashes */
static void simple_string_test(void)
{
	byte buf[8];
	u64b i;

	memset(buf, 0, 8);

	printf("These should all be different\n");
	for (i=0; i<8; i++)
	{
		printf("%lld  0-byte strings, hash is %llx\n", i, hash(buf, i));
	}
	
	memset(buf, 42, 8);
	printf("These should also all be different\n");
	for (i=1; i<8; i++)
	{
		printf("%lld  42-byte strings, hash is %llx\n", i, hash(buf, i));
	}
	
	printf("These should also all be different\n");
	for (i=1; i<8; i++)
	{
		buf[i] += i;
	}
	
	for (i=1; i<8; i++)
	{
		printf("%lld  42+-byte strings, hash is %llx\n", i, hash(buf, i));
	}
}

The bad add-the-input-bytes hash described above will fail this test. No matter how long the stream of zeros, it will output the same value: zero. This test also makes sure that odd-sized inputs are handled well. More advanced hash functions tend to input their data in large chunks. The handling of the extra bytes in the partially-full last chunk can be poor.

Another useful thing is to test for avalanche, where altering a single bit in the key changes the hash. For a good hash, every key bit should statistically alter every output bit about 50% of the time. Bob Jenkins has some nice code that tests for this. A simple version tests that each key bit affects each hash bit:


/* check that every input bit changes every output bit half the time */
#define MAXPAIR 80
#define MAXLEN 100
static void avalanche(void)
{
	byte qa[MAXLEN+1], qb[MAXLEN+2], *a = &qa[0], *b = &qb[1];
	u64b c, d, i, j = 0, k, l, z;
	u64b e, f, g, h;
	u64b x, y;
	u64b hlen;

	printf("No more than %d trials should ever be needed \n",MAXPAIR/2);
	for (hlen = 0; hlen < MAXLEN; ++hlen)
	{
		z = 0;
		
		/* For each input byte, */
		for (i = 0; i < hlen; ++i)
		{
			/* for each input bit, */
			for (j = 0; j < 8; ++j)
			{
				e=f=g=h=x=y=~(u64b)0;

				/* check that every output bit is affected by that input bit */
				for (k = 0; k < MAXPAIR; k += 2)
				{
					/* keys have one bit different */
					for (l=0; l<hlen+1; ++l)
					{
						a[l] = b[l] = 0;
					}

					/* have a and b be two keys differing in only one bit */
					a[i] ^= (k<<j);
					a[i] ^= (k>>(8-j));
					c = hash(a, hlen);
					b[i] ^= ((k+1)<<j);
					b[i] ^= ((k+1)>>(8-j));
					d = hash(b, hlen);

					e &= c^d;
					f &= ~(c^d);
					g &= c;
					h &= ~c;
					x &= d;
					y &= ~d;
					
					if (!(e|f|g|h|x|y)) break;
				}
				if (k > z) z = k;
				if (k == MAXPAIR)
				{
					printf("Some bit didn't change: ");
					printf("%.8llx %.8llx %.8llx %.8llx %.8llx %.8llx  ", e,f,g,h,x,y);
					printf("i %lld j %lld len %lld\n",i,j,hlen);
				}
				if (z == MAXPAIR) goto done;
			}
		}
		
done:
		if (z < MAXPAIR)
		{
			printf("Mix success  %2lld bytes ",i);
			printf("required  %lld  trials\n",z/2);
		}
	}
	printf("\n");
}

(The above has been cleaned up to support 64bit hashes and the coding style altered to match that in the rest of this website.) If a hash function doesn't pass the above test, then there is something seriously wrong with it.

Next, we should test for correlation between bits. To do this, a helper function for bit-string manipulation is useful:



/* Modify the k-th bit */
static void modify_bit(byte *a, unsigned k)
{
	/* XOR the appropriate byte in the bitstring with the right bit */
	a[k/8] ^= 1 << (k&7);
}

We would like to test how often an output bit is changed by input bits, and would like to see a 50% correlation between every bit-pair. Unfortunately, for any given number of trials, there is statistical noise. Thus we can only say there is a problem when the correlation differs from 50% by more than an amount that varies as the square root of the total number of trials. Then there is the fact that there are multiple bit-pairs we are checking simultaneously. Simply by random chance, this increases the probability that any given bit-pair will signal a false over or under-correlation. In order to get good statistical results, about a million trials are required.


static void corr_1st_order(u64b trials, int size)
{
	int i, j, k;

	byte *state = calloc(size, 1);
	byte *save = calloc(size, 1);
	
	u64b inb, outb;

	int *t = calloc(sizeof(int) * 64 * size * 8, 1);
	
	double x, y, ssq = 0;
	double maxt = 50.0, mint = 50.0;
	double sfactor = 4.0*64/sqrt(trials);
	
	srandom(getpid() * time(NULL));

	for (i = 0; i < trials; i++)
	{
		for (j = 0; j < size; j++)
		{
			state[j] = random();
		}
	
		memcpy(save, state, size);
	
		inb = hash(state, size);
	
		for (k = 0; k < 8 * size; k++)
		{
			memcpy(state, save, size);
			
			modify_bit(state, k);
			
		    outb = hash(state, size) ^ inb;
			
		    for (j = 0; j < 64; j++)
		    {
		        t[k * 64 + j] += outb & 1;
				outb /= 2;
		    }
		}
	}

	for (i = 0; i < 8 * size; i++)
	{
		for (j = 0; j < 64; j++)
		{
			x = t[i * 64 + j] * 100.0 / trials;
			if (x > maxt) maxt = x;
			if (x < mint) mint = x;
			
			y = fabs(x - 50.0);
			if (y > sfactor) printf("Bad value %f (%d %d)\n", x, i, j);
			
			ssq += (x-50)*(x-50);
		}
	}
	
	ssq /= (64 * 8 * size);
	
	printf("Max %f Min %f Variance %f sfactor %f\n", maxt, mint, ssq, sfactor);
	
	free(t);
	free(state);
	free(save);
}

An even more complex test compares pairs of output bits with each input bit. This is more sensitive to subtle problems with a hash function, and can indicate when the output isn't quite mixed enough.


static void corr_2nd_order(u64b trials, int size)
{
	int i, j, k, l;

	byte *state = calloc(size, 1);
	byte *save = calloc(size, 1);
	
	u64b inb, outb, o1, o2;

	int *t = calloc(sizeof(int) * 64 * 63 * size * 8, 1);
	
	double x, y, ssq = 0;
	double maxt = 50.0, mint = 50.0;
	double sfactor = 3.0*64/sqrt(trials);
	
	srandom(getpid() * time(NULL));

	for (i = 0; i < trials; i++)
	{
		for (j = 0; j < size; j++)
		{
			state[j] = random();
		}
	
		memcpy(save, state, size);
	
		inb = hash(state, size);
	
		for (k = 0; k < 8 * size; k++)
		{
			memcpy(state, save, size);
			
			modify_bit(state, k);
			
		    outb = hash(state, size) ^ inb;
			
			o1 = outb;
			
		    for (j = 0; j < 64; j++)
		    {
				o2 = o1;
				for (l = j + 1; l < 64; l++)
				{
					o2 /= 2;
					t[k * 64 * 63 + j * 63 + l - 1] += (o1 ^ o2) & 1;
				}
				
				o1 /= 2;
		    }
		}
	}

	for (i = 0; i < 8 * size; i++)
	{
		for (j = 0; j < 64; j++)
		{
			for (k = j + 1; k < 64; k++)
			{
				x = t[i * 64 * 63 + j * 63 + k - 1] * 100.0 / trials;
				if (x > maxt) maxt = x;
				if (x < mint) mint = x;

				y = fabs(x - 50.0);
				if (y > sfactor) printf("Bad value %f (%d %d %d)\n", x, i, j, k);

				ssq += (x-50)*(x-50);
			}
		}
	}
	
	ssq /= (64 * 63 * 8 * size);
	
	printf("Max %f Min %f Variance %f sfactor %f\n", maxt, mint, ssq, sfactor);
	
	free(t);
	free(state);
	free(save);
}

Using about a million trials, the above test will say if your hash function differs from a random oracle at the 1% level. This is good enough for most (non-cryptographic) purposes.

Other tests for hash-functions exist. However, I have found that the above ones tend to do a good job. The Maurer test passes some quite poor hash functions. The second and third-order ANFS tests are somewhat discerning. However, the simpler bit correlation tests are quicker to run, and give more useful information on failure.

The task for today is to construct the fastest possible hash function that passes all of these tests. (Even faster hash functions are possible, like variants of the add-all-bytes one, but may lead to statistical problems for non-random input data.) For ultimate speed, we will use assembly language for the hash function descriptions. This can give up to a 2× speedup in some cases.

The first hash function uses a similar algorithm to the parallel random number generator described in another article. The hash state is repeatedly multiplied by a large odd number. It then is "folded" with a half-shift + xor operation. These operations are chosen because they are invertible. This means that the available phase-space is conserved, and thus all possible hash values can be obtained.

The multiplication by an odd number (mod a power of two) can be inverted by finding the modular inverse. The xorshift operation can be inverted by repeating itself. The multiplication mixes lower bits into higher bits. The xorshift does the reverse, mixing information downwards to the least significant bits. The combination of the two spreads information throughout the hash state. We need two multiplication + xorshifts per mix-in of key state in this algorithm. If only one is done, then the tests do not pass.

The second part of the algorithm involves dealing with the fraction of an 8-byte chunk remaining. Since we don't want to read beyond the end of the key buffer, we mix in the size-four remaining chunk if it exists, then the size two, and finally the size one. This is faster than using a jump-table for a calculated goto.

Finally two more multiplications, and one xorshift is done to mix in the new bits. The result is an algorithm which passes the above tests, and is quite fast. The inner loop is quite small, and it processes eight bytes per iteration.


# %rsi = buffer pointer, %rdi = buffer length
.globl hash_asm
.type   hash_asm,@function
hash_asm:
	# The large odd multiplication constant
	movabs  $0x6595a395a1ec531b, %r8
	mov     %rsi, %rdx
	xor     %r8, %rdx
	
	# Do we only have a partial chunk to do?
	sub     $0x8, %rsi
	jb      2f
	
	# The main loop
.align 8, 0x90
1:	xor     (%rdi), %rdx
	imul    %r8, %rdx
	mov     %rdx, %rax
	shr     $0x20, %rdx
	xor     %rdx, %rax
	add     $0x8, %rdi
	imul    %r8, %rax
	mov     %rax, %rdx
	shr     $0x20, %rax
	xor     %rax, %rdx
	sub     $0x8, %rsi
	jae     1b

	# Is there a partial chunk remaining?
2:	cmp     $-0x8, %rsi
	jne     4f
	
	# Handle the final mixing
3:	imul    %r8, %rdx
	mov     %rdx, %rax
	shr     $0x20, %rdx
	xor     %rdx, %rax
	imul    %r8, %rax
	retq
		
	# Handle a partial chunk
4:	xor     %rax, %rax
	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %eax
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %ecx
	shl     $0x10, %rax
	add     %rcx, %rax
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %ecx
	shl     $0x8, %rax
	add     %rcx, %rax
	
	# Mix in the partial chunk
4:	xor     %rax, %rdx
	imul    %r8, %rdx
	mov     %rdx, %rax
	shr     $0x20, %rdx
	xor     %rdx, %rax
	imul    %r8, %rax
	mov     %rax, %rdx
	shr     $0x20, %rax
	xor     %rax, %rdx
	jmp     3b
.size	hash_asm, .-hash_asm

So how should we benchmark the above? The problem is that its speed depends on the length of the key buffer. To be fair, we will need to include a few sizes in the benchmark, and weight each equally. One which does this is:


static void benchmark(void)
{
	int size = 1 << 28;
	byte *buf = calloc(size, 1);
	u64b result = 0;
	
	int i;
	
	for (i = 0; i < size / 8; i++)
	{
		result += hash(buf, 8);
	}
	
	for (i = 0; i < size / 32; i++)
	{
		result += hash(buf, 32);
	}
	
	for (i = 0; i < size / 1024; i++)
	{
		result += hash(buf, 1024);
	}
	
	for (i = 0; i < size / (1 << 16); i++)
	{
		result += hash(buf, 1 << 16);
	}
	
	for (i = 0; i < size / (1 << 22); i++)
	{
		result += hash(buf, 1 << 22);
	}
	
	printf("Result : %llu\n", result);
}

This benchmark completes in 1.24s for the above hash function on this machine. Can we do better? In fact we can. Instead of using an xorshift to shift information downwards, we can use other instructions. (Using a multiplication to shift it upwards is fairly optimal.) One that comes to mind is the rotate instruction. By rotating by 32 bits we can swap the upper and lower parts of our 64 bit state. A hash function which does this is:


# rsi = buffer pointer, %rdi = buffer length
.globl hash_rol
.type   hash_rol,@function
hash_rol:
	movabs  $0xac95a395a9ec535b, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	sub     $0x8, %rsi
	jb      2f
	
	# Use a rotate instruction to mix downwards
.align 8, 0x90
1:	xor     (%rdi), %rax
	imul    %r8, %rax
	rol     $32, %rax
	xor     %r8, %rax
	add     $0x8, %rdi
	imul    %r8, %rax
	rol     $32, %rax
	sub     $0x8, %rsi
	jae     1b
2:	cmp     $-0x8, %rsi
	jne     4f
	
	# Use rotates + multiplications for final mixing
3:	imul    %r8, %rax
	rol     $32, %rax
	xor     %r8, %rax
	imul    %r8, %rax
	rol     $32, %rax
	imul    %r8, %rax
	rol     $32, %rax
	imul    %r8, %rax
	retq
4:	xor     %rdx, %rdx
	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %ecx
	shl     $0x10, %rdx
	add     %rcx, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %ecx
	shl     $0x8, %rdx
	add     %rcx, %rdx
	
	# Mix in partial chunk using rotates
4:	xor     %rdx, %rax
	imul    %r8, %rax
	rol     $32, %rax
	imul    %r8, %rax
	rol     $32, %rax
	jmp     3b
.size	hash_rol, .-hash_rol

Since a rotate doesn't do as much information mixing as an xorshift does, we need more of them. However, since many less instructions are required per iterations, we have a net win. This hash function successfully passes the tests, and completes the benchmark in 1.13s, which is a fair speedup.

It turns out that there is an even better downward-mixing instruction than rotate. The bswap instruction reverses the bytes in a register. This is perfect for our purposes. Replacing the rotates with this gives:


# rsi = buffer pointer, %rdi = buffer length
.globl hash_bswap
.type   hash_bswap,@function
hash_bswap:
	movabs  $0xac95a395a9ec535b, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	sub     $0x8, %rsi
	jb      2f
	
	# Use a bswap instruction to mix downwards
.align 8, 0x90
1:	xor     (%rdi), %rax
	imul    %r8, %rax
	bswap   %rax
	add     $0x8, %rdi
	imul    %r8, %rax
	bswap    %rax
	sub     $0x8, %rsi
	jae     1b
2:	cmp     $-0x8, %rsi
	jne     4f
	
	# Use bswaps + multiplications for final mixing
3:	imul    %r8, %rax
	bswap   %rax
	xor     %r8, %rax
	imul    %r8, %rax
	bswap   %rax
	imul    %r8, %rax
	bswap   %rax
	imul    %r8, %rax
	retq
4:	xor     %rdx, %rdx
	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %ecx
	shl     $0x10, %rdx
	add     %rcx, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %ecx
	shl     $0x8, %rdx
	add     %rcx, %rdx
	
	# Mix in partial chunk using bswaps
4:	xor     %rdx, %rax
	imul    %r8, %rax
	bswap   %rax
	imul    %r8, %rax
	bswap   %rax
	jmp     3b
.size	hash_bswap, .-hash_bswap

This is a little faster, at 1.07s This is because we don't need the extra xor by the constant in the inner loop. (Without it, the rotate version doesn't pass the correlation tests.)

Is this the end? Not quite, we can do a bit better by expanding the amount of hash state in the inner loop. By using 128 bits of state, we can use 64×64 →128 bit multiplications. This is a more efficient use of processor resources. In addition, the fact that the high and low results of the multiplication are in separate registers allows us to perform the xorshift operation in a single instruction. The resulting code looks like:


# rsi = buffer pointer, %rdi = buffer length
.globl hash_128
.type   hash_128,@function
hash_128:
	movabs  $0x1591aefa5e7e5a17, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	movabs  $0x2bb6863566c4e761, %r9
	mov     %r9, %rcx
	sub     $0x10, %rsi
	jb      2f
	
# Process 128bits = 16 bytes at a time.
.align 8, 0x90
1:	xor     (%rdi), %rax
	xor     8(%rdi), %rcx

# Only one mix operation needed per iteration.
	mul     %r8
	add     $0x10, %rdi
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	sub     $0x10, %rsi
	jae     1b
2:	cmp     $-0x10, %rsi
	jne     4f

# Final mix using 128bit mul and xorshift.
3:	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	retq

# Need to mix in up to 15 bytes.
4:	xor     %rdx, %rdx
	test    $0x8, %rsi
	je      4f
	xor     (%rdi), %rax
	add     $0x8, %rdi
4:	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %r10d
	shl     $0x10, %rdx
	add     %r10, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %edi
	shl     $0x8, %rdx
	add     %rdi, %rdx
4:	xor     %rdx, %rcx
	jmp    3b
.size	hash_128, .-hash_128

Using the above algorithm results in superior mixing. This means that only one 128 bit + xor operation is needed per 128 bits of key state. It also means that less mixes are required after the final merge. Even though the double-word multiplication instruction is slow, the fact that we are now processing twice the data per iteration more than makes up for it. The benchmark now completes in 0.84s, which well and truly breaks the one-second barrier.

Of course, this isn't the quite best code. Now that we are processing 128 bits at a time, how about trying the bswap algorithm within the inner loop, and then mixing the final results together with 128 bit multiplies and xorshifts? This removes dependencies in the inner loop, and allows the processor to extract extra parallelism it couldn't previously.


# rsi = buffer pointer, %rdi = buffer length
.globl hash_128_swap
.type   hash_128_swap,@function
hash_128_swap:
	movabs  $0x1591aefa5e7e5a17, %r8
	mov     %rsi, %rax
	xor     %r8, %rax
	movabs  $0x2bb6863566c4e761, %r9
	mov     %r9, %rcx
	sub     $0x10, %rsi
	jb      2f
	
# Use two multiply + bswap algorithms in parallel
.align 8, 0x90
1:	xor     (%rdi), %rax
	xor     8(%rdi), %rcx
	add     $0x10, %rdi
	imul    %r8, %rax
	imul    %r9, %rcx
	bswap   %rax
	bswap   %rcx
	sub     $0x10, %rsi
	jae      1b
2:	cmp     $-0x10, %rsi
	jne     4f

# Finish off with 128bit multiples to mix.
3:	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	retq
4:	xor     %rdx, %rdx
	test    $0x8, %rsi
	je      4f
	xor     (%rdi), %rax
	add     $0x8, %rdi
4:	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %r10d
	shl     $0x10, %rdx
	add     %r10, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %edi
	shl     $0x8, %rdx
	add     %rdi, %rdx
4:	xor     %rdx, %rcx
	mul     %r8
	imul	%r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	jmp     3b
.size	hash_128_swap, .-hash_128_swap

The extra parallelism increases speed a little bit. Note how the superior mixing means that we only need one bswap per 64bit half-state in the above. (Instead of the two bswaps for full state in the earlier algorithm.) This code takes 0.81s in the benchmark, and as a result the above algorithm is highly recommended.

Of course we still aren't quite done. There is one other set of instructions we haven't touched yet. The streaming SIMD instructions (SSE) allow computation on a large amount of data in a short amount of time. If we can construct a hash function which uses them efficiently, we may be able to beat the above record.

There are several problems with the SSE instructions. The first is that they require 16-byte alignment of memory. Normally, this isn't a problem, as the user of a SSE function can enforce alignment, However, a hash function cannot be so choosey. Therefor, we will need to check the alignment of the input buffer, and fall back on a slightly slower inner loop in the unaligned case.

Another problem with the SSE instructions is that they are not very orthogonal. There are a wide variety of instructions, but there doesn't seem to have been much of a plan about what was implemented, and what not. The instructions chosen were selected on their use in multimedia, not the integer arithmetic within hash functions. Thus we will need to think laterally in order to find ones useful for our task.

We require two types of instructions for the inner loop. Ones which mix bits "upwards", and ones which mix them "downwards". By using both sets, we can mix all the bits, and create a nice function. The problem is that SSE instructions are designed to work on multiple bits of data in parallel. Having information cross the boundaries between those chunks of data isn't easy.

The method we used to move information upwards previously was via multiplication by large odd constants. The problem here is that we are limited in the SSE2 instruction set about which multiplications are allowed. We would like to use as large a multiplication as possible. However, the largest one which computes two 32×32→64 bit multiplies only uses half of the data within a SSE register. Doing the same on the other half requires other instructions to extract the information, and perhaps yet more instructions to recombine everything.

It seems the best multiply instruction available is pmullw, which does a 16×16→16 multiply. This doesn't do very much mixing, which may be a problem. If only a 32 bit or 64 bit version of the above existed in SSE2. (The higher SSE implementations cannot be guaranteed to exist on a 64bit machine.)

Mixing information downwards presents a new problem. There are no bswap or rotate instructions in SSE2. There are 128 bit shifts, but implementing a rotate out of them is slow. Another possibility is to use a shift + xor. However, since the 128 bit shifts are limited to be in units of whole bytes, there aren't many good possibilities. The amount of mixing available with this technique is limited.

What saves us are the unpack instructions. We can use punpckhbw and punpcklbw to interleave the bytes from two SSE registers. This, when combined with the 16 bit multiply yields the following algorithm:


# rsi = buffer pointer, %rdi = buffer length
.align 16
hash_sse_c:
.quad	0x5a379ab38dc5a46b
.quad	0xd6c573e9c613993d 
.globl hash_sse
.type   hash_sse,@function
hash_sse:
	movabs  $0x1591aefa5e7e5a17, %r8
	movq    %rsi, %xmm0
	movaps  hash_sse_c(%rip), %xmm2
	movabs  $0x2bb6863566c4e761, %r9
	movdqa  %xmm2, %xmm1
	pxor    %xmm2, %xmm0
	sub     $0x20, %rsi
	jb      2f
	test    $0x0f, %rdi
	jne     5f

# Inner loop - use 16 bit mul and unpack to mix
.align 8, 0x90
1:	paddq   (%rdi), %xmm0
	paddq   0x10(%rdi), %xmm1
	pmullw  %xmm2, %xmm0
	pmullw  %xmm2, %xmm1
	add     $0x20, %rdi
	movdqa  %xmm0, %xmm3
	punpckhbw %xmm1, %xmm0
	punpcklbw %xmm3, %xmm1
	sub     $0x20, %rsi
	jae     1b
	pxor    %xmm1, %xmm0
2:	cmp     $-0x20, %rsi
	jne     4f

#	Extract 128 bits of state from %xmm0
	movhlps %xmm0, %xmm1
	movq    %xmm0, %rax
	movq    %xmm1, %rcx

# Use 128bit multiple and xorshift for final mix
3:	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	mul     %r8
	imul    %r9, %rcx
	add     %rdx, %rcx
	xor     %rcx, %rax
	retq

# The final chunk can be up to 31 bytes in size
4:	test    $0x10, %rsi
	je      4f
	movups  (%rdi), %xmm3
	pxor    %xmm3, %xmm0
	add     $0x10, %rdi
	pmullw  %xmm2, %xmm0
4:	xor     %rdx, %rdx
	movhlps %xmm0, %xmm1
	movq    %xmm0, %rax
	movq    %xmm1, %rcx
	test    $0x8, %rsi
	je      4f
	xor     (%rdi), %rax
	add     $0x8, %rdi
4:	test    $0x4, %rsi
	je      4f
	mov     (%rdi), %edx
	add     $0x4, %rdi
4:	test    $0x2, %rsi
	je      4f
	movzwl  (%rdi), %r10d
	shl     $0x10, %rdx
	add     %r10, %rdx
	add     $0x2, %rdi
4:	test    $0x1, %rsi
	je      4f
	movzbl  (%rdi), %edi
	shl     $0x8, %rdx
	add     %rdi, %rdx
4:	xor     %rdx, %rcx
	jmp 3b
	
# miss-aligned source inner loop
.align	16
5:	movups  (%rdi), %xmm3
	movups  0x10(%rdi), %xmm4
	paddq   %xmm3, %xmm0
	paddq   %xmm4, %xmm1
	pmullw  %xmm2, %xmm0
	pmullw  %xmm2, %xmm1
	add     $0x20, %rdi
	movdqa  %xmm0, %xmm3
	punpckhbw %xmm1, %xmm0
	punpcklbw %xmm3, %xmm1
	sub     $0x20, %rsi
	jae     5b
	pxor   %xmm1, %xmm0
	jmp     2b
.size	hash_sse, .-hash_sse

The above code is quite complicated. However, it is extremely fast in the benchmark, taking 0.785s. It manages to be so fast because it processes 32 bytes for every iteration of the inner loop. The relatively poor mixing of the SSE instructions is made up for by an extra 128bit multiply and xorshift. If you are hashing keys which are large buffers, the above is a very good function to use. For smaller sizes, the non-sse functions may be faster.

Comments

Dmitry V'jukov said...
Thanks! I really like your blog. You know, it's difficult to find anything deep for developers those days.
sfuerst said...
Thanks! I try to come up with something new and interesting every two weeks or so to write about.
Chris Anderson said...
In most of the inner loops, there's the sequence that tests for branching:

sub $0x10, %rsi ; subtract 16 from rsi (buffer length)
ja 1b ; continue if rsi > 0

or the C equiv:

 do { } while ( (rsi -= 16) > 0 );

Shouldn't this be:

 do { } while ( (rsi -= 16) > 15 );

instead? Since the last 15 bytes are handled by the final chunk parts.
sfuerst said...
Oops, you are right. How embarrassing! Thanks for letting us know so it can be fixed.

Fortunately, the code can be fixed without affecting the inner loop by offsetting the size by the chunk length. We can do this by changing the initial "cmp" into a "sub", and then having a "jae" instead of a "ja" as the loop condition.

I've replaced the code in the article with the new version. The only other change is that hash_128_swap() needs to do a little more mixing in the odd-sized case. It doesn't passing the 31-byte mix-test otherwise.
Chris Anderson said...
I converted one of the versions to C, just to see how close gcc can come to hand coded asm, and the results are pretty good using the benchmark() function above:

$ gcc -O2 h.c -lrt
$ a.out
Asm 551332706 nanosecs (ba560400)
C 414688636 nanosecs (ba560400)
Asm 452776232 nanosecs (ba560400)
C 414525092 nanosecs (ba560400)
NH 1778447535 nanosecs (b0351680)
Asm 452955337 nanosecs (ba560400)
NH 1778373149 nanosecs (b0351680)

code, full source here: http://www.eetbeetee.org/h.c

unsigned int
hash_128_swapc( const void *p, unsigned int len )
{
  register unsigned long long r8 = 0x1591aefa5e7e5a17ULL,
                              r9 = 0x2bb6863566c4e761ULL,
                              rax = len ^ r8,
                              rcx = r9,
                              rdx;
#define bswap( r ) \
  __asm__ __volatile__ ( "bswapq %0" : "+r" (r) : : )
#define mul128( a, d, r ) \
  __asm__ __volatile__ ( "mulq %2" : "+a" (a), "=d" (d) : "r" (r) : )

  while ( len >= 16 ) {
    rax = ( rax ^ ((unsigned long long *) p)[ 0 ] ) * r8;
    rcx = ( rcx ^ ((unsigned long long *) p)[ 1 ] ) * r9;
    bswap( rax );
    bswap( rcx );
    p = &((unsigned long long *) p)[ 2 ];
    len -= 16;
  }
  if ( len != 0 ) {
    if ( ( len & 8 ) != 0 ) {
      rdx = 0;
      rax ^= ((unsigned long long *) p)[ 0 ];
      p = &((unsigned long long *) p)[ 1 ];
    }
    if ( ( len & 4 ) != 0 ) {
      rdx = ((unsigned int *) p)[ 0 ];
      p = &((unsigned int *) p)[ 1 ];
    }
    if ( ( len & 2 ) != 0 ) {
      rdx = ( rdx << 16 ) | ((unsigned short *) p)[ 0 ];
      p = &((unsigned short *) p)[ 1 ];
    }
    if ( ( len & 1 ) != 0 ) {
      rdx = ( rdx << 8 ) | ((unsigned char *) p)[ 0 ];
    }
    rcx ^= rdx;
  }
  mul128( rax, rdx, r8 );
  rcx = ( rcx * r9 ) + rdx;
  rax ^= rcx;
  mul128( rax, rdx, r8 );
  rcx = ( rcx * r9 ) + rdx;
  rax ^= rcx;

  return ( rax >> 32 ) ^ rax;
}


sfuerst said...
Interesting benchmark results. It looks like the bswap-based hash is quite a bit faster than "newhash".

However, the comparison between C and asm isn't quite right. GCC is smart, and will inline the C version of the hash - removing some of the size-checks in the benchmark. To prevent that, you'll need to add __attribute__((noclone, noinline)) to get the speed that a user with a non-compiler-optimizable buffer length will see.
Chris Anderson said...
Yeah, newhash is a derivative of Bob Jenkins' newhash64(): http://www.burtleburtle.net/bob/c/lookup8.c

I counted the instructions in the newhash inner loop vs bswap, bswap: 11 / 16 bytes = .6875 inst per byte, newhash64: 73 / 24 bytes = 3.042. With that, bswap should be 4.42 times as fast, and it is fairly close to the actual multiple: 4.29.
sfuerst said...
Yeah, this article isn't called "Creating a Fast Hash Function" for nothing. ;-)

Although, I have to admit, the final few functions do cheat a bit. There is an increased correlation between bits 23 bytes apart. This is a bit larger than the 8 bytes tested by the correlation checking tests, so the flaw isn't picked up. Fortunately, for most uses, the statistical weakness can be ignored.
Timo Ewalds said...
Very interesting article!

I noticed your very first hash has a bug. It never moves the key pointer forward, so you simply add the first byte to itself len times, which is an even worse hashing function than you intended.

How do your hash functions compare to MurmurHash3? https://code.google.com/p/smhasher/
sfuerst said...
Haha, thanks for noticing that. Yes, it was definitely a really bad hash. :-)

I've just updated the article with the slightly more sensible version that updates the key buffer pointer.

I haven't really tested the newer versions of MurmurHash. You are right, it'll probably be quite interesting to see how they compare with those here.
sfuerst said...
It looks like there was a bug in the non-aligned version of hash_sse. The final mixing was done with a padd instead of a pxor. The version in the article has been fixed. Thanks go out to Samy Bahra for spotting this.

I've also just done some benchmarking with Murmurhash3. It takes 1.5s to run the benchmark on this machine. hash_sse() is about twice as fast, which is pretty amazing...
Georgi 'Sanmayce' said...
Well-said:
"Lockless Inc. seeks to make the highest performance software available. The goal is to optimize for modern machines, and take advantage of the changes in technology that have occurred in the past few years. We seek to extract the power of multiple cores, vectorization, and 64 bit technology to its fullest."

Hi to all,

a few months ago a new superfast hash function was revealed (by chance) based on FNV1A named FNV1A_Jesteress, I should like you to test it along with your implementations and if it is possible with FNV1A_Mantis. It screams on new architectures like Core i3+, for reference: http://www.strchr.com/hash_functions

It would be very useful to use your experience in sake of more clarity on this subject, don't you think?

Regards
sfuerst said...
I've now timed your FNV variants. It takes 0.833s for FNV1A_Jesteress to complete the benchmark, and 0.875s for FNV1A_Mantis. Note that these hashes output 32 bits rather than 64, so it is a little unfair to compare them to the other hashes here which do twice the work...

This machine is a rather old AMD Opteron, so may not have quite the same hash performance characteristics as machines based on Intel's Core architecture.
Georgi 'Sanmayce' said...
Thanks sfuerst,
it is interesting, sadly I have no access to AMDs (even Opteron).
Looking at code (in my view very complicated) presented on your very informative web-page I suggest (and would be very glad) you or someone else to rewrite my FNV variants, to make them 64bit, it is obvious they are with great potential, or maybe you are not so much impressed as I am?

Thanks again and good luck.
sfuerst said...
If you expand your FNV1A variants to 64 bit, then you get something that looks very much like the hash_rol function on this page. Using bswap instead of the rol instruction seems to be better due to the increased amount of downwards diffusion.
Georgi 'Sanmayce' said...
Hi again sfuerst,

I was lucky to speed up even more FNV1A_Jesteress, it resulted in appearance of FNV1A_Yorikke at:
http://www.sanmayce.com/Fastest_Hash/index.html#KT_torture

It is copyleft i.e. free.
I would enjoy some testing from your side.

Regards
Jesse Towner said...
For anyone interested, I've implemented an optimized version of the final SSE2-enabled hash function for 32-bit x86 targets.

You can find it here (for both GAS and MASM, respectively):
https://github.com/upcaste/upcaste/blob/master/src/upcore/src/cstring/gas/i686/memhash.s
https://github.com/upcaste/upcaste/blob/master/src/upcore/src/cstring/msvc/i686/memhash.asm

The challenging part was performing the 128-bit multiplication and mix towards the end of the function, especially given the reduced number of available registers. My original version just used the x86 ALU instructions, and used some stack space for some temporaries.

However, I realized I could keep everything in the SSE registers, batch some of my multiplies together using the PMULUDQ instruction, and then move only what was needed to the 32-bit x86 registers to compute the final 128-bit result via the ADC instruction. This final version doesn't use any stack space for temporaries, is a bit faster as a result, and computes hash values exactly identical to the x86-64 version.

I tested it on both a Core i7 860 and a Core i7 2720QM, and against a few other hash functions compiled in 32-bit mode with Clang, GCC, and MSVC.

Against its x86-64 version it's nearly the same in terms of performance. This isn't surprising, considering that the main loops are identical between them. It only really takes a hit for small keys due to the additional logic needed to handle the 128-bit multiplies.

Of note when compared to Georgi's FNV1A variants, it's nearly as fast as them on small keys below 16-bytes (within 5%), and out-performs them on larger keys greater than 1024-2048 bytes (beats FNV1A-Yorrike by as much as 25%).

Given that Steven's hash function generates 64-bit keys and has much better properties, I'd say it's a real winner. It passes all of his tests that he described in this page. Plus being able to use the exact same hash function in both 32-bit and 64-bits modes, and not having to worry about performance, conjures certain benefits. Furthermore, in my english language dictionary test, it had zero collisions, even when the upper 32-bit were discared and only the lower 32-bits of the result were considered. The higher-dimensionality of the intermediate values used while the hash is accumulated definitely helps here.

I might try and port it to 32-bit ARM using NEON instructions at some point in the future. A 64-bit ARM port w/ NEON should be pretty simple to do too.
Georgi 'Sanmayce' said...
Hi Mr.Towner,
it seems you have done it. I like your style.

Assembler is beauty, like stars in the sky.
You see that fine etudes behave sometimes surprisingly not as expected, I mean even a single/simple reordering of instructions leads to butterfly effects on different machines.
Sadly I am not in the assembly (no time for that fun), so it makes me joyous to see such interesting and useful works as yours.
By the way, what is the name of your hasher?
You wrote 'FNV1A-Yorrike' but please don't, it is FNV1A_Yorikke, the name is important.
In fact FNV1A_Yorikke still is my personal favoritess, yet in my tools I use FNV1A_Jesteress simply because my string (x-gram) sizes are in range 1..126 bytes - in my tests she outjuggled the rest hasher(esse)s.
I looked briefly at https://github.com/upcaste/upcaste/blob/master/src/upcore/test/cstring/cstring.hash.cpp and I got frightened how clean your style is and how chaotic mine (after all I am not a programmer) and please if it is not against your policy test FNV1A_Yorikke in your avalanche test, my primary interest lie in short English text strings, saying that your English language dictionary test is perfect interests me a lot how FNV1A_Yorikke fares there as well.

FNV1A_YoshimitsuTRIAD's main loop (code generated by Intel 12.1, /Ox used) is only (291a-28df+2)=61bytes:
[CODE]
;;; for(; wrdlen >= 3*2*sizeof(uint32_t); wrdlen -= 3*2*sizeof(uint32_t), p += 3*2*sizeof(uint32_t)) {

  028d7 83 fa 18 cmp edx, 24
  028da 72 43 jb .B4.5 ; Prob 10%
                                ; LOE eax edx ecx ebx ebp esi edi
.B4.2: ; Preds .B4.1
  028dc 89 34 24 mov DWORD PTR [esp], esi ;
                                ; LOE eax edx ecx ebx ebp edi
.B4.3: ; Preds .B4.2 .B4.3

;;; hash32 = (hash32 ^ (_rotl(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+4))) * PRIME;

  028df 8b 31 mov esi, DWORD PTR [ecx]
  028e1 83 c2 e8 add edx, -24
  028e4 c1 c6 05 rol esi, 5
  028e7 33 71 04 xor esi, DWORD PTR [4+ecx]
  028ea 33 de xor ebx, esi

;;; hash32B = (hash32B ^ (_rotl(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+12))) * PRIME;

  028ec 8b 71 08 mov esi, DWORD PTR [8+ecx]
  028ef c1 c6 05 rol esi, 5
  028f2 33 71 0c xor esi, DWORD PTR [12+ecx]
  028f5 33 fe xor edi, esi

;;; hash32C = (hash32C ^ (_rotl(*(uint32_t *)(p+16),5) ^ *(uint32_t *)(p+20))) * PRIME;

  028f7 8b 71 10 mov esi, DWORD PTR [16+ecx]
  028fa c1 c6 05 rol esi, 5
  028fd 33 71 14 xor esi, DWORD PTR [20+ecx]
  02900 83 c1 18 add ecx, 24
  02903 33 ee xor ebp, esi
  02905 69 db e7 d3 0a
        00 imul ebx, ebx, 709607
  0290b 69 ff e7 d3 0a
        00 imul edi, edi, 709607
  02911 69 ed e7 d3 0a
        00 imul ebp, ebp, 709607
  02917 83 fa 18 cmp edx, 24
  0291a 73 c3 jae .B4.3 ; Prob 82%
[/CODE]

As you can see from the results below on non-overclocked Mr.Eiht's Rig (i7 3930K 3.2GHz 8x4GB @1337MHz) FNV1A_YoshimitsuTRIAD is faster than FNV1A_Yorikke, but my strings are short and I need 24..30 bit hash table lookups so this 10+% gain is like driving a dragster instead of electric car in non-aligned short city streets:

BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 6101 clocks or 10.742MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 7307 clocks or 8.969MB per clock
FNV1A_Yorikke: (64MB block); 65536MB hashed in 7958 clocks or 8.235MB per clock
CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 44693 clocks or 1.466MB per clock

BURST_Read_3DWORDS: (10MB block); 81920MB fetched in 5781 clocks or 14.171MB per clock
FNV1A_YoshimitsuTRIAD: (10MB block); 81920MB hashed in 7499 clocks or 10.924MB per clock
FNV1A_Yorikke: (10MB block); 81920MB hashed in 8474 clocks or 9.667MB per clock
CRC32_SlicingBy8K2: (10MB block); 81920MB hashed in 54796 clocks or 1.495MB per clock

BURST_Read_3DWORDS: (5MB block); 40960MB fetched in 2517 clocks or 16.273MB per clock
FNV1A_YoshimitsuTRIAD: (5MB block); 40960MB hashed in 3402 clocks or 12.040MB per clock
FNV1A_Yorikke: (5MB block); 40960MB hashed in 3941 clocks or 10.393MB per clock
CRC32_SlicingBy8K2: (5MB block); 40960MB hashed in 27029 clocks or 1.515MB per clock

BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 3806 clocks or 17.219MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 5374 clocks or 12.195MB per clock
FNV1A_Yorikke: (2MB block); 65536MB hashed in 6246 clocks or 10.492MB per clock
CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 43143 clocks or 1.519MB per clock

BURST_Read_3DWORDS: (128KB block); 65536MB fetched in 3109 clocks or 21.079MB per clock
FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 5195 clocks or 12.615MB per clock
FNV1A_Yorikke: (128KB block); 65536MB hashed in 6108 clocks or 10.730MB per clock
CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 42873 clocks or 1.529MB per clock

BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 3081 clocks or 21.271MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 5206 clocks or 12.589MB per clock
FNV1A_Yorikke: (16KB block); 65536MB hashed in 6068 clocks or 10.800MB per clock
CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 42888 clocks or 1.528MB per clock

For blocks 16KB..5MB the gain is 16%..15%, but they are cached, when the bigger part is outwith the CPU caches the gain is only (8.969-8.235)/8.235*100% = 9%, and this is linear speed which is not what I need.

No point here to be made, just sharing some thoughts.
Georgi 'Sanmayce' said...
Hi, just wanted to share my latest best:
http://www.sanmayce.com/Fastest_Hash/index.html#Yoshimura

AFAIK, for the first time the nifty technique which I call INTERLEAVING came into the play drawing/fetching from two not-so-adjacent memory positions. This approach has one more nifty speed-wise feature reducing the branching. My tests show ULTRA fast behavior:

E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r1>hash Sentence-list_00,032,359_English_The_Holy_Bible.txt
32359 lines read
65536 elements in the table (16 bits)
Jesteress: 20075 [ 6883]
Meiyan: 20195 [ 6897]
Yorikke: 19221 [ 6872]
Yoshimura: 17738 [ 6903] !!! SIGNIFICANTLY fastEST !!!
Yoshimitsu: 19699 [ 6937]
YoshimitsuTRIAD: 19491 [ 6843]
FNV-1a: 47179 [ 6840]
Larson: 48009 [ 6889]
CRC-32: 42940 [ 6891]
Murmur2: 25825 [ 6786]
Murmur3: 25753 [ 6850]
XXHfast32: 18481 [ 6859]
XXHstrong32: 20691 [ 6887]

E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r1>

It is worth the attempt to explore the Jesteress-Yorikke (i.e. 1 hash line 4+4 vs 2 hash lines 4+4) 8-16 GAP in order to lessen their collisions further more.
Simply put, 3 hash lines 4 bytes each, 12 bytes per loop.
The 'non power of 2' workaround I see as one MONOLITH function with no remainder mixing at all.
The idea #1 is to exterminate all nasty IFs outwith the main loop, I believe such branchless etude will outperform Jesteress.
The idea #2 is to STRESS memory by fetching not-so-adjacent areas.

For example:
Key: hash_with_overlapping_aye_aye
Key left-to-right quadruplets and remainder: 'hash', '_wit', 'h_ov', 'erla', 'ppin', 'g_ay', 'e_ay', 'e'
Key right-to-left quadruplets and remainder: 'h', 'ash_', 'with', '_ove', 'rlap', 'ping', '_aye', '_aye'
Key_Length: 29
Loop_Counter: 3 //if ( Key_Length%(3*4) ) Loop_Counter = Key_Length/(3*4)+1; else Loop_Counter = Key_Length/(3*4);

Loop #1 of 3:
Hash line 1: hash
Hash line 2: h_ov
Hash line 3: ping
Loop #2 of 3:
Hash line 1: _wit
Hash line 2: erla
Hash line 3: _aye
Loop #3 of 3:
Hash line 1: h_ov
Hash line 2: ppin
Hash line 3: _aye

I don't know the internals, whether lines are 32/64/128 bytes long is a secondary concern.
Well, the key is too short, in reality the above key may span only 1|2 cache lines, if the key is longer than 4 cache lines (assuming 32) e.g. 128+2 bytes then it may span 5|6 lines.
My dummy/clueless premise is that is possible (in future systems) to access effectively RAM in such manner.
Does someone know whether such type of accessing has any practical value in nowadays CPUs?
Of course, it is worth trying to "interleave" in that way all the short string hashers, yes?

I believe Yoshimura is the best balance between speed & dispersion, in one hand, and between LATENCY & BANDWIDTH, in the whole INTERNET.
Georgi 'Sanmayce' said...
With help from Fantasy, Przemyslaw Skibinski, Maciej Adamczyk (m^2) and Peter Kankowski I have shown that FNV1A_Yoshimura is faster and better than FNV1A_Yorikke, I think:
http://www.overclock.net/t/1319572/benchmarking-the-fastest-hash-function/20_20#post_19531418

Feel free to use it in your tasks ... freely.
Georgi 'Sanmayce' said...
I am happy to share my latest fastest (with bigger latency though)FNV1A_YoshimitsuTRIADii 2x12 bytes interleaved&interlaced hasher.

Excuse me for this big dump, but I wanted some PRECISE benchmarking to come into the play along with so many functions/tests.

Below, the results after running 32bit code by Intel 12.1 compiler (/Ox used):
Linear speed on Sanmayce's 'Bonboniera' laptop (Core 2 T7500, 2200MHz, RAM bus: 666MHz Duad Channel):

// 'BURST_Read_4XMM128bit' Main Loop:

.B1.182:
  009d0 8b f1 mov esi, ecx
  009d2 41 inc ecx
  009d3 c1 e6 06 shl esi, 6
  009d6 81 f9 00 00 80
        00 cmp ecx, 8388608
  009dc 66 0f 6f 04 32 movdqa xmm0, XMMWORD PTR [edx+esi]
  009e1 66 0f 7f 05 00
        00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0
;;; xmm1 = xmmload(pointerflush+(KT+1)*16);
  009e9 66 0f 6f 4c 32
        10 movdqa xmm1, XMMWORD PTR [16+edx+esi]
  009ef 66 0f 7f 0d 00
        00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1
;;; xmm2 = xmmload(pointerflush+(KT+2)*16);
  009f7 66 0f 6f 54 32
        20 movdqa xmm2, XMMWORD PTR [32+edx+esi]
  009fd 66 0f 7f 15 00
        00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2
;;; xmm3 = xmmload(pointerflush+(KT+3)*16);
  00a05 66 0f 6f 5c 32
        30 movdqa xmm3, XMMWORD PTR [48+edx+esi]
  00a0b 66 0f 7f 1d 00
        00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3
  00a13 72 bb jb .B1.182

// 'BURST_Read_8XMM128bit' Main Loop:

.B1.194:
  00b00 8b d9 mov ebx, ecx
  00b02 41 inc ecx
  00b03 c1 e3 07 shl ebx, 7
  00b06 81 f9 00 00 40
        00 cmp ecx, 4194304
  00b0c 66 0f 6f 04 1a movdqa xmm0, XMMWORD PTR [edx+ebx]
  00b11 66 0f 7f 05 00
        00 00 00 movdqa XMMWORD PTR [_xmm0], xmm0
;;; xmm1 = xmmload(pointerflush+(KT+1)*16);
  00b19 66 0f 6f 4c 1a
        10 movdqa xmm1, XMMWORD PTR [16+edx+ebx]
  00b1f 66 0f 7f 0d 00
        00 00 00 movdqa XMMWORD PTR [_xmm1], xmm1
;;; xmm2 = xmmload(pointerflush+(KT+2)*16);
  00b27 66 0f 6f 54 1a
        20 movdqa xmm2, XMMWORD PTR [32+edx+ebx]
  00b2d 66 0f 7f 15 00
        00 00 00 movdqa XMMWORD PTR [_xmm2], xmm2
;;; xmm3 = xmmload(pointerflush+(KT+3)*16);
  00b35 66 0f 6f 5c 1a
        30 movdqa xmm3, XMMWORD PTR [48+edx+ebx]
  00b3b 66 0f 7f 1d 00
        00 00 00 movdqa XMMWORD PTR [_xmm3], xmm3
;;; xmm4 = xmmload(pointerflush+(KT+4)*16);
  00b43 66 0f 6f 64 1a
        40 movdqa xmm4, XMMWORD PTR [64+edx+ebx]
  00b49 66 0f 7f 25 00
        00 00 00 movdqa XMMWORD PTR [_xmm4], xmm4
;;; xmm5 = xmmload(pointerflush+(KT+5)*16);
  00b51 66 0f 6f 6c 1a
        50 movdqa xmm5, XMMWORD PTR [80+edx+ebx]
  00b57 66 0f 7f 2d 00
        00 00 00 movdqa XMMWORD PTR [_xmm5], xmm5
;;; xmm6 = xmmload(pointerflush+(KT+6)*16);
  00b5f 66 0f 6f 74 1a
        60 movdqa xmm6, XMMWORD PTR [96+edx+ebx]
  00b65 66 0f 7f 35 00
        00 00 00 movdqa XMMWORD PTR [_xmm6], xmm6
;;; xmm7 = xmmload(pointerflush+(KT+7)*16);
  00b6d 66 0f 6f 7c 1a
        70 movdqa xmm7, XMMWORD PTR [112+edx+ebx]
  00b73 66 0f 7f 3d 00
        00 00 00 movdqa XMMWORD PTR [_xmm7], xmm7
  00b7b 72 83 jb .B1.194

// 'BURST_Read_4DWORDS' Main Loop:

.B2.3:
  02e4c 83 c3 f0 add ebx, -16
;;; hash32 = *(uint32_t *)(p+0);
  02e4f 8b 07 mov eax, DWORD PTR [edi]
;;; hash32B = *(uint32_t *)(p+4);
  02e51 8b 77 04 mov esi, DWORD PTR [4+edi]
;;; hash32C = *(uint32_t *)(p+8);
  02e54 8b 57 08 mov edx, DWORD PTR [8+edi]
;;; hash32D = *(uint32_t *)(p+12);
  02e57 8b 4f 0c mov ecx, DWORD PTR [12+edi]
  02e5a 83 c7 10 add edi, 16
  02e5d 83 fb 10 cmp ebx, 16
  02e60 73 ea jae .B2.3

// 'BURST_Read_8DWORDSi' Main Loop:

.B3.3:
;;; for(; Loop_Counter; Loop_Counter--, p += 4*sizeof(uint32_t)) {
;;; hash32 = *(uint32_t *)(p+0) ^ *(uint32_t *)(p+0+Second_Line_Offset);
  02ebc 8b 07 mov eax, DWORD PTR [edi]
;;; hash32B = *(uint32_t *)(p+4) ^ *(uint32_t *)(p+4+Second_Line_Offset);
  02ebe 8b 77 04 mov esi, DWORD PTR [4+edi]
;;; hash32C = *(uint32_t *)(p+8) ^ *(uint32_t *)(p+8+Second_Line_Offset);
  02ec1 8b 57 08 mov edx, DWORD PTR [8+edi]
;;; hash32D = *(uint32_t *)(p+12) ^ *(uint32_t *)(p+12+Second_Line_Offset);
  02ec4 8b 4f 0c mov ecx, DWORD PTR [12+edi]
  02ec7 33 04 1f xor eax, DWORD PTR [edi+ebx]
  02eca 33 74 1f 04 xor esi, DWORD PTR [4+edi+ebx]
  02ece 33 54 1f 08 xor edx, DWORD PTR [8+edi+ebx]
  02ed2 33 4c 1f 0c xor ecx, DWORD PTR [12+edi+ebx]
  02ed6 83 c7 10 add edi, 16
  02ed9 4d dec ebp
  02eda 75 e0 jne .B3.3

// 'FNV1A_YoshimitsuTRIADii' Main Loop:

.B4.4:
;;; hash32 = (hash32 ^ (_rotl_KAZE(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME;
  02f4c 8b 2b mov ebp, DWORD PTR [ebx]
  02f4e c1 c5 05 rol ebp, 5
  02f51 33 2c 33 xor ebp, DWORD PTR [ebx+esi]
  02f54 33 fd xor edi, ebp
;;; hash32B = (hash32B ^ (_rotl_KAZE(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME;
  02f56 8b 6c 33 04 mov ebp, DWORD PTR [4+ebx+esi]
  02f5a c1 c5 05 rol ebp, 5
  02f5d 33 6b 04 xor ebp, DWORD PTR [4+ebx]
  02f60 33 cd xor ecx, ebp
;;; hash32C = (hash32C ^ (_rotl_KAZE(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+8+Second_Line_Offset))) * PRIME;
  02f62 8b 6b 08 mov ebp, DWORD PTR [8+ebx]
  02f65 c1 c5 05 rol ebp, 5
  02f68 33 6c 33 08 xor ebp, DWORD PTR [8+ebx+esi]
  02f6c 83 c3 0c add ebx, 12
  02f6f 33 c5 xor eax, ebp
  02f71 69 ff e7 d3 0a
        00 imul edi, edi, 709607
  02f77 69 c9 e7 d3 0a
        00 imul ecx, ecx, 709607
  02f7d 69 c0 e7 d3 0a
        00 imul eax, eax, 709607
  02f83 4a dec edx
  02f84 75 c6 jne .B4.4

; mark_description "Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.1.258 Build 20111011";
; mark_description "-Ox -TcHASH_linearspeed_FURY.c -FaHASH_linearspeed_FURY_Intel_IA-32_12 -FAcs";

And the full dump:

Copying a 256MB block 1024 times i.e. 256GB READ + 256GB WRITTEN ...
memcpy(): (256MB block); 262144MB copied in 141321 clocks or 1.855MB per clock

Fetching a 512MB block 1024 times i.e. 512GB ...
BURST_Read_4XMM128bit: (512MB block); 524288MB fetched in 107812 clocks or 4.863MB per clock

Fetching a 512MB block 1024 times i.e. 512GB ...
BURST_Read_8XMM128bit: (512MB block); 524288MB fetched in 113162 clocks or 4.633MB per clock

Fetching a 64MB block 8*1024 times i.e. 512GB ...
BURST_Read_4XMM128bit: (64MB block); 524288MB fetched in 107765 clocks or 4.865MB per clock

Fetching a 64MB block 8*1024 times i.e. 512GB ...
BURST_Read_8XMM128bit: (64MB block); 524288MB fetched in 113085 clocks or 4.636MB per clock

Fetching a 128KB block 4*1024*1024 times i.e. 512GB ...
BURST_Read_4XMM128bit: (128KB block); 524288MB fetched in 37580 clocks or 13.951MB per clock

Fetching a 128KB block 4*1024*1024 times i.e. 512GB ...
BURST_Read_8XMM128bit: (128KB block); 524288MB fetched in 36380 clocks or 14.411MB per clock

Fetching a 16KB block 8*4*1024*1024 times i.e. 512GB ...
BURST_Read_4XMM128bit: (16KB block); 524288MB fetched in 20436 clocks or 25.655MB per clock

Fetching a 16KB block 8*4*1024*1024 times i.e. 512GB ...
BURST_Read_8XMM128bit: (16KB block); 524288MB fetched in 17456 clocks or 30.035MB per clock

Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15148 clocks or 4.326MB per clock
BURST_Read_8DWORDSi: (64MB block); 65536MB fetched in 14087 clocks or 4.652MB per clock
FNV1A_YoshimitsuTRIADii: (64MB block); 65536MB hashed in 14539 clocks or 4.508MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 15912 clocks or 4.119MB per clock
FNV1A_Yorikke: (64MB block); 65536MB hashed in 16427 clocks or 3.990MB per clock
FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14695 clocks or 4.460MB per clock
CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71557 clocks or 0.916MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...
BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 9532 clocks or 6.875MB per clock
BURST_Read_8DWORDSi: (2MB block); 65536MB fetched in 8907 clocks or 7.358MB per clock
FNV1A_YoshimitsuTRIADii: (2MB block); 65536MB hashed in 9407 clocks or 6.967MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9750 clocks or 6.722MB per clock
FNV1A_Yorikke: (2MB block); 65536MB hashed in 10187 clocks or 6.433MB per clock
FNV1A_Yoshimura: (2MB block); 65536MB hashed in 9984 clocks or 6.564MB per clock
CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69763 clocks or 0.939MB per clock

Fetching/Hashing a 128KB block 512*1024 times ...
BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 9220 clocks or 7.108MB per clock
BURST_Read_8DWORDSi: (128KB block); 65536MB fetched in 10577 clocks or 6.196MB per clock
FNV1A_YoshimitsuTRIADii: (128KB block); 65536MB hashed in 10686 clocks or 6.133MB per clock
FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 9656 clocks or 6.787MB per clock
FNV1A_Yorikke: (128KB block); 65536MB hashed in 10094 clocks or 6.493MB per clock
FNV1A_Yoshimura: (128KB block); 65536MB hashed in 11278 clocks or 5.811MB per clock
CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 69795 clocks or 0.939MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...
BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7878 clocks or 8.319MB per clock
BURST_Read_8DWORDSi: (16KB block); 65536MB fetched in 7909 clocks or 8.286MB per clock
FNV1A_YoshimitsuTRIADii: (16KB block); 65536MB hashed in 8923 clocks or 7.345MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 8955 clocks or 7.318MB per clock
FNV1A_Yorikke: (16KB block); 65536MB hashed in 9719 clocks or 6.743MB per clock
FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9734 clocks or 6.733MB per clock
CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69732 clocks or 0.940MB per clock

Mr.Towner what do you think, are you impressed of 'YoshimitsuTRIADii' as I am?

This is my last post, I hope other people to use above info/techniques to improve their functions.

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.