Lockless Inc

Creating a Cryptographic Hash Function

A Cryptographic hash function is something that mechanically takes an arbitrary amount of input, and produces an "unpredictable" output of a fixed size. The unpredictableness isn't in the operation itself. Obviously, due to its mechanical nature, every time a given input is used the same output will result. It means that you cannot easily craft an input to produce a wanted output, or similarly guess what input produced a given output; Easily create two different inputs that give the same output hash; Easily create a suffix or prefix to a given input to create a wanted hash; Or in different terms, distinguish the output from an abstract "Random Oracle".

This differs from non-cryptographic hash functions used for high-speed hash-tables or other data structures. Those are designed to produce high enough quality statistical randomness to fulfil their purposes. However, they aren't designed to defeat a determined adversary who can craft inputs and outputs. Such an adversary might be willing to spend a large amount of computer time or memory to find a matching hash value. If the hash algorithm isn't secure, then the adversary can perform their attack faster than on an ideal hash function.

An ideal hash function produces an output of n bits. This means that the output can have up to 2n states. Thus, if an adversary wants to find a match to a given value, around 2n different inputs will need to be tried to get a match. However, if the matching value isn't given, the Birthday Paradox means that only around the square root of this value is needed (2n/2). This is because we can check all possible pairs for a match. The number of such pairs increases as the square of the number of outputs.

If Quantum Computers scale upwards enough (as of this writing, they can factor 15=3×5) then Grover's algorithm becomes applicable. It allows a much faster search of hash outputs, basically acting like the number of output bits have been halved. Fortunately, this attack is easy to defeat: just double the number of output bits. This threat is the reason why 512 bit hashes are starting to become popular. Testing 2512 states is obviously impossible. However, finding a collision might only take the square root of the square root of this, i.e. 2128 work. This security is the level of modern block ciphers.

So you've decided to create your own cryptographic hash function. The correct response is: Don't! There exist many well designed and more importantly, well tested cryptographic hash functions out there. Use them instead. The chance of you being able to make one better is small, and the chance of you making something much worse is great. However, lets assume that you can't. Assume that you don't trust them for one reason or another. Many published functions have arbitrary constants. You may feel that the authors may (or may not) have tweaked the constants in some way to have a subtle weakness.

The Merkle-Damgård construction

Many modern cryptographic hash functions are based on the "Merkle-Damgård construction". This uses a compression function to take an input message block, and a previous hash to produce a new hash. The hashes are chained along as each message is added to produce an ending hash. Finally, some kind of terminating feature is used (via a padding format, or appending the total length, or both) so that a final hash can be created that is collision resistant.

It can be proved that if the compression function is collision resistant, and if the correct type of termination is done, then the result is also collision resistant. However, there are other properties of this construction that are very unlike an ideal hash function. For example, an attacker given just a hash, can predict what a specially crafted suffix block to the unknown message will convert the hash into. This weakness is why you cannot use many of the SHA variants as a HMAC. (However, with a bit of care, you can hash twice to make a HMAC from them.)

Similarly, since it is trivial to lengthen a hash-chain, once one collision is known, it is easy to create many collisions. Just append blocks to the chain using the same message blocks. If the (same length) prefix hashes to the same value, the suffix will perform the same operations, and produce identical results.

Another serious weakness to the M-D construction is the multi-collision attack. It takes 2n/2 work to find one collision. However, to find many more related collisions it doesn't take much more work. Basically, you can find a block (or sequence of blocks) that given a hash input will hash to the same hash output, but with differing message inputs. It takes the typical birthday-attack time to do this, and you have two messages that hash to the same value. Now, look at some other blocks on the M-D chain. Apply the same attack to those blocks. After a similar amount of work, you now have 2×2=4 messages that hash to the same value. Repeat. Each time you you get a new pair of collisions, you double the number of messages that hash to the wanted result as the available combinations increase geometrically.

This attack means that if you use a pair of of M-D based hashes to protect something, it isn't much better than using one. The multi-collision attack takes roughly n/2 × 2n/2 time to find enough combinations with the first hash to run a 2n/2-time birthday attack on the second hash. This is only moderately longer than just one hash. If you had used a double-length hash, it would have been roughly 2n/2-log(n) stronger.

Fortunately, it is relatively simple to fix these problems. Just double the amount of internal state (a wide-pipe design). This doubling means that the internal birthday-attack now takes of order 2n time. Since this is the same as finding a specific collision, no advantage is now gained. Unfortunately, the doubling makes the hash function construction less efficient. Related, is the ability to just not output some of the final bits. The SHA-234 and SHA-384 functions are based on the SHA-256 and SHA-512 functions, with some output bits elided. The missing bits mean that extension attacks are impossible. It also weakens the multi-collision a bit. Applying this tactic fully, if you used just half the output of SHA-512, it would be much stronger than SHA-256.

Keccak

The recent winner of the competition to create SHA-3 is the Keccak algorithm. This doesn't use a Merkle-Damgård construction, but uses a "sponge" instead. A sponge soaks up message bits slowly, and then squeezes output bits slowly. In other words, it is a generalization of the wide-pipe M-D method, with the pipe being arbitrarily large.

A sponge is based on a mixing function instead of a compression function. The new message bits can either replace old sponge bits, or be xored with them. Then a mixing function is used to distribute the information contained in those bits throughout the sponge. Once enough mixing is done, further message bits can be added. Provided the number of message bits per block is less than or equal to half the total bits in the sponge, entropy considerations mean that the soaking-up phase is secure. In that extreme, it is identical to a M-D wide-pipe design.

The final phase of a sponge is squeezing out the output bits. This is done in the reverse way in which information was added. No more than half the bits in the sponge can be removed at any one time for it to remain secure. After each removal, a secure mixing should follow. Finally, no more bits should be removed in total than there are in the sponge. If more are used, then by the pigeon-hole principle it is easy to show that the Birthday Paradox attack will work with an increased efficiency.

Keccak is designed to be very fast in hardware. Most of the mixing is done with hardware-friendly choices such as parity operations. The nonlinearity is due to a simple bitwise operation. (Nothing complex like integer addition is used.) Many rounds are used so that the small amount of nonlinearity snowballs into enough complexity so that the mix function is "random" enough.

Compression and Mixing Functions

The compression or mixing function is the core to many hash function designs. It takes m message bits, h hash bits, and then acts like a black box that produces h new hash bits. The compression function needs to diffuse and confuse the effects of the message bits into all of the h output bits. A secure compression function acts like a keyed hash function that takes only a single fixed input block size. It is expected to have all the collision resistances that such a hash function would need.

The difficult task is coming up with a good compression function. One way to do that is to use some other well known cryptographic primitive. An example is a block cipher. One way is to use the h input bits as a key, and then "encrypt" the message block with it. This isn't quite secure yet. You need to xor the result with either the key, the message, or both. Other possibilities exist, but the above isn't used very often in practice. The reason is that block ciphers tend to be too narrow to be of much use. Hash functions need to have an output bit size that is double that of a typical block cipher.

What we will do instead is use a Feistel structure to construct a secure compression function from smaller components. This takes a pair of inputs, a left and a right half. The left half is then passed into a randomizing function, and the result is xored onto the right half. Both halves are then swapped. This is a reversible operation, so is information preserving. By repeating this set of operations (called a "round") multiple times, we can reversibly mix the information in a highly secure way between the left and right halves. To convert this permutation into a compression function, we can then drop the last right half, and keep the left part.

The question becomes how many rounds are needed? Obviously, the better the job the randomizing function does, the less rounds we need. The converse is that if the randomizing function is linear, even an infinite number of rounds won't help. So, we need some sort of non-linear function, the characteristics of which determine the security.

Assume the randomizing function is amoungst the class of perfectly secure ones, and each round uses a different member from that class. This is the ideal case, which we can only hope to approach. It turns out that if you do three rounds with this, the result isn't quite mixed well enough. This can be proved with differential analysis: Try two inputs into the Feistel Ladder. Let the left halves be identical, and the right halves differ by some delta. After one round, the left halves will act identically. This means that the inputs into the second round will differ by the delta in the left half, and zero on the right half. After two rounds, the right half (which is the old left half) will differ by the delta. The left half will be the "encryption" of two values differing by the delta.

So we would like all possible outputs from the third round to be equally likely. This unfortunately isn't the case. Look at the case where the left half difference is the delta, and right half difference is zero. What is the chance for this to happen in the above? Well... if the encryptions both encrypt to the same value we get that result. The chance for that is 1 in 2n/2. However, the wanted probability is 1 in 2n. Thus we have found a weakness, and the three-round version isn't secure. Four rounds are needed, even with a perfectly random function.

If four rounds are the lower limit, how close can we approach it with a reasonable round function? We want to try the most random-looking things we can. A good choice might be a secure block cipher. What is the limit now? It turns out you can use the above differential analysis to show that five rounds aren't enough. (Lars Knudsen showed this with his analysis of DEAL.) Use the same inputs as above in the first round. Assume at the end of the fifth round, have the left differences differ by zero, and the right halves by the delta. Work backwards. In the third round we get a contradiction. Basically, we find that E3(E2(x))=E3(E2(x^delta)). Where the E functions correspond to those chosen at each round. Now these functions are from a block cipher. That means that they are permutations. Thus we end up with a combination of permutations with differing inputs that yields the same output. This is impossible. The resulting impossible differential can be used to attack the algorithm, and more rounds are needed.

Another attack is to look at the bitwise polynomial form of the resulting function as a black-box. Any function can be written as a polynomial of the input bits, with each coefficient being zero or one. If this polynomial has low order, then it is reversible in some sense, and the compression function will be weak. What we want is the polynomial to have terms of all possible orders. It turns out that linear operations don't touch the polynomial order. Thus this is just a check of testing exactly how nonlinear our function is. The higher the nonlinearity the better.

We just need to be able to work out the approximate effect of operations on the polynomial's order. A linear operation, like "xor", will return the maximum order of its two inputs. A non-linear simple operation, like "and", will return the sum of the orders of its inputs. (This is assuming that the inputs are uncorrelated.) An S-box will return an order that is equal to its inputs multiplied by half the log of the input bit-depth. The S-box acts like a binary polynomial function, and the binomial theorem can be used to show that this is the most common term. Other more complex operations can be built out of simpler ones.

So we want a nonlinear randomizing function to use in the Feistel Ladder. Block ciphers are a little slow. Besides, using one is like begging the question. It is possible to construct a block cipher from a hash, and a hash function from a block cipher. However, it is better for efficiency to use optimized techniques for each though. Thus we will likely need more than the six rounds determined above. However, if each round is faster by enough, the resulting compression function will be quicker to evaluate even though it is built from a less secure foundation.

The question becomes how nonlinear to make each round? The more nonlinear, the more expensive they are. I'd like to make the argument that using highly nonlinear rounds, with fewer of them is better than using many many slightly nonlinear rounds. The problem is that attackers on the cryptographic system will try to reduce the nonlinearity. A small reduction can be catastrophic because the way nonlinearity compounds exponentially.

Assume we have two systems. The first uses 20 rounds, with a nonlinear factor of two for each round. The second system uses 8 rounds, with a nonlinear factor of five. The factor refers to how the underlying bitwise polynomial increases in degree. Each round will multiply the degree of the polynomial by this amount, assuming the inputs are diffused correctly, and there isn't too much cancellation. What we would like is the resulting number to be much larger than the maximal degree possible. If we have 512 bits, then we want 220=1048576>>512, and 58=390625>>512.

As can be seen, both systems are secure. However, what happens if a crafty adversary manages to reduce the effective multiplicative factor per round? (They can do this by choosing particular forms of inputs that causes coefficients to statistically cancel.) Lets say that they can reduce the effective factor by 0.5 per round, the results are 1.520=3325, and 4.58=168151. Both are still secure, but the one with less nonlinearity is looking shaky, with hardly any margin left.

We would like to make a highly non-linear round function. Using S-boxes has traditionally been the method to do this. However, S-boxes have their issues. Firstly, it is difficult to simultaneously make them have high entropy (needed for the high nonlinearity), and to have low entropy for their description. If your S-boxes seem arbitrary, then people may think you have hidden some sort of back door in them. AES gets around this by using a finite field to generate it's S-boxes. This hasn't stopped people from feeling uneasy about their algebraic nature.

A second problem with S-boxes is that they assume that accessing an entry within them is O(1). This might seem true, but modern machines have multiple layers of cache. This means that subtle timing attacks might be possible due to the different speeds of access. (AES was famously attacked this way by Dan Bernstein.) This is one of the reasons why AES instructions have been added to recent Intel processors.

Finally, S-boxes are limited in size. They can't get too large, as the memory to describe them grows exponentially with their input size. This means that they suffer from the effects of the "law of small numbers". Small numbers have many many patterns within them, simply because there aren't many small numbers to begin with. Patterns are the enemy of someone trying to construct a cryptographic primitive. You need to be very careful and make sure someone can't use something like differential analysis on your S-boxes to weaken your algorithm. Basically, any statistical weakness in your S-boxes can be used as a foothold in an attack. Testing for any possible weakness is difficult.

A Cryptographic Hash Function

If we avoid using S-boxes, and would like to have high nonlinearity, the possibility of using multiply instructions appeals. Multiplication is the very definition of nonlinearity. Unfortunately, we can't really let both inputs to the multiply be controlled by the user, as that gives them too much control over the output. However, we can use a multiply as a super-powered add instruction by multiplying by a known constant.

A multiply sends information "upwards", so if it is the only data movement, then the lower bits will not be mixed enough. Basically, we can just look at the lowest bit (which will have only simple linear dependencies), and solve for it. Next, we can look at the next-to-lowest bit, and solve for that. Each solved bit gives enough information to solve the next. This means that we need to be able to send information "downwards" as well to have security.

A very interesting operation is the "folded multiply". Take an n-bit number. Multiply it by an n-bit constant. The result is 2n bits long. Now, take the top n bits of the result, and xor them with the bottom n bits. What have we got now? Well... every output bit depends on every input bit. This is optimal diffusion. The result is also very nonlinear, with each bit depending in a complex way on many others. This fact is used to create a fast psuedorandom number generator in this article

The question is how nonlinear is the folded multiply? The answer to that depends on the size of the multiply operation. We are multiplying two numbers together, which typically will have about half their bits set. So, in the sum corresponding to that multiply, about one quarter of the bits will be ones. Thus each output bit will be the result of a "random" n/4 bit sum, modified by potential carries from below. So the question becomes, statistically, how far do carries propagate? The answer to that is log2(n/4) binary places.

It isn't quite so simple though. The bits in the middle of the output get slightly less carries than the top or bottom bits. To see this, draw a triangle representing the total number of bits summed in each column of the 2n-bit result. The peak of that triangle is in the middle. The folding operation then moves that peak to the top and bottom. The middle of the n-bit result is thus the xor of two half-sized sums. Thus the result varies from about log2(n) - 2 in the edges, to log2(n) - 3 in the middle. Averaging between them since the mixing is very good, we get an bitwise polynomial order multiplication factor of log2(n) - 2.5 per folded multiply operation.

Since we now know how much nonlinearity each round improves our mix/compression function by, we can work out how many rounds we need. Basically we want a number of rounds, r, such that (log2(n) - 2.5)r >> 2n. The question becomes how much greater? To work that out, we can compare to some other cryptographic algorithms.

The AES block cipher uses an S-box for its nonlinearity. The eight-bit box therefor can be approximated as multiplying the polynomial order by four for each round. There are ten rounds for 128 bit AES, so we have the ratio 410/128 = 8192. The Keccak hash algorithm uses a single nonlinear operation per sub-round. Assuming optimal mixing, this doubles the polynomial order each time. There are 24 rounds, and 1600 bits of internal state. So we have the ratio 224/1600 = 10485. Using these two as benchmarks, we can expect that a factor of above eight thousand or so seems to be secure.

So, for our case, we will assume a 256 bit hash function, with 512 bits of internal state. This means n = 256. Simple arithmetic shows that if we choose nine rounds we have 5.59/512 = 8995 as a safety factor. Nine rounds is about twice as many as what would be optimal, but our nonlinear mix function isn't perfect. If the number of bits is increased, it turns out that nine rounds is still good up until unreasonably large hash sizes. For example, let n = 8192 (an eight kilobyte hash), then we have 10.59/16384 = 94686; where it looks like we could decrease by a round. At even higher (completely unrealistic) values more rounds are needed though.

Before implementing the hash function, we will need to have a way of implementing arithmetic on large unsigned integers. Fortunately, it isn't difficult to describe how to do math on a 2n-bit integer in terms of n-bit components. This isn't the most efficient way of doing things, but will suffice to test the results. A C++ template class that implements just the bits we need looks like:


#include <string.h>
#include <string>
#include <vector>
#include <stdexcept>

template <typename T, typename T_half, int size>

class double_uint
{
private:
	void fromstring(const std::string &s);
	std::string tostring(void) const;
public:
	T lo_, hi_;
	enum type_size
	{
		tsize = size,
		hsize = size / 2,
		qsize = size / 4
	};

	double_uint(const double_uint &u) : lo_(u.lo_), hi_(u.hi_) {}
	double_uint(const T_half &hi1, const T_half &hi2, const T_half &lo1, const T_half &lo2)
		: lo_(T(lo2) + (T(lo1) << qsize)),
		  hi_(T(hi2) + (T(hi1) << qsize)) {}
	double_uint(const T &hi, const T &lo) : lo_(lo), hi_(hi) {}
	explicit double_uint(const T &lo) : lo_(lo), hi_() {}
	explicit double_uint(const std::string &s) {fromstring(s);};
	double_uint(void) : lo_(), hi_() {}

	/* Hack - basically we are a little-endian integer type */
	explicit double_uint(const unsigned char *c) {memcpy(this, c, sizeof(*this));}
	explicit double_uint(size_t len, const unsigned char *c)
	{
		if (len >= sizeof(*this))
		{
			memcpy(this, c, sizeof(*this));
		}
		else
		{
			memcpy(this, c, len);
			memset(reinterpret_cast<unsigned char *>(this) + len, 0, sizeof(*this) - len);
		}
	}

	friend std::ostream& operator<< (std::ostream &o, const double_uint &x)
	{
		return o << std::string(x);
	}

	double_uint& operator= (int x)
	{
		lo_ = x;
		hi_ = T();
		return *this;
	}
	double_uint& operator=(const double_uint &u)
	{
		lo_ = u.lo_;
		hi_ = u.hi_;
		return *this;
	}

	operator T() const {return lo_;}
	operator T() {return lo_;}
	operator const std::string() const {return tostring();}
	operator std::string() {return tostring();}

	double_uint operator+= (const double_uint &u)
	{
		T old = lo_;
		lo_ += u.lo_;
		if (lo_ < old) hi_++;
		hi_ += u.hi_;
		return *this;
	}

	double_uint operator-= (const double_uint &u)
	{
		T old = lo_;
		lo_ -= u.lo_;
		if (lo_ > old) hi_--;
		hi_ -= u.lo_;
		return *this;
	}

	double_uint operator+ (const double_uint &u) const
	{
		double_uint res(*this);
		return res += u;
	}

	double_uint operator- (const double_uint &u) const
	{
		double_uint res(*this);
		return res -= u;
	}

	double_uint operator++ (void)
	{
		return *this += double_uint(1);
	}

	double_uint operator++ (int)
	{
		double_uint res(*this);
		*this += double_uint(0, 1);
		return res;
	}

	double_uint operator- (void)
	{
		return double_uint(0) - *this;
	}

	double_uint operator+ (void) const
	{
		return *this;
	}

	double_uint operator<<= (int x)
	{
		x &= (tsize - 1);
		if (x >= hsize)
		{
			hi_ = lo_ << (x - hsize);
			lo_ = 0;
			return *this;
		}

		hi_ <<= x;
		hi_ += lo_ >> (hsize - x);
		lo_ <<= x;

		return *this;
	}

	double_uint operator>>= (int x)
	{
		x &= (tsize - 1);
		if (x >= hsize)
		{
			lo_ = hi_ >> (x - hsize);
			hi_ = 0;
			return *this;
		}

		lo_ >>= x;
		lo_ += hi_ << (hsize - x);
		hi_ >>= x;

		return *this;
	}

	double_uint operator<< (int x) const
	{
		double_uint res(*this);
		return res <<= x;
	}

	double_uint operator>> (int x) const
	{
		double_uint res(*this);
		return res >>= x;
	}

	static double_uint halfmul(T x, T y)
	{
		T_half x1, x2, y1, y2;
		x1 = x;
		x2 = x >> qsize;
		y1 = y;
		y2 = y >> qsize;

		T xx1(x1);
		T xx2(x2);
		T yy1(y1);
		T yy2(y2);

		double_uint p1, p2, p3, p4;
		p1.lo_ = xx1 * yy1;
		p2.lo_ = xx1 * yy2;
		p3.lo_ = xx2 * yy1;
		p4.lo_ = xx2 * yy2;

		p2 <<= qsize;
		p3 <<= qsize;
		p4 <<= hsize;
		return p1 + p2 + p3 + p4;
	}

	double_uint operator*= (const double_uint &u)
	{
		double_uint p1 = halfmul(lo_, u.lo_);
		double_uint p2 = halfmul(lo_, u.hi_);
		double_uint p3 = halfmul(hi_, u.lo_);

		p2 <<= hsize;
		p3 <<= hsize;
		return (*this = p1 + p2 + p3);
	}

	double_uint operator* (const double_uint &u) const
	{
		double_uint res(*this);
		return res *= u;
	}

	double_uint operator^= (const double_uint &u)
	{
		lo_ ^= u.lo_;
		hi_ ^= u.hi_;
		return *this;
	}

	double_uint operator^ (const double_uint &u) const
	{
		double_uint res(*this);
		return res ^= u;
	}

	/* Calculates (self * u, folded in half) xor u */
	double_uint fold(const double_uint &u) const
	{
		typedef double_uint<double_uint, T, tsize * 2> Tdouble;

		Tdouble x(Tdouble::halfmul(*this, u));

		return x.lo_ ^ x.hi_;
	}
};

template <class T, class T_half, int size>
    std::string double_uint<T, T_half, size>::tostring(void) const
{
	int num = size / 4;

	static const char hex_symb[] = "0123456789ABCDEF";

	/* Pad with zeros */
	std::string s(num, '0');

	double_uint<T, T_half, size> v(*this);

	for (int i = num - 1; i >= 0; i--)
	{
		s[i] = hex_symb[v.lo_ & 0xf];
		v >>= 4;
	}

	return s;
}

template <class T, class T_half, int size>
    void double_uint<T, T_half, size >::fromstring(const std::string &s)
{
	double_uint<T, T_half, size> temp;

	*this = 0;

	for (std::string::const_iterator i = s.begin(); i != s.end(); i++)
	{
		if ((*i >= '0') && (*i <= '9'))
		{
			temp = *i - '0';
		}
		else if ((*i >= 'A') && (*i <= 'F'))
		{
			temp = *i - 'A' + 10;
		}
		else if ((*i >= 'a') && (*i <= 'f'))
		{
			temp = *i - 'A' + 10;
		}
		else
		{
			throw std::runtime_error("Invalid hex character\n");
		}

		*this <<= 4;
		*this += temp;
	}
}

static std::string tostring(const __uint128_t &x)
{
	int num = 128 / 4;

	static const char hex_symb[] = "0123456789ABCDEF";

	/* Pad with zeros */
	std::string s(num, '0');

	__uint128_t v(x);

	for (int i = num - 1; i >= 0; i--)
	{
		s[i] = hex_symb[v & 0xf];
		v >>= 4;
	}

	return s;
}

std::ostream& operator<< (std::ostream &o, const __uint128_t &x)
{
	return o << tostring(x);
}

Such a template can be instantiated as a type like:


typedef double_uint<__uint128_t, unsigned long long, 256> u256;

to define our unsigned 256 bit type from the built-in unsigned 128 bit types that gcc has. If you have a 32bit machine, then such types may not exist. If so, you'll need to use two levels of templates:


typedef double_uint<unsigned long long, unsigned, 128> u128;
typedef double_uint<u128, unsigned long long, 256> u256;

Using the above, we can construct our nine-round mix function:


/* Reversible mix function */
void hash_step(const u256 &i1, const u256 &i2, u256 &o1, u256 &o2)
{
	static const u256 t1, t2, t3, t4, t5, t6, t7, t8, t9;
	
	// Insert initialization for constants here

	o1 = i1;
	o2 = i2;

	o1 += o2.fold(t1);
	o2 += o1.fold(t2);
	o1 += o2.fold(t3);
	o2 += o1.fold(t4);
	o1 += o2.fold(t5);
	o2 += o1.fold(t6);
	o1 += o2.fold(t7);
	o2 += o1.fold(t8);
	o1 += o2.fold(t9);
}

Before we start to think about what multiplication constants to use, we have one issue. If the inputs are both zero, what happens? Well... each step involves a folded multiply by a constant with zero. This causes us to add zero to the other half each time. (We could choose xor, but addition mixes bits slightly more.) The result is that the output is always zero, no matter how many rounds we choose. This is bad.

So we need some way to alter the above so that there are no fixed points, especially ones as obvious as the number zero. One trick is use an xor operation to modify the rounds. If we do this right, then zero will be altered, and since each round will xor in a different constant and there will be no obvious fixed points. i.e.


/* Reversible mix function */
void hash_step(const u256 &i1, const u256 &i2, u256 &o1, u256 &o2)
{
	static const u256 t1, t2, t3, t4, t5, t6, t7, t8, t9;
	
	// Insert initialization for constants here

	o1 = i1;
	o2 = i2;

	o1 += o2.fold(t1) ^ t1;
	o2 += o1.fold(t2) ^ t2;
	o1 += o2.fold(t3) ^ t3;
	o2 += o1.fold(t4) ^ t4;
	o1 += o2.fold(t5) ^ t5;
	o2 += o1.fold(t6) ^ t6;
	o1 += o2.fold(t7) ^ t7;
	o2 += o1.fold(t8) ^ t8;
	o1 += o2.fold(t9) ^ t9;
}

How does this do? Well... the problem with zero is fixed, but now if both halves are equal to one, then we have a similar issue. 1×t = t. t ^ t = 0. So we end up not modifying again. The trick to finally fix this is to put the xor operation inside the folded multiply:


/* Reversible mix function */
void hash_step(const u256 &i1, const u256 &i2, u256 &o1, u256 &o2)
{
	static const u256 t1, t2, t3, t4, t5, t6, t7, t8, t9;
	
	// Insert initialization for constants here

	o1 = i1;
	o2 = i2;

	o1 += (o2^t1).fold(t1);
	o2 += (o1^t2).fold(t2);
	o1 += (o2^t3).fold(t3);
	o2 += (o1^t4).fold(t4);
	o1 += (o2^t5).fold(t5);
	o2 += (o1^t6).fold(t6);
	o1 += (o2^t7).fold(t7);
	o2 += (o1^t8).fold(t8);
	o1 += (o2^t9).fold(t9);
}

The above is looking much better. Now we just need to come up with some constants to use. These need to be "random", but still with low entropy. (They also need to be distinct to avoid slide attacks.) One common choice is to use the bit-strings that correspond to the square roots of primes. (Another choice might be the binary expansion of pi.) Unfortunately, this isn't quite good enough. Since we are using multiplies, what happens when we start with zero again? Well, the constant is an approximation to a square root, and then we end up multiplying that approximation by itself. The result will have many one or zero bits in the upper half depending on whether that approximation is slightly larger or smaller than the exact value. The long strings of identical bits are worrying. We should use something other than square roots.

So we choose to use cubic roots of primes to avoid the above issue. Now there are a couple more things to worry about. The first is scaling. We will scale the numbers so that the top bit is always a one. This maximizes the size of the resulting product, giving more overlap in the fold operation. Similarly, we will forcibly set the least significant bit to always be a one. The combination of the above will yield a full 2n-bit product, rather than having padding zero-bits on the top or bottom.

Thus the completed mixing function looks like:


/* Reversible mix function for 256 bit hash */
void hash_step(const u256 &i1, const u256 &i2, u256 &o1, u256 &o2)
{
	/*
	 * Cube roots of primes.
	 * (Square roots don't work as well when folded with zero.
	 *  The high part becomes nearly all one bits.)
	 *
	 * They are scaled so that the uppermost bit is set.
	 * Then the lowest bit is also set, so that the constant is odd
	 */

	/* cbrt(2) */
	static const u256 t1(0xa14517cc6b945711, 0x1eed5b8adf128686,
	                     0x144788148b18fde0, 0x30c00661b7d16e9d);

	/* cbrt(3) */
	static const u256 t2(0xb89ba24891f7b2e6, 0xef3f8b62b71933e0,
	                     0x50c4a6157ab766cc, 0xfa2ba143e9029653);

	/* cbrt(5) */
	static const u256 t3(0xdae07de7f6269d97, 0xed0ddb59924b141a,
	                     0x0ae36687aa58c29f, 0xe8293af2918f493b);

	/* cbrt(7) */
	static const u256 t4(0xf4daedd2c0c4edde, 0x50536bb743875dac,
	                     0xfdb214852ccf272e, 0x53a3540f5e5aa011);

	/* cbrt(11) */
	static const u256 t5(0x8e55b096fcd22d4e, 0x3c1e6d4936833117,
	                     0x0ae1a0b51ea515b2, 0x6ef98efb6ebf35e3);

	/* cbrt(13) */
	static const u256 t6(0x967c447c6d817406, 0x7bc5196b06dc9887,
	                     0x214ac2f50046dc65, 0x0f9bfa326367aeb7);

	/* cbrt(17) */
	static const u256 t7(0xa48fe0a92bc653e6, 0xec03c7ed7e59981b,
	                     0x3e3a27d8d8e54797, 0xd607fe20b08d6175);

	/* cbrt(19) */
	static const u256 t8(0xaac717b5769b6046, 0x27896d0e27f2c11e,
	                     0x281e73be041f0383, 0xa937169045fb3849);

	/* cbrt(23) */
	static const u256 t9(0xb601eaa628c0c090, 0xac51900eab494a5c,
	                     0x236edd364b4df8c4, 0x5a0cdaed7df05aed);

	o1 = i1;
	o2 = i2;

	o1 += (o2^t1).fold(t1);
	o2 += (o1^t2).fold(t2);
	o1 += (o2^t3).fold(t3);
	o2 += (o1^t4).fold(t4);
	o1 += (o2^t5).fold(t5);
	o2 += (o1^t6).fold(t6);
	o1 += (o2^t7).fold(t7);
	o2 += (o1^t8).fold(t8);
	o1 += (o2^t9).fold(t9);
}

The next task is to convert this into a cryptographic hash function. One possibility is to not use "o2" and convert the mix function into a compression function. This compression function can then be plugged into a standard Merkle-Damgård construction to create the hash we desire. However, as was mentioned above, that construction has issues, and we would like to do better. We could use as a mixing function within a sponge, but we can perhaps improve that as well.

Instead, we will use a tree-based hash. This increases parallelism greatly. An implementation can then work on multiple leaves or branches simultaneously, with either vectorization, threading, or by simple instruction interleaving. By taking half of the data from one side of the tree, and half from the other, we can use the mix function to propagate the hash up the structure. At the very last step we can drop "o2".

The above will work. However, there is a slightly better method. Instead of taking half the data (which exposes us to multi-collision attacks) we can combine the data from two halves irreversibly. Simply xoring the data from both halves will work. To add some asymmetry, we choose to flip the sense of one side of the tree, and xor o1 with o2.

The only remaining question is how to start the hash process. We need 256 bits of "initial vector" for each leaf. We choose to use the combination of 128 bits of the total length, and 128 bits of "length so far". (Length in bytes, rather than bits unlike some other hash functions.) This makes the IV of each leaf different. Also if someone tries to use suffix or prefix attacks, every leaf will differ. The result could perhaps be called a "tree sponge" construction.

The hash can be implemented recursively, or we can use a stack of values. Modulo bugs, the completed hash function looks like:


void hash_foldmul256(const std::vector<unsigned char> &input, unsigned char output[32])
{
	size_t len = input.size();
	size_t offset = 0;

	/* Assume less than 64 levels of recursion for now */
	u256 p1[64], p2[64];

	unsigned long long filled = 0;

	int i, j;

	for (len = input.size();; len -= 32, offset += 32)
	{
		u256 t1(len, offset);
		u256 t2(len, &input[offset]);

		hash_step(t1, t2, t1, t2);

		filled++;

		for (i = 0; i < 64; i++)
		{
			if (filled & (1ull << i)) break;

			/* These xors are irreversible */
			hash_step(p1[i] ^ t2, t1 ^ p2[i], t1, t2);
		}

		p1[i] = t1;
		p2[i] = t2;

		if (len <= 32) break;
	}

	/* Find the first value */
	for (i = 0; i < 64; i++)
	{
		if (filled & (1ull << i)) break;
	}

	/* Handle the remaining partial evaluations */
	for (j = i + 1; j < 64; j++)
	{
		if (!(filled & (1ull << i))) continue;

		/* These xors are irreversible */
		hash_step(p1[i] ^ p2[j], p1[j] ^ p2[i], p1[i], p2[i]);
	}

	/* Final irreversible step - only return a single output */
	memcpy(output, &p1[i], 32);
}

The expansion to 512, 1024 or higher depth hashes is easy. You just need to update the constants, and then use longer width unsigned integers. With templates in C++, the result isn't too complicated. However, doing it that way would be quite inefficient. A real implementation would worry about using faster multiplication techniques, like using Karatsuba or Toom-Cook. It would also consider vectorization and other parallelization techniques. The design here is very conducive to software optimization.

Summary

Constructing a new cryptographic hash is fairly difficult. It is a good idea to use something already existing. However, making a new one is possible, you just need to be aware of the many possible attacks an adversary might use. If you have an estimate of how nonlinear your round function is, then it is possible to work out how many rounds are needed to give a fairly good guarantee of security.

Here, we create a hash based on a folded high precision multiply. The primitive has high nonlinearity, so only nine rounds are required. We use a tree-based hashing structure to allow parallelization and to be resistant to many generic attacks. The resulting function is quite simple conceptually, which enables easy analysis.

Comments


Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.