Lockless Inc

Implementing calloc()

There are a couple of subtle issues that one must deal with when implementing the calloc() function. This function takes in two values; the number of elements in an array, and the size of each element. It then returns a block of memory that is filled with zeroes. The most obvious way to implement this is:

void *calloc(size_t n, size_t size)
	size_t total = n * size;
	void *p = malloc(total);
	if (!p) return NULL;
	return memset(p, 0, total);

More than a few memory allocators implement it this way. Can you spot the problems? The first is fairly obvious. The memory returned from malloc may already be filled with zeroes. This happens when malloc obtains the memory straight from the operating system. Due to security reasons, the memory is pre-cleared by the operating system first. This means that the memset may not be required. This doesn't seem like that big of a deal. However, the most likely case where this happens is when a large chunk of memory is asked for. Clearing it again can be extremely slow as the whole block of memory will need to be paged in. The operating system trick of merging pages filled with zeroes is thwarted, and much more memory may be used. The resulting slow-down can be extreme if the program was going to use the memory for a sparse array.

The second problem is more subtle. The multiplication may overflow. This causes the memory allocator to allocate less than the program asked for. The resulting malloc may succeed, and a pointer may be returned. The program can then overflow the buffer, and a security problem will exist. The solution is of course to securely do the multiplication, and return NULL if the result is too large.

The problem for today is to work out the fastest way to implement the multiplication and overflow detection process. If this were an addition, the code would be rather simple. Add the two numbers, and check to see if the result is less than either of them. If so, then we have overflowed.

This algorithm doesn't work with multiplication. An example that shows this is imagine your machine has a 4bit size_t. Let n=7 and size=9. The true value of the multiplication is 63. However, mod 16, this is 15. 15 is larger than both 7 and 9, however, we have still overflowed.

A slow technique to do the multiplication safely is to break the size_t values into two chunks. For the rest of this page, assume size_t is a 64bit type. We split them into two 32bit chunks, corresponding to the upper and lower 32bits.

size_t safemul(size_t n, size_t size)
	unsigned n_high = n >> 32;
	unsigned n_low = n;
	unsigned size_high = size >> 32;
	unsigned size_low = size;
	size_t t1, t2;
	/* We can't overflow if both are less than 32bits */
	if (!n_high && !size_high) return n * size;
	/* We will definately overflow */
	if (n_high && size_high) return goto overflow;
	/* Only one of the values is more than 32bits in size, overflow uncertain */
	if (n_high)
		t1 = (size_t) n_high * size_low;
		t2 = (size_t) n_low * size_low;
		t1 = (size_t) size_high * n_low;
		t2 = (size_t) size_low * n_low;
	/* Overflow? */
	if ((t1 + (t2 >> 32)) >= (1ULL << 32)) goto overflow;
	/* Add together the two chunks, we know that they won't overflow */
	return (t1 << 32) + t2;

	/* Return a large number that malloc will fail to allocate */
	return ~0ULL;

The above works. However, it is quite slow, since it has quite a few multiplies. Can we do better? Looking in the "Hackers Delight" book gives a new technique based upon counting the number of bits used in each value.

size_t safemul(size_t n, size_t size)
	unsigned lz_n = nlz(n);
	unsigned lz_size = nlz(size);
	size_t t;
	/* Numbers too large to multiply without overflow? */
	if (lz_n + lz_size <= 62) goto overflow;

	/* Do most of the multiplication */
	t = n * (size/2);
	/* Top bit set? */
	if ((s_size_t)t < 0) goto overflow;

	/* Correct for the factor of 1/2 above */
	t += t;
	/* Add in the effect of the bottom bit */
	if (size & 1)
		/* The addition overflow detection trick works here */
		t += n;
		if (t < n) goto overflow;
	return t;
	/* Return a large number that malloc will fail to allocate */
	return ~0ULL;	

where the nlz() function counts the number of leading zeros. This routine is unfortunately even slower in the common case where allocations are smaller than 4GiB. Larger allocations are faster... but those are extremely rare, so the net result is something slower. In 32bit code, where allocations are quite often larger than 64KiB, this routine is the winner, however.

Is there an evern faster way? Yes, but it is non-portable. We require a 128bit type to do the multiplication correctly. Fortunately, gcc has this. The resulting code is rather simple

static size_t safemul(size_t n, size_t size)
	__uint128_t dn = n;
	__uint128_t dsize = size;
	__uint128_t drsize = dn*dsize;
	/* Overflow? */
	if (drsize >> 64) return ~0ULL;

	/* Return the bottom 64bits */
	return drsize;

Note how the shifts actually correspond to a choice of register, so aren't actually emitted. Also note how there is only a single multiplication this way. The above code is non-portable. However, the 32bit version is in C99, where the unsigned long long type exists.


Enter the 10 characters above here

About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.