Lockless Inc

Dynamic Strings in C

Strings in C are defined as a stream of contiguous bytes, terminated by a byte with the value zero. The C standard library has many functions that deal with this type of string, but they suffer from one major problem. The definition of a C string does not contain the size of the memory allocated for that string. Thus the user must remember this buffer size in some other variable. Obviously, this can be error-prone, and the result are the famous class of security vulnerability: buffer overflows.

The other common type of string type in other computer languages (in particular PASCAL) is that where the string is specified by a stream of contiguous bytes, together with the length of that string. This has the advantage that strings may have embedded null characters. It however, does not fix the problem of not knowing the buffer size, and thus buffer overflows are still quite likely.

The reason C uses its particular type of string is simplicity. The pointer ↔ array connection allows the syntax "string" to specify an array of characters, and having the compiler add a terminating zero (nul) byte is all that's required to make a string. Another reason is that it allows the implementation of the string functions to be unrelated to the memory allocation functions, making them more useful. (strdup is the exception which proves the rule.)

If we drop the simplicity requirement, and move to some other data structure, what is the best for describing strings to avoid buffer overflow problems? To make things easier, we will define strings to be an "untyped" stream of bytes in memory. Thus things like locale and Unicode will not be relevant.

The obvious step is to choose a structure that looks like


typedef struct string string;
struct string
{
	char *s;
	size_t size;
	size_t b_size;
};

Where size is the length of the string, and b_size is the length of the buffer containing the string. Another possibility is to prepend the size information in the front of the character buffer *s, but this causes type aliasing so we will avoid that option.

The next big problem is to notice that if we are describing a dynamic string library that the strings will need to allocate memory in some operations. Allocating memory can fail, and this is a problem. The failure is an extremely rare occurance, so we would like to make checking for it cheap. One obvious way to do this would be to return an error code in those functions which could allocate memory. Unfortunately, such an API is rather annoying to use. Another option would be to allow the user to register some sort of error handler that could be called in out of memory situations. The problem with that is that it doesn't work well if multiple client libraries are simultaneously using dynamic strings. The C++ way is to raise an exception. However, C doesn't have exceptions, so this is also not ideal.

Other standard library code has similar problems. Specifically, the floating point functions may fail in some cases. For example, if you try to take the square root of a negative number, the result cannot be expressed as a double. To signify the problem, the floating point library returns "NaN" standing for "Not a Number". Operations involving NaNs also produce NaNs. Thus the error check may be done only when the application cares to look, and not after every single floating point function call.

We will choose this API method for dynamic strings. If memory allocation fails, then a dynamic string goes into the "NaS" or "Not a String" state. Any operation on a NaS will maintain that status. Thus an application can check for the problem only when it needs to. Thus we can define two macros, and implement the new version of strlen() as:


#define NaS ((string) {NULL, 0, 0})
#define isnas(S) (!(S)->s)

static size_t strsize(string *s)
{
	if (isnas(s)) return 0;
	return s->size;
}

The next problem with the implementation of dynamic strings in C is to notice that pointers can have multiple sources. The string memory buffer could point to static memory defined at compile time. It could point to somewhere on the stack. It also could point to a block allocated by malloc() or realloc() Only the last of these can be resized with a call to realloc() and must be freed with a call to free(). Thus we need to somehow remember what type of memory block we have. To do this, we choose the upper bit of b_size to store that state. This limits the size of string buffers to 263-1 bytes on 64bit machines, which is large enough to live with for a while.

Thus we may now define functions to allocate and free dynamic strings


#define STR_FREEABLE (1ULL << 63)

/* An initialized empty struct string */
#define STRINIT ((string) {malloc(16), 0, (16)})

static string strmalloc(size_t size)
{
	if (size < 16) size = 16;
	return (string) {malloc(size), 0, size | STR_FREEABLE};
}

/* Try to compact string memory */
static void strrealloc(string *s)
{
	char *buf;
	
	/* Not a string? */
	if (isnas(s)) return;
	
	/* Can't realloc? */
	if (!(s->b_size & STR_FREEABLE)) return;
	
	/* Don't invoke undefined behaviour with realloc(x, 0) */
	if (!s->size)
	{
		free(s->s);
		s->s = malloc(16);
	}
	else
	{
		/* Try to compact */
		buf = realloc(s->s, s->size);
		if (buf) s->s = buf;
	}
}

static void strresize(string *s, size_t size)
{
	char *buf;
	size_t bsize;
	
	/* Are we not a string? */
	if (isnas(s)) return;
	
	/* Not resizable */
	if (!(s->b_size & STR_FREEABLE))
	{
		string s2;
		
		/* Don't do anything if we want to shrink */
		if (size <= s->size) return;
		
		/* Need to alloc a new string */
		s2 = strmalloc(size);
		
		/* Copy into new string */
		memcpy(s2.s, s->s, s->size);
		
		/* Point to new string */
		s->s = s2.s;
		s->b_size = s2.b_size;
		
		return;
	}
	
	/* Too big */
	if (size & STR_FREEABLE)
	{
		free(s->s);
		*s = NaS;
		return;
	}
	
	bsize = s->b_size - STR_FREEABLE;
	
	/* Keep at least 16 bytes */
	if (size < 16) size = 16;
	
	/* Nothing to do? */
	if ((4 * size > 3 * bsize) && (size <= bsize)) return;
	
	/* Try to double size instead of using a small increment */
	if ((size > bsize) && (size < bsize * 2)) size = bsize * 2;
	
	/* Keep at least 16 bytes */
	if (size < 16) size = 16;

	buf = realloc(s->s, size);
	
	if (!buf)
	{
		/* Failed, go to NaS state */
		free(s->s);
		*s = NaS;
	}
	else
	{
		s->s = buf;
		s->b_size = size | STR_FREEABLE;
	}
}

static void strfree(string *s)
{
	if (s->b_size & STR_FREEABLE) free(s->s);
	
	*s = NaS;
}

With the above, we may also define functions and macros to convert C-style strings to dynamic strings


/* Allocate room for a struct string on the stack */
#define stralloca(S) ((string) {alloca(S), 0, (S)})

/*
 * Copy a struct string to the stack.
 * (Could use strdupa(), but this is more portable)
 */
#define stradupstr_aux(S)\
	__extension__ ({\
		char *_stradupstr_aux = alloca((S).size + 1);\
		memcpy(_stradupstr_aux, (S).s, (S).size);\
		strcstr_aux(_stradupstr_aux, (S).size);\
	})

#define stradupstr(S) strdupastr_aux(*(S))

/* A struct string based on a C string, stored on the stack */
#define S(C) stradupstr_aux(strcstr((char *)C))

static string strcstr_aux(char *c, size_t len)
{
	return (string) {c, len, len + 1};
}

/* A struct string based on a C string, stored in whatever c points to */
static string strcstr(char *c)
{
	size_t len = strlen(c);
	return strcstr_aux(c, len);
}

The above code uses the alloca() function to allocate memory on the stack for the magic S() macro. Doing this means that such strings do not need to be explicitly freed, and are deallocated automatically at block exit. (If C99 dynamic arrays are not used, then they are deallocated at function exit.) Unfortunately, alloca() cannot be used inside the parameters to a function call. This is because it alters the stack, which may be simultaneously be altered by the compiler building the parameter list. Thus we need to use a gcc extension for block expressions instead of the obvious nested-function code.

Since most of the time, strings are local to a function, allocating on the stack is quite convenient. However, some of the time we would like to store a string into a longer-lived data structure. To do this, we need to allocate memory explicitly.


/* Create a new string as a copy of an old one */
static string strdupstr(string *s)
{
	string s2;
	
	/* Not a string? */
	if (isnas(s)) return NaS;
	
	s2 = strmalloc(s->size);
	s2.size = s->size;
	memcpy(s2.s, s->s, s->size);
	
	return s2;
}

/* Copy the memory from the source string into the dest string */
static void strcpystr(string *dest, string *src)
{
	/* Are we no a string */
	if (isnas(src)) return;
	
	strresize(dest, src->size);
	
	if (isnas(dest)) return;
	dest->size = src->size;
	memcpy(dest->s, src->s, src->size);
}

Of course, since the C standard library deals with pointers-to-char terminated by a nul as its string type, we need to be able to convert back to this type of string. Fortunately, this isn't very difficult. A zero byte can be added on the end since we have made room in the above functions for it. (This is the reason for the "+ 1" in strstr_aux() etc.


static char *strtocstr(string *s)
{
	size_t bsize;
	
	/* Are we not a string? */
	if (isnas(s)) return NULL;
	
	/* Get real buffer size */
	bsize = s->b_size & ~STR_FREEABLE;
	
	if (s->size == bsize)
	{
		/* Increase buffer size */
		strresize(s, bsize + 1);
		
		/* Are we no longer a string? */
		if (isnas(s)) return NULL;
	}
	
	/* Tack a zero on the end */
	s->s[s->size] = 0;
	
	/* Don't update the size */
	
	/* Can use this buffer as long as you don't append anything else */
	return s->s;
}

Of course, the above doesn't really deal with the problem we set out to solve, that of buffer overflows. To do that, we will split modifying a string into different conceptual problems. The simplest to solve is accessing a string beyond the end of the buffer. A simple accessor macro can be used which can bounds-check when required.


#ifdef DEBUG_BOUNDS
#define S_C(S, I)\
	(* __extension__ ({\
		assert((I) >= 0);\
		assert((I) < (S)->size);\
		assert((I) < ((S)->b_size & ~STR_FREEABLE));\
		&((S)->s[I]);\
	}))
#else
#define S_C(S, I) ((S)->s[I])
#endif

The next common task that can result in a buffer overflow problem is appending to a string. We would like to be able to append characters, C strings, and dynamic strings to our initial dynamic string. We would also like to append multiple C strings and dynamic strings simultaneously, with a simple API.


static void strncatcstr(string *s, size_t len, const char *str)
{
	size_t bsize;
	
	/* Are we not a string? */
	if (isnas(s)) return;
	
	/* Nothing to do? */
	if (!str || !len) return;
	
	/* Get real buffer size */
	bsize = s->b_size & ~STR_FREEABLE;
	
	if (s->size + len >= bsize)
	{
		strresize(s, s->size + len);
		
		/* Are we no longer a string? */
		if (isnas(s)) return;
	}
	
	memcpy(&s->s[s->size], str, len);
	s->size += len;
}

static void strcatcstr(string *s, const char *str)
{
	if (str) strncatcstr(s, strlen(str), str);
}

static void strcatstr(string *s, const string *s2)
{
	strncatcstr(s, s2->size, s2->s);
}

static void strcatcstrs(string *s, ...)
{
	const char *str;
	va_list v;
	
	/* Are we not a string? */
	if (isnas(s)) return;
	
	va_start(v, s);
	
	for (str = va_arg(v, const char *); str; str = va_arg(v, const char *))
	{
		strncatcstr(s, strlen(str), str);
	}
	
	va_end(v);
}

static void strcatstrs(string *s1, ...)
{
	const string *s2;
	va_list v;
	
	/* Are we not a string? */
	if (isnas(s1)) return;
	
	va_start(v, s1);
	
	for (s2 = va_arg(v, const string *); s2; s2 = va_arg(v, const string *))
	{
		strncatcstr(s1, s2->size, s2->s);
	}
	
	va_end(v);
}

The final situation that can cause a buffer overflow problem is formatting a string from input data. Using sprintf() can be error prone due to possibly not knowing the length of the resulting string. (Remember that a user may have a different locale from you, and thus have different sized output.) The snprintf() function fixes this problem... but if the output buffer is too small, the results will be truncated. We would like a version of sprintf() that will dynamically allocate a large enough string for the result. (And if that isn't possible, return the NaS constant.) A function which does this, based on vsnprintf() is


static void strprintf(string *s, const char *fmt, ...)
{
	va_list v;
	size_t len;
	
	/* Are we not a string? */
	if (isnas(s)) *s = STRINIT;
	
	/* Nothing to do? */
	if (!fmt) return;
	
	va_start(v, fmt);
	len = vsnprintf(NULL, 0, fmt, v) + 1;
	va_end(v);
	
	strresize(s, len);
		
	/* Are we no longer a string? */
	if (isnas(s)) return;

	va_start(v, fmt);
	vsnprintf(s->s, len, fmt, v);
	va_end(v);
	s->size = len - 1;
}

Finally, we would often like to allocate such a formatted string on the local stack. Very often we just wish to print out the result to the screen or a file, so having to worry about remembering to free the data is a burden. Fortunately, we can use a similar trick to stradupstr_aux() to create the formatted string dynamically. The result is the macro:


/* Use a (C string) format and return a stack-allocated struct string */
#define straprintf(...)\
	__extension__ ({\
		size_t _straprintf_len = snprintf(NULL, 0, __VA_ARGS__) + 1;\
		char *_straprintf_buf = alloca(_straprintf_len);\
		snprintf(_straprintf_buf, _straprintf_len, __VA_ARGS__);\
		strcstr_aux(_straprintf_buf, _straprintf_len - 1);\
	})

Using these functions and macros, strings in C become much easier to use. Buffer overflows become much less likely, and thus creating security-conscious C code becomes a less difficult task. In fact, the cgi scripts that run this website are written in C using a predecessor to the string library described here.

Comments

said...
Would be great to have source available to download!
~ik said...
Nice, was looking for similar thing
Den said...
no compile:

COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/i686-pc-linux-gnu/4.9.0/lto-wrapper
Target: i686-pc-linux-gnu
Configured with: ./configure --prefix=/usr
Thread model: posix
gcc version 4.9.0 20131027 (experimental) (GCC)


ar rcs libllalloc.a libllalloc.o
BFD: libllalloc.o: unknown [0] section `' in group [wm4.1.4170c524e98ed4ce7a3e989dd25425a1]
BFD: libllalloc.o: unknown [0] section `' in group [wm4.features.h.30.5c3c447adabc0811a924b6e3287995b8]
BFD: libllalloc.o: unknown [0] section `' in group [wm4.cdefs.h.21.8178d3d0707f696c84de397cc087c0de]
BFD: libllalloc.o: unknown [0] section `' in group [wm4.cdefs.h.331.1e7d2855dc11325b112c801067008ed2]
BFD: libllalloc.o: unknown [0] section `' in group [wm4.stubs32.h.10.60002ca50c741fede33a2cc4eecdf17d]
BFD: libllalloc.o: unknown [0] section `' in group [wm4.endian.h.20.078fabf31bd583d1be6b7c75c8966567]
BFD: libllalloc.o: unknown [0] section `' in group [wm4.endian.h.42.5a1d690a342518667b7a927de16f31f4]
BFD: libllalloc.o: unknown [0] section `' in group [wm4.stddef.h.184.f740e675efc82b9790b2d86753673cd5]
... etc
said...
Enter your comments here

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.