Obscure CThe C language is relatively "small" in comparison to other modern computer languages. To completely specify it, (and its standard library) only requires about ~550 pages. To do the same for Java or C++ would require an entire bookshelf, rather than a single book. However, even though the language is small enough to be easily comprehended, it has some dark corners. The purpose of this article is to explore some of them. Since C is used in a wide variety of applications, the "dialect" of it varies. This means that some people may be quite familiar with many of the following items. However, the individual "non-obscure" subsets should hopefully vary, resulting in at least a few items you might not know about. 1) Pointers are not unsigned integersPointers are not unsigned integers. That is fairly obvious, you can't dereference an unsigned integer (even if it is an address). However, many people assume many properties of integers apply to pointers, even if they don't. Unfortunately, this can result in security issues. A common problem is bounds-checking an access to a buffer. A simple solution might look like:
Unfortunately, that code might not work correctly. If the len variable is too big, then the addition can overflow and the inequality will not hold. The obvious fix is something like:
However, this doesn't work either. The problem is that in C, you can only compare pointers in the same underlying object. That means that all pointer arithmetic is assumed to never overflow. The second check can be optimized away as it only triggers in undefined behavior. The gcc compiler did this, breaking a security check in the Linux Kernel. As an aside, the proper way to check this is the simple:
This isn't the only pointer arithmetic that can be a little tricky. Another case is:
Code like the above might appear in macros attempting to do address-space manipulation, so it isn't quite as strange as it first appears. The problem is that again it is undefined behavior. NULL isn't in any valid object. Thus we are subtracting two pointers in different objects - undefined behavior. The compiler is allowed to set u to any value it pleases. These features of C pointers mean that no object can be larger than half the address space. (Otherwise pointer arithmetic within it would overflow the (signed) ptrdiff_t). Another limitation is that no object can end on address (uintptr_t) -1. i.e. 0xffffffff on 32bit machines. You are always allowed to look one past the end of a C object. That isn't possible in that case due to the overflow. Again, that doesn't affect most programmers. However, in embedded work where every byte matters, such subtleties matter. 2) Loops in the C pre-processorThe C Pre-Processor is a simple text-replacing macro implementation. It isn't very complex, and deliberately reduces its own scope to try to avoid extra complexity. Macros are only expanded once per phase, so arbitrary computation via recursion is hindered. However, wouldn't it be nice to have loops controlled by macros? You could programatically cause varying amounts of text-substitution. The obvious way doesn't work. i.e. via the relationship between recursion and iteration, since recursion is disallowed. However, that doesn't stop us from being tricky. The key is the idea that a text file is allowed to A new problem is that a macro definition can't really refer to previous definitions of itself, so we can't update a loop counter in an obvious way. That in turn can be surmounted by using binary logic. We can manually The following program uses the technique:
Can you guess what it does without compiling and running it? 3) strxfrm()The C standard library has quite a few functions in it. Some of them are used more than others. Some are quite obvious from their name, so if you run into them you can derive what they should do. However, others are relatively obscure.
The Okay... so it transforms a string in some poorly-described way, but then refers to the also obscure function However, that doesn't really explain the need (and use) for the If you never do internationalization work, you may never run into this function. Even if you do, only if you care about performance in specific types of algorithms will you need to know about this function. It's there if you need it. The rest of us can safely ignore it, relegating it to a dark corner of the language. 4) Integer DivisionC is a low level language, and details of the underlying hardware matter. This means that C code can be very fast, but it does expose C programs to some dark corners. One such issue is exposed by integer division.
The effect of division is described in paragraphs 6.5.5.5-6 in the C 99 standard: The result of the / operator is the quotient from the division of the first operand by the second; the result of the % operator is the remainder. In both operations, if the value of the second operand is zero, the behavior is undefined. When integers are divided, the result of the / operator is the algebraic quotient with any fractional part discarded. (This is often called "truncation towards zero".) If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a. So the above discusses that division by zero is undefined behavior. However, there is one other case not explicitly mentioned. The problem is that C supports three different types of representations of signed integers. Signed Magnitude. Ones Complement. Twos Complement. The first two of these have no other undefined possibilities for division. However, the third does. Guess which one your hardware probably uses? Twos complement division has a hidden undefined case. INT_MIN / -1 will cause integer overflow. Since the result is INT_MAX + 1, which isn't representable in the integer type, there is a problem. You might expect that it overflows to INT_MIN again... but since overflow is undefined, anything could actually happen. On some machines you'll get a hardware exception. Thus in addition to checking for division by zero, security conscious code also needs to test for the INT_MIN / -1 case. 5) Bit ShiftingBig shifts on unsigned integers work as you might expect. Shifting left is like multiplication, and right like division. Signed integers are a different story. Shifting leftwards works until you overflow. Once overflow occurs, we are again in the world of undefined behavior. The compiler can do anything, including ignoring the possibility. Thus you cannot left shift into the sign bit! (Checking the sign bit is a really fast comparison with zero, or overflow flag test on some hardware.) If you want to use the special properties of manipulating the sign bit with shifts, you first need to cast to unsigned to do the work, and then cast back when done. Unsigned overflow is well-defined, avoiding all the problems. Right-shifting signed integers is also slightly tricky. It isn't division by powers of two, like in the unsigned case. Non-negative integers will be divided as you might expect, but negative ones will exercise implementation defined behavior. This isn't quite as bad as undefined behavior. The result will be defined by your compiler maker. However, different compilers might choose different things, which makes it almost as annoying. Fortunately, most compiler vendors choose sane defaults for this, with a right shift of a signed integer acting as an "arithmetic shift". The sign bit will stay constant, yet propagate its state rightwards. This is quite useful, allowing simplified masking operations. Since the common case for the right shifts is to behave in a certain way, many people assume that that behavior is portable. Unfortunately, it isn't. 6) K&R DeclarationsC predates its standardization by the ANSI organization. Thus some really old C code doesn't exactly look like its more modern descendants. On thing in particular that has changed is the way functions are declared. Now, the types of the arguments are specified within the parenthesis. In pre-ANSI code, they were after it.
You'll typically only run into this when maintaining really old code. (Which might have macros allowing both forms of declarations to co-exist.) However, there is one case in modern C code where you might want to use this archaic style. The problem occurs when dealing with variable length arrays passed as parameters. C is a lingua-franca of programming languages, and has to talk to many others. One such common interface is with FORTRAN. In FORTRAN, it is quite common to pass arrays as parameters to functions. The problem is that there are two ways to do it. The first works nicely:
In the above, the length of the array is passed first, and the array second. Thus we can use the length in the definition of the array. What happens when the order is the other way around?
Unfortunately, this doesn't compile. We can't use b before it is defined in the list of function parameters. This is where the old K&R syntax comes to the rescue:
Notice how we can reorder the type declarations now so that a is specified after b. The above is nice, but how do we pre-declare such a function, so that others can call it? This is another obscure part of C99:
The square brackets with the asterisk inside represent a variable length array. You can only use this syntax in pre-declarations, the only place it is needed. Most of the time, you wouldn't need to use it though. Only in the special case when the array is passed before the length is it needed. 7) typedefThe
The first three of these are legal, and do as you might expect. The last isn't allowed though. A more complex case might look like:
Here we declare a struct tag "test", and make the As an aside, the following is also legal C:
Unlike C++, struct tags live in a separate "namespace" than type names. Thus the three x's refer to different things. The first is a struct tag. The second is a field of that struct. The final x is the name of a variable. Also notice how we are allowed to make the struct member volatile. We can also add arbitrary number of "const" or "volatile" keywords to the variable declaration. (Not that you'd really want to though. Beyond the first, they don't change anything.) 8) Goto labels and case statementsGoto is very powerful in C. It is the only construct that has full function scope. Everything else depends on bracing for scoping. Thus you can use goto to jump into places you might not expect. You can jump in between an The An example of strange code using this flexibility is:
Note how the "break" keyword isn't so nice. It is always attached to the construct immediately previous to it in scope. Thus in this case, attached to the 9) The conditional operatorThe conditional operator allows you to convert an if statement into an expression. This can simplify some code. However, what happens when the two alternatives are differently typed? As you might expect, cases of incommensurable types are not allowed. Arithmetic types get converted to the "larger" type as would other binary operations. The subtle cases occur when you have pointers to types that are qualified differently. The result is typed with all the qualifiers of the two cases. Even if one case is "impossible":
The above doesn't compile. Even though we are technically only accessing y, the type of the conditional expression is a pointer to a constant integer. The constantness leaking in from the definition of x. This can affect some macros, which might otherwise be nicely optimized away. You can of course alter the types to be equal by using casts. The comma operator is also sometimes useful in this situation. 10) Array MagicWhat does the following do?
The first line is obvious. It declares an array that holds ten integers. The second line is also clear, it sets the first such integer to zero. The tricky bit is the last line. What it uses is the fact a x[y] is the same as *(x + y) and *(y + x) and thus y[x] in C. What it "really" is doing is:
Which is quite a bit clearer. However, that isn't the point of this entry. What does that line do? Well... obviously, since a[0] is zero, it sets a[a[0]], which is a[0], to be equal to 1. Right? Nope. The above is technically undefined behavior in C99. (It is fixed in C11 to do what you might expect though.) Why is it undefined? The tricky bit is that we are modifying a lvalue at the same time as we are using it. The expression is similar to the more obviously wrong:
where we are adding one to i at the same time as having as the destination of the assignment. The problem is that in the expression a[a[0]] = 1, there doesn't seem to be two different modifications happening. So what is going on? Well... the issue is that in C, the [] array access operation isn't a sequence point. That means that even though you might expect to have some ordering between the calculation what array member to use, and the use of that array entry, there isn't explicitly such a constraint in C99. In short, we evaluate a[0], and then use that evaluation to work out which a[] location to modify. However, that evaluation can occur _after_ we have done the modification. (Which seems to make no sense.) However, there are cases where this weird atemporal strangeness matters: Imagine some hardware with memory where reads are destructive. In such a device all reads must be later written back to memory to avoid problems. What happens with the above construct? Well... we read a[0]. We then write 1 to that location. We then re-write the old value into a[0] because the C compiler "knows" that no concurrent modification has occurred. The result is that a[0] stays as zero. Oops. Another case is when your compiler has a very good optimizer. It can notice that a[0] cannot be modified by that line, and thus cache its value. If a later line assigns a[0] to some other variable, then the compiler can optimize that assignment to a set-to-zero. Undefined behavior is a strange beast, and with optimizations its effects can be wide-reaching. SummaryC has some dark corners. Most of them aren't really relevant to most programmers. However, if you deal with security you should know about the integer overflow issues. Low-level programers should know about pointer arithmetic and its interaction with address spaces. Of course, if you are interested there is always more to learn. How many did you know about? |
About Us | Returns Policy | Privacy Policy | Send us Feedback |
Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |
Comments
Resuna said...And dealing with objects larger than half the address space used to be reasonably common, back when there was barely room for a decent sized edit buffer even in a split-I&D executable on a PDP-11. You just have to be careful.
And speaking of evil and wicked, what ANSI C did to typecasts of signed values was scary. It used to be saner in most implementations.
Truncation towards zero has only become required behavior in C99/C++11.
Wouldn't the following be more readable?
---
void foo(unsigned int len) {
char buf[LEN];
char *buf_end = buf + LEN;
if (buf + len >= buf_end) errx(1, "Buffer Overflow!\n");
// ...
}
Unfortunately, function foo might not work correctly. If the len argument is too big, then the addition can overflow and the inequality will not hold. The obvious fix is something like:
---
I've observed that making code snippets more 'real-life' makes the reader more happy and article more interesting. E.g. rewriting the above into sth like the following:
---
char lookup_character(unsigned int idx) {
char buf[LEN] = { /* ... */ };
char *buf_end = buf + LEN;
if (buf + idx >= buf_end) errx(1, "Buffer Overflow!\n");
return buf[idx];
}
Unfortunately, lookup_character may fail if the idx (...)
--
Ming that I'm not trying to be picky, just giving you feedback so you can write more articales that I can read with more joy on my face :-) Peace.
If you want the increment first you do i = ++i;
i=i++ would not be sure to do the increment after the assignment. It is only sure to do the increment after reading the value for assignment. The actual assignment _may_or_may_not_ come later.
/* assume "buff" is declared as an array, not a pointer; it matters
* not what the type of buff[0] is
*/
if (idx < (sizeof buff / sizeof buff[0]))
{
/* buff[idx] is defined */
}
else
{
/* idx is out of bounds as an index for buff[] */
}