GCC Inline ASM

GCC has an extremely powerful feature where it allows inline assembly within C (or C++) code. Other assemblers allow verbatim assembly constructs to be inserted into object code. The assembly code then interfaces with the outside world though the standard ABI. GCC is different. It exposes an interface into its "Register Transfer Language" (RTL). This means that gcc understands the meaning of the inputs and outputs to the fragment of assembly code.

The extra information gcc has allows it to carefully choose the registers (or other operands) that define the interface. The ones chosen can vary depending on the surrounding code. In addition, gcc can be told which registers will be "clobbered" by the assembly code. It will then automatically save and restore them if required. This contrasts strongly with other methods, where inlined assembly code needs to manually do this saving and restoring. (Even when the surrounding code is such that it isn't needed.)

The result is that commonly a piece of gcc inline assembly will compile into a single asm instruction in the executable or library. (Often you just want access to a single instruction not exposed by C.) However, to do this, you need to understand how to craft the constraints told to the compiler. If they are incorrect, then subtle bugs can result.

Constraints

The above shows several features of gcc's interface. Firstly, the asm code is a compile-time C constant string. You can put anything you like within that string. GCC doesn't parse the assembly language itself. What it does do is use escape sequences (i.e. %0 in the above) to reference the interface described by the programmer. In this case %0 corresponds to the zeroth constraint, which in turn is described after the colon.

That constraint "=r" is an output-constraint (due to the use of the '=' symbol), and consists of a general-purpose register (due to the use of the 'r' symbol. The resulting output is then stored into the variable within the parenthesis, 'out'.

The result is a magic bit of code that somehow materializes a value, and then stores it into the variable 'out'. GCC doesn't understand where the value comes from. So in turn, it doesn't know that the variable 'var1' is actually used unless you tell it explicitly by the used attribute. (An unused variable can be elided from the executable object as a simple optimization.)

When the above is put inside a .c file called gcc_asm.c, and then compiled, the result is:

The standard ABI on 64bit x86 machines is to return integers in the %eax register. GCC picks this for the register chosen to contain the variable 'out'. Thus the resulting function actually only consists of two instructions: (The above has a whole lot of asm directives describing unwinding and debug information in addition, but that doesn't appear in the straight-line code.)

See how gcc has replaced the '%0' in the asm string with the register it picked for the zeroth constraint. If there were more constraints, we could use '%1', '%2' etc. for them in the asm string. Values up to '%9' are available.

The above describes how to get information out of a fragment of inline assembly code. So what about the reverse, getting information in? An example function that does that looks like:

The above looks very similar to the first function. However, it has two more colon-delineated parts to the asm intrinsic. The first of these is again the asm string. The second, for the outputs, is blank in this case. This function has no outputs. The third section is an input constraint. Notice that the '=' symbol is missing. (It's an input, not an output.) What remains is the 'r', describing that this asm code wants that input stored in some general register. Finally, the asm code ends with a 'memory' constraint. This tells gcc that it writes to arbitrary memory.

One other difference from the other function is that the asm fragment has an extra 'volatile' keyword. This is necessary because the code has no outputs. GCC needs to know if it is allowed to elide the perhaps useless asm which may not interact with anything else. The 'volatile' tells gcc that it shouldn't be removed. The 'memory' constraint tells gcc that it shouldn't move this call across other memory references. (Otherwise our read of 'var2' might cross writes to it.)

It is possible to have output-less inline asm which don't have the above constraints. However, be aware that gcc can optimize your asm away, or move it around if they are missing. If done when you don't expect, the result will again be subtle bugs.

Which again is as small as possible. GCC picks the %edi register corresponding to the ABI register for the first parameter on x86_64. (If you want to find the exact code generated by the asm fragment, look for the areas surrounded by #APP, #NO_APP comments.)

Here, the input parameter is %1, and the output is %0. Note the AT&T syntax used by default, which has outputs on the right of the asm instructions. Intel format can be used, which swaps things around. However, most gcc inline asm you will see will stick to AT&T format, so you should get used to seeing it.

GCC has picked both input and output registers so that again the result is a single instruction.

A slightly more complex example is when you want something to be both an input and an output at the same time. For that, use the position of an output, and use a '+' symbol instead of an '=':

The above also shows how you should prefix immediates with a dollar symbol in AT&T syntax. It also has the 'cc' constraint. This stands for "condition codes". Since the add instruction will affect the carry flag amongst other things, we need to tell gcc about it. Otherwise it might want to split a test-and-branch around our code. If it did so, the branch might go the wrong way due to the condition codes being corrupted. Basically, any inline asm that does arithmetic should explicitly clobber the flags like this.

So now the input and outputs are one and the same register, %eax. However, since the parameter passed to the function is in %edi, gcc helpfully copies it into %eax for us. Only when the copying was really needed did gcc insert it.

Functions 5 and 6 attempt to something similar to function 4. However, instead of returning a value, they call some other function called foo. This means that the output should be in the %edi register. However, the input will also be in that register. The result shows how gcc will assume that output and input registers are allowed to overlap unless you tell it otherwise.

func6() will not work correctly. gcc will pick %edi for both of 'out' and 'parm'. This will compile into:

To fix this, use the '=&' constraint. That tells gcc that the output constraint register shouldn't overlap an input register. Using that instead gives us function 5:

Which uses two registers, as required. It picks %eax for this, and inserts the extra copy needed.

You may have noticed that the multi-line asm used '\n\t' control codes. This simply makes the result nice. You just need a carriage return '\n' to go to the next line. The tab character indents things to line up with the code generated by gcc from the rest of the program. (Remember that the inline asm string is basically inserted verbatim into the output sent to the assembler, modulo simple replacements.)

Another possibility is that you might want some inputs and outputs to share a register. As described above, one way to do that is to use the '+' constraint. However, there is another way. You can use the number corresponding to another constraint within a second constraint. If you do this, then gcc will know that the two are linked, and must be the same. An example of using this is:

This may, or may not be a more readable technique than using a '+' constraint. '+' used to be buggy in old versions of gcc, so old code tends to use this method. Newer code might want to use the more concise '+' descriptor.

In addition to passing information in registers, gcc can understand references to raw memory. This will expand to some more complex addressing mode within the asm string. Note that not all instructions can handle arbitrary memory references. Thus sometimes you need gcc to create a register with the required information. However, if you can get away with it, it is more efficient to use memory directly. Some code that does this looks like:

Notice how in the above, gcc has generated a %rip-relative addressing mode for us.

Sometimes you really want a constraint to be satisfied by a certain register. Fortunately, gcc has specialized constraints for many (but not all) of the general purpose registers used on x86_64.

The above code shows how you can explicitly use the 'a' register (which corresponds to %al, %ax, %eax, or %rax, depending on size). Note how we need to use a double-percent sign within the asm string. This is similar to a normal printf format string, where to print a single percent you need two of them. (This is due to a percent symbol being an escape character.)

GCC has copied from %edi into the constraint register defined by 'a', %eax for us. Note that different machines will have differing names, and differing constraint symbols for their registers. You will need to look at the gcc documentation for your particular machine to find out what they are. This article will concentrate on the x86_64 case.

The above is a little tricky. p3 is passed in within %edx as specified by the function ABI. This means that gcc needs to copy it into another register so that p1 can go there. Fortunately, gcc handles all of the marshalling for us:

Note the extra moves before the add instruction, and afterwards in order to get things where they need to be. This is the reason why you really shouldn't use explicit named registers if you can avoid them. The only time where they are unavoidable is if you want to match some kind of ABI, or have to interface with an instruction with fixed inputs or outputs.

An example of this on x86, is the mul instruction. That will put its output in the 'a' and 'd' registers, and always takes one of its inputs from the 'a' register. So to describe it's use you might do something like:

The above uses another feature of gcc asm. Sometimes inputs commute, and we don't really care which of them uses a particular register. In this case p1*p2 = p2*p1, and we don't mind which of them goes in %eax. To tell gcc this, we can use the '%' constraint flag, which means that that constraint and the following one commute.

In this case, gcc decides not to swap the order of the two inputs because it doesn't matter.

We can try something slightly different, where we use the 'D' constraint to force the use of %edi as the multiplicand.

Unfortunately, gcc fails to make the swap in this case as well, even though it would be very profitable to do so. It looks like you can't really count on the '%' constraint specifier, which is a shame.

Generalized Constraints

There is another way to get more flexibility within the constraints. You can simply list more than one constraint symbol. GCC will choose the best one. An example of using either a register, or a direct memory reference is:

Another way of gaining flexibility is using a more general constraint. 'g' allows a register, memory, or immediate operand. Using it:

GCC will again pick the best option, which in this case is direct memory addressing modes.

Of course, if you want an immediate, there is a symbol for that as well, 'i'. The limitation is that an immediate must be a compile, or link-time constant.

Notice how gcc automatically converts into the AT&T syntax for us, with the dollar symbol preceding the constant.

There are other constraint modifiers. One of which is the '#' symbol which acts like a comment character.

Everything after the hash symbol is ignored. Unfortunately, you can't include spaces or punctuation symbols within the comment. The other thing that ends the 'comment' is a comma. This is because you can use commas to allow multiple alternatives in an inline asm. The alternatives are linked together (all first option, all second option, etc.) rather than being unlinked like in the 'rm' case. Some example code is:

The above shows the power of the technique. In x86 assembly language, there can only be a single reference to memory within an instruction. Thus if we use two 'g' constraints, we can sometimes generate invalid code. One fix for this is to use register-only 'r' constraints. However, they can lead to inefficiency. What we want to do is only ban the invalid option. By using alternative constraints, we select the valid 'm + r', 'r + m', and 'r + r' options.

Note that this feature isn't used very often within inline asm code, so is a little buggy. The final inline asm which is #defined out, in the above function should work. However, gcc gets confused by it. The fix is to add the 'r + r' option, like in the other cases.

Disparagement

Another possibility is when you want a constraint, but you don't want the compiler to worry too much about the cost of that constraint. This doesn't really come into play very often. In fact, with orthogonal architectures like x86, it may not happen at all. This is really a case of API leakage, where gcc offers a feature that may be useful to some machines to all. The '*' constraint specifier causes the following character to not count in terms of register pressure. The canonical example is the following:

In the above we have an instruction (an add, in this case), which will either take two references to the same register, or a memory-register combination. The same-reg, same-reg case is more strict, and we would like gcc to use the memory-addressing version if possible. The '*' accomplishes this. However, this trick is rather subtle... and probably shouldn't be used with inline asm. The above compiles into:

A much better technique is to use constraint modifiers that explicitly penalize some alternatives over others. By using the right amount of penalization, you can create patterns that match the machine's costs. GCC will then be able to make intelligent choices about which is best. The simple way to do this is to add a '?' character to the more costly alternative.

The above shows how you can tell the compiler that (for example) %eax can be more or less expensive to use than %edx. It compiles into:

Of course, a single level of penalization might not be enough. You can add more '?' symbols. Two question marks is even more penalized than one.

For even greater penalization, you can use the '!' symbol. It is equivalent to 100 '?' symbols. This should be very rarely needed.

Clobbers

Up until now, we have only used the clobber part of the asm intrinsic for 'memory', and 'cc' (condition codes). However, there are other things you can put in there. The most often used are names of registers. This tells gcc that that register is somehow used in the asm string. It will not use that register for inputs or outputs, and will helpfully save that register before the asm is called, and then will automatically restore it afterwards.

The mul instruction will write to %rax and %rdx. We don't care about the upper part, so it isn't an output. To tell gcc about the register write, the clobber does the job. (Yes, there are other versions of the x86 multiply instruction that don't clobber %rdx unnecessarily, but this is just an example of how clobbers might be useful.) This compiles into:

In this case, %rdx is 'dead' because it is a parameter-register in the ABI. GCC doesn't need to save or restore it, so doesn't. Without the clobber, we would need to save and restore the register manually. That would be inefficient in cases like the above, were such saves and restores are not needed.

The above is bad coding style. You really shouldn't use control-flow altering instructions inside inline asm. GCC doesn't know about them, and can do optimizations that invalidate what you are trying to do. (If 'foo' is inlined everywhere, it may not even exist to call.) Also, there have been many bugs when the number of clobbered registers gets too large. If gcc can't find a way to save and restore everything it may simply give up and crash.

In the above case, we are lucky, and it compiles without issue. The trick is to notice that the clobbered registers are all dead (except %rdi) due to the x86_64 SYSV ABI.

A much better technique is to use explicit temporaries. GCC can then allocate them where ever it has space. It can also move things around for more efficiency, based on the needs of surrounding code. An example of doing this is:

In the above, we use two temporary registers. Since we don't want them to overlap the other inputs or outputs, they need to be defined by '=&r' constraints. The only thing left on the clobber list is the 'cc' due to the arithmetic and logic instructions altering the condition codes.

Finally, there is another way to name registers within the asm string itself. Depending on your point of view, the numerical '%0-%9' scheme may be more or less readable than the following:

By putting a name within square brackets in the constraints we can then use those names in the asm string. Note that the asm operand name does not have to be the same as the C variable it comes from. However, for readability, it may be better to keep the two the same if possible. The main disadvantage of the technique is that it can make the asm string a little longer, and can make it harder to see what addressing modes are used.

Less Common Constraint Types

One of these is for "offsetable memory", which is any memory reference which can take an offset to it. In the orthogonal x86 architecture, this is anything that 'm' could reference, so this constraint class isn't too useful there. Other machines may be different though. An example of its usage is:

The linker and assembler understand the more complex addressing within "out.2398+4(%rip)", and will generate the appropriate fix-up for us.

Since some machines have offsetable memory as a separate class from normal memory constraints, there is some memory which is not offsetable. If you want to have a constraint that references such memory, you can use the 'V' constraint flag. However, since x86 doesn't have such a beast, we don't provide an example of its use.

Some machines provide memory that automatically increments or decrements things stored within it. Such memory can be described by the '<' and '>' constraints. Again, x86 doesn't have anything like that, so those constraints are not supported, and no example is provided.

Another constraint that isn't so useful on x86 is 'n'. That refers to a constant integer that is known at assembly time. Some machines have less capable assemblers and linkers, and cannot use the more general 'i' constraint. 'i' is an integer constant known at link time. Since 'n' defines a sub-category of 'i', you can also use it on x86:

Another integer immediate constraint type is 's'. This describes an integer that is known at link time, but not compile or assembly time. This isn't particularly useful on x86, but on other machines can lead to optimizations.

Not all immediates are integers. Some machines allow immediate floating point numbers. The 'E' constraint is for floating point immediates that are defined on the compiling machine. If the target machine is different, then the bit-values may be incorrect. Thus, this constraint shouldn't be used if you are cross-compiling.

The x86 architecture really doesn't allow floating point immediates. You should get constants into SSE registers and the legacy floating point stack from memory instead. However, there are a coupled of special cases that still work:

The above use the bit-pattern for the double '2.0', and indirectly moves it into an SSE register (defined by the 'x' constraint). It would be more efficient to do a direct memory load, but the above does work:

In addition to the 'E' constraint is the 'F' constraint. This is cross-compiling friendly, and should probably be used instead. Otherwise, it has the same meaning as it's 'E' cousin.

Another rarely used constraint is 'p'. It describes a valid memory addresses. On x86, it behaves just like 'm' does. You should use the more standard 'm' instead.

There is one final constraint common to all machines, 'X'. This constraint matches absolutely everything. This catch-all doesn't give gcc any information about how to pass the information to the inline asm, so gcc picks the form most convenient for it. Since the exact output will be highly variable, it is difficult to use in normal asm instructions. However, it may be helpful in asm directives:

This creates a zero-terminated ASCII string containing the operand used by gcc. With a bit of section magic, it obtains a pointer to it, which is then returned in the output.

X86 Register Constraints

Most of the previous constraint types will work on all machines. Some have been x86-only though. For example, 'a', which will expand to '%al', '%ax', '%eax" or '%rax', will obviously not work the same way on another architecture. We have seen a few of these x86-only, but there are many more.

A simple register constraint is 'R'. This selects any legacy register for use. i.e. one of the a,b,c,d,si,di,bp, or sp registers. This may be useful when interfacing with old code unable to use any of the new 64 bit registers. Otherwise, the constraint acts just like 'r' would do:

The above cannot use p5 as is because it is passed in %r8 by the ABI. Thus gcc will insert a move instruction into a legacy register as requested. This copy wouldn't happen if 'r' were used instead;

Another constraint that picks a subset of the available registers is 'q'. This picks a register with an addressable lower 8-bit part. The list of available registers differs between 64-bit mode and 32-bit mode. In 32-bit mode, some of the registers don't exist. i.e. you can't access %dil or %sil.

A variant of the above is the 'Q' constraint, that picks a register with a 'high' 8-bit sub-register. i.e. any of the a, b, c or d registers:

Notice how the compiler was not allowed to use the %edi register as the operand any more. Instead, it picked %edx.

As we have seen in the earlier sections, some of the x86 registers have constraints of their very own. We have seen 'a' and 'd'. Similarly, 'b' and 'c' do what you might expect, and refer to the '%bl', '%bx', '%ebx', and '%rbx' registers, and the '%cl', '%cx', '%ecx', and '%rcx' registers respectively. An example of this might be:

Where every input has had its register manually defined by an explicit constraint. GCC needs to do a little bit of copying to get everything into the right spot;

There are also special constraints for the si and di registers, 'S' and 'D' respectively. (We have used 'D' before in func13().) Something using them looks like:

There is one final way to access the general purpose registers, which is via the 'A' constraint. This is the two-register pair defined by the a and d registers. This is useful when you want to deal with 128-bit quantities in 64-bit mode, or 64-bit quantities in 32-bit mode. The low bits are stored in the a register, and the high bits in the d register, just like the multiply and division asm instructions expect. Its use looks like:

Since the ABI requires a function returning a 128-bit integer to do so in %rax and %rdx, the above has no extra register to register copies. (Other than that required to get the multiply instruction initialized.)

X86 Floating Point Constraints

The x86 has a strange floating-point coprocessor which uses an internal stack of registers. Dealing with this is difficult with gcc. You need to make sure that the right number of values are added and removed from this stack. GCC assumes that all output constraints are under its purview, and are popped by it. Input constraints are more complex, can be either popped by gcc afterwards or not.

The least complex method is to tie an input constraint to an output. That makes it popped afterwards with the output that replaces it. You can also clobber an input to make it assumed to have been implicitly popped. Otherwise, gcc will assume it can use the input later for other calculations, and will handling the popping of that register itself.

One critical detail is that the floating point processor acts on a stack. That means that the used (popped or not) registers must be contiguous. It's not possible for gcc to re-arrange the stack by popping something the middle. You need to make sure the outputs are first in the stack, followed by all registers you pop, and finally followed by the ones gcc will pop from that stack.

The constraint for the top of the floating point stack is 't'. We can add things to the stack without a floating point register input by using memory instead:

The ABI mandates that long doubles are returned in st(0), so the above routine doesn't need to alter the stack.

The next-from-top floating point register, st(1), also has a special constraint: 'u'. An example of its use might be:

Note how in the above we link the first input to the output, so it is stored in st(0), and popped by gcc afterwards. The other input is in st(1), and since is not clobbered, will also be popped by gcc afterwards.

You can see how gcc sets up the floating point stack (in a not particularly efficient way). You can also see how the st(1) input is cleaned up afterwards by the ftsp instruction. st(0) is still live at the end of the function, and is used for the long double output.

Finally, you can create an input in an arbitrary floating point slot by using the 'f' constraint. (This doesn't work as an output constraint.) An example of this is:

Where just to be different from the previous function, we use an in-out parameter on the top of the stack.

Again the code generated has an extra fxch than what is needed. You really shouldn't use the legacy floating point instructions. Instead, modern code should use SSE instructions for their floating point work.

Another legacy part of the x86 instruction set are the mmx registers. These are aliases of the legacy floating point stack. This means that they are difficult to use because you need to use the 'emms' instruction afterwards to avoid floating point exceptions. However, some older vectorized code does use them. The constraint for their use is 'y':

The above is obviously very inefficient, as gcc goes through the better SSE registers as mandated by the vector ABI. Another thing missing is the emms instruction. You'll need to use yet another inline asm in order to add it where needed. A better option is to avoid these registers if possible.

Instead, most modern code should be using the 16-byte SSE registers. The constraint for accessing those is 'x'. (This was also used in func29.) Since the ABI is much more compatible, the overhead is lower:

Many fewer instructions are used in the above, with the bulk of the function just a single SSE instruction.

The final register constraint type is defined by the two-character string 'Yz'. This constrains to the first SSE register, %xmm0. This is useful because that register is often mentioned by the ABI. It is the first floating point or vector parameter passed to a function, and also the register used for floating point or vectorized output. Using it is easy:

Here we deliberately cause gcc to have to swap the SSE registers around in order to get p2 into %xmm0:

X86 Integer Constraints

In addition to the machine-specific register constraints, the x86 inline asm in gcc also supports special integer constraints. Most of these are actually not useful for inline asm - being 'leakage' from the RTL pattern-matching used by the optimizer. They still can be used, although this is not recommended as these are not really documented.

The first of these is relatively useful. The 'I' constraint specifies a constant integer in the range 1-31. It is useful for 32 bit shift instructions:

Similarly, there is the 'J' constraint which specifies a constant integer in the range 1-63 for 64 bit shift instructions:

The above two constraints are helpful in that gcc will error out if the constants are the wrong size. This extra error-checking can prevent bugs.

Perhaps less useful is the 'K' constraint. This specifies a signed 8-bit integer constant;

On the other hand the 'L' constraint is obviously something that has escaped from RTL-land. It only allows the two integers 0xFF or 0xFFFF. It basically is a method of pattern-matching certain zero-extending constructs. Since you can't alter the asm string based on register matches, this constraint is barely useful. Of course, it still can be used:

Another not so useful constraint is 'M'. This specifies integer constants from 0-3. This is useful for RTL pattern-matching shifts that may otherwise be better done with an lea instruction. However, again the result is something not so useful for inline asm. You probably shouldn't use it. However, if you do, it may look something like:

The next integer constraint is 'N'. This one specifies an unsigned 8-bit integer constant. It is useful for the io/instructions 'in' and 'out':

The addition of 64-bit support to gcc meant that constraints needed to be added to support it. Since most instructions do not support 64 bit immediates, we need to differentiate from 'i' which will allow such large integers. Instead, you can use 'e', for a constraint for a constant 32-bit signed integer:

Similarly, there now is also a constraint for 32-bit unsigned integer constants, 'Z':

Finally there are two floating point constant constraints that you probably shouldn't use at all. These are used by gcc for optimizations. The first of these, 'G', will match a constant that can be easily generated by the i387 by a single instruction. However, since the resulting operand cannot actually be used by floating point instructions, there is very little point in using it in inline asm:

Where in the above we use the same trick as used with the 'X' constraint, and simply convert the operand into a string. The resulting code after compilation is:

The other floating point constraint is the equivalent for SSE registers, 'C'. Since there are less constants constructible from a single instruction, this is even less useful:

X86 Operand Modifiers

The use of constraints doesn't fulfil all the possible things you might want to do in an inline assembly statement. The problem is that the operand %0 might not be in quite the form you want. For example, you may want to access a sub-register of %0, or use a different addressing mode that perhaps requires some slightly different formatting than the default. Fortunately, gcc offers operand modifiers that allow doing these changes.

Operand modifiers work by inserting a symbol between the percent sign and the number for the operand (or its square-bracketed operand name). By using different modifiers, you can get different effects. However, many of the modifiers are really designed for RTL usage, so aren't helpful in inline asm mode.

The simplest modifier is one that just outputs the character 'b' (for byte-sized accesses) if the compiler is in AT&T mode. This helps in writing asm strings that can also be parsed in Intel mode, which requires unadorned instructions. Use the 'B' symbol to do this:

Note how 'mov' gets changed into 'movb'. This particular operand modifier doesn't really depend on the operand itself.

There are other versions of this for the 16-bit and 32-bit cases. 'W' will generate a 'w', and 'L' will create an 'l':

Unfortunately, this pattern does not continue into 64 bits. The 'Q' modifier outputs an 'l', rather than the 'q' you might expect. Perhaps this is due to the fact that most instructions cannot take a 64-bit immediate. An example of using it is:

Finally, there are two other character-printing modifiers. 'S' creates an 's', and 'T' makes a 't'. These are less useful, corresponding to legacy floating-point use. Of course, since the output is a raw string, you don't actually have to use them for that... and other sillier usages are possible, as is shown below.

Of course it goes without saying that such tricks should be avoided in real code.

Another operand modifier tells gcc that the operand is a label. This is used in the "asm goto" extension. Labels are listed after the clobber list, and can be referred inside the asm string. Such asms should not have any outputs. They are designed for control flow usage.

The problem is that there is no real way to get condition code information into and out of an inline asm statement. The asm goto method avoids this problem by letting the user do the branching inside, and thus all condition usage is encapsulated. Other gcc optimizers can then deal with the jump labels and move them around as needed. The result can be very efficient code. An example using it is:

If this function gets inlined inside an if statement, then the extra statements that set the output will be removed by optimizers.

The above modifiers didn't really change the output of the operands. However the following do. The 'a' and 'A' modifiers deal with addresses. They are helpful when allowing compilation in Intel mode. They modify the operands in the correct way so that dereferencing is written in the right syntax. Example of their use is:

Note how 'a' added brackets around the register name, and 'A' added an asterisk in front.

The 'p' modifier is similar. It modifies an operand to be a raw symbol name. For constants, it removes the leading dollar symbol. This is useful because in some contexts a dollar symbol is incorrect syntax. For example, in segment-offset addressing:

The 'P' modifier does a little more work, it removes things like '@PLT'. This is helpful if you are creating something like a dynamic linker, where you need to do inline asm before relocations have been calculated:

The 'X' modifier is similar to 'P'. It outputs a symbol name with a prefixed dollar symbol. It is useful for symbolic immediates:

Compare with the output from the 'P' modifier. Basically, these symbol modifiers are only useful if you are playing with linker tricks. Usually, the default behavior from the 'm' or 'g' constraint is what you want. Only when you absolutely need some other form of linkage are they needed.

Occasionally, you may want to use a differently sized sub-register based on a given constraint. Without operand modifiers there is no way to do this. The given asm string for a register name may be completely different. Compare %rax to %eax, versus %r8 to %r8d. Fortunately, gcc provides ways of accessing all possible registers based on a given constraint.

The 'b' operand modifier gives you the 8-bit register related to a given register operand. (For those registers that have two 8-bit sub-registers, it picks the low one, i.e. %al, not %ah from %eax. Code using it looks like:

The above takes the bottom 8 bits of the 32-bit integer parameter, and sets the corresponding bits of the 64 bit output:

There are, of course, other sized sub-registers. The 16-bit operand modifier is 'w':

Where the above uses the 64-bit x86 feature that using a 32-bit instruction on a 64-bit register will clear the upper 32 bits. The asm looks like:

Of course, we still may want to access the other 8-bit "high" sub-register. The 'h' operand modifier allows this:

Note how we had to use the 'Q' constraint to make sure that the high sub-reg existed. The resulting code chooses the %edx, and the %dh registers for this:

Somewhat related is the 'H' operand modifier. This allows you to access the high 8-byte part of a 16 byte SSE variable in memory. It adds 8 bytes to the offset in the memory access. This effect can of course be simulated manually.

Another useful feature is that there are operand modifiers that help inline asm statements that deal with constants. The main issue is that in AT&T syntax, you may need to add a suffix to an instruction to tell the assembler what size of instruction to use. In Intel syntax, this suffix should not be there. The other problem is that flexible code may need to accept many possible instruction sizes. The 'z', and 'Z' modifiers help here. They print the correct suffix for a given register size:

Notice the 'l' in the 'movl' instruction has been added for us. The 'Z' variant is similar:

The difference between 'z' and 'Z' is that 'Z' is more flexible. It works with floating-point registers as well as just the integer ones. Unfortunately, neither modifier will work with constant asm constraints, just register constraints.

Sometimes useful is that you may want to write accesses to the top of the legacy floating point stack slightly differently. The 'y' modifier converts 'st' into 'st(0)':

'n' is a weird operand modifier. It negates the value of an integer constant. It also suppresses the leading dollar sign:

Another strange one is the 's' modifier. It prints out an integer constant, followed by a comma. It does not suppress the leading dollar sign:

The next set of modifiers help asm using AVX instructions. The 't' modifier converts a SSE register name into its AVX equivalent:

The reverse is implemented by the 'x' modifier, which converts an AVX name into the SSE version:

Also potentially useful for AVX code is the 'd' operand modifier. This is documented to duplicate an operand. Since the fused multiply-add instructions come in three and four operand variants, it would be convenient to be able to support both from the same code-base. Using duplicated operands would help somewhat. Unfortunately, simple usage of 'd' with AVX registers leads to internal compiler errors with the current version of gcc (4.7.1), so this modifier should be avoided for now.

Other modifiers to be avoided are those dealing with condition codes. There is no way for inline asm to input a condition code operand type. (They are generated from RTL, however.) So you shouldn't use the 'c', 'C', 'f', 'F', 'D' and 'Y' modifiers.

The one remaining modifier is 'O'. It isn't particularly useful. It prints nothing if sun syntax is off. (The default.) Otherwise it prints 'w', 'l' or 'q', helpful for cmov instructions, which are slightly different in that asm dialect.

Special Operands

In addition to operands specified by the constraints, there are a few others. The first of these we have seen before. '%%' will print a single percent sign. This is helpful for writing asm registers explicitly within the output string. The '%%' behavior is the same as that for the printf() function, so it is easy to remember. func10(), above shows its use.

The '%*' operand prints an asterisk if you are using AT&T assembly output. Otherwise, nothing is printed. This is helpful for portability:

Again, you probably shouldn't use control flow instructions like that in inline asm, since gcc will not understand them. However... sometimes you might just need to, and tricks like that often help.

The '%=' operand prints a unique numeric identifier within the compilation region. This is helpful for constructing a unique symbol from within an inline asm. Perhaps __LINE__, or local symbols should be used instead though. For example:

Where in this particular case, it expanded to "820". Note that since you can construct a symbol name with a given pattern, this trick may be helpful for debugging.

The '%@' operand expands to the thread TLS segment register. In 32-bit mode, this is %gs. In 64-bit mode, %fs. If you are writing low-level thread library code, this may be helpful for portability.

The '%~' operand expands to 'i' if avx2 is available. Otherwise it expands to 'f'. I don't know why this could be useful.

The '%;' operand expands to ':' if gcc has had compiled in support for certain buggy versions of the gnu assembler. Otherwise, it expands to nothing. This apparently is useful for getting segment overrides to work. However, these days, binutils is most likely modern, so you don't have to worry about this.

Finally, there are two more operands that are not useful from inline asm. The '%+' operand is designed to add branch-prediction prefixes. However, inline asm can't give the information it needs. The '%&' operand expands to the name of a dynamic tls variable used within the function the inline asm is invoked in. This is used internally within gcc to get thread local variables to work correctly. You shouldn't need to use it in inline asm code.

Other Tricks

Another interface with assembly language within gcc are register variables. GCC has an extension that lets you assign which particular register a variable may use. An example of this is:

Where we would like the input parameter p1 to be stored in %r10, before being copied into %eax for output. Unfortunately, reality isn't so kind:

GCC ignores our request, and instead optimizes the extra moves away. You might think you could use a volatile specifier on the variable to make loads and stores to it more explicit. This doesn't work either. In fact, there is a warning "-Wvolatile-register-var" for this broken usage. In light of the fact that asm register variables are held captive to the whims of the optimizer, the should perhaps not be used. It is difficult to make sure they will have the behavior you might need.

A final trick is that it is possible to insert asm at top-level within a C source code file. Normally, you would need to be inside a function to use inline assembly language. However, we can use the fact that the section attribute is inserted verbatim into the output. Since we can embed carriage returns, we can put anything we like there. The only constraint is that the input must be a constant C string:

The above creates a function called func85() within the section attribute. The 'used' attribute is there to make sure that the variable func85a is not removed. The result is that func85 is inserted into the object code manually:

A similar version of this trick allows variables to be put into elf sections that are not '@progbits'. Simply add the section details you want, and then end them with a comment '#' character. The comment will remove the unwanted details gcc adds as a suffix.

Comments

Jike said...

Great! The best article I have ever read about gcc inline assembly:)

airc said...

true, its the best gcc-asm article i've ever seen , there is some thing i was hoping to find here is : how to use gcc asm with classes

thanks for artile

AHS said...

Wonderful!!! Really it's the best documentation in gcc line assembly I've ever read. Please write a book ;-)

Thanks a lot!!!

hqgqjnwxnf said...

ifsqempdlmfttjod, http://www.shjkngfsep.com/ xqeanbnuyi

Sibusiso said...

, - , . . 86 ( 86 , Java, C#, VB int 32 , 8087 floating point ateihmrtic , ). :3.9.1 Fundamental typesObjects declared as characters (char) shall be large enough to store any member of the implementation's basiccharacter set. - ?....Plain ints have the natural size suggested by thearchitecture of the execution environment40; the other signed integer types are provided to meet specialneeds. - , .4 Unsigned integers, declared unsigned, shall obey the laws of ateihmrtic modulo 2n where n is the numberof bits in the value representation of that particular size of integer.42 - . :1.9 Program execution2 Certain aspects and operations of the abstract machine are described in this International Standard asimplementation-defined (for example, sizeof(int)). These constitute the parameters of the abstract machine.Each implementation shall include documentation describing its characteristics and behavior in theserespects.5 Such documentation shall define the instance of the abstract machine that corresponds to thatimplementation (referred to as the corresponding instance below). - , , . ++ unsigned char, unsigned char :3.9 Types4. The object representation of an object of type T is the sequence of N unsigned char objects taken up bythe object of type T, where N equals sizeof(T).

Jamie said...

16- ?int main(){ cout <<<a href="http://gdjixy.com"> seizof</a>('a') << '\n'; cout <<<a href="http://gdjixy.com"> seizof</a>("a") << endl; return 0;} (++):12 - GCC. $ Intel - ( ), "" () 'A' '\0'. 2 char.

Pau said...

Stoune" , - , . .", ."byteaddressable unit of data storage large enguoh to hold any member of the basic character set of the execution environment" http://sncbuv.com [url=http://lhftdjnhck.com]lhftdjnhck[/url] [link=http://dnuwjak.com]dnuwjak[/link]

Chyna said...

You've captured this <a href="http://zgghgrbkgev.com">peflyctre.</a> Thanks for taking the time!

said...

Best resource on the net.

a said...

Register variables work fine, *if* you remember to use them as inline assembly operands, like so:

int func84(int p1) {
  register int out asm("r10") = p1;
  asm volatile("" : "+r"(out));
  return out;
}

The generated assembly:

_Z6func84i:
        movl %edi, %r10d
        movl %r10d, %eax
        ret

said...

Enter your comments here

awakening said...

static __attribute__((used)) int var2;
void func2(int parm)
{
/* Register input - volatile because has no outputs, writes to memory */
asm volatile("mov %0, var2" : : "r" (parm) : "memory");
}
make fail:
/tmp/cc7iWbmV.s: Assembler messages:
/tmp/cc7iWbmV.s:34: Error: immediate expression requires a # prefix -- `mov r3,var2'

Philias said...

I guess thats pretty much comprehensive, great article :) (and sad to see the comment section being abused for useless spam ...)

Gordon said...

Here is some of my C code that I converted to assembly after
reading your article. It works, but maybe can be better. All
local variables are ints. Any suggestions for improvement?

//d += (((x - y--) << 2) + 10);

asm (
"movl %1, %%eax;"
"subl %2, %%eax;"
"shl $2, %%eax;"
"addl $10, %%eax;"
"addl %%eax, %0;"
: "+r"(d)
: "r"(x), "r"(y)
: "cc","%eax"
);
//y decrement is done here.
asm ("dec %%eax;" : "+a"(y) :: "cc"); //force into eax

Should y decrement be done differently? Thanks.

said...

This remains the best online tutorial for (x86-64) gcc inline asm. I agree that the comment section needs purging, though not necessarily disabling. Anything to keep this resource up to date with most recent gcc release would be worthwhile.

SEMA said...

This was an excellent article. However it says: "Without the clobber, we would need to save and restore the register manually"... How??? More generally how do you use the stack in inline assembly without breaking the exception mechanism all to hell? We need an entire new article just about that :-)

SEMA said...

In x86 (32 bit) mode:

unsigned long long Test = 0x123456789ABCDEF0U; size_t TestO;
asm("MOV %1,%0" : "=&r" (TestO) : "r" (Test):);

Produces:
T64 Test = 0x123456789ABCDEF0U; TPtr TestO;
  424fc9:c7 45 e8 f0 de bc 9a movl $0x9abcdef0,-0x18(%ebp)
  424fd0:c7 45 ec 78 56 34 12 movl $0x12345678,-0x14(%ebp)
M:/Misc/SEMAAsmUtil.cpp:1846
    asm("MOV %X1,%0" : "=&r" (TestO) : "r" (Test):);
  424fd8:8b 45 e8 mov -0x18(%ebp),%eax
  424fdb:8b 55 ec mov -0x14(%ebp),%edx
  424fde:89 c1 mov %eax,%ecx
  424fe0:89 4d e4 mov %ecx,-0x1c(%ebp)

You can see that the compiler actually uses 2 registers for %0, but %0 is substituted with %eax is there some way of referring to %edx (or whatever register the compiler is in the mood to use)?

Seth said...

Enter your comments hereint var9;
int func9(void)
{
int out;
asm ("mov %0, %1"
: "=r" (out) : "m" (var9));
return out;
}

shouldn't the %0 and %1 be reversed if out is to contain the int at var9?