X86 Instruction Wishlist
The x86 instruction set has evolved over time. Originally, it was based around 16 bits in the 8086, later 32bit support was added. Recently AMD extended the general purpose registers to 64 bits, and Intel have added 128 bit vector (SIMD) registers, and will upgrade them to 256 bits when processors with the AVX extensions are released. As time has gone by, the number of instructions has increased dramatically because the number of new added instructions has greatly outweighed the small number of instructions that have been removed.
The net result of all this innovation has resulted in a rather complex encoding of the work the processor needs to do. Some instructions may be up to 15 bytes in length, whereas others may only require a single byte. In general, the more complex and rarer the instruction, the longer it tends to be. In effect, this acts like a Huffman encoding of the list of work for the processor, shrinking the amount of memory required to store that work. Since cache is expensive on the processor die, this saves money, and increases performance.
Through the trials of history, the original CISC instruction set has been converted into a fairly orthogonal and lean RISC for the general purpose instructions. A fair number of the original complex instructions have been depreciated, and most compiler-optimized code these days avoids their use. Conversely, the relatively recently added SSE extensions have added many decidedly non-orthogonal instructions. These new instructions tend to be designed for specific tasks like video encoding or encryption. Without explicitly using compiler intrinsics, the more complex of these instructions are unlikely to be used in your compiled code.
The non-orthogonality of the SSE instruction sets (SSE, SSE2, SSE3 SSSE3 SSE4.1 SSE4.2 etc.) poses a problem when optimizing code. Fairly often it is quite difficult to conceive of the best set of instructions to perform a given task. If the exact instruction you need isn't implemented, whereas a slightly different variant is, then you are out of luck.
So what improvements could be made to the instruction set? The following is a "wish-list" of useful instructions together with those that make the everything more orthogonal.
It would be really useful if the processor had an instruction that returned random bits. This could be used to initialize cryptographic keys, user-space pseudo random number generators, and alleviate the problems of trying to get random bits from an embedded machine with very little sources of i/o.
The most obvious way of doing this has subtle problems, which is probably why this instruction has not been implemented in the processor itself. (Some chipsets do have rngs attached though.) This "obvious" method is to use the amplified thermal noise from a resistor. Basic physics says that this noise should be a good source of entropy. However, such a resistor will also pick up interference from other sources and have a self-inductance that will give an increased correlation from the outputs. The hardest part of creating a good random number generator is to remove those correlations.
Imagine you have used the above technique to create a good rng on your processor chip. Now comes the time to shrink your design down to the next fabrication feature-size. Unfortunately, you will need to all your testing to see if the randomness statistical properties hold. (The rng isn't like a normal digital circuit, and will have unpredictable scaling properties.) This increase in certification time is problematic.
However, there is another way to generate good random bits. Construct a circuit that cryptographically hashes the output from a counter that is updated every clock cycle. By giving each processor a different initialization vector for the hash function, each processor will yield a different sequence of bits. This trick is making sure that this hidden key is not available to be reverse engineered by any code running on the machine. This prevents the privacy problems in having a unique serial number for each machine which occurred in Pentium III systems.
By pipe-lining the encryption process, a random number per clock cycle could be obtained. Since the hash is cryptographically secure, it will have good randomness guarantees. It will also not require the expensive recertification on process shrinks.
The condition move instructions (
For example, it is fairly common in C to require that the output from some calculation be converted into a boolean result. C treats any non-zero value as true, and zero as false. Further calculations may require just the values zero and one instead. To do this on current machines requires three instructions:
or, using just a single register:
If an immediate conditional move exists, then only two instructions are required:
The bitscan instructions
If the result is not defined, then an explicit check with zero is required nearly every time these instructions are used. (They will set the zero flag in the undefined case, but extra code is still needed to interpret that result.) For example, 32bit code which scans a 64 bit variable for the highest set bit looks like:
The above code obviously is rather complex due to all the extra error-checking required. A much simpler result occurs if the bit-scan instructions are defined to not alter the output register if their input is zero. My AMD opteron has this undocumented behaviour. Using this gives the shorter sequence:
Note how the extra
Larger offset multiples
With the introduction of the i386 processor, Intel added extra addressing modes. These use the SIB byte in addition to the old Mod-R/M byte to allow any general purpose register to act as a displacement or index. In addition, this new addressing mode allows these indices to be scaled by 2, 4 or 8.
A well-known hack is that by combining a displacement with an index other multiples are possible. For example, one can use the
The problem here is that as time has passed, the amount of data that computers deal with has increased. Size-eight multiples simply are no longer enough. SSE registers are 16 bytes in size. Their AVX extensions will be 32 bytes in size. Also, many data structures in multithreaded applications are cache-line sized at a total of 64 bytes. Thus modern code tends to require bit-shifts to multiply the scale factor in pointer accesses.
Imagine we had four bits to describe an index registers multiplication factor. If we include a zero multiple, this allows fifteen different scale factors. In addition, it might be nice to allow negative scale factors, which can save a
The set-condition-code instructions are very useful. They set a given byte to 0 or 1 depending the the value of a flag bit. This allows branch-free code in some cases. However, very often a value larger than a byte is required. To get these larger results zero-extension or pre-zeroing of a register is done:
Where this second version is faster because the clearing of
It would be nice if the extra instructions were not required. 32 bit, or 64 bit versions of the
Many lock-free algorithms are complex due to dealing with pointers. The problem with a pointer is that it takes two instructions to use it. The first instruction reads the value of the pointer stored in a memory location. The second instruction uses that pointer in some indexed operation. Since two instructions are required, nothing prevents another thread from altering the contents of the pointer's memory location between the two.
This second thread may have no idea that the first thread has read the pointer. (It hasn't had any time to notify anyone of that fact.) Thus it may not be sure whether or not it is allowed to access or free the memory pointed to.
Making sure of the non-existence of these hidden references is what makes most lock-free and wait-free algorithms so complex. The threads need to communicate with each other in some way that prevents invalid states from occurring. Note that since it is possible that the pointer-accessing thread can be scheduled out after reading that pointer, an arbitrarily long amount of time may be required to notify other threads. This deferred-notification means that simple delay loops don't work.
A common way of managing data is to use a reference count. A reference count will tell you how many other threads may be accessing this object concurrently. For example, if others are using the data, you cannot free it. For good abstraction, this reference count should be ideally placed within the object it protects. The problem here is that to access that reference count, an active pointer to the object is required. In the time taken to read that pointer, the pointer could be invalidated and the reference count could be set to zero. Thus when the first thread finally gets around to incrementing the reference count it is accessing invalid memory. This is a version of the lock-free pointer problem described above.
If the pointer-following could be made to be atomic, then these data races disappear. To do this from the instruction viewpoint requires "double-dereference" addressing. How could this work? There are some corner cases to work out. What should be we do if the pointer we read is invalid? How can we be notified? How should the result of the operation be encoded?
One way of solving this is to require that the result of the instruction in the register is the value of the pointer at the time of execution. If an invalid pointer is used, then the operation simply isn't performed. The user can look at the pointer value to determine whether success occurred or not.
Example syntax of this may look like:
AMD has discussed the possible addition of extra instructions for lock-free memory transactions. However, this method appears to be much simpler, and can perhaps add similar power without the complexity.
Arithmetic with complex numbers is more complex than most programmers realize. The problem is that floating point numbers are approximations of the real number line. IEEE floating point numbers have some nice properties, but some of these get in the way of efficient complex math.
The obvious way of implementing complex multiplication is to use the formula from your math text book:
(a+ib)×(c+id) = (ac - bd) + i(ad + bc).
Look what happens if we use IEEE arithmetic with some extreme values. Lets multiply infinity by one. The result is obviously infinity. This is the case in real arithmetic, and naively, the fact that the imaginary parts of the complex numbers are zero shouldn't affect this result. However, this isn't true:
(∞ + i 0)(1 + i 0) = ∞×1 - 0×0+ i(∞×0 + 1×0) = ∞ + i NaN
Oops, a NaN has appeared because zero times infinity is undefined. Further calculations will cause this NaN to contaminate the results, yielding NaN for both the real and imaginary parts. Thus before we multiply we need to check for extreme cases. Similar problems appear with division, where signed zeros require special checks in order to be produced correctly.
So high-quality complex arithmetic libraries need to pre-check every calculation to avoid these problems. This is slow. Wouldn't it be nice if single and double-precision complex arithmetic was supported in hardware? Some rudimentary complex support has been added via the
A little-known part of the x86_64 ABI is the
It would be nice if hardware natively supported this data type. This would speed up high-precision applications by perhaps an order of magnitude or more over the current methods which implement high precision in software.
Packed Integer Multiplication
As new SSE versions have been implemented, there are more and more ways of multiplying integers in SSE registers. However, these instructions are not orthogonal, and different possibilities exist for different sizes. For example, there are no instructions that multiply packed 8 bit quantities.
There are three different ways to multiply packed 16 bit quantities. The first instruction (
Packed 32 bit quantities work differently. Here, there is an instruction (
There are currently no 64 bit packed multiplication instructions in any SSE extension. Thus to perform these multiplies one needs to move the data into general purpose registers and back again.
Finally there are "weird" instructions (
It would be nice if the SSE instruction set was more orthogonal. Perhaps an equivalent of the 16 bit high-multiply instructions would be useful for packed 32 bit registers, and vice-versa. Also packed 64bit multiplies would be extremely useful for cryptography, hashes, compression, fixed-point arithmetic, and perhaps even graphics applications.
SSE to General Purpose moves
One problem with the current instruction set is that transferring data from SSE registers to general purpose registers and back is difficult. The
It would be nice if there was an instruction to transfer all the information in an SSE register into general purpose registers and back again. Since SSE registers are larger than general purpose ones, such an instruction will obviously need to affect two (or more) of these. The problem here is that the x86 architecture really isn't designed for multiple output instructions. AVX will part-way to fixing this with its FMA additions, but that is only for SSE registers.
Instead, in typical x86 style, we should choose the rax/rdx pair as the input/output for this operation. This register pair is used for double-precision integer multiplies, and is also specified by the ABI as the way to return a 128 bit integer result.
If such an instruction were available, it would simplify the task of constructing software based floating point support for the
which uses an extra SSE register to perform the operation.
SSE shifts and rotates
Currently performing shifts and rotates on packed data in SSE registers is difficult. The instructions are not particularly orthogonal, and rotate instructions don't exist at all. Instead, they need to be emulated with a pair of shifts together with a combining logic operation.
For 16, 32 and 64 bit packed types we have the
What is much worse is the 128 bit case. The instruction
Compared to the hypothetical case with a hardware accelerated version which is an order of magnitude shorter:
Horizontal Maximum and Minimum
Quite often one would like to obtain the maximum or minimum from a set of numbers. There are SSE instructions which allow this operation to be vectorized. By using these, we can obtain quite a large speedup. However, the final step of the algorithm requires us to find the "horizontal" maximum or minimum within the accumulator SSE register.
This operation exposes some non-orthogonality in the SSE instruction sets. There are very few ways to transmit information between the packed types. Thus we need to use multiple shuffle and combine operations to complete this task.
However, if you are extremely lucky, there is a case that is already hardware optimized. Finding the horizontal minimum of a set of unsigned words can be done with
Why aren't there equivalent instructions for other types, sizes and signedness? Note that a maximum may be converted into a minimum via a negation operation, but this still doesn't help us much if we would like to quickly scan bytes or single precision floats:
compared to the hypothetical
The final thing that would be really good to change with the SSE extensions is the elephant in the room: alignment. SSE instructions typically require that their memory accesses be 16 byte aligned. This is not too much trouble if you are in control of the source of the data you are processing. However, sometimes you need to accept data from an arbitrary source. If so, you may need to handle arbitrary alignments.
The fact that you may need to handle non-aligned data requires the code to waste time checking for this to deciding which version of the algorithm to use. More space (and thus cache) is then wasted in having the non-aligned version around. Then there is the problem in constructing the non-aligned algorithm.
Some algorithms are easy to change: Just use unaligned loads and stores. i.e.
With the addition of the
The above lists some possibilities of how the x86 instruction set could be improved. Some instructions could be added to increase orthogonality. Some could be slightly changed to define previously undefined behaviour. Some could be improved to work in more situations than before. What is your wishlist? In what other ways do you think the instruction set could be improved?
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.