X86 Instruction WishlistThe x86 instruction set has evolved over time. Originally, it was based around 16 bits in the 8086, later 32bit support was added. Recently AMD extended the general purpose registers to 64 bits, and Intel have added 128 bit vector (SIMD) registers, and will upgrade them to 256 bits when processors with the AVX extensions are released. As time has gone by, the number of instructions has increased dramatically because the number of new added instructions has greatly outweighed the small number of instructions that have been removed. The net result of all this innovation has resulted in a rather complex encoding of the work the processor needs to do. Some instructions may be up to 15 bytes in length, whereas others may only require a single byte. In general, the more complex and rarer the instruction, the longer it tends to be. In effect, this acts like a Huffman encoding of the list of work for the processor, shrinking the amount of memory required to store that work. Since cache is expensive on the processor die, this saves money, and increases performance. Through the trials of history, the original CISC instruction set has been converted into a fairly orthogonal and lean RISC for the general purpose instructions. A fair number of the original complex instructions have been depreciated, and most compiler-optimized code these days avoids their use. Conversely, the relatively recently added SSE extensions have added many decidedly non-orthogonal instructions. These new instructions tend to be designed for specific tasks like video encoding or encryption. Without explicitly using compiler intrinsics, the more complex of these instructions are unlikely to be used in your compiled code. The non-orthogonality of the SSE instruction sets (SSE, SSE2, SSE3 SSSE3 SSE4.1 SSE4.2 etc.) poses a problem when optimizing code. Fairly often it is quite difficult to conceive of the best set of instructions to perform a given task. If the exact instruction you need isn't implemented, whereas a slightly different variant is, then you are out of luck. So what improvements could be made to the instruction set? The following is a "wish-list" of useful instructions together with those that make the everything more orthogonal. RNGIt would be really useful if the processor had an instruction that returned random bits. This could be used to initialize cryptographic keys, user-space pseudo random number generators, and alleviate the problems of trying to get random bits from an embedded machine with very little sources of i/o. The most obvious way of doing this has subtle problems, which is probably why this instruction has not been implemented in the processor itself. (Some chipsets do have rngs attached though.) This "obvious" method is to use the amplified thermal noise from a resistor. Basic physics says that this noise should be a good source of entropy. However, such a resistor will also pick up interference from other sources and have a self-inductance that will give an increased correlation from the outputs. The hardest part of creating a good random number generator is to remove those correlations. Imagine you have used the above technique to create a good rng on your processor chip. Now comes the time to shrink your design down to the next fabrication feature-size. Unfortunately, you will need to all your testing to see if the randomness statistical properties hold. (The rng isn't like a normal digital circuit, and will have unpredictable scaling properties.) This increase in certification time is problematic. However, there is another way to generate good random bits. Construct a circuit that cryptographically hashes the output from a counter that is updated every clock cycle. By giving each processor a different initialization vector for the hash function, each processor will yield a different sequence of bits. This trick is making sure that this hidden key is not available to be reverse engineered by any code running on the machine. This prevents the privacy problems in having a unique serial number for each machine which occurred in Pentium III systems. By pipe-lining the encryption process, a random number per clock cycle could be obtained. Since the hash is cryptographically secure, it will have good randomness guarantees. It will also not require the expensive recertification on process shrinks. CMOVThe condition move instructions ( For example, it is fairly common in C to require that the output from some calculation be converted into a boolean result. C treats any non-zero value as true, and zero as false. Further calculations may require just the values zero and one instead. To do this on current machines requires three instructions:
or, using just a single register:
If an immediate conditional move exists, then only two instructions are required:
Bit ScanThe bitscan instructions If the result is not defined, then an explicit check with zero is required nearly every time these instructions are used. (They will set the zero flag in the undefined case, but extra code is still needed to interpret that result.) For example, 32bit code which scans a 64 bit variable for the highest set bit looks like:
The above code obviously is rather complex due to all the extra error-checking required. A much simpler result occurs if the bit-scan instructions are defined to not alter the output register if their input is zero. My AMD opteron has this undocumented behaviour. Using this gives the shorter sequence:
Note how the extra Larger offset multiplesWith the introduction of the i386 processor, Intel added extra addressing modes. These use the SIB byte in addition to the old Mod-R/M byte to allow any general purpose register to act as a displacement or index. In addition, this new addressing mode allows these indices to be scaled by 2, 4 or 8. A well-known hack is that by combining a displacement with an index other multiples are possible. For example, one can use the
The problem here is that as time has passed, the amount of data that computers deal with has increased. Size-eight multiples simply are no longer enough. SSE registers are 16 bytes in size. Their AVX extensions will be 32 bytes in size. Also, many data structures in multithreaded applications are cache-line sized at a total of 64 bytes. Thus modern code tends to require bit-shifts to multiply the scale factor in pointer accesses. Imagine we had four bits to describe an index registers multiplication factor. If we include a zero multiple, this allows fifteen different scale factors. In addition, it might be nice to allow negative scale factors, which can save a SETccThe set-condition-code instructions are very useful. They set a given byte to 0 or 1 depending the the value of a flag bit. This allows branch-free code in some cases. However, very often a value larger than a byte is required. To get these larger results zero-extension or pre-zeroing of a register is done:
or
Where this second version is faster because the clearing of It would be nice if the extra instructions were not required. 32 bit, or 64 bit versions of the
Double-dereferenceMany lock-free algorithms are complex due to dealing with pointers. The problem with a pointer is that it takes two instructions to use it. The first instruction reads the value of the pointer stored in a memory location. The second instruction uses that pointer in some indexed operation. Since two instructions are required, nothing prevents another thread from altering the contents of the pointer's memory location between the two. This second thread may have no idea that the first thread has read the pointer. (It hasn't had any time to notify anyone of that fact.) Thus it may not be sure whether or not it is allowed to access or free the memory pointed to. Making sure of the non-existence of these hidden references is what makes most lock-free and wait-free algorithms so complex. The threads need to communicate with each other in some way that prevents invalid states from occurring. Note that since it is possible that the pointer-accessing thread can be scheduled out after reading that pointer, an arbitrarily long amount of time may be required to notify other threads. This deferred-notification means that simple delay loops don't work. A common way of managing data is to use a reference count. A reference count will tell you how many other threads may be accessing this object concurrently. For example, if others are using the data, you cannot free it. For good abstraction, this reference count should be ideally placed within the object it protects. The problem here is that to access that reference count, an active pointer to the object is required. In the time taken to read that pointer, the pointer could be invalidated and the reference count could be set to zero. Thus when the first thread finally gets around to incrementing the reference count it is accessing invalid memory. This is a version of the lock-free pointer problem described above. If the pointer-following could be made to be atomic, then these data races disappear. To do this from the instruction viewpoint requires "double-dereference" addressing. How could this work? There are some corner cases to work out. What should be we do if the pointer we read is invalid? How can we be notified? How should the result of the operation be encoded? One way of solving this is to require that the result of the instruction in the register is the value of the pointer at the time of execution. If an invalid pointer is used, then the operation simply isn't performed. The user can look at the pointer value to determine whether success occurred or not. Example syntax of this may look like:
AMD has discussed the possible addition of extra instructions for lock-free memory transactions. However, this method appears to be much simpler, and can perhaps add similar power without the complexity. Complex ArithmeticArithmetic with complex numbers is more complex than most programmers realize. The problem is that floating point numbers are approximations of the real number line. IEEE floating point numbers have some nice properties, but some of these get in the way of efficient complex math. The obvious way of implementing complex multiplication is to use the formula from your math text book: (a+ib)×(c+id) = (ac - bd) + i(ad + bc). Look what happens if we use IEEE arithmetic with some extreme values. Lets multiply infinity by one. The result is obviously infinity. This is the case in real arithmetic, and naively, the fact that the imaginary parts of the complex numbers are zero shouldn't affect this result. However, this isn't true: (∞ + i 0)(1 + i 0) = ∞×1 - 0×0+ i(∞×0 + 1×0) = ∞ + i NaN Oops, a NaN has appeared because zero times infinity is undefined. Further calculations will cause this NaN to contaminate the results, yielding NaN for both the real and imaginary parts. Thus before we multiply we need to check for extreme cases. Similar problems appear with division, where signed zeros require special checks in order to be produced correctly. So high-quality complex arithmetic libraries need to pre-check every calculation to avoid these problems. This is slow. Wouldn't it be nice if single and double-precision complex arithmetic was supported in hardware? Some rudimentary complex support has been added via the
__float128A little-known part of the x86_64 ABI is the It would be nice if hardware natively supported this data type. This would speed up high-precision applications by perhaps an order of magnitude or more over the current methods which implement high precision in software. Packed Integer MultiplicationAs new SSE versions have been implemented, there are more and more ways of multiplying integers in SSE registers. However, these instructions are not orthogonal, and different possibilities exist for different sizes. For example, there are no instructions that multiply packed 8 bit quantities. There are three different ways to multiply packed 16 bit quantities. The first instruction ( Packed 32 bit quantities work differently. Here, there is an instruction ( There are currently no 64 bit packed multiplication instructions in any SSE extension. Thus to perform these multiplies one needs to move the data into general purpose registers and back again. Finally there are "weird" instructions ( It would be nice if the SSE instruction set was more orthogonal. Perhaps an equivalent of the 16 bit high-multiply instructions would be useful for packed 32 bit registers, and vice-versa. Also packed 64bit multiplies would be extremely useful for cryptography, hashes, compression, fixed-point arithmetic, and perhaps even graphics applications. SSE to General Purpose movesOne problem with the current instruction set is that transferring data from SSE registers to general purpose registers and back is difficult. The It would be nice if there was an instruction to transfer all the information in an SSE register into general purpose registers and back again. Since SSE registers are larger than general purpose ones, such an instruction will obviously need to affect two (or more) of these. The problem here is that the x86 architecture really isn't designed for multiple output instructions. AVX will part-way to fixing this with its FMA additions, but that is only for SSE registers. Instead, in typical x86 style, we should choose the rax/rdx pair as the input/output for this operation. This register pair is used for double-precision integer multiplies, and is also specified by the ABI as the way to return a 128 bit integer result. If such an instruction were available, it would simplify the task of constructing software based floating point support for the
Compared to
which uses an extra SSE register to perform the operation.
SSE shifts and rotatesCurrently performing shifts and rotates on packed data in SSE registers is difficult. The instructions are not particularly orthogonal, and rotate instructions don't exist at all. Instead, they need to be emulated with a pair of shifts together with a combining logic operation. For 16, 32 and 64 bit packed types we have the What is much worse is the 128 bit case. The instruction
Compared to the hypothetical case with a hardware accelerated version which is an order of magnitude shorter:
Horizontal Maximum and MinimumQuite often one would like to obtain the maximum or minimum from a set of numbers. There are SSE instructions which allow this operation to be vectorized. By using these, we can obtain quite a large speedup. However, the final step of the algorithm requires us to find the "horizontal" maximum or minimum within the accumulator SSE register. This operation exposes some non-orthogonality in the SSE instruction sets. There are very few ways to transmit information between the packed types. Thus we need to use multiple shuffle and combine operations to complete this task. However, if you are extremely lucky, there is a case that is already hardware optimized. Finding the horizontal minimum of a set of unsigned words can be done with Why aren't there equivalent instructions for other types, sizes and signedness? Note that a maximum may be converted into a minimum via a negation operation, but this still doesn't help us much if we would like to quickly scan bytes or single precision floats:
compared to the hypothetical
Unaligned SSEThe final thing that would be really good to change with the SSE extensions is the elephant in the room: alignment. SSE instructions typically require that their memory accesses be 16 byte aligned. This is not too much trouble if you are in control of the source of the data you are processing. However, sometimes you need to accept data from an arbitrary source. If so, you may need to handle arbitrary alignments. The fact that you may need to handle non-aligned data requires the code to waste time checking for this to deciding which version of the algorithm to use. More space (and thus cache) is then wasted in having the non-aligned version around. Then there is the problem in constructing the non-aligned algorithm. Some algorithms are easy to change: Just use unaligned loads and stores. i.e. With the addition of the Your OpinionThe above lists some possibilities of how the x86 instruction set could be improved. Some instructions could be added to increase orthogonality. Some could be slightly changed to define previously undefined behaviour. Some could be improved to work in more situations than before. What is your wishlist? In what other ways do you think the instruction set could be improved? |
About Us | Returns Policy | Privacy Policy | Send us Feedback |
Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |
Comments
Dmitry V'jukov said...LOCK_RELAXED ADD [ecx], 1
2. System-wide memory fence (hw analog of FlushProcessWriteBuffers())
3. AMD ASF, double dereference won't do the thing.
AMD's ASF is nice... but I thought it would be a bit presumptuous arguing for it here. Double-dereference is much simpler, and it is quite surprising how powerful it is. I should probably write an article on it. :-)
Well, I think that it's a bit late to cope about security when a robber is already in an apartment. I believe there are 1000 and 1 ways how a malicious process can compromise the system's performance. If you want security just do not run a suspicious process, or use a virtual environment.
Anyway, QPI based systems *already* have such a feature, so we are not losing anything.
As for "poor man's transactional memory", MOV [mem],[mem] can be useful too. It would allow to implement a zero cost SMR. However, it still requires locking of 2 separate cache-lines, which is quite problematic on QPI-based systems.
Btw, the problem you outlined in "Double-dereference" part is solvable today with a strongly thread-safe reference counting algorithms. They allow to basically acquire a reference to an object when you do not have a one.
The fact that 2 separate cache lines are required for double-dereference isn't much of a problem. As things currently stand atomic ops are allowed to cross cachelines, and in fact pages. Of course the maximum number of cache lines required is currently two - and double dereference might need four...
Yeah, you can emulate the reference counting effect of double-deference with hazard pointers. The problem with those is handling the annoying special cases where threads are being created and destroyed simultaneously with the memory ops. Then there is the fact that hazard pointers are O(n) rather than O(1) :-/
The problem is performance rather than possibility. Split atomic RMW on QPI based system takes 5000 cycles, and effectively acts as FlushProcessWriteBuffers():
http://blogs.sun.com/dave/entry/qpi_quiescence
>Yeah, you can emulate the reference counting effect of double-deference with hazard pointers.
I meant atomic reference counting, one still pays 1 atomic RMW (perhaps double-word) per operation (acquire/release), and no corner cases.
btw, the captcha is annoying
Hazard pointers are quite efficient and are a general solution the same problem that works on existing hardware with a wide variety of data structures. They require that the number of threads be statically bounded, however.
I agree with pretty much everything you said about x86 SIMD. There are so many special cases, and at this point it's impossible to make any sense of it. It has been particularly annoying that basic ops like multiply continue to work only for specific integral types.
I have also been frustrated with limitations of the SETcc and CMOVcc instructions. In fact I wish SETcc would set a 4-byte register to either 00000000 or FFFFFFFF (or sign-extend to 8-bytes in AMD64) for branch optimization. I'm sure C++ compiler-writers would kill me for that suggestion.
The BSF/BSR instructions are generally implemented as CLZ/CTZ on other processors. These instructions are well-defined in all cases. If the source operand is zero, then they store the total number of bits in the register (i.e. 32 for x86 and 64 for x64). This is consistent with GCC intrinsics.
Finally, there is a processor that has an on-die RNG: the VIA C3, C7, and Nano processors. They are embedded x86 processors with specialized instructions for accessing the on-die RNG and have additional hardware support for accelerating AES, RSA, and SHA algorithms.
1) ALU instructions that do not set flags:
addnf dst, src
ornf dst, src
adcnf dst, src
sbbnf dst, src
andnf dst, src
subnf dst, src
xornf dst, src
incnf dst, src
decnf dst, src
rolnf dst, src
rornf dst, src
shlnf dst, src
shrnf dst, src
sarnf dst, src
These are very useful in branch-free code for scheduling instructions when the flags need to be preserved. I did not include cmp or test for obvious reasons, and I also didn't include rcr or rcl. By the way, shifts, rotates, and inc/dec have historically been problematic for the architecture because of the way they update eflags. They result in stalls under certain conditions.
These are also useful in virtualization code. It makes it possible to update the virtual cycle counter without saving eflags, for example.
2) bitswap dst, src:
Reverses all the bits in src and stores the result in dst. It's similar to bswap. This isn't incredibly common, but it comes up over and over again. It would also be acceptable to reverse the bits in each byte of src such that using bswap on the result would complete the swap operation.
3) Bit string manipulation instructions!
insr dst, src, imm6, imm6
extr dst, src, imm6, imm6
mux4 dst, src, imm8
The insr instruction computes the following:
msk = rol((1 << len) - 1, off)
dst = (dst & ~msk) | (rol(src, off) & msk)
This allows a right-aligned bit string of any length to be copied from one register into another at any arbitrary position.
The extr instruction computes the following:
dst = ror(src, off) & ((1 << len) - 1)
This allows any arbitrary bit string to be copied from one register into another and zero-extended or sign-extended.
These two instructions are available on IA-64 as dep and extr, and PowerPC has rlwinm which is very similar. These instructions would optimize C bit-field manipulation, and they're also extremely useful in decoding RISC instructions.
The mux4 instruction divides the src & dst registers into 4 equal-sized partitions and computes the following:
dst[0] = src[imm8 & 3]
dst[1] = src[(imm8 >> 2) & 3]
dst[2] = src[(imm8 >> 4) & 3]
dst[3] = src[(imm8 >> 6) & 3]
This allows any byte to be moved into any position in a 32-bit register or any two bytes to be moved into any position in a 64-bit register. It would be possible to also have a mux8 instruction with a 24-bit immediate or a mux16 instruction with a 64-bit immediate.
An alternative implementation might be to allow copying an arbitrary element from the source register into an arbitrary position in the destination register.
4) ll/sc
Many of the RISCs have load-linked/store-conditional. It's more generic than cmpxchg and solves the A-B-A problem without requiring DCAS. It would also totally obviate the need for the lock prefix.
The whole point of atomic double-dereference is to avoid the possibility of accessing invalid pointers. The instruction should just not do the required operation if it would segfault. By returning the address it used/did not used, the client code can work out whether or not the operation actually occurred. i.e. Since we may know that the page at address zero isn't mapped, NULL works as intended. If that page is used, you can transparently use the address of some other unmapped page for the NULL bit-pattern.
I like hazard pointers. However, they aren't the perfect solution. As you mention, they are much harder to use when the number of threads is either unknown, or changes at runtime. Another problem with them is that the writer side needs to scan a list that is proportional to the number of threads. Since each single check will inevitably result in a cache-miss, the overhead can become quite large when you have loads of threads.
Yes, I'd love the set-all-bits on cc instruction. "sbb reg,reg" is my favourite instruction because it does that with the carry flag. In fact using the "neg" instruction to test for zero instead of "test" can be much more elegant due to its use of the carry instead of the zero flag.
I know it sounds weird, but having new instructions that don't set the flags might actually be a pessimization. Internally, the processor counts on the fact that most instructions completely rewrite/invalidate the flags. This allows dependency chains to be shortened or removed, and more OOO parallelism to be obtained. Also, with macro-op fusion which exists in modern processors test+branch takes but a single cycle, so redoing a test isn't expensive.
Bitswap would be nice for hash functions and encryption. :-) It is also very useful for locality-sensitive data structures. (Not quite as good as Hilbert Curve order, but close.)
ll/sc is actually quite horrible. The problem with it is that the sc can fail. This turns every atomic operation into some sort of loop. Now you have the difficulty of solving live-lock, where multiple threads interfere with each other. Correcting that with some sort of exponential back-off greatly bloats your atomic primitives. AMD has suggested something much like this to help lock free code. However, it doesn't help the much more performant wait free algorithms. We really need more atomics... the problem is finding which ones help the most. :-)
2. A reliable and faster MONITOR/MWAIT.
The documentation is found under http://software.intel.com/en-us/avx/ when you click on "Intel Advanced Vector Extensions Programming Reference".
You can test it with version >=3.88 of the emulator: http://software.intel.com/en-us/articles/intel-software-development-emulator/
In my opinion, the best way to implement this would be a new prefix "nof" that tells the processor to ignore any effects on the flags register caused by the next instruction. Then you can write something like
nof add %eax,%ebx
nof dec %eax
etc.
I like this more since the usage looks very intuitive.
I absolutely love your articles, especially this one.
Cheers from Michigan!
Maybe an instruction that interleaves the bits from two registers would be cool too (improves data locality of 2d arrays).
on your wishlist!
mov ecx,0x55555555
pdep eax,eax,ecx
pdep edx,edx,ecx
lea eax,[eax+edx*2]
I would like to add one more that is very important in my opinion.
It is about memory access.
Very often [otherwise well vectorizible] algorithms require
reading/writing from/to mem addresses which are calculated per-channel
(reading from table, sampling a texture, etc.).
When you get to this, you are forced to make that part of the algorithm scalar
by extracting each channel in turn to a GP register, performing the memory operation and then inserting the result back to a vector register.
I don't think a single instruction that interprets each channel as an address and reads/writes to different memory locations at once is hardware feasible (though it would be extremely good) but at least they should try to device some helper instructions that would ease the situation. I don't have an immediate idea what those instructions would be.
It would be great if they add instructions for memory access that read the address directly from an SSE/AVT register instead of GP. e.g.:
ploaddq $i, %xmm0, %xmm1 - read a double quadword from address specified in the i-th dword of xmm0 and store it in xmm1
ploadq $(i + (j<<4)), %xmm0, %xmm1 - read a quadword from address specified in the i-th dword of xmm0 and store it in j-th half of xmm1
ploadd $(i + (j<<4)), %xmm0, %xmm1 - read a dword from address specified in the i-th dword of xmm0 and store it in j-th quarter of xmm1
pstored $(i + (j<<4)), %xmm0, %xmm1 - store a dword to address specified in the i-th dword of xmm0, the source is the j-th quarter of xmm1
something like that :)
pcmpged / pcmpled
Currently I need to check for (a<b) | (a==b) to get the move mask.