Lockless Inc

X86 Instruction Wishlist

The x86 instruction set has evolved over time. Originally, it was based around 16 bits in the 8086, later 32bit support was added. Recently AMD extended the general purpose registers to 64 bits, and Intel have added 128 bit vector (SIMD) registers, and will upgrade them to 256 bits when processors with the AVX extensions are released. As time has gone by, the number of instructions has increased dramatically because the number of new added instructions has greatly outweighed the small number of instructions that have been removed.

The net result of all this innovation has resulted in a rather complex encoding of the work the processor needs to do. Some instructions may be up to 15 bytes in length, whereas others may only require a single byte. In general, the more complex and rarer the instruction, the longer it tends to be. In effect, this acts like a Huffman encoding of the list of work for the processor, shrinking the amount of memory required to store that work. Since cache is expensive on the processor die, this saves money, and increases performance.

Through the trials of history, the original CISC instruction set has been converted into a fairly orthogonal and lean RISC for the general purpose instructions. A fair number of the original complex instructions have been depreciated, and most compiler-optimized code these days avoids their use. Conversely, the relatively recently added SSE extensions have added many decidedly non-orthogonal instructions. These new instructions tend to be designed for specific tasks like video encoding or encryption. Without explicitly using compiler intrinsics, the more complex of these instructions are unlikely to be used in your compiled code.

The non-orthogonality of the SSE instruction sets (SSE, SSE2, SSE3 SSSE3 SSE4.1 SSE4.2 etc.) poses a problem when optimizing code. Fairly often it is quite difficult to conceive of the best set of instructions to perform a given task. If the exact instruction you need isn't implemented, whereas a slightly different variant is, then you are out of luck.

So what improvements could be made to the instruction set? The following is a "wish-list" of useful instructions together with those that make the everything more orthogonal.


It would be really useful if the processor had an instruction that returned random bits. This could be used to initialize cryptographic keys, user-space pseudo random number generators, and alleviate the problems of trying to get random bits from an embedded machine with very little sources of i/o.

The most obvious way of doing this has subtle problems, which is probably why this instruction has not been implemented in the processor itself. (Some chipsets do have rngs attached though.) This "obvious" method is to use the amplified thermal noise from a resistor. Basic physics says that this noise should be a good source of entropy. However, such a resistor will also pick up interference from other sources and have a self-inductance that will give an increased correlation from the outputs. The hardest part of creating a good random number generator is to remove those correlations.

Imagine you have used the above technique to create a good rng on your processor chip. Now comes the time to shrink your design down to the next fabrication feature-size. Unfortunately, you will need to all your testing to see if the randomness statistical properties hold. (The rng isn't like a normal digital circuit, and will have unpredictable scaling properties.) This increase in certification time is problematic.

However, there is another way to generate good random bits. Construct a circuit that cryptographically hashes the output from a counter that is updated every clock cycle. By giving each processor a different initialization vector for the hash function, each processor will yield a different sequence of bits. This trick is making sure that this hidden key is not available to be reverse engineered by any code running on the machine. This prevents the privacy problems in having a unique serial number for each machine which occurred in Pentium III systems.

By pipe-lining the encryption process, a random number per clock cycle could be obtained. Since the hash is cryptographically secure, it will have good randomness guarantees. It will also not require the expensive recertification on process shrinks.


The condition move instructions (cmov and friends) allow branch-free code to be efficiently implemented. Two different values can be simultaneously calculated via the processors parallel execution units. Those values can then be efficiently selected between with a conditional move. The problem here is that as currently implemented cmov requires two registers as arguments. Many instruction sequences could be shortened by allowing creating an immediate version of the conditional moves.

For example, it is fairly common in C to require that the output from some calculation be converted into a boolean result. C treats any non-zero value as true, and zero as false. Further calculations may require just the values zero and one instead. To do this on current machines requires three instructions:

mov	$1, %rcx
test	%rax, %rax		# Compare %rax with itself, and set zero flag if zero
cmovne	%rcx, %rax		# If I'm not zero, I need to be one.
or, using just a single register:

cmp	$1, %rax		# Are we less than one; if so, set the carry flag
sbb	%rax, %rax		# Set %rax to -1 if carry is set, zero otherwise
inc	%rax			# Change to zero or one.

If an immediate conditional move exists, then only two instructions are required:

test	%rax, %rax		# Compare %rax with itself, and set zero flag if zero
cmovne	$1, %rax		# If I'm not zero, I need to be one.

Bit Scan

The bitscan instructions bsf and bsr are used to find the first or last set bits in a register or memory location. This allows fast scanning of bit-string data structures. The problem is what to return with an all-zero string. A return value of zero corresponds to the 20=1 bit. Thus Intel and AMD have left this case undefined.

If the result is not defined, then an explicit check with zero is required nearly every time these instructions are used. (They will set the zero flag in the undefined case, but extra code is still needed to interpret that result.) For example, 32bit code which scans a 64 bit variable for the highest set bit looks like:

mov	$-1, %ebp		# Use -1 as the value for all-zeros
bsr	%eax, %ecx		# Scan the low 32 bits
cmove	%ebp, %ecx		# Choose -1 if they were all zero
add	$-0x20, %ecx		# Subtract 32, so the later add cancels out
bsr	%edx, %ebp		# Scan the upper 32 bits
cmove	%ecx, %ebp		# If the upper bits are all zero, choose the lower result
add	$0x20, %ebp		# Add 32 so the upper result is correct

The above code obviously is rather complex due to all the extra error-checking required. A much simpler result occurs if the bit-scan instructions are defined to not alter the output register if their input is zero. My AMD opteron has this undocumented behaviour. Using this gives the shorter sequence:

mov	$-1, %ebp		# Use -1 as the value for all-zeros
bsr	%eax, %ebp		# Scan the low 32 bits
add	$-0x20, %ebp		# Subtract 32, so the later add cancels out
bsr	%edx, %ebp		# Scan the upper 32 bits
add	$0x20, %ebp		# Add 32 so the upper result is correct

Note how the extra cmove instructions are now not needed. Simply by defining the behaviour to that actually implemented on modern machines, more efficient use of these instructions is possible.

Larger offset multiples

With the introduction of the i386 processor, Intel added extra addressing modes. These use the SIB byte in addition to the old Mod-R/M byte to allow any general purpose register to act as a displacement or index. In addition, this new addressing mode allows these indices to be scaled by 2, 4 or 8.

A well-known hack is that by combining a displacement with an index other multiples are possible. For example, one can use the lea instruction to multiply a register by 5 and store into another register:

lea	(%eax, %eax, 4), %edx	# %edx = 5 * %eax

The problem here is that as time has passed, the amount of data that computers deal with has increased. Size-eight multiples simply are no longer enough. SSE registers are 16 bytes in size. Their AVX extensions will be 32 bytes in size. Also, many data structures in multithreaded applications are cache-line sized at a total of 64 bytes. Thus modern code tends to require bit-shifts to multiply the scale factor in pointer accesses.

Imagine we had four bits to describe an index registers multiplication factor. If we include a zero multiple, this allows fifteen different scale factors. In addition, it might be nice to allow negative scale factors, which can save a neg instruction in some cases. Using powers of two (and minus two) for these factors gives the possibilities: -64, -32, -16, -8, -4, -2, -1, 0, 1, 2, 4, 8, 16, 32, 64, 128. A scale factor of 128 should be enough for a while.


The set-condition-code instructions are very useful. They set a given byte to 0 or 1 depending the the value of a flag bit. This allows branch-free code in some cases. However, very often a value larger than a byte is required. To get these larger results zero-extension or pre-zeroing of a register is done:

test	%eax, %eax		# Example flag-setting instruction
sete	%bl			# Set %bl to the result of the test
movzxb	%ebx, %ebx		# Zero-extend result to full 32 bits.

xor	%ebx, %ebx		# Clear %ebx
test	%eax, %eax		# Example flag-setting instruction
sete	%bl			# Set %bl, and thus %ebx to the result of the test

Where this second version is faster because the clearing of %ebx can occur in parallel with the test instruction.

It would be nice if the extra instructions were not required. 32 bit, or 64 bit versions of the setcc instructions could automatically zero the high bits of the register:

test	%eax, %eax		# Example flag-setting instruction
sete	%ebx			# Set %ebx to the result of the test with new instruction


Many lock-free algorithms are complex due to dealing with pointers. The problem with a pointer is that it takes two instructions to use it. The first instruction reads the value of the pointer stored in a memory location. The second instruction uses that pointer in some indexed operation. Since two instructions are required, nothing prevents another thread from altering the contents of the pointer's memory location between the two.

This second thread may have no idea that the first thread has read the pointer. (It hasn't had any time to notify anyone of that fact.) Thus it may not be sure whether or not it is allowed to access or free the memory pointed to.

Making sure of the non-existence of these hidden references is what makes most lock-free and wait-free algorithms so complex. The threads need to communicate with each other in some way that prevents invalid states from occurring. Note that since it is possible that the pointer-accessing thread can be scheduled out after reading that pointer, an arbitrarily long amount of time may be required to notify other threads. This deferred-notification means that simple delay loops don't work.

A common way of managing data is to use a reference count. A reference count will tell you how many other threads may be accessing this object concurrently. For example, if others are using the data, you cannot free it. For good abstraction, this reference count should be ideally placed within the object it protects. The problem here is that to access that reference count, an active pointer to the object is required. In the time taken to read that pointer, the pointer could be invalidated and the reference count could be set to zero. Thus when the first thread finally gets around to incrementing the reference count it is accessing invalid memory. This is a version of the lock-free pointer problem described above.

If the pointer-following could be made to be atomic, then these data races disappear. To do this from the instruction viewpoint requires "double-dereference" addressing. How could this work? There are some corner cases to work out. What should be we do if the pointer we read is invalid? How can we be notified? How should the result of the operation be encoded?

One way of solving this is to require that the result of the instruction in the register is the value of the pointer at the time of execution. If an invalid pointer is used, then the operation simply isn't performed. The user can look at the pointer value to determine whether success occurred or not.

Example syntax of this may look like:

lea	(pointer)%rip, %rax	# Load the location of the pointer
lock				# Lock prefix to make next instruction atomic
inc	((%rax))		# Atomic double-dereference increment of reference count
test	%rax, %rax		# Was the pointer null?

AMD has discussed the possible addition of extra instructions for lock-free memory transactions. However, this method appears to be much simpler, and can perhaps add similar power without the complexity.

Complex Arithmetic

Arithmetic with complex numbers is more complex than most programmers realize. The problem is that floating point numbers are approximations of the real number line. IEEE floating point numbers have some nice properties, but some of these get in the way of efficient complex math.

The obvious way of implementing complex multiplication is to use the formula from your math text book:

(a+ib)×(c+id) = (ac - bd) + i(ad + bc).

Look what happens if we use IEEE arithmetic with some extreme values. Lets multiply infinity by one. The result is obviously infinity. This is the case in real arithmetic, and naively, the fact that the imaginary parts of the complex numbers are zero shouldn't affect this result. However, this isn't true:

(∞ + i 0)(1 + i 0) = ∞×1 - 0×0+ i(∞×0 + 1×0) = ∞ + i NaN

Oops, a NaN has appeared because zero times infinity is undefined. Further calculations will cause this NaN to contaminate the results, yielding NaN for both the real and imaginary parts. Thus before we multiply we need to check for extreme cases. Similar problems appear with division, where signed zeros require special checks in order to be produced correctly.

So high-quality complex arithmetic libraries need to pre-check every calculation to avoid these problems. This is slow. Wouldn't it be nice if single and double-precision complex arithmetic was supported in hardware? Some rudimentary complex support has been added via the addsubpd and addsubps instructions which simplify a small part of the calculation. Something like the following would be ideal:

cmuld	%xmm0, %xmm1		# Hypothetical complex multiply double instruction
addpd	%xmm2, %xmm1		# Packed addition already works for complex numbers 
cdivd	%xmm1, %xmm2		# Hypothetical complex division


A little-known part of the x86_64 ABI is the __float128 data type. This uses a SSE register to store a 128 bit IEEE floating point type. This allows quite a bit of extra precision compared to 64 bit doubles, and even 80 bit long doubles. The extra precision is useful in scientific, engineering, and mathematical applications. For example, if one needs to calculate the moments of the relativistic electron scattering function then there are many cancellations in the analytic formulae when Taylor-expanded. These prevent useful results from being obtained unless double-double or quad-double precision is used.

It would be nice if hardware natively supported this data type. This would speed up high-precision applications by perhaps an order of magnitude or more over the current methods which implement high precision in software.

Packed Integer Multiplication

As new SSE versions have been implemented, there are more and more ways of multiplying integers in SSE registers. However, these instructions are not orthogonal, and different possibilities exist for different sizes. For example, there are no instructions that multiply packed 8 bit quantities.

There are three different ways to multiply packed 16 bit quantities. The first instruction (pmullw) obtains the lowest 16 bits of each multiplication. The next two instructions (pmulhw and pmulhuw) calculate the upper 16 bits of the 32 bit result for signed and unsigned multiplications respectively. Thus to obtain a full 32 bit result for 16 bit arguments two multiplies are needed, and the results combined somehow.

Packed 32 bit quantities work differently. Here, there is an instruction (pmulld that acts like pmullw, calculating the low half of the result. However, there are no instructions that calculate just the high part. Instead, there are two instructions that do a full 32 bit × 32 bit to 64 bit multiply. One instruction (puldq) performs signed, the other (pmuludq) unsigned multiplies. The problem here is that the results are twice the size of the arguments. Thus something needs to be left out to store them. Intel decided to only store the results of multiplying the 1st and 3rd packed 32 bit quantities. Thus to perform a full packed multiply, two multiply instructions are needed with a shuffle to get the quantities in the right locations.

There are currently no 64 bit packed multiplication instructions in any SSE extension. Thus to perform these multiplies one needs to move the data into general purpose registers and back again.

Finally there are "weird" instructions (pmulhrsw, pmaddwd, pmaddubsw) that perform multiply-based operations useful for video and graphics applications. These obviously need to be directly specified as intrinsics because of their strange semantics.

It would be nice if the SSE instruction set was more orthogonal. Perhaps an equivalent of the 16 bit high-multiply instructions would be useful for packed 32 bit registers, and vice-versa. Also packed 64bit multiplies would be extremely useful for cryptography, hashes, compression, fixed-point arithmetic, and perhaps even graphics applications.

SSE to General Purpose moves

One problem with the current instruction set is that transferring data from SSE registers to general purpose registers and back is difficult. The movq instruction allows the lower 64 bits of a SSE register to be accessed. However, the high 64 bits (and high 192 bits) in AVX are inaccessible without either going via memory (movhpd/movhps or through a shuffle.

It would be nice if there was an instruction to transfer all the information in an SSE register into general purpose registers and back again. Since SSE registers are larger than general purpose ones, such an instruction will obviously need to affect two (or more) of these. The problem here is that the x86 architecture really isn't designed for multiple output instructions. AVX will part-way to fixing this with its FMA additions, but that is only for SSE registers.

Instead, in typical x86 style, we should choose the rax/rdx pair as the input/output for this operation. This register pair is used for double-precision integer multiplies, and is also specified by the ABI as the way to return a 128 bit integer result.

If such an instruction were available, it would simplify the task of constructing software based floating point support for the __float128 type. It would also allow fast access to 64 bit × 64 bit to 64 bit multiplies:

mov		(constant)%rip, %rcx	# Get constant to multiply both parts by
movssetoad	%xmm0			# Move upper part to %rdx, and lower part of %xmm0 to %rax
mul		%rcx, %rax		# Multiply
mul		%rcx, %rdx
movadtosse	%xmm0			# Move the data back into the sse register.
Compared to

mov	(constant)%rip, %rcx		# Get constant to multiply both parts by
movhlps	%xmm0, %xmm1			# Split off upper part into %xmm1
movq	%xmm0, %rax			# Move to general purpose registers
movq	%xmm1, %rdx
mul	%rcx, %rax			# Do the multiplies
mul	%rcx, %rdx
movq	%rax, %xmm0			# Move back into SSE registers
movq	%rdx, %xmm1
movlhps	%xmm1, %xmm0			# Recombine the data
which uses an extra SSE register to perform the operation.

SSE shifts and rotates

Currently performing shifts and rotates on packed data in SSE registers is difficult. The instructions are not particularly orthogonal, and rotate instructions don't exist at all. Instead, they need to be emulated with a pair of shifts together with a combining logic operation.

For 16, 32 and 64 bit packed types we have the psslw, pssld and psslq instructions for left shifts. (Right shifts are similarly implemented.) These instructions allow shifting by a fixed immediate constant, or by a variable count stored in the low 64 bits of a SSE register. Unfortunately, there is no shift that takes a count from a general purpose register. Thus you may need to transfer that value to a spare SSE register via a movq instruction.

What is much worse is the 128 bit case. The instruction pslldq, is much more limited. It only allows immediate constants to describe the shift amount. Thus to generate a computed shift, a jump table is needed. This isn't so bad. However, the shift is implemented in byte-granularity. Constructing a computed 128 bit variable shift with bit-granularity becomes rather verbose:

mov	$40, %rax			# %rax = 64
and	$7f, %rcx			# Mask off values from 0 to 127 to get shift count
test	%rax, %rcx			# Is shift >= 64?
jne	1f				# Go to special case if is.

sub	%rcx, %rax			# Calculate 64 - count for right-shift component
movapd	%xmm0, %xmm1			# Copy to spare register
movq	%rcx, %xmm2			# Move counts to SSE registers
movq	%rax, %xmm3
pslldq	$8, %xmm0			# 64 bit shift for lower part into upper part
psslq	%xmm2, %xmm1			# Compute two 64 bit left shifts in parallel
pssrq	%xmm3, %xmm0			# Compute the part that moves from low to high
por	%xmm1, %xmm0			# Combine the results
jmp	2f				# Done

sub	%rax, %rcx			# Get part of shift greater than 64 bits
pslldq	$8, %xmm0			# Shift 8 bytes = 64 bits
movq	%rcx, %xmm1			# Need to move shift count to SSE register
psslq	%xmm1, %xmm0			# Complete shift
2:					# Done

Compared to the hypothetical case with a hardware accelerated version which is an order of magnitude shorter:

sldq	%rax, %xmm0			# Just do the shift

Horizontal Maximum and Minimum

Quite often one would like to obtain the maximum or minimum from a set of numbers. There are SSE instructions which allow this operation to be vectorized. By using these, we can obtain quite a large speedup. However, the final step of the algorithm requires us to find the "horizontal" maximum or minimum within the accumulator SSE register.

This operation exposes some non-orthogonality in the SSE instruction sets. There are very few ways to transmit information between the packed types. Thus we need to use multiple shuffle and combine operations to complete this task.

However, if you are extremely lucky, there is a case that is already hardware optimized. Finding the horizontal minimum of a set of unsigned words can be done with phminposuw in a single instruction.

Why aren't there equivalent instructions for other types, sizes and signedness? Note that a maximum may be converted into a minimum via a negation operation, but this still doesn't help us much if we would like to quickly scan bytes or single precision floats:

movhlps	%xmm0, %xmm1			# Move top two floats to lower part of %xmm1
minps	%xmm1, %xmm0			# Get minimum of sets of two floats
pshufd	$0x55, %xmm0, %xmm1		# Move second float to lower part of %xmm1
minps	%xmm1, %xmm0			# Get minimum of all four floats originally in %xmm0
compared to the hypothetical

hminps	%xmm0, %xmm1			# Fictional Horizontal minimum of packed floats.

Unaligned SSE

The final thing that would be really good to change with the SSE extensions is the elephant in the room: alignment. SSE instructions typically require that their memory accesses be 16 byte aligned. This is not too much trouble if you are in control of the source of the data you are processing. However, sometimes you need to accept data from an arbitrary source. If so, you may need to handle arbitrary alignments.

The fact that you may need to handle non-aligned data requires the code to waste time checking for this to deciding which version of the algorithm to use. More space (and thus cache) is then wasted in having the non-aligned version around. Then there is the problem in constructing the non-aligned algorithm.

Some algorithms are easy to change: Just use unaligned loads and stores. i.e. movupd instead of movapd However, move instructions aren't the only ones that can access memory. The power of the x86 instruction set allows arithmetic and logic operations to also have memory as one of their arguments. Unfortunately, no unaligned versions of these instructions exist. Thus a highly optimized aligned algorithm may need significant changes to work in the unaligned case, with perhaps many extra explicit load and store instructions.

With the addition of the lddqu instruction, the processor requires the hardware to automatically decide which is the best way to load or store to memory in 16 byte chunks. It can use a fast method if the memory is aligned, and a less fast method if not. Why can't this hardware also be used in other instructions? In short, if the alignment restrictions are removed (just as they are for operations with general purpose registers), then code can be made much more compact and efficient.

Your Opinion

The above lists some possibilities of how the x86 instruction set could be improved. Some instructions could be added to increase orthogonality. Some could be slightly changed to define previously undefined behaviour. Some could be improved to work in more situations than before. What is your wishlist? In what other ways do you think the instruction set could be improved?


Dmitry V'jukov said...
1. Atomic RMW operations w/o trailing #StoreLoad style memory fence, something along the line of:

2. System-wide memory fence (hw analog of FlushProcessWriteBuffers())

3. AMD ASF, double dereference won't do the thing.

sfuerst said...
These are pretty good. However, I think number 2 may lead to a DOS problem with current systems. Imagine a malicious process calling the system memory fence over and over in a tight loop. Every other process on the system is going to run much more slowly due to the enforced synchronization.

AMD's ASF is nice... but I thought it would be a bit presumptuous arguing for it here. Double-dereference is much simpler, and it is quite surprising how powerful it is. I should probably write an article on it. :-)
Dmitry V'jukov said...
> However, I think number 2 may lead to a DOS problem with current systems.

Well, I think that it's a bit late to cope about security when a robber is already in an apartment. I believe there are 1000 and 1 ways how a malicious process can compromise the system's performance. If you want security just do not run a suspicious process, or use a virtual environment.
Anyway, QPI based systems *already* have such a feature, so we are not losing anything.

As for "poor man's transactional memory", MOV [mem],[mem] can be useful too. It would allow to implement a zero cost SMR. However, it still requires locking of 2 separate cache-lines, which is quite problematic on QPI-based systems.

Btw, the problem you outlined in "Double-dereference" part is solvable today with a strongly thread-safe reference counting algorithms. They allow to basically acquire a reference to an object when you do not have a one.
Peter Kankowski said...
About bit scan: AMD recently added LZCNT instruction, which is similar to BSR, but returns the register size (e.g., 32 bits) for all-zeros.
sfuerst said...
Argh, I forgot about LZCNT. That's what I get for reading through the latest Intel docs, and forgetting to do the same with the AMD ones. :-(
sfuerst said...
Oooh the "operand [mem], [mem]" version looks very interesting. I had overlooked the possibility of adding double-memory operations. If you are going to add an addressing mode, why not forget all the previous limits. ;-)

The fact that 2 separate cache lines are required for double-dereference isn't much of a problem. As things currently stand atomic ops are allowed to cross cachelines, and in fact pages. Of course the maximum number of cache lines required is currently two - and double dereference might need four...

Yeah, you can emulate the reference counting effect of double-deference with hazard pointers. The problem with those is handling the annoying special cases where threads are being created and destroyed simultaneously with the memory ops. Then there is the fact that hazard pointers are O(n) rather than O(1) :-/
Dmitriy V'jukov said...
>As things currently stand atomic ops are allowed to cross cachelines, and in fact pages

The problem is performance rather than possibility. Split atomic RMW on QPI based system takes 5000 cycles, and effectively acts as FlushProcessWriteBuffers():

>Yeah, you can emulate the reference counting effect of double-deference with hazard pointers.

I meant atomic reference counting, one still pays 1 atomic RMW (perhaps double-word) per operation (acquire/release), and no corner cases.

btw, the captcha is annoying
sfuerst said...
I've fixed the captcha now. The timeout was too low in some cases, so it would fail you even if you got the answer correct. :-/
Jeff K said...
Atomic double-dereference is a band-aid for reference-counting, and it's fundamentally broken. The problem is what happens when the first dereference is invalid. You could take an exception (consistent with the rest of the architecture!), but the resulting performance would be awful. If you were writing code, you would probably include a null check, but the architecture doesn't give any special meaning to null. In fact, I have seen several times memory mapped at null and used. On Windows machines, NTVDM does this in order to set up the V86 VM. I suppose you could have it compare the address with a special register, but this sort of solution is somewhat ugly, and there is still the question of how to signal the condition that the first dereference matches that value.

Hazard pointers are quite efficient and are a general solution the same problem that works on existing hardware with a wide variety of data structures. They require that the number of threads be statically bounded, however.

I agree with pretty much everything you said about x86 SIMD. There are so many special cases, and at this point it's impossible to make any sense of it. It has been particularly annoying that basic ops like multiply continue to work only for specific integral types.

I have also been frustrated with limitations of the SETcc and CMOVcc instructions. In fact I wish SETcc would set a 4-byte register to either 00000000 or FFFFFFFF (or sign-extend to 8-bytes in AMD64) for branch optimization. I'm sure C++ compiler-writers would kill me for that suggestion.

The BSF/BSR instructions are generally implemented as CLZ/CTZ on other processors. These instructions are well-defined in all cases. If the source operand is zero, then they store the total number of bits in the register (i.e. 32 for x86 and 64 for x64). This is consistent with GCC intrinsics.

Finally, there is a processor that has an on-die RNG: the VIA C3, C7, and Nano processors. They are embedded x86 processors with specialized instructions for accessing the on-die RNG and have additional hardware support for accelerating AES, RSA, and SHA algorithms.
Jeff K said...
I have a few additional suggestions for instructions to add to your wish list:

1) ALU instructions that do not set flags:
   addnf dst, src
   ornf dst, src
   adcnf dst, src
   sbbnf dst, src
   andnf dst, src
   subnf dst, src
   xornf dst, src
   incnf dst, src
   decnf dst, src
   rolnf dst, src
   rornf dst, src
   shlnf dst, src
   shrnf dst, src
   sarnf dst, src

   These are very useful in branch-free code for scheduling instructions when the flags need to be preserved. I did not include cmp or test for obvious reasons, and I also didn't include rcr or rcl. By the way, shifts, rotates, and inc/dec have historically been problematic for the architecture because of the way they update eflags. They result in stalls under certain conditions.
   These are also useful in virtualization code. It makes it possible to update the virtual cycle counter without saving eflags, for example.

2) bitswap dst, src:

   Reverses all the bits in src and stores the result in dst. It's similar to bswap. This isn't incredibly common, but it comes up over and over again. It would also be acceptable to reverse the bits in each byte of src such that using bswap on the result would complete the swap operation.

3) Bit string manipulation instructions!

   insr dst, src, imm6, imm6
   extr dst, src, imm6, imm6
   mux4 dst, src, imm8

   The insr instruction computes the following:
   msk = rol((1 << len) - 1, off)
   dst = (dst & ~msk) | (rol(src, off) & msk)

   This allows a right-aligned bit string of any length to be copied from one register into another at any arbitrary position.

   The extr instruction computes the following:
   dst = ror(src, off) & ((1 << len) - 1)

   This allows any arbitrary bit string to be copied from one register into another and zero-extended or sign-extended.

   These two instructions are available on IA-64 as dep and extr, and PowerPC has rlwinm which is very similar. These instructions would optimize C bit-field manipulation, and they're also extremely useful in decoding RISC instructions.

   The mux4 instruction divides the src & dst registers into 4 equal-sized partitions and computes the following:
   dst[0] = src[imm8 & 3]
   dst[1] = src[(imm8 >> 2) & 3]
   dst[2] = src[(imm8 >> 4) & 3]
   dst[3] = src[(imm8 >> 6) & 3]

   This allows any byte to be moved into any position in a 32-bit register or any two bytes to be moved into any position in a 64-bit register. It would be possible to also have a mux8 instruction with a 24-bit immediate or a mux16 instruction with a 64-bit immediate.

   An alternative implementation might be to allow copying an arbitrary element from the source register into an arbitrary position in the destination register.

4) ll/sc

   Many of the RISCs have load-linked/store-conditional. It's more generic than cmpxchg and solves the A-B-A problem without requiring DCAS. It would also totally obviate the need for the lock prefix.
sfuerst said...
Hello Jeff,

The whole point of atomic double-dereference is to avoid the possibility of accessing invalid pointers. The instruction should just not do the required operation if it would segfault. By returning the address it used/did not used, the client code can work out whether or not the operation actually occurred. i.e. Since we may know that the page at address zero isn't mapped, NULL works as intended. If that page is used, you can transparently use the address of some other unmapped page for the NULL bit-pattern.

I like hazard pointers. However, they aren't the perfect solution. As you mention, they are much harder to use when the number of threads is either unknown, or changes at runtime. Another problem with them is that the writer side needs to scan a list that is proportional to the number of threads. Since each single check will inevitably result in a cache-miss, the overhead can become quite large when you have loads of threads.

Yes, I'd love the set-all-bits on cc instruction. "sbb reg,reg" is my favourite instruction because it does that with the carry flag. In fact using the "neg" instruction to test for zero instead of "test" can be much more elegant due to its use of the carry instead of the zero flag.

I know it sounds weird, but having new instructions that don't set the flags might actually be a pessimization. Internally, the processor counts on the fact that most instructions completely rewrite/invalidate the flags. This allows dependency chains to be shortened or removed, and more OOO parallelism to be obtained. Also, with macro-op fusion which exists in modern processors test+branch takes but a single cycle, so redoing a test isn't expensive.

Bitswap would be nice for hash functions and encryption. :-) It is also very useful for locality-sensitive data structures. (Not quite as good as Hilbert Curve order, but close.)

ll/sc is actually quite horrible. The problem with it is that the sc can fail. This turns every atomic operation into some sort of loop. Now you have the difficulty of solving live-lock, where multiple threads interfere with each other. Correcting that with some sort of exponential back-off greatly bloats your atomic primitives. AMD has suggested something much like this to help lock free code. However, it doesn't help the much more performant wait free algorithms. We really need more atomics... the problem is finding which ones help the most. :-)
Jason Bourne said...
1. Multiword BSF/BSR.
2. A reliable and faster MONITOR/MWAIT.
Jasper said...
A random generator function (RDRAND reg16/32/64) is announed by Intel as a post-32nm processor instruction.
The documentation is found under http://software.intel.com/en-us/avx/ when you click on "Intel Advanced Vector Extensions Programming Reference".
You can test it with version >=3.88 of the emulator: http://software.intel.com/en-us/articles/intel-software-development-emulator/
FUZxxl said...
@Jeff K

In my opinion, the best way to implement this would be a new prefix "nof" that tells the processor to ignore any effects on the flags register caused by the next instruction. Then you can write something like

nof add %eax,%ebx
nof dec %eax


I like this more since the usage looks very intuitive.
AcidTonic said...
Finally some meat in a technical writer.

I absolutely love your articles, especially this one.

Cheers from Michigan!
mad said...
A fast way to do linear interpolation from floating point indexes would be nice. Kinda like the texturing stuff they had in the Larrabee.

Maybe an instruction that interleaves the bits from two registers would be cool too (improves data locality of 2d arrays).
Jeremy said...
See the new rdrand instruction that Ivy Bridge introduces for the first item
on your wishlist!
Jasper said...
With the new Haswell instructions PDEP and PEXT you can do some bit magic such as the bit interleaving wished by mad:

mov ecx,0x55555555
pdep eax,eax,ecx
pdep edx,edx,ecx
lea eax,[eax+edx*2]
Lucho said...
Very good and spot on suggestions.

I would like to add one more that is very important in my opinion.
It is about memory access.
Very often [otherwise well vectorizible] algorithms require
reading/writing from/to mem addresses which are calculated per-channel
(reading from table, sampling a texture, etc.).
When you get to this, you are forced to make that part of the algorithm scalar
by extracting each channel in turn to a GP register, performing the memory operation and then inserting the result back to a vector register.
I don't think a single instruction that interprets each channel as an address and reads/writes to different memory locations at once is hardware feasible (though it would be extremely good) but at least they should try to device some helper instructions that would ease the situation. I don't have an immediate idea what those instructions would be.
Lucho said...
..continuing the above :)

It would be great if they add instructions for memory access that read the address directly from an SSE/AVT register instead of GP. e.g.:

ploaddq $i, %xmm0, %xmm1 - read a double quadword from address specified in the i-th dword of xmm0 and store it in xmm1

ploadq $(i + (j<<4)), %xmm0, %xmm1 - read a quadword from address specified in the i-th dword of xmm0 and store it in j-th half of xmm1

ploadd $(i + (j<<4)), %xmm0, %xmm1 - read a dword from address specified in the i-th dword of xmm0 and store it in j-th quarter of xmm1

pstored $(i + (j<<4)), %xmm0, %xmm1 - store a dword to address specified in the i-th dword of xmm0, the source is the j-th quarter of xmm1

something like that :)

Reneg said...
Please add comparison of <= and/or >= for integers.
pcmpged / pcmpled

Currently I need to check for (a<b) | (a==b) to get the move mask.

Enter the 10 characters above here

About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.