IBM Search Java Site Map Microelectronics News Products Services Technology About Us IBM Microelectronics Order Contact Legal

power pcProducts

overview
news
products
documents
performance
technology


[ Table of Contents | Index ]
Chapter 5

5. Clever Examples


The following code sequences illustrate interesting ways to implement various functions using PowerPC code. A compiler might generate some of these examples, but in many cases the code would more likely be found in a run-time library function. These examples apply to 32-bit implementations and 64-bit implementations running in 32-bit mode. The concepts apply to 64-bit mode, but the specific code sequences may require some adjustments.


5.1 Sign Function

The sign or signum function is defined by:

Figure 5-1 shows a four-instruction sequence that computes the sign function for integers. Section D.4 on page 205 presents additional sequences.



Figure 5-1. Sign Function Code Sequence

# R3 contains x
srawi R4,R3,31 # x >> 31
neg R5,R3 # -x
srwi R5,R5,31 # t = (unsigned) -x >> 31
or R6,R4,R5 # sign(x) = (x >> 31) | t


5.2 Transfer of Sign

The function that transfers the sign of one argument to another, the integer version of which is called ISIGN in FORTRAN, is defined by:

Figure 5-2 shows a four-instruction sequence that calculates the ISIGN function (mod 232).



Figure 5-2. Fortran ISIGN Function Code Sequence

# R3 contains x
# R4 contains y
xor R5,R4,R3 # x ^ y
srawi R5,R5,31 # t = (x ^ y) >> 31
xor R6,R3,R5 # x ^ t
sub R6,R6,R5 # ISIGN(x,y) = x ^ t - t


5.3 Register Exchange

Figure 5-3 shows two code sequences that exchange the contents of two registers without the need of a temporary register. The approach of the first sequence, which uses XOR operations, derives from the fact that a ^ b ^ a = b. The EQV operation can substitute for XOR operation. The second sequence uses an addition and two subtractions. The large number of registers in the PowerPC architecture, however, makes the need for such exchanges unlikely.



Figure 5-3. Register Exchange Code Sequence

Using XOR
# R3 = a
# R4 = b
xor R3,R3,R4 # R3 = a ^ b
xor R4,R4,R3 # R4 = b ^ (a ^ b) = a
xor R3,R3,R4 # R3 = (a ^ b) ^ a = b
Using Arithmetic
# R3 = a
# R4 = b
add R3,R3,R4 # R3 = a + b
sub R4,R3,R4 # R4 = (a + b) - b = a
sub R3,R3,R4 # R3 = (a + b) - a = b


5.4 x = y Predicate

The x = y predicate has the value 1 if x = y, and 0 if x y. The three-instruction code sequence in Figure 5-4 computes this predicate. The x = 0 special case requires only two instructions because the subtraction is not necessary.



Figure 5-4. "x = y" Predicate Code Sequence

# R3 contains x
# R4 contains y
sub R5,R4,R3 # x - y
cntlzw R5,R5 # nlz(x - y)
srwi R5,R5,5 # (unsigned) nlz(x - y) >> 5


5.5 Clear Least-Significant Nonzero Bit

The code in Figure 5-5 illustrates how to clear the least significant non-zero bit of an integer x by ANDing the value with its value decremented by 1.



Figure 5-5. Clear Least-Significant Nonzero Bit Code Sequence

# R3 contains x
subi R4,R3,1 # tmp = x - 1
and R3,R3,R4 # x = x & (x - 1)

The code in Figure 5-6 uses this idea to test for 0 or a power of 2. If the result following the clearing of the least-significant bit is 0, the original value was either 0 or a power of 2.



Figure 5-6. Test for 0 or a Power of 2 Code Sequence

# R3 contains x
subi R4,R3,1 # tmp = x - 1
and. R4,R4,R3 # tmp = x & (x - 1)
beq cr0,ZeroOrPowerOfTwo # branch if x = 0 or
# if x = power of 2


5.6 Round to a Multiple of a Given Power of 2

Figure 5-7 illustrates how to round a value up to a multiple of a given power of 2.



Figure 5-7. Round Up to a Multiple of 8 Code Sequence

# R3 contains x
addi R4,R3,7 # x + 7
rlwinm R4,R4,0,0,28 # (x + 7) & -8


5.7 Round Up or Down to Next Power of 2

The floor power of 2 (flp2) and ceiling power of 2 (clp2) functions are similar to the floor and ceiling functions, respectively, but they round to an integral power of 2, rather than to an integer. Figure 5-8 tabulates some sample values of these functions.



Figure 5-8. Values of flp2(x) and clp2(x)

x flp2(x) clp2(x)
0 0 0
1 1 1
2 2 2
3 2 4
4 4 4
5 4 8
... ... ...
231-1 230 231
231 231 231
231+1 231 0
... ... ...
232-1 231 0

For x > 231, clp2(x) is defined to be 0 because 0 is the mathematically correct result modulo 232, following the usual computer arithmetic convention. Defining these functions to be 0 for x = 0 is arbitrary.

Figures 5-9 and 5-10 show code sequences that calculate flp2(x) and clp2(x), respectively. The notation nlz(x) denotes the number of leading zeros function (evaluated using the cntlzw instruction). Because PowerPC shifts in 32-bit mode are mod 64, these instructions give a zero result if the shift amount is in the range 32 to 63.



Figure 5-9. flp2(x) Code Sequence

# R3 contains x
lis R0,0x8000 # load constant 0x8000_0000
cntlzw R4,R3 # nlz(x)
srw R4,R0,R4 # flp2(x) = 0x8000_0000 >> nlz(x)



Figure 5-10. clp2(x) Code Sequence

# R3 contains x
li R1,1 # load constant 1
addi R4,R3,-1 # x - 1
cntlzw R4,R4 # nlz(x - 1)
subfic R4,R4,32 # 32 - nlz(x - 1)
slw R4,R1,R4 # clp2(x) = 1 << (32 - nlz(x - 1))


5.8 Bounds Checking

Bounds checking refers to verification that an integer x lies between two bounds a and b; that is, a x b. If the integers are signed and a b, a x b is equivalent to (x - a) u (b - a), where the notation denotes a signed comparison, and u denotes an unsigned comparison. Similarly, if the integers are unsigned and a u b, a u x u b is equivalent to (x - a) u (b - a). Thus, for both signed and unsigned integers, a single comparison can perform a check that seems to require two comparisons.

An important application of bounds checking is to ensure that array indices fall in the proper range. For example, suppose values from 1 to 10 index a one-dimensional array A. For a reference A(i), a compiler might generate code to check that 1 i 10 and trap if this condition is not satisfied. The compiler can do this check by evaluating the inequality (i - 1) u 9. In this example, there is a good chance that the quantity (i - 1) is needed to do array indexing, so only one additional instruction (trap on unsigned greater than 9, using the twi instruction) effectively accomplishes the check.

These transformations are correct only if a b (or a u b). Computer languages that do not allow arrays to have a zero or negative number of elements may use these transformations even when the array bounds are variables.


5.9 Power of 2 Crossing

Given an address A and a length L that address memory, we wish to determine whether the referenced bytes cross a power of 2 boundary of some particular size. The four-instruction sequence in Figure 5-11 illustrates this operation for a page boundary (4096 bytes).



Figure 5-11. Detect Page Boundary Crossing Code Sequence

# R3 contains address A
# R4 contains length L
rlwinm R5,R3,0,20,31 # A & 4095
subfic R5,R5,0x1000 # t = 4096 - (A & 4095)
cmplw cr3,R5,R4 # unsigned compare of t and L
blt cr3,boundary_cross # branch if t u< L

If a boundary crossing occurs,

L - (4096 - (A & 4095))

gives the length that extends beyond the block boundary and may be calculated with one additional instruction (subf).


5.10 Count Trailing Zeros

The four instruction sequence in Figure 5-12 calculates the number of trailing zeros (ntz) of a word x. The first two instructions form a mask identifying the trailing zeros. Then, the number of leading zeros is subtracted from 32 to yield the result.



Figure 5-12. Count Trailing Zeros Code Sequence

# R3 contains x
addi R4,R3,-1 # x - 1
andc R4,R4,R3 # ~x & (x - 1)
cntlzw R4,R4 # t = nlz(~x & (x - 1))
subfic R4,R4,32 # ntz(x) = 32 - t

The "number of powers of 2" (npow2) function might be defined as follows:

This variation of the count trailing zeros function treats a 0 argument as a special case, returning -1. Figure 5-13 shows a code sequence that calculates npow2(x). The first two instructions form a mask identifying the least-significant 1-bit. Then, the number of leading zeros is subtracted from 31 to yield the result. An argument of 0 generates an all-0 mask, which has 32 leading zeros. Subtracting 32 from 31, the function returns -1.



Figure 5-13. Number of Powers of 2 Code Sequence

# R3 contains x
neg R4,R3 # -x
and R4,R4,R3 # x & -x
cntlzw R4,R4 # t = nlz(x & -x)
subfic R4,R4,31 # npow2(x) = 31 - t


5.11 Population Count

The population count is the number of 1-bits in a 32-bit word. Figure 5-14 shows a branch-free function for population count. The algorithm involves summing the 1-bits in 2-bit, 4-bit, 8-bit, 16-bit and 32-bit fields sequentially. This function requires 18 instructions, counting one for a load of each of the large immediate values (but neglecting the function prolog and epilog). The advantage of this algorithm is its relatively short worst-case execution time and the lack of branches.



Figure 5-14. Branch-Free Population Count Code Sequence

C Source Code
int nbits(unsigned int x)
{
unsigned long int t;
x = x - ((x >> 1) & 0x55555555);
t = ((x >> 2) & 0x33333333);
x = (x & 0x33333333) + t;
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x << 8);
x = x + (x << 16);
return(x >> 24);
}
Assembly Code
# R3 contains x
lwz R4 # load R4 with 0x33333333
lwz R5 # load R5 with 0x55555555
lwz R6 # load R6 with 0x0F0F0F0F
srwi R7,R3,1 # x >> 1
and R7,R7,R5 # t = (x >> 1) & 0x55555555
sub R3,R3,R7 # x = x - t
srwi R7,R3,2 # x >> 2
and R7,R7,R4 # t1 = (x >> 2) & 0x33333333
and R8,R3,R4 # t2 = x & 0x33333333
add R3,R7,R8 # x = t1 + t2
srwi R7,R3,4 # x >> 4
add R7,R7,R3 # t = x + x >> 4
and R3,R7,R6 # x = t & 0x0F0F0F0F
slwi R7,R3,8 # x << 8
add R3,R7,R3 # x = x + x << 8
slwi R7,R3,16 # x << 16
add R3,R7,R3 # x = x + x << 16
srwi R3,R3,24 # return x >> 24

Other algorithms that employ loops have smaller code volume and execute faster for some values of x. For example, if you expect that x contains only a few 1-bits, construct a loop that counts the number of times a bit in x is turned off. The code in Figure 5-15 uses x = x & (x - 1) to clear the least-significant 1-bit.



Figure 5-15. Branching Population Count Code Sequence

# R3 contains x
cmplwi cr0,R3,0 # test for no bits set
li R4,0 # initialize the counter
beq Done # exit if x = 0
Loop:
subi R5,R3,1 # tmp = x - 1
and. R3,R5,R3 # x = x & (x - 1)
addi R4,R4,1 # increment the counter
bne Loop # next iteration
Done: # R4 contains population count

Figure 5-16 shows an alternative approach employing table lookup. This code will probably execute faster than that of Figures 5-14 and 5-15 on many implementations, and for many values of the argument x, particularly on implementations with a large number of functional units.



Figure 5-16. Alternative Population Count Code Sequence

C Source Code
int nbits(unsigned int x)
{
static unsigned char popcnt[256] =
{0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
int count0, count1, count2, count3;
count0 = popcnt[(x & 0x000000FF)];
count1 = popcnt[((x >> 8) & 0x000000FF)];
count2 = popcnt[((x >> 16) & 0x000000FF)];
count3 = popcnt[((x >> 24) & 0x000000FF)];
return(count0 + count1 + count2 + count3);
}
Assembly Code
# R3 contains x
lwz R2 # load R2 with address of POPCNT array
andi R4,R3,0x00FF # extract & right-justify byte 3
rlwinm R5,R3,24,24,31 # extract & right-justify byte 2
lbzx R4,R2,R4 # popcnt[byte 3]
rlwinm R6,R3,16,24,31 # extract & right-justify byte 1
lbzx R5,R2,R5 # popcnt[byte 2]
rlwinm R7,R3, 8,24,31 # extract & right-justify byte 0
lbzx R6,R2,R6 # popcnt[byte 1]
lbzx R7,R2,R7 # popcnt[byte 0]
add R3,R4,R5 #
add R4,R6,R7 # accumulate result into R3 and return
add R3,R3,R4 #

You may also code a novel algorithm from the following rather surprising formula (Morton [1990]):

,

where rotatel(x,i) rotates x to the left i places.


5.12 Find First String of 1-Bits of a Given Length

The problem of finding the first occurrence of a string of 1-bits of a given length has application in disk allocation algorithms. For example, if x = 0b001110001111100011...1 and you are searching for 4 consecutive 1-bits, this function should return 8 (which is the position of the leftmost bit in the leftmost string of 4 or more consecutive 1-bits, where the bits are numbered from the left starting with 0).

One algorithm for this operation uses a series of cntlzw instructions. The function first counts the number of leading 0s in x, and shifts x left by that amount. The shift amounts are summed to a variable to keep track of the total amount shifted. Then, the function counts the number of leading 1s in x (by counting the number of leading 0s in ~x). If this number is sufficiently large, it returns, with the return value equal to the total amount shifted. Otherwise, it shifts left by the value that "count leading zeros" returned to discard the too-short sequence of 1s just encountered, and increases the total shift by this shift amount. This process repeats until either the function returns, or x is 0, indicating that no sequence of sufficient length was found.

This algorithm is fast if the argument x consists of a small number of groups of consecutive 0s and 1s, or if the desired sequence is found quickly, which may be common situations. On the other hand, this algorithm has a worst-case execution time of about 180 instructions for x = 0b0101...01 and n 2.

An algorithm based on a sequence of Shift Left and AND instructions has a shorter worst-case execution time. To see how this algorithm works, consider searching for a string of eight or more consecutive 1-bits in a 32-bit word x. This search might proceed as follows:

x = x & (x << 1);
x = x & (x << 2);
x = x & (x << 4);

After the first assignment, the 1s in x indicate the starting positions of strings of length two. After the second assignment, the 1s in x indicate the starting positions of strings of length four (a string of length two followed by another string of length two). After the third assignment, the 1s in x indicate the starting positions of strings of length eight. Executing "count leading zeros" on this word generates the position of the first string of length eight (or more) or the value 32 if none exists.

Observe that the above three assignments may occur in any order. To develop an algorithm that works for any length n from 1 to 32, the reverse order is more convenient. The case n = 10 illustrates the general method:

x1 = x & (x << 5);
x2 = x1 & (x1 << 2);
x3 = x2 & (x2 << 1);
x4 = x3 & (x3 << 1);

The first statement uses an n/2 shift, which reduces the problem to that of finding a string of five 1-bits in x1. The next statement shifts x1 left by floor(5/2) = 2 and ANDs it with x1, which reduces the problem to that of finding a string of length 3 (5 - 2). The last two statements identify the location of length-3 strings in x2. The sum of the shift amounts is always n - 1.

Figure 5-17 shows this algorithm for a general n. The execution requires from 3 to 38 instructions, as n ranges from 1 to 32.

If n is often moderately large, unroll this loop by repeating the loop body five times. This unrolled loop gives a branch-free algorithm that executes in a constant 20 instructions. The unrolled version requires fewer instructions than the looping version for n 5. (For a 64-bit version of this algorithm, the unrolled code would repeat the loop six times).



Figure 5-17. Detect First String of n 1-Bits Code Sequence

C Source Code
int ffstr(int x, int n)
/* Must have 1 n 32 */
{
int s;
while(n > 1) {
s = n >> 1;
x = x & (x << s);
n = n - s;
}
return(nlz(x)); /* (Returns 32 if not found) */
}
Assembly Code
# R3 contains x
# R4 contains n
loop:
cmpwi cr3,R4,1 # compare n and 1
ble done # branch if n 1
srwi R5,R4,1 # s = n/2
slw R6,R3,R5 # x << s
and R3,R3,R6 # x = x & x << s
sub R4,R4,R5 # n = n - s
b loop # next iteration of loop
done:
cntlzw R3,R3 # return nlz(x)


5.13 Incrementing a Reversed Integer

The problem of incrementing a reversed integer has application to the Fast Fourier Transform (FFT) algorithm, which employs an integer and its bitwise reversal to index an array in memory. The integer and its reversal have the same length, almost certainly less than 32 bits, and they are both right-justified in a register. Here, we assume that the reversed integer is 32 bits in length. For the FFT application, it is necessary to shift the result right a certain number of bits before using it to index an array in memory. Thus, the incrementing proceeds as follows, in hexadecimal: 00000000, 80000000, 40000000, C0000000, 20000000, A0000000,....

The five instructions in Figure 5-18 increment a reversed integer. When n = 32, the algebraic shift yields all 1s as a result, so this code sequence properly steps from 0xFFFFFFFF to 0.



Figure 5-18. Incrementing a Reversed Integer Code Sequence

# R3 contains x
lis R4,0x8000 # load R8 with 0x80000000
not R5,R3 # ~x
cntlzw R5,R5 # n = nlz(~x)
sraw R5,R4,R5 # m = 0x80000000 >> n
xor R3,R3,R5 # x = x ^ m


5.14 Decoding a "Zero Means 2n" Field

Sometimes a zero or negative value for a quantity does not make sense, so the quantity is encoded in an n-bit field with a zero value indicating 2n and a nonzero value having its normal binary interpretation. The 5-bit length field of Load String Word Immediate (lswi) instruction is a good example. An instruction that loads zero bytes is not useful, but it is definitely useful to be able to load 32 bytes. Values of 0 to 31 for the length field could denote lengths from 1 to 32, but the zero means 32 convention results in simpler logic when the processor must also support a corresponding instruction with a variable (in-register) length that employs straight binary encoding (e.g., PowerPC's lswx instruction). To encode an integer in the range 1 to 2n into the zero means 2n encoding, simply mask the integer with 2n - 1. The following 3-instruction sequences perform the decoding operation without a test-and-branch:

((x - 1) & 7) + 1
((x + 7) | 8) - 7
((x - 1) & 8) + x

There are many other similar and equivalent expressions.


5.15 2n in Fortran

The IBM XL Fortran compiler defines the 2n function as:

The exponent n and the result are interpreted as signed integers. This definition satisfies the ANSI/ISO Fortran standard, but is more restrictive than necessary to meet that standard. The definition is reasonable for n 31, because it generates the mathematically correct result, modulo 232, and the result agrees with the result of repeated multiplication.

The standard way to compute 2n involves putting the integer 1 in a register and shifting it left n places. The difficulty with this procedure is that shift amounts are treated modulo 64, giving incorrect results for large or negative shift amounts.

The code sequence in Figure 5-19 computes the function correctly using four instructions.



Figure 5-19. 2n in Fortran Code Sequence

addi R4,R3,-32 # n - 32
andc R4,R4,R3 # tmp = ¬n & (n - 32)
srw R4,R4,31 # tmp = (unsigned) tmp >> 31
slw R5,R4,R3 # pow2(n) = tmp << n


5.16 Integer Log Base 10

We define of integer log base 10 to be:

where log10(x) is the ordinary (real) base-10 logarithm. This function has application to the conversion of a binary number to decimal for inclusion into a line with leading zeros suppressed. The conversion process successively divides by 10, producing the least significant digit first. It would be convenient to know where to place the least significant digit so that putting the converted number in a temporary area and then moving it can be avoided. For this application, it would be even more convenient to define ilog10(0) to be 0; this is considered in the following.

The code sequence in Figure 5-20 computes the integer log base 10. This sequence assumes x is an unsigned number, extending its range of application and avoiding the undefined cases. It computes ilog10(x) with two table lookups, executing in 11 branch-free instructions as coded, or 10 instructions if the values in table1 are multiplied by 4, to save a shift when accessing table2.

You can modify this procedure to return the value 0, rather than -1, for x = 0 (which is preferable for the decimal conversion problem) by changing the last entry in table1 to 1 (i.e., change the final 0 in table1 to a 1).



Figure 5-20. Integer Log Base 10 Code Sequence

C Source Code
int ilog10(unsigned int x)
{
static unsigned char table1[33] = { 10, 9, 9, 8, 8, 8,
7, 7, 7, 6, 6, 6, 6, 5, 5, 5, 4, 4, 4,
3, 3, 3, 3, 2, 2, 2, 1, 1, 1, 0, 0, 0, 0 };
static int table2[11] = { 1, 10, 100, 1000,
10000, 100000, 1000000, 10000000,
100000000, 1000000000, 0 };
int y;
y = table1[nlz(x)];
y = y - ((unsigned) (x - table2[y]) >> 31);
return(y);
}
Assembly Code
# R3 contains x
lwz R4,.table1 # load table1's base address
lwz R5,.table2 # load table2's base address
cntlzw R6,R3 # nlz(x)
lbzx R6,R4,R6 # load table1(nlz(x))
slwi R7,R6,2 # times 4 for offset into table2
lwzx R7,R5,R7 # load table2(y)
subf R7,R7,R3 # t = x - table2(y)
srwi R7,R7,31 # t = (unsigned) t >> 31
subf R3,R7,R6 # ilog10(x) = y - t
blr # return


[ Table of Contents | Index ]
Copyright 1998 IBMchips