IBM Search Java Site Map Microelectronics News Products Services Technology About Us IBM Microelectronics Order Contact Legal

power pcProducts

overview
news
products
documents
performance
technology


[ Table of Contents | Index ]

3.2 Integer and String Operations

Optimal code selection generally depends on the surrounding code. The total number of instructions should be minimized, but scheduling considerations may give longer code sequences a performance advantage. Where possible, branches and long latency instructions (multiplies, divides, loads, and stores) should be avoided. Because of the pipelined, superscalar nature of many PowerPC implementations, the minimization of dependences introduces flexibility in scheduling and opportunities for parallel computation.

Dependences, particularly the so-called false dependences (anti-dependences and output dependences) may cause the processor to stall, even when it can be avoided. (Renaming mechanisms are frequently provided in advanced implementations to eliminate false dependences.) The PowerPC instruction forms explicitly indicate all operands, facilitating the determination of dependence. Moreover, most operations have instruction forms that do not record, set the carry bit, or set the overflow bit, thereby reducing the potential for conflicts over these resources.

The PowerPC integer instruction set architecture includes:


3.2.1 Memory Access

The PowerPC architecture is a load-store architecture; that is, only load and store instructions can move values between registers and memory. These instructions introduce delays due to memory latencies, so their use should generally be minimized.


3.2.1.1 Single Scalar Load or Store

Figures 3-22 and 3-23 show the PowerPC scalar load and store instructions. The update form of the load or store writes the calculated address to the base register, simplifying the iterative load or store of data structures, such as strings or arrays. The update load or store instructions execute as fast or faster than the equivalent non-update load or store instruction and the associated add. If the RA field of a non-updating load or store is R0, the value 0 is used instead of the contents of R0.



Figure 3-22. Scalar Load Instructions

Scalar Type Basic Form Indexed Form Update Form Update Indexed Form
logical byte lbz lbzx lbzu lbzux
logical halfword lhz lhzx lhzu lhzux
logical word lwz lwzx lwzu lwzux
logical doubleword (ld) (ldx) (ldu) (ldux)
algebraic halfword lha lhax lhau lhaux
algebraic word (lwa) (lwax) (lwaux)
Instructions in parentheses are available only in 64-bit implementations.



Figure 3-23. Scalar Store Instructions

Scalar Size Basic Form Indexed Form Update Form Update Indexed Form
byte stb stbx stbu stbux
halfword sth sthx sthu sthux
word stw stwx stwu stwux
doubleword (std) (stdx) (stdu) (stdux)
Instructions in parentheses are available only in 64-bit implementations.


3.2.1.2 Load and Reserve/ Store Conditional

The load and reserve instructions load the addressed value from memory and then set a reservation on an aligned unit of real storage (called a reservation granule) containing the address. The size of a reservation granule is implementation dependent. It must be a multiple of 4 bytes for lwarx and a multiple of 8 bytes for ldarx. In most implementations, it equals the size of a cache line. A subsequent store conditional instruction to this address verifies that the reservation is still set on the granule before carrying out the store. If the reservation does not exist, the instruction completes without altering storage. If the store is performed, bit 2 of CR0 is set; otherwise, it is cleared. The processor may clear the reservation by setting another reservation or by executing a conditional store to any address. Another processor may clear the reservation by accessing the same reservation granule. A pair of load-and-reserve and store-conditional instructions permit the atomic update of a single aligned word or doubleword (only in 64-bit implementations) in memory.

A compiler that directly manages threads may use these instructions for in-line locks and to implement wait-free updates using primitives similar to compare and swap. Because locked or synchronized operations in multi-processor systems are complex, these operations are usually exposed only through calls to appropriate run-time library functions.

Further details about the use of these instructions may be found in Book I Section 3.3.7 and Book II Section 1.8.2 of The PowerPC Architecture.


3.2.1.3 Multiple Scalar Load or Store

The load and store multiple and move assist instructions access a sequence of words or bytes in memory. In PowerPC Little-Endian mode, the execution of these instructions generates an interrupt.

Some implementations may not execute these instructions as efficiently as the equivalent sequence of scalar loads and stores. If the output of the compiler targets a generic PowerPC implementation, the use of an equivalent sequence of scalar loads or stores may be preferable. For a particular implementation, check Appendix B and relevant implementation-specific documentation for the latency and timing of these instructions.


3.2.1.4 Byte-Reversal Load or Store

The byte-reversal loads and stores reverse the order of the bytes in the accessed halfword or word, regardless of the processor's endian mode. With knowledge of the data structure, data types, and endian orientation, these instructions may be used for endian conversion. Some implementations may have a longer latency for byte reversing loads and stores than for ordinary loads and stores.


3.2.1.5 Cache Touch Instructions

The data cache touch instructions may improve performance by providing the processor with a hint that the cache line containing the addressed byte will soon be fetched into the data cache. The processor is not required to act on this hint. Successful use of the touch instructions requires knowledge of the cache geometry and a non-blocking cache. See Section 4.4.3 on page 134 for further details.


3.2.2 Computation

All integer computational instructions have operands that are stored in registers or appear as intermediate fields in the instruction. With the exception of multiplies and divides, they usually execute in a single cycle.


3.2.2.1 Setting Status

Integer operations may write the status of the result to either the Condition Register or the XER, depending on the opcode (some cause the Carry bit to be written) and the Rc and OE fields (if present) in the instruction. Unnecessarily setting status bits can reduce instruction-level parallelism by increasing instruction interdependence. The addi, addis, add, and subf instructions do not access status bits.

Recording
Most integer instructions have a Record (Rc) field. When the Rc field in an instruction is set, CR0 is written reflecting both a comparison of the result to zero and the Summary Overflow bit. The andi., andis., and addic. instructions also record, even though they do not have an Rc field. For 64-bit implementations operating in 64-bit mode, the full 64-bit register value is compared to zero. For 64-bit implementations operating in 32-bit mode, only the low-order 32 bits are compared to zero. Even unsigned instructions with the Rc field set, such as the logical and rotation instructions, set CR0 based on signed comparison of the result to zero.

Carry
Carrying and extended arithmetic operations may set the Carry (CA) bit. In addition, the Shift Right Algebraic instruction sets CA when the argument is negative and a 1-bit has been shifted out; CA is cleared otherwise. In 64-bit implementations, the value of CA depends on whether the processor is in 32-bit or 64-bit mode. During a sequence of instructions using CA, changing modes can lead to unexpected results.

Overflow
XER contains two overflow bits: overflow (OV) and summary overflow (SO). OV indicates whether an overflow occurred during the execution of an integer arithmetic instruction with the OE field set. Add, subtract from, and negate instructions may set OV if the carry into the most-significant bit is not the same as the carry out of it. Multiply low and divide instructions may set OV if the result cannot be represented in 64 bits for doubleword forms and 32 bits for word forms.

The processor sets SO whenever setting OV. SO remains set, however, until explicitly cleared by an mtxer or mcrxr instruction. Remaining set allows the check for overflow to occur at the end of a sequence of integer arithmetic instructions, rather than immediately after each instruction.

In 64-bit implementations, the OV and SO values depend on whether the processor is in 32-bit or 64-bit mode. During a sequence of instructions using OV, changing modes can lead to unexpected results.


3.2.2.2 Arithmetic Instructions

The PowerPC arithmetic operations include:

addi, addis, add, and subf are the preferred forms for addition and subtraction because they set no status bits. If the source register is R0 for addi and addis, the value 0 is used instead of the contents of R0.

In the most general case, the multiplication of two n-bit operands yields a 2n-bit product. Multiply low and multiply high return the low-order and high-order n-bits of the product, respectively. For mulli and mullw, the low-order 32 bits of the product are the correct 32-bit product for 32-bit mode. The low-order 32 bits of the product are independent of whether the operands are regarded as signed or unsigned 32-bit integers. For mulli and mulld, the low-order 64 bits of the product are independent of whether the operands are regarded as signed or unsigned 64-bit integers. In some implementations, multiply instructions, with the exception of mulli, may execute faster if the second argument is smaller in absolute value than the first.


3.2.2.3 Logical Instructions

The PowerPC architecture provides a large number of Boolean instructions (and, or, nand, nor, xor, eqv, andc and orc) so that any 2-argument Boolean function can be calculated with only a single instruction. These instructions can easily implement the bit-parallel operators in higher-level languages. Logical expressions can be calculated as predicates as described in Section 3.1.5.1 on page 38.

The AND Immediate instruction has only a recording form. If the immediate value is of the form 2n - 2m with n greater than m (i.e., an unbroken series of 1s), the AND Immediate instruction may be replaced with a non-recording rlwinm instruction.

The Count Leading Zeros operation is useful in many algorithms. To implement this function with other operations requires at least 10 instructions. If the cntlzw instruction's record bit is set, the LT bit in CR0 is cleared.


3.2.2.4 Rotate and Shift Instructions

The PowerPC architecture implements instructions that can shift, rotate, extract, clear and insert bit fields in a general way. The shift instructions have analogous word and doubleword forms. The rotate instructions, however, do not have analogous word and doubleword forms because the 32-bit instruction size prevents an equivalent specification of the mask. In a 64-bit implementation, the rotate and shift word instructions operate on the low-order 32 bits; the high-order 32 bits are cleared.


3.2.2.5 Compare Instructions

The compare instructions write the condition code resulting from the comparison of two operands to the specified field in the Condition Register, whose 4 bits hold the result of the comparison and a copy of the Summary Overflow bit from the XER. The L field in the instruction determines whether the operands are treated as 32-bit (L = 0) or 64-bit (L = 1) quantities. In a 64-bit implementation, bit 32 is the sign bit in 32-bit mode. In 32-bit implementations, L must be cleared.


3.2.2.6 Move To/From XER

mtxer may be used to clear the SO bit or to write bits 25:31 of the XER, which specify the number of bytes a Move Assist instruction transfers. mfxer may be used to save the value of the XER during a function call.


3.2.3 Uses of Integer Operations

Many high-level language operations and functions have a straightforward, but not necessarily unique translation into PowerPC assembly. The following sections examine code sequences for important or non-obvious operations and consider some of the trade-offs among possible sequences.


3.2.3.1 Loading a Constant into a Register

Loading an integer constant into a register (either an address or a datum) is a common operation. The optimum way to handle this operation depends on the size of the constant and the availability of registers and cache lines to hold a constant pool.

The Load Immediate extended mnemonic (li Rd,value, which is equivalent to addi Rd,R0,value) loads a 16-bit sign-extended value into the destination register. The large number of immediate instruction forms, however, can handle most operations involving constants smaller than 16 bits without the need for a separate load. Immediate arithmetic and signed-compare instructions sign extend their 16-bit immediate value to allow negative intermediates. Add Immediate and Add Immediate Shifted instructions with source register R0 use the value of zero instead of the contents of R0.

The method of loading an arbitrary 32-bit or 64-bit value into a destination register involves either constructing it in a register or loading it from memory. The following code sequence builds a 32-bit value in a register provided bit 16 of value is 0 and Rd is not R0:

li Rd,(value & 0xFFFF)
addis Rd,Rd,((value >> 16) & 0xFFFF)

If bit 16 of value is 1, use the following code sequence:

li Rd,(value & 0xFFFF)
addis Rd,Rd,((value >> 16) & 0xFFFF) + 1

If the destination register is R0, use an intermediate register other than R0. An alternative to the preceding sequence involves the OR Immediate instruction (lis Rd,value is equivalent to addis Rd,R0,value):

lis Rd,((value >> 16) & 0xffff)
ori Rd,Rd,(value & 0xffff)

The add instructions are preferred to logical instructions because future PowerPC implementations may have a three-input adder, which permits executing the preceding two instructions (and other forms of add-feeding-add) in parallel, even though they are not independent. Do not use andi. to load a constant because it needlessly sets CR0.

The construction of a general 64-bit constant may use the rldimi instruction to combine two 32-bit constants formed as indicated above. If the implementation has multiple integer units, the formation of the two 32-bit constants may proceed in parallel so that the entire 64-bit constant requires 3 cycles. Special 64-bit values or special circumstances, such as the availability of the needed 32-bit components in registers, might reduce the number of required cycles.

If a register is dedicated to contain the base address of a constant pool in memory, the cost of loading a constant is the single load cost plus the associated cache or memory access delay. Loading the constant from memory is preferred to constructing the constant if the load is likely to hit in the cache and addressing code is unnecessary, or if too many constants are required to be kept in registers. It is almost always better to load 64-bit values from a constant pool in memory. The ABI can affect this trade-off. For example, in the AIX ABI, constants may be placed in the Table Of Contents (TOC), whose base address resides in R2. Accessing these constants, therefore, requires no addressing code and most likely the access will hit in the cache.


3.2.3.2 Endian Reversal

Byte-reversal load and store instructions allow block transfers of data between applications with different endian orientations. For example, assume a 4KB block of data is known to be a sequence of 32-bit quantities. Figure 3-24 shows a loop that reverses the endian orientation of the data block. This loop correctly reverses the endian orientation of the data independent of the endian mode of the processor.



Figure 3-24. Endian Reversal of a 4KB Block of Data Code Sequence

# R3 = base address of the source buffer
# R4 = base address of the destination buffer
li R5,1024 # 1024 words to transfer
mtctr R5 # load count register
addi R4,R4,-4 # adjust the destination for the loop
subf R6,R4,R3 # difference between source and destination
loop:
lwbrx R5,R4,R6 # load byte-reverse at (dest + diff)
stwu R5,4(R4) # store with update to destination
bdnz loop # branch to loop if ctr != 0

This example illustrates the use of byte-reversal instructions and how store update can subsume address computation. Because the byte-reversal instructions do not have an update form, this example calculates the difference between the source and destination buffer addresses outside of the loop. Then, the load address is given by the destination address indexed by the difference, allowing the byte-reversal load to use the address updated by the store.


3.2.3.3 Absolute Value

The absolute value of an integer may be expressed in C as:

(a >= 0) ? a : (0 - a)

Figure 3-25 shows a non-branching instruction sequence to compute this function.



Figure 3-25. Absolute Value Code Sequence

# R3 = argument a
srawi R4,R3,31 # R4 = (a < 0) ? -1 : 0
xor R5,R4,R3 # R5 = (a < 0) ? ~a : a
subf R6,R4,R5 # R6 = (a < 0) ? (~a + 1) : a
# R6 = result


3.2.3.4 Minimum and Maximum

The integer minimum or maximum can be expressed, assuming a and b have the same type:

min(a,b): (a <= b) ? a : b
max(a,b): (a >= b) ? a : b

Figure 3-26 shows a code sequence computing max(a,b) for unsigned values. This approach is based on the fact that subtraction of unsigned values generates a carry if the subtrahend is greater than or equal to the minuend. By replacing andc with and, the code in Figure 3-26 produces min(a,b).



Figure 3-26. Unsigned Maximum of a and b Code Sequence

# R3 = a
# R4 = b
subfc R5,R3,R4 # R5 = b - a with carry
# CA = (b >= a) ? 1 : 0
subfe R6,R4,R4 # R6 = (b >= a) ? 0 : -1
andc R5,R5,R6 # R5 = (b >= a) ? (Rb - Ra) : 0
add R7,R3,R5 # R3 = (b >= a) ? Rb : Ra
# R7 = result

Figure 3-27 shows a code sequence computing max(a,b) for signed values. The first two lines shift the unsigned values so that the problem becomes finding the maximum of unsigned values. By replacing andc with and, the code in Figure 3-27 produces min(a,b).


3.2.3.5 Division by Integer Constants

On many implementations, integer division is rather slow compared to integer multiplication or other integer arithmetic and logical operations. When the divisor is a constant, integer division instructions can be replaced by shifts to the right for divisors that are powers of 2 or by multiplication by a magic number for other divisors. The following describes techniques for 32-bit code, but everything extends in a straightforward way to 64-bit code.



Figure 3-27. Signed Maximum of a and b Code Sequence

# R3 = a
# R4 = b
xoris R5,R4,0x8000 # flip sign b
xoris R6,R3,0x8000 # flip sign a
# the problem is now analogous to that of the unsigned maximum
subfc R6,R6,R5 # R6 = R5 - R6 = b - a with carry
# CA = (b >= a) ? 1 : 0
subfe R5,R5,R5 # R5 = (b >= a) ? 0 : -1
andc R6,R6,R5 # R6 = (b >= a) ? (Rb - Ra) : 0
add R6,R6,R3 # R6 = (b >= a) ? Rb : Ra
# R6 = result

Signed Division
Most computers and high-level languages use a truncating type of signed-integer division in which the quotient q is defined as the integer part of n/d with the fraction discarded. The remainder is defined as the integer r that satisfies

n = q × d + r

where 0 r < |d| if n 0, and -|d| < r 0 if n < 0. If n = -231 and d = -1, the quotient is undefined. Most computers implement this definition, including PowerPC processor-based computers. Consider the following examples of truncating division:

7/3 = 2 remainder 1

(-7)/3 = -2 remainder -1

7/(-3) = -2 remainder 1

(-7)/(-3) = 2 remainder -1

Signed Division by a Power of 2
If the divisor is a power of 2, that is 2k for 1 k 31, integer division may be computed using two elementary instructions:

srawi Rq,Rn,Rk
addze Rq,Rq

Rn contains the dividend, and after executing the preceding instructions, Rq contains the quotient of n divided by 2k. This code uses the fact that, in the PowerPC architecture, the shift right algebraic instructions set the Carry bit if the source register contains a negative number and one or more 1-bits are shifted out. Otherwise, the carry bit is cleared. The addze instruction corrects the quotient, if necessary, when the dividend is negative. For example, if n = -13, (0xFFFF_FFF3), and k = 2, after executing the srawi instruction, q = -4 (0xFFFF_FFFC) and CA = 1. After executing the addze instruction, q = -3, the correct quotient.

Signed Division by Non-Powers of 2
For any divisor d other than 0, division by d can be computed by a multiplication and a few elementary instructions such as adds and shifts. The basic idea is to multiply the dividend n by a magic number, a sort of reciprocal of d that is approximately equal to 232/d. The high-order 32 bits of the product represent the quotient. This algorithm uses the PowerPC multiply high instructions. The details, however, are complicated, particularly for certain divisors such as 7. Figures 3-28, 3-29, and 3-30 show the code for divisors of 3, 5, and 7, respectively. The examples include steps for obtaining the remainder by simply subtracting q ×d from the dividend n.



Figure 3-28. Signed Divide by 3 Code Sequence

lis Rm,0x5555 # load magic number = m
addi Rm,Rm,0x5556 # m = 0x55555556 = (232 + 2)/3
mulhw Rq,Rm,Rn # q = floor(m*n/232)
srwi Rt,Rn,31 # add 1 to q if
add Rq,Rq,Rt # n is negative
#
mulli Rt,Rq,3 # compute remainder from
sub Rr,Rn,Rt # r = n - q*3



Figure 3-29. Signed Divide by 5 Code Sequence

lis Rm,0x6666 # load magic number = m
addi Rm,Rm,0x6667 # m = 0x66666667 = (233 + 3)/5
mulhw Rq,Rm,Rn # q = floor(m*n/232)
srawi Rq,Rq,1
srwi Rt,Rn,31 # add 1 to q if
add Rq,Rq,Rt # n is negative
#
mulli Rt,Rq,5 # compute remainder from
sub Rr,Rn,Rt # r = n - q*5



Figure 3-30. Signed Divide by 7 Code Sequence

lis Rm,0x9249 # load magic number = m
addi Rm,Rm,0x2493 # m = 0x92492493 = (234 + 5)/7 - 232
mulhw Rq,Rm,Rn # q = floor(m*n/232)
add Rq,Rq,Rn # q = floor(m*n/232) + n
srawi Rq,Rq,2 # q = floor(q/4)
srwi Rt,Rn,31 # add 1 to q if
add Rq,Rq,Rt # n is negative
#
mulli Rt,Rq,7 # compute remainder from
sub Rr,Rn,Rt # r = n - q*7

The general method is:

  1. Multiply n by a certain magic number.
  2. Obtain the high-order half of the product and shift it right some number of positions from 0 to 31.
  3. Add 1 if n is negative.
The general method always reduces to one of the cases illustrated by divisors of 3, 5, and 7. In the case of division by 3, the multiplier is representable in 32 bits, and the shift amount is 0. In the case of division by 5, the multiplier is again representable in 32 bits, but the shift amount is 1. In the case of 7, the multiplier is not representable in 32 bits, but the multiplier less 232 is representable in 32 bits. Therefore, the code multiplies by the multiplier less 232, and then corrects the product by adding n ×232, that is by adding n to the high-order half of the product. For d = 7, the shift amount is 2.

For most divisors, there exists more than one multiplier that gives the correct result with this method. It is generally desirable, however, to use the minimum multiplier because this sometimes results in a zero shift amount and the saving of an instruction.

The corresponding procedure for dividing by a negative constant is analogous. Because signed integer division satisfies the equality n/(-d) = -(n/d), one method involves the generation of code for division by the absolute value of d followed by negation. It is possible, however, to avoid the negation, as illustrated by the code in Figure 3-31 for the case of d = -7. This approach does not give the correct result for d = -231, but for this case and other divisors that are negative powers of 2, you may use the code described previously for division by a positive power of 2, followed by negation.



Figure 3-31. Signed Divide by -7 Code Sequence

lis Rm,0x6DB7 # load magic number = m
addi Rm,Rm,0xDB6D # m = 0x6DB6DB6D = -(234 + 5)/7 +232
mulhw Rq,Rm,Rn # q = floor(m*n/232)
sub Rq,Rq,Rn # q = floor(m*n/232) - n
srawi Rq,Rq,2 # q = floor(q/4)
srwi Rt,Rq,31 # add 1 to q if
add Rq,Rq,Rt # q is negative (n is positive)
#
mulli Rt,Rq,-7 # compute remainder from
sub Rr,Rn,Rt # r = n - q*(-7)

The code in Figure 3-31 is the same as that for division by +7, except that it uses the multiplier of the opposite sign, subtracts rather than adds following the multiply, and shifts q rather than n to the right by 31. (The case of d = +7 could also shift q to the right by 31, but there would be less parallelism in the code.)

The multiplier for -d is nearly always the negative of the multiplier for d. For 32-bit operations, the only exceptions to this rule are d = 3 and 715,827,883.

Unsigned Division
Perform unsigned division by a power of 2 using a srwi instruction (a form of rlwinm). For other divisors, except 0 and 1, Figures 3-32 and 3-33 illustrate the two cases that arise.



Figure 3-32. Unsigned Divide by 3 Code Sequence

lis Rm,0xAAAB # load magic number = m
addi Rm,Rm,0xAAAB # m = 0xAAAAAAAB = (233 + 1)/3
mulhwu Rq,Rm,Rn # q = floor(m*n/232)
srwi Rq,Rq,1 # q = q/2
#
mulli Rt,Rq,3 # compute remainder from
sub Rr,Rn,Rt # r = n - q*3



Figure 3-33. Unsigned Divide by 7 Code Sequence

lis Rm,0x2492 # load magic number = m
addi Rm,Rm,0x4925 # m = 0x24924925 = (235 + 3)/7 - 232
mulhwu Rq,Rm,Rn # q = floor(m*n/232)
sub Rt,Rn,Rq # t = n - q
srwi Rt,Rt,1 # t = (n - q)/2
add Rt,Rt,Rq # t = (n - q)/2 + q = (n + q)/2
srwi Rq,Rt,2 # q = (n + m*n/232)/8 = floor(n/7)
#
mulli Rt,Rq,7 # compute remainder from
sub Rr,Rn,Rt # r = n - q*7

The quotient is

(m ×n)/2p,

where m is the magic number (e.g., (235 + 3)/7 in the case of division by 7), n is the dividend, p 32, and the "/" denotes unsigned integer (truncating) division. The multiply high of c and n yields (c× n)/232, so we can rewrite the quotient as

[(m ×n)/232]/2s,

where s 0.

In many cases, m is too large to represent in 32 bits, but m is always less than 233. For those cases in which m 232, we may rewrite the computation as

[((m - 232) ×n)/232 + n]/2s,

which is of the form (x + n)/2s, and the addition may cause an overflow. If the PowerPC architecture had a Shift Right instruction in which the Carry bit participated, that would be useful here. This instruction is not available, but the computation may be done without causing an overflow by rearranging it:

[(n - x)/2 + x]/2s-1,

where x = [(m - 232)n]/232. This expression does not overflow, and s > 0 when c 232. The code for division by 7 in Figure 3-33 uses this rearrangement.

If the shift amount is zero, the srwi instruction can be omitted, but a shift amount of zero occurs only rarely. For 32-bit operations, the code illustrated in the divide by 3 example has a shift amount of zero only for d = 641 and 6,700,417. For 64-bit operations, the analogous code has a shift amount of zero only for d = 274,177 and 67,280,421,310,721.

Sample Magic Numbers
The C code sequences in Figures 3-34 and 3-35 produce the magic numbers and shift values for signed and unsigned divisors, respectively. The derivation of these algorithms is beyond the scope of this book, but it is given in Warren [1992] and Granlund and Montgomery [1994].



Figure 3-34. Signed Division Magic Number Computation Code Sequence

struct ms {int m; /* magic number */
int s; }; /* shift amount */
struct ms magic(int d)
/* must have 2 <= d <= 231-1 or -231 <= d <= -2 */
{
int p;
unsigned int ad, anc, delta, q1, r1, q2, r2, t;
const unsigned int two31 = 2147483648; /* 231 */
struct ms mag;
ad = abs(d);
t = two31 + ((unsigned int)d >> 31);
anc = t - 1 - t%ad; /* absolute value of nc */
p = 31; /* initialize p */
q1 = two31/anc; /* initialize q1 = 2p/abs(nc) */
r1 = two31 - q1*anc; /* initialize r1 = rem(2p,abs(nc)) */
q2 = two31/ad; /* initialize q2 = 2p/abs(d) */
r2 = two31 - q2*ad; /* initialize r2 = rem(2p,abs(d)) */
do {
p = p + 1;
q1 = 2*q1; /* update q1 = 2p/abs(nc) */
r1 = 2*r1; /* update r1 = rem(2p/abs(nc)) */
if (r1 >= anc) { /* must be unsigned comparison */
q1 = q1 + 1;
r1 = r1 - anc;
}
q2 = 2*q2 /* update q2 = 2p/abs(d) */
r2 = 2*r2 /* update r2 = rem(2p/abs(d)) */
if (r2 >= ad) { /* must be unsigned comparison */
q2 = q2 + 1;
r2 = r2 - ad;
}
delta = ad - r2;
} while (q1 < delta || (q1 == delta && r1 == 0));
mag.m = q2 + 1;
if (d < 0) mag.m = -mag.m; /* resulting magic number */
mag.s = p - 32; /* resulting shift */
return mag;
}



Figure 3-35. Unsigned Division Magic Number Computation Code Sequence

struct mu {unsigned int m; /* magic number */
int a; /* "add" indicator */
int s;} /* shift amount */
struct mu magicu(unsigned int d)
/* must have 1 <= d <= 232-1 */
{
int p;
unsigned int nc, delta, q1, r1, q2, r2;
struct mu magu;
magu.a = 0; /* initialize "add" indicator */
nc = - 1 - (-d)%d;
p = 31; /* initialize p */
q1 = 0x80000000/nc; /* initialize q1 = 2p/nc */
r1 = 0x80000000 - q1*nc; /* initialize r1 = rem(2p,nc) */
q2 = 0x7FFFFFFF/d; /* initialize q2 = (2p-1)/d */
r2 = 0x7FFFFFFF - q2*d; /* initialize r2 = rem((2p-1),d) */
do {
p = p + 1;
if (r1 >= nc - r1 ) {
q1 = 2*q1 + 1; /* update q1 */
r1 = 2*r1 - nc; /* update r1 */
}
else {
q1 = 2*q1; /* update q1 */
r1 = 2*r1; /* update r1 */
}
if (r2 + 1 >= d - r2) {
if (q2 >= 0x7FFFFFFF) magu.a = 1;
q2 = 2*q2 + 1; /* update q2 */
r2 = 2*r2 + 1 - d; /* update r2 */
}
else {
if (q2 >= 0x80000000) magu.a = 1;
q2 = 2*q2; /* update q2 */
r2 = 2*r2 + 1; /* update r2 */
}
delta = d - 1 - r2;
} while (p < 64 && (q1 < delta || (q1 == delta && r1 == 0)));
magu.m = q2 + 1; /* resulting magic number */
mag.s = p - 32; /* resulting shift */
return magu;
}

Even if a compiler includes these functions to calculate the magic numbers, it may also incorporate a table of magic numbers for a few small divisors. Figure 3-36 shows an example of such a table. Magic numbers and shift amounts for divisors that are negative or are powers of 2 are shown just as a matter of general interest; a compiler would probably not include them in its tables. Figure 3-37 shows the analogous table for 64-bit operations.

The tables need not include even divisors because other techniques handle them better. If the divisor d is of the form b ×2k, where b is odd, the magic number for d is the same as that for b, and the shift amount is the shift for b increased by k. This procedure does not always give the minimum magic number, but it nearly always does. For example, the magic number for 10 is the same as that for 5, and the shift amount for 10 is 1 plus the shift amount for 5.



Figure 3-36. Some Magic Numbers for 32-Bit Operations

d signed unsigned
m (hex) s m (hex) a s
-5 99999999 1
-3 55555555 1
-2k 7FFFFFFF k-1
1 0 1 0
2k 80000001 k-1 232-k 0 0
3 55555556 0 AAAAAAAB 0 1
5 66666667 1 CCCCCCCD 0 2
6 2AAAAAAB 0 AAAAAAAB 0 2
7 92492493 2 24924925 1 3
9 38E38E39 1 38E38E39 0 1
10 66666667 2 CCCCCCCD 0 3
11 2E8BA2E9 1 BA2E8BA3 0 3
12 2AAAAAAB 1 AAAAAAAB 0 3
25 51EB851F 3 51EB851F 0 3
125 10624DD3 3 10624DD3 0 3



Figure 3-37. Some Magic Numbers for 64-Bit Operations

d signed unsigned
m (hex) s m (hex) a s
-5 9999999999999999 1
-3 5555555555555555 1
-2k 7FFFFFFFFFFFFFFF k-1
1 0 1 0
2k 8000000000000001 k-1 264-k 0 0
3 5555555555555556 0 AAAAAAAAAAAAAAAB 0 1
5 6666666666666667 1 CCCCCCCCCCCCCCCD 0 2
6 2AAAAAAAAAAAAAAB 0 AAAAAAAAAAAAAAAB 0 2
7 4924924924924925 1 2492492492492493 1 3
9 1C71C71C71C71C72 0 E38E38E38E38E38F 0 3
10 6666666666666667 2 CCCCCCCCCCCCCCCD 0 3
11 2E8BA2E8BA2E8BA3 1 2E8BA2E8BA2E8BA3 0 1
12 2AAAAAAAAAAAAAAB 1 AAAAAAAAAAAAAAAB 0 3
25 A3D70A3D70A3D70B 4 47AE147AE147AE15 1 5
125 20C49BA5E353F7CF 4 0624DD2F1A9FBE77 1 7

In the special case when the magic number is even, divide the magic number by 2 and reduce the shift amount by 1. The resulting shift might be 0 (as in the case of signed division by 6), saving an instruction.

To use the values in Figure 3-36 to replace signed division:

  1. Load the magic value.
  2. Multiply the numerator by the magic value with the mulhw instruction.
  3. If d > 0 and m < 0, add n.
If d < 0 and m > 0, subtract n.

  1. Shift s places to the right with the srawi instruction.
  2. Add the sign bit extracted with the srwi instruction.
To use the values in Figure 3-37 to replace unsigned division:

  1. Load the magic value.
  2. Multiply the numerator by the magic value with the mulhwu instruction.
  3. If a = 0, shift s places to the right with the srwi instruction.
  4. If a = 1,
It can be shown that s - 1 0, except in the degenerate case d = 1, for which this technique is not recommended.


3.2.3.6 Remainder

Figure 3-38 shows a code sequence that computes the 32-bit signed remainder assuming that the quotient is well-defined. The code in Figure 3-39 shows the computation of the 32-bit unsigned remainder.



Figure 3-38. 32-Bit Signed Remainder Code Sequence

divw Rt,Ra,Rb # quotient = (int)(Ra / Rb)
mullw Rt,Rt,Rb # quotient * Rb
subf Rt,Rt,Ra # remainder = Ra - quotient * Rb



Figure 3-39. 32-Bit Unsigned Remainder Code Sequence

divwu Rt,Ra,Rb # quotient = (int)(Ra / Rb)
mullw Rt,Rt,Rb # quotient * Rb
subf Rt,Rt,Ra # remainder = Ra - quotient * Rb

The corresponding code sequences for the signed and unsigned 64-bit remainders appears the same, except that the doubleword forms of the multiply and divide are used.


3.2.3.7 32-Bit Implementation of a 64-Bit Unsigned Divide

With the exception of division, 32-bit implementations of 64-bit arithmetic operations are straightforward exercises in multi-precision arithmetic. This section presents a 32-bit code sequence that performs 64-bit unsigned division. Signed division may use the same routine for the magnitudes followed by appropriate correction of the quotient and remainder (and preceded by trapping for the -263/(-1) case).

Figure 3-40 shows a 32-bit implementation of a 64-bit unsigned division routine, which uses a restoring shift and subtract algorithm. The dividend (dvd) is placed in the low-order half of a 4-register combination (tmp:dvd). Each iteration includes the following steps:

  1. Shift the tmp:dvd combination 1 bit to the left.
  2. Subtract the divisor from tmp.
  3. If the result is negative, do not modify tmp and clear the low-order bit of dvd.
  4. If the result is positive, place the result in tmp and set the low-order bit of dvd.
  5. If the number of iterations is less than the width of dvd, goto step 1.
This implementation of the algorithm shifts the original dividend value in tmp:dvd so that the minimum number of iterations is required.



Figure 3-40. 32-Bit Implementation of 64-Bit Unsigned Division Code Sequence

# (R3:R4) = (R3:R4) / (R5:R6) (64b) = (64b / 64b)
# quo dvd dvs
#
# Remainder is returned in R5:R6.
#
# Code comment notation:
# msw = most-significant (high-order) word, i.e. bits 0..31
# lsw = least-significant (low-order) word, i.e. bits 32..63
# LZ = Leading Zeroes
# SD = Significant Digits
#
# R3:R4 = dvd (input dividend); quo (output quotient)
# R5:R6 = dvs (input divisor); rem (output remainder)
#
# R7:R8 = tmp
# count the number of leading 0s in the dividend
cmpwi cr0,R3,0 # dvd.msw == 0?
cntlzw R0,R3 # R0 = dvd.msw.LZ
cntlzw R9,R4 # R9 = dvd.lsw.LZ
bne cr0,lab1 # if(dvd.msw == 0) dvd.LZ = dvd.msw.LZ
addi R0,R9,32 # dvd.LZ = dvd.lsw.LZ + 32
lab1:
# count the number of leading 0s in the divisor
cmpwi cr0,R5,0 # dvd.msw == 0?
cntlzw R9,R5 # R9 = dvs.msw.LZ
cntlzw R10,R6 # R10 = dvs.lsw.LZ
bne cr0,lab2 # if(dvs.msw == 0) dvs.LZ = dvs.msw.LZ
addi R9,R10,32 # dvs.LZ = dvs.lsw.LZ + 32
lab2:
# determine shift amounts to minimize the number of iterations
cmpw cr0,R0,R9 # compare dvd.LZ to dvs.LZ
subfic R10,R0,64 # R10 = dvd.SD
bgt cr0,lab9 # if(dvs > dvd) quotient = 0
addi R9,R9,1 # ++dvs.LZ (or --dvs.SD)
subfic R9,R9,64 # R9 = dvs.SD
add R0,R0,R9 # (dvd.LZ + dvs.SD) = left shift of dvd for
# initial dvd
subf R9,R9,R10 # (dvd.SD - dvs.SD) = right shift of dvd for
# initial tmp
mtctr R9 # number of iterations = dvd.SD - dvs.SD
# R7:R8 = R3:R4 >> R9
cmpwi cr0,R9,32 # compare R9 to 32
addi R7,R9,-32
blt cr0,lab3 # if(R9 < 32) jump to lab3
srw R8,R3,R7 # tmp.lsw = dvd.msw >> (R9 - 32)
li R7,0 # tmp.msw = 0
b lab4
lab3:
srw R8,R4,R9 # R8 = dvd.lsw >> R9
subfic R7,R9,32
slw R7,R3,R7 # R7 = dvd.msw << 32 - R9
or R8,R8,R7 # tmp.lsw = R8 | R7
srw R7,R3,R9 # tmp.msw = dvd.msw >> R9
lab4:
# R3:R4 = R3:R4 << R0
cmpwi cr0,R0,32 # compare R0 to 32
addic R9,R0,-32
blt cr0,lab5 # if(R0 < 32) jump to lab5
slw R3,R4,R9 # dvd.msw = dvd.lsw << R9
li R4,0 # dvd.lsw = 0
b lab6
lab5:
slw R3,R3,R0 # R3 = dvd.msw << R0
subfic R9,R0,32
srw R9,R4,R9 # R9 = dvd.lsw >> 32 - R0
or R3,R3,R9 # dvd.msw = R3 | R9
slw R4,R4,R0 # dvd.lsw = dvd.lsw << R0
lab6:
# restoring division shift and subtract loop
li R10,-1 # R10 = -1
addic R7,R7,0 # clear carry bit before loop starts
lab7:
# tmp:dvd is considered one large register
# each portion is shifted left 1 bit by adding it to itself
# adde sums the carry from the previous and creates a new carry
adde R4,R4,R4 # shift dvd.lsw left 1 bit
adde R3,R3,R3 # shift dvd.msw to left 1 bit
adde R8,R8,R8 # shift tmp.lsw to left 1 bit
adde R7,R7,R7 # shift tmp.msw to left 1 bit
subfc R0,R6,R8 # tmp.lsw - dvs.lsw
subfe. R9,R5,R7 # tmp.msw - dvs.msw
blt cr0,lab8 # if(result < 0) clear carry bit
mr R8,R0 # move lsw
mr R7,R9 # move msw
addic R0,R10,1 # set carry bit
lab8:
bdnz lab7
# write quotient and remainder
adde R4,R4,R4 # quo.lsw (lsb = CA)
adde R3,R3,R3 # quo.msw (lsb from lsw)
mr R6,R8 # rem.lsw
mr R5,R7 # rem.msw
blr # return
lab9:
# Quotient is 0 (dvs > dvd)
mr R6,R4 # rmd.lsw = dvd.lsw
mr R5,R3 # rmd.msw = dvd.msw
li R4,0 # dvd.lsw = 0
li R3,0 # dvd.msw = 0
blr # return


3.2.3.8 Bit Manipulation

Extracting and inserting bit fields in register sized quantities are common operations. For example, consider the following C structure declaration:

struct {
unsigned f1 :1;
unsigned f3 :3;
unsigned f4 :4;
unsigned f8 :8;
} x;

This structure can be packed into a 32-bit word from left-to-right, consistent with a big-endian system, as shown in Figure 3-41. Figure 3-42 presents instructions to extract these bit fields. Figure 3-43 presents instructions to insert into these bit fields.


Figure 3-41. Structure x



Figure 3-42. Code sequences to Extract Bit Fields

rlwinm Rt,Rx,1,31,31 # Rt = f1 from Rx
rlwinm Rt,Rx,4,29,31 # Rt = f3 from Rx
rlwinm Rt,Rx,8,28,31 # Rt = f4 from Rx
rlwinm Rt,Rx,16,24,31 # Rt = f8 from Rx



Figure 3-43. Code Sequences to Insert Bit Fields

rlwimi Rt,Rx,31,0,0 # insert f1 into Rt from Rx
rlwimi Rt,Rx,28,1,3 # insert f3 into Rt from Rx
rlwimi Rt,Rx,24,4,7 # insert f4 into Rt from Rx
rlwimi Rt,Rx,16,8,15 # insert f8 into Rt from Rx


3.2.3.9 Multiple-Precision Shifts

A multiple-precision shift is the shift of an N-doubleword quantity (64-bit mode) or an N-word quantity (32-bit mode). The quantity to be shifted is contained in N consecutive registers. For further details, including immediate and algebraic shifts, see Appendix E of Book I of The PowerPC Architecture and Kacmarcik [1995]. For a general reference, see Lamport [1975].

Figure 3-44 shows the example of the left shift of a 3-word quantity stored in R2 through R4. The size of the shift obviously affects the algorithm, as indicated in the figure. Figure 3-45 shows assembly code that executes this shift. The shift instructions in the code function in such a way that those which are nonzero when the shift is less than 32 bits, are zero for shifts greater than or equal to 32 bits and vice versa.


Figure 3-44. Left Shift of a 3-Word Value



Figure 3-45. Code Sequence to Shift 3 Words Left When sh < 64

# R6 = shift amount, sh
# R2 (msw) to R4 (lsw) = multi-word
# R0 = temporary
# introductory assignments
subfic R31,R6,32 # R31 = 32 - sh
addi R30,R6,-32 # R30 = sh - 32
subfic R29,R6,64 # R29 = 64 - sh
# register 2
slw R2,R2,R6 # R2 << sh (nonzero if sh < 32)
srw R0,R3,R31 # (unsigned) R3 >> 32 - sh
# (nonzero if sh < 32)
or R2,R2,R0 # combine results in R2
slw R0,R3,R30 # R3 << sh - 32 (nonzero if sh > 31)
or R2,R2,R0 # combine results in R2
srw R0,R4,R29 # (unsigned) R4 >> 64 - sh
# (nonzero if sh > 31)
or R2,R2,R0 # combine results in R2
# register 3
slw R3,R3,R6 # R3 << sh (nonzero if sh < 32)
srw R0,R4,R31 # (unsigned) R4 >> 32 - sh
# (nonzero if sh < 32)
or R3,R3,R0 # combine results in R3
slw R0,R4,R30 # R4 << sh - 32 (nonzero if sh > 31)
or R3,R3,R0 # combine results in R3
# register 4
slw R4,R4,R6 # R4 = r4 << sh (nonzero if sh < 32)


3.2.3.10 String and Memory Functions

Library functions such as those that perform block moves or pattern searching may perform better by:

Searching for the First Occurrence of a Specified Byte Value
The end of a string in the C language is denoted by an all-0 byte or null character. Therefore, the length of a string is determined by searching the string with increasing address for the 0-byte, and returning the number of bytes scanned, not counting the 0-byte. A string length function might load and test single bytes until reaching a word boundary, and then load a word at a time into a register and test the register for the presence of the 0-byte. In the big-endian case, the index of the first 0-byte from the high-order end of the register is desired. A convenient encoding is values from 0 to 3, denoting bytes 0 to 3, and a value of 4 denoting there is no 0-byte in the word. Therefore, a value of 0 through 4 is returned. This value is added to the string length as successive words are searched. The little-endian case functions analogously. Figure 3-46 shows a branch-free implementation of this function, which uses the cntlzw instruction.



Figure 3-46. Find Leftmost 0-Byte: Non-Branching Code Sequence

lis Rc,0x7F7F
addi Rc,Rc,0x7F7F # c = 0x7F7F7F7F
and Ry,Rx,Rc # x & 0x7F7F7F7F
or Rt,Rx,Rc # x | 0x7F7F7F7F
add Ry,Ry,Rc # (x & 0x7F7F7F7F) + 0x7F7F7F7F
nor Ry,Ry,Rt # ~(y | t)
# nonzero bytes = 0x00
# zero bytes = 0x80
cntlzw Rn,Ry # n = 0, 8, 16, 24, or 32
srwi Rn,Rn,3 # divide by 8 to get result

The same branch-free algorithm can be used to search for any particular byte value by first XORing the word to be searched with a word consisting of the desired byte value replicated in each byte position. For example, to search x for an ASCII blank (0x20), search x ^ 0x20202020 for a 0-byte. Two words can be compared for matching bytes by searching for a 0-byte in the word resulting from the XOR of the two words. If all that is required is a test for a 0-byte, use the preceding code sequence up to the nor instruction. If the word contains a 0-byte, the result of the nor instruction is nonzero. By using the recording form of the nor instruction, the EQ bit in CR0 can be tested by a subsequent conditional branch.

Memset
Figure 3-47 shows a version of the C memset function, which copies a specified byte value into a range of bytes beginning at some destination address. The Move Assist instructions may be used to implement this function, but not all implementations execute them well, and little-endian systems must trap and emulate them. This memset function employs scalar store instructions. This code sequence loads a register with four copies of the byte value. The sequence begins by storing a byte and/or halfword until it reaches a word boundary, at which point it can use aligned-word stores. If any bytes remain after the final aligned-word store, byte stores manage these.



Figure 3-47. Memset Code Sequence with Scalar Store Instructions

# R3 = address of the start of the block of memory
# LSB of R4 = value to be copied
# R5 = size of the block in bytes
mr R0,R5
mr R31,R3
cmplwi cr1,R0,3 # total vs. 3
rlwimi R5,R4,8,16,23 # low-order of R5 = 2 copies of byte
rlwimi R5,R5,16,0,15 # R5 = 4 copies of byte
ble cr1,done # if (total <= 3) goto done
andi. R6,R3,3 # low-order 2 bits of dest. address
cmpwi cr1,R6,2
beq cr0,W_align # if (2 low-order bits == 0)
# block is word aligned
subf R0,R6,R0
beq cr1,H_align # block is halfword aligned
stb R5,0(R31) # store 1 byte
addi R31,R31,1
blt cr1,W_align # remainder of block is word aligned
H_align: # halfword aligned
sth R5,0(R31)
addi R31,R31,2
W_align: # remainder of block is word aligned
cmpwi cr0,R0,0 # set cr0 comparing R0 to 0
srwi R6,R0,3 # total bytes/8
cmplwi cr2,R0,8 # total vs. 8
mtctr R6 # CTR = number of 8 byte blocks
addi R31,R31,-4 # R31 = R3 - 4
blt cr2,done # total < 8 goto done
andi. R0,R0,7 # R0 = total % 8
loop:
stw R5,4(R31)
stwu R5,8(R31) # issue 2 aligned stores per iteration
bdnz loop # loop till no more 8-byte blocks
done:
beqlr cr0 # return if zero bytes left
mtctr R0 # CTR = number of bytes left
addi R31,R31,-1
bloop:
stbu R5,1(R31)
bdnz bloop
blr # return, R3 = destination address


[ Table of Contents | Index ]
Copyright 1998 IBMchips