IBM Search Java Site Map Microelectronics News Products Services Technology About Us IBM Microelectronics Order Contact Legal

power pcProducts

overview
news
products
documents
performance
technology


[ Table of Contents | Index ]
Chapter 3

3. Code Selection


An effective compiler must select machine language representations that perform best for the operations and structures of the high-level language being compiled. This selection does not necessarily occur at a single step in the compilation process, but rather may be distributed among several steps. Some of the selection may occur during the transformation of the source code to an intermediate language, during various optimization transformations, or during the final generation of the assembly or machine language code.

An important consideration for code selection is the relationship between instructions. A dependence is a relationship between two instructions that requires them to execute in program order. A control dependence requires an instruction to execute in program order with respect to a branch instruction. An instruction has a data dependence on a preceding instruction when its source operand is the result of the preceding instruction. This result may be an indirect input through a data dependence on one or more intermediate instructions. Two instructions have a name dependence when, although not data dependent, both access a particular register or memory location as an operand, so they must be executed in program order. If the register or memory location for one of the instructions is renamed, the name dependence is removed. An output dependence is a name dependence for which the instruction's destination register or memory location is the preceding instruction's destination register or memory location. An antidependence is a name dependence for which the instruction's destination register or memory location is the preceding instruction's source register or memory location.

This chapter shows how to compile certain common operations and structures in high-level languages, such as C or Fortran, into PowerPC instructions. For the purpose of clarity, the examples illustrate simple sequences that generally do not include full scheduling optimizations.


3.1 Control Flow

The performance of modern processors is extremely sensitive to branches and program branching behavior. Branches fall into one of three categories: unconditional branches, conditional branches that select one of two potential successor addresses (one of which is the next sequential address, sometimes called the fall-through successor), and unconstrained branches whose target address is computed.

Programs decompose into basic blocks, which are single-entry, multiple-exit units with no internal branch targets. Basic blocks form the simplest unit of optimization because, within a basic block, only local data dependencies need be considered. Optimizations that span multiple basic blocks require more complicated analysis of data flow within the program. Thus, minimizing the use of branches increases the size of the basic blocks and simplifies optimization.

A branch is unresolved when either the condition or the target address is unavailable when the branch is processed. If the processor delays execution until a branch is resolved, the pipeline usually stalls. As an alternative, the processor may execute a predicted path of the unresolved branch. When the branch is resolved, if the prediction was correct, execution simply continues. If the prediction was incorrect, the processor must back up, cancel speculative instructions subsequent to the branch, and begin execution from the correct instruction, a potentially large penalty in pipelined and superscalar processors. By using branch instructions that are always resolved (unconditional branches or branch-on-count instructions) or by ensuring that the condition and target address are available before the branch is processed, the unused cycles and possible mispredict penalties of an unresolved branch may be avoided. If a branch, taken or not, is resolved early enough, prefetching along the target path may permit the execution of the branch in parallel with other instructions preventing stalls in the pipeline. Therefore, a highly desirable optimization is to compute the condition or to load the Link or Count Register as early as possible in the basic block so that the dependent branch is resolved when it is encountered.

Because most PowerPC processors are pipelined and superscalar, many clock cycles may go unused during branch-resolution delays. Therefore, most PowerPC processors have the ability to execute speculatively. The branch prediction may be derived from the static prediction (y) bit in the conditional branch instruction or from dynamic branch-prediction hardware. Accurate predictions can dramatically improve performance.

The PowerPC architecture allows a variety of features that can reduce the number of branches, enable the early resolution of conditional branches, and permit the accurate prediction of unresolved branches. These features include:


3.1.1 Architectural Summary

The PowerPC architecture supports many flow-control options. The Link Register can be used to save the return address during subroutine linkage. The Count Register can be used to maintain the loop counter for loops with a fixed number of iterations. The result of compare and record operations are first-class values, so all the traditional optimizations can be applied (e.g., common sub-expression and dead code elimination). Book I: Chapter 2 of the The PowerPC Architecture includes complete details regarding the control-flow resources and instructions.


3.1.1.1 Link Register

The Link Register is loaded either directly from a General-Purpose Register (using the mtlr instruction) or by executing a branch instruction with the link bit (LK) in the instruction set to one. All branch instructions permit setting LK to one to save the address of the next instruction in the Link Register. A subsequent branch to the Link Register using the bclr[l] instruction initiates a transfer of control to the instruction following the branch that last wrote the Link Register. This mechanism permits subroutine linkage because a branch to a function may automatically write its return address in the Link Register. In fact, a branch instruction with LK set to one writes the address of the next instruction in the Link Register whether the branch is actually taken or not. Optimizations that attempt to minimize the saving and restoring of the Link Register need to be aware of this feature.

The address of the next instruction is given by:


bcl 20,31,$+4

The value 20 in the BO field means branch always, so the result of the test of bit 31 in the Condition Register is ignored. This instruction is an unconditional branch to the relative address 4 bytes forward (the next instruction) with LK set to one so that the address of the next instruction is written to the Link Register.

During a control transfer to a computed target address, depending on the local context and ABI constraints, either the Link Register or the Count Register may hold the target address in order to preserve some ongoing use of the other. The Link Register, however, typically maintains the return address for function linkage, so control transfer to a computed address normally uses the Count Register. The PowerPC architecture does not require this division of labor.

In most ABIs, the Link Register is volatile and used for subroutine linkage. A function may save the Link Register value in its stack frame or leave it in a General-Purpose Register. If the intent is to return via a branch to the address in the Link Register, the value must be restored to the Link Register before the return.


3.1.1.2 Count Register

If a loop has a predetermined number of iterations, this number may be written to the Count Register so that a branch-on-count instruction (a conditional branch with bit 2 of the BO field cleared) can automatically decrement the Count Register and test it for zero. Branch-on-count operations are always resolved, except perhaps for the first iteration if the value in the Count Register is not yet valid. In 64-bit implementations, whether in 64-bit or 32-bit mode, the processor decrements the entire 64-bit Count Register, but tests only the low-order 32 bits in 32-bit mode.

To transfer control to a computed address, the Count Register may be loaded with a target address from a General-Purpose Register using the mtctr instruction, followed by a branch to the Count Register. When possible, independent instructions should separate the load of the Count Register from the branch to prevent pipeline stalls.

In most implementations, saving and restoring the Count Register introduces delays that cancel the performance advantage of the branch-on-count operation over the equivalent add, compare and branch sequence. In particular, the branch-on-count operation should be used only on a single loop in a set of nested loops (usually the innermost loop that can be implemented using it) and should not be used if the loop body contains a call to a subprogram that may alter the Count Register.


3.1.1.3 Condition Register

The 32-bit Condition Register reflects certain properties of computations and their results, either implicitly through recording operations or explicitly through comparison operations and direct transfers. Condition Register logical instructions can manipulate individual Condition Register bits. Conditional branch instructions may test these computational properties.

The Condition Register has various interpretations. The mfcr instruction treats it as a 32-bit register. Recording instructions, compare instructions, mcrf, mcrxr, and mcrfs treat it as eight 4-bit registers. Conditional branch and Condition Register logical instructions treat it as 32 1-bit registers. The meaning associated with each bit is managed by the compiler.

Recording Instructions
Most fixed-point operations have recording instruction forms, that is, forms that write a 4-bit value to Condition Register field 0 that includes three bits denoting whether the result is less than, greater than, or equal to zero, and the summary overflow (SO) bit. This implicit recording may eliminate the need for an additional comparison before a conditional branch. Conditional branch instructions can test SO for fixed-point exception support.

Most floating-point operations have recording instruction forms, that is, forms that write a 4-bit value to Condition Register field 1 that includes floating-point exception summary information. This implicit recording allows conditional branches to test for floating-point exceptions generated by the recording instruction or a previous instruction.

Compare Instructions
The result of an fixed-point or floating-point comparison can be written to any of the Condition Register fields. Compare instructions treat the Condition Register as eight 4-bit fields. Condition Register logical instructions may combine several of the results from such comparisons, potentially eliminating branches.

Fixed-point comparisons may be signed or unsigned, and 32-bit or 64-bit (for 64-bit implementations). Floating-point comparisons may be ordered, which cause an exception if either operand is a NaN, or unordered, which cause an exception only if either operand is an SNaN.

mcrf Instruction
The mcrf instruction copies the contents of one Condition Register field to another, treating the Condition Register as eight 4-bit fields. In particular, this instruction can copy the result of one recording instruction out of Condition Register field 0 or 1 so that a second recording instruction will not overwrite the previous record.

If the linkage conventions require certain Condition Register fields to be preserved across call boundaries and the cost of saving the Condition Register in memory is prohibitive, the mcrf instruction can move data from volatile to nonvolatile Condition Register fields. This use of the Condition Register is limited by the aggressiveness of optimizers and assembly programmers.

mtcrf and mfcr Instructions
The mtcrf instruction loads 4-bit fields of the Condition Register from a General-Purpose Register. The mfcr instruction copies the contents of the Condition Register to a General-Purpose Register. The most common use for these instructions is saving and restoring fields of the Condition Register across subroutine boundaries. Depending on ABI requirements, the contents of the Condition Register may remain in a General-Purpose Register or be stored in memory.

mcrxr Instruction
The mcrxr instruction moves bits 0:3 of the XER to one of the Condition Register fields. This field of the XER contains the Summary Overflow (SO), Overflow (OV), and Carry (CA) bits. Their presence in the Condition Register allows conditional branching on these bits for fixed-point exception support. This instruction clears bits 0:3 of the XER.

mcrfs Instruction
The mcrfs instruction copies the contents of a 4-bit field of the FPSCR to a 4-bit field of the Condition Register. The FPSCR contains the floating-point exception summary bits, the floating-point exception bits, the Floating-Point Fraction Rounded bit, the Floating-Point Fraction Inexact bit, and the Floating-Point Result Flags. One way to interrogate these bits is to load them into the Condition Register. Then, conditional branching can determine the presence and nature of an exception and the class of a floating-point result.

Condition Register Logical Instructions
The Condition Register logical instructions (crand, cror, crxor, crnand, crnor, creqv, crandc, and crorc) allow direct manipulation of bits in the Condition Register. These eight instructions can combine the results of multiple conditions into a single condition for test by a conditional branch. Condition Register logical instructions treat the Condition Register as 32 1-bit fields.


3.1.2 Branch Instruction Performance

Implementation details affect the manner in which the control flow instructions are selected. The dispatch and cache access behavior favor the fall-through path of branches. Using some branch unit or recording instructions unnecessarily can degrade performance.


3.1.2.1 Fall-Through Path

When possible, the most likely outcome of a branch should lie on the fall-through (not-taken) path. Consider a PowerPC processor that can dispatch four instructions per cycle. If an unresolved branch is the first instruction, the remaining three instructions, those lying on the fall-through path, may be dispatched speculatively in the same cycle as the branch, even if the branch is predicted taken. No fetch is needed to access instructions on the fall-through path. If the unresolved branch is the last instruction and it is predicted taken, the next group of fetched instructions will come from the branch target. It is also likely, however, that the instructions on the fall-through path are available in the cache.


3.1.2.2 Needless Branch Register and Recording Activity

On some PowerPC implementations, the execution of the mtlr, mflr, mtctr, mfctr or Condition Register instructions causes serialized execution, when they could otherwise have executed in parallel. If a branch instruction has the LK bit set to one, it loads the Link Register with the address of the following instruction, whether the branch is taken or not. Therefore, do not enable LK unless you need the address of the next instruction for some purpose, such as function linkage. Setting the record bit of instructions needlessly can prevent parallel execution by introducing resource dependencies.


3.1.2.3 Condition Register Contention

Conditional branch instructions and Condition Register logical instructions treat the Condition Register as 32 1-bit fields. Most implementations, however, treat the Condition Register as a set of eight 4-bit fields, so better timing characteristics result if the destination bit is in a different field than either of the source fields, thereby avoiding a delay due to contention. For example, on some implementations,


cror 11,0,2 # cr2[so] = cr0[lt] | cr0[eq]

will execute faster than


cror 3,0,2 # cr0[so] = cr0[lt] | cr0[eq].


3.1.3 Uses of Branching

High-level language flow control features map to unconditional branches, conditional branches, multi-way conditional branches, counting loops, and function calls and returns. The following sections describe PowerPC code to implement these features.


3.1.3.1 Unconditional Branches

Unconditional flow control statements in C include unconditional goto, break, continue, and return. The unconditional goto, break, and continue statements may be implemented using unconditional branches to relative immediate addresses (b) or using unconditional branches to the Link or Count Register (bclr or bcctr). The return statement is implemented using a branch to the Link Register (see Section 3.1.3.5 on page 32).

Branching to an absolute address transfers control to the sign-extended word address given in the immediate field of the branch instruction. Some ABIs use the linking forms of this instruction (bla or bcla) to call special functions.


3.1.3.2 Conditional Branches

Conditional branching statements in C include if and if-else, which are most often implemented using conditional branch instructions and perhaps some unconditional branches. The conditional expression, which, depending on the source language, can take various forms, is evaluated, and the result is placed in a bit of the Condition Register. The most common way to write the result of a conditional expression to the Condition Register for testing by conditional branch instructions involves compare instructions or recording instructions, both of which set a 4-bit field in the Condition Register. These bits may be further manipulated using Condition Register logical instructions. Figure 3-1 shows C and assembly code for an if-else sequence. This simple example uses a recording instruction form to provide the conditions for both conditional branches.

Some simple sequences that involve fixed-point comparisons can be replaced with branch-free code, as shown in Section 3.1.5.1 on page 38 and Appendix D. For floating-point comparisons, the optional Floating Select (fsel) instruction can sometimes replace this type of conditional structure when IEEE 754 compatibility is not required. See Section 3.3.9 on page 86.



Figure 3-1. if-else Code Example

C Source Code
a = b + c;
if (a > 0)
...action 1...
else if (a < 0)
...action 2...
else
...action 3...
Assembly Code
add. Ra,Rb,Rc # recording form generates conditions
# for both conditional branches
ble cr0,lab1 # if a > 0, do action 1
...action 1...
b out # exit
lab1:
bge cr0,lab2 # else if a < 0, do action 2
...action 2...
b out # exit
lab2: # else do action 3
...action 3...
out:


3.1.3.3 Multi-Way Conditional Branches

The multi-way branch construction, as represented by the C switch or the Fortran computed goto, can be implemented in several different ways. These include if-else sequences, branch tables, hash tables, arithmetic progressions, search algorithms over binary or ternary trees, range testing, or combinations of these and others. The choice of the best implementation depends on problem specifics, including the number and distribution of test conditions and the instruction timings and latencies.

Figure 3-2 shows the example of a C Switch and assembly code that implements it as a series of if-else constructions.



Figure 3-2. C Switch: if-else Code Sequence

C Source Code
switch (x) {
case 10: case 11: case 12: case 13: case 14: case 15:
...do_something...
}
Assembly Code
lwz R3,x # load x into R3
cmpwi cr0,R3,10
beq cr0,lab10 # if (x == 10) goto lab10
cmpwi cr1,R3,11
beq cr1,lab11 # else if (x == 11) goto lab11
cmpwi cr5,R3,12
beq cr5,lab12 # else if (x == 12) goto lab12
cmpwi cr6,R3,13
beq cr6,lab13 # else if (x == 13) goto lab13
cmpwi cr7,R3,14
beq cr7,lab14 # else if (x == 14) goto lab14
cmpwi cr0,R3,15
beq cr0,lab15 # else if (x == 15) goto lab15
...
lab10:
lab11:
lab12:
lab13:
lab14:
lab15:
...do_something...

The same example can be coded as a range test as shown in Figure 3-3. Multiple compares and combining branch conditions facilitate this approach.



Figure 3-3. C Switch: Range Test Code Sequence

lwz R3,x # load x into R3
subic R4,R3,10 # tmp = (x - min)
cmplwi cr3,R4,5 # logically compare (tmp, max-min)
bgt cr3,out # if tmp < 0 or tmp > 5,
# x is outside the range [min,max]
...do_something...
out:

Another example involves the use of a branch table. Figure 3-4 shows a C switch. Assume that TABLE contains the 32-bit addresses of code corresponding to the various case n: labels. To branch to a computable address, load that address (after computation) into either the Count Register or the Link Register, and then branch to the contents of the register. The local context and ABI conventions determine which of the special registers to use for this purpose. For example, within the body of an iterative loop based on the value in the Count Register, you might choose the Link Register. If the Link Register is required for other purposes such as subroutine linkage, its contents can be saved in another register or memory temporarily while the Link Register is being used for this purpose. Figure 3-4 shows how to implement the C switch by using the Count Register to hold the branch target.

The cases of a switch are frequently decomposed into clusters, which are a group of cases handled as a unit for the purpose of selecting the implementation. Some guidelines that are useful for selecting the implementation of a switch statement in the PowerPC architecture include:


3.1.3.4 Iteration

A do group is any kind of iterative construct, such as a C for loop or a Fortran do loop. The latch point of an iterative construct is the branch that transfers control from the bottom of the loop back to the top. It is a back edge in the program flow graph. The following instructions may serve as latch points:

A do group has the general form:


loop:
...body of code...
latch_to_loop

The C language while, do, and for statements are examples of do groups. Figure 3-5 shows a simple implementation of the C strlen function that uses the while statement.



Figure 3-5. strlen Code Example

C Source Code
i = 0;
while(s[i] != `\0')
++i;
/* i is the length of the string s */
Assembly Code
addi R4,R3,-1 # initialize
loop:
lbzu R5,1(R4) # read s[i] and increment
cmpwi cr3,R5,0 # compare s[i] to `\0'
bne cr3,loop # if (s[i] != `\0') goto loop
out:
subf R3,R3,R4 # R3 is the length of the string s

Loops for which the number of iterations can be computed at compile time or at execution time represent a special case of do groups. In this case, you may use the branch-on-count form of conditional branch (extended mnemonics bdnz and bdz) for loop control, rather than some form of add, compare, and branch on a Condition Register bit. Branch-on-count operations are almost always resolved, so they may execute in parallel with other instructions, preventing stalls in the fixed-point and floating-point pipelines. Fortran do loops and many of the C for loops represent opportunities to exploit the branch-on-count feature.

Figure 3-6 shows a simple Fortran do loop and the corresponding assembly code. If the program uses the loop index in computations, increment a separate value because of delays associated with access of the Count Register.



Figure 3-6. Branch-On-Count Loop: Simple Code Example

Fortran Source Code
do 10 i=1,10
...loop body...
10 continue
Assembly Code
li R3,10 # number of iterations = 10
li R4,1 # set up induction variable i
mtctr R3 # load CTR with the number of iterations
loop:
...loop body...
addi R4,R4,1 # i = i + 1 (needed if program uses i)
bdnz loop # decrement and branch to loop if (CTR) 0

Figure 3-7 shows an example where the number of iterations is not known until execution time. The program must confirm that the number of iterations is at least one. If not, branch around the loop.



Figure 3-7. Branch-On-Count Loop: Variable Number of Iterations Code Example

Fortran Source Code
do 10 i=1,N
...loop body...
10 continue
Assembly Code
lwz R3,n # load number of iterations = n
cmpwi cr0,R3,1 # compare the number of iterations to 1
li R4,1 # setup induction variable i
blt cr0,out # goto out if n < 1
mtctr R3 # load CTR with the number of iterations
loop:
...loop body...
addi R4,R4,1 # i = i + 1 (needed if program uses i)
bdnz loop # decrement and branch to loop if (CTR) 0
out:

Figure 3-8 shows the example of a loop where the initial and final values of the loop index, as well as the stride, are determined at execution time. You must calculate the number of iterations and confirm that it is at least one. If not, as in the preceding example, branch around the loop.



Figure 3-8. Branch-On-Count Loop: Variable Range and Stride Code Example

Fortran Source Code
do 10 i=ns,nf,nstep
...loop body...
10 continue
Assembly Code
lwz R3,ns # load starting value of i
lwz R4,nf # load final value of i
subf R6,R3,R4 # nf - ns
lwz R5,nstep # load stepping increment
divw R6,R6,R5 # (nf - ns)/nstep
addi R6,R6,1 # (nf-ns)/nstep + 1 = number of iterations
cmpwi cr0,R6,1 # compare the number of iterations to 1
mr R7,R3 # set up induction variable i
blt cr0,out # goto out if number of iterations < 1
mtctr R6 # load CTR with the number of iterations
loop:
...loop body...
add R7,R7,R5 # i = i + nstep (needed if program uses i)
bdnz loop # decrement and branch to loop if (CTR) 0
out:

Figure 3-9 shows a Fortran example with a compound branch form of latch point. The do loop and the if inside the loop combine to form a single latch point. Although this form of branch-on-count is not resolved, it combines the two conditional branches into one.



Figure 3-9. Compound Latch Point Code Example

Fortran Source Code
do 10 i=1,100
...loop body...
if(a .eq. b) goto 20
10 continue
20 continue
Assembly Code
# R3 contains a
# R4 contains b
li R5,100 # i = 100
mtctr R5 # CTR = 100
loop:
...loop body...
cmpw cr3,R3,R4 # compare a and b
bdnzf cr3[eq],loop # decrement and branch to loop
# if (CTR) 0 and a b
lab20:


3.1.3.5 Procedure Calls and Returns

Procedure calls usually use the Link Register to hold the return address. Details of the linkage conventions, such as parameter passing, stack format, and specifics of indirect procedure calling, are ABI-specific. The examples in this section use the Link Register to hold the return address for procedure linkage.

A procedure call can be a relative branch to another procedure or an indirect call through a pointer. For a call through a pointer, the address of the procedure is loaded from memory into a General-Purpose Register. The simplest case is a direct call through a pointer, which copies the procedure address to the Count Register and then branches to the Count Register with LK set to one.

Figure 3-10 shows the C source for two calls that illustrate:

Some system libraries provide stub routines with special linkage conventions to handle transfers via pointers to arbitrary procedure addresses. The examples in Figures 3-10 through 3-12 do not assume such a routine.



Figure 3-10. Function Call Code Example—C Source

#include <stdio.h>
int foo(int i)
{
printf("I'm in function foo\n");
return(i);
}
main()
{ int (*foo_bar)(int i);
foo_bar = foo;
foo(1); /* call foo directly */
foo_bar(2) ; /* call foo via pointer foo_bar */
}

Figures 3-11 shows the relative call in assembly language. The procedure parameter is loaded into R3, and the code branches unconditionally to the address of the procedure with LK set to one.



Figure 3-11. Relative Call to foo Code Sequence

li R3,1 # function argument R3 = 1
bl foo # branch relative to foo sets Link Register

Figure 3-12 shows the call via pointer in assembly language. The address of the procedure is loaded from memory into a General-Purpose Register, and then copied to the Count Register. The procedure parameter is loaded into R3, and the code branches to the Count Register with LK set to one. In both cases, the return will occur by branching to the Link Register, perhaps first having to load it (using mtlr) if it was stored. The Link Register can be used instead of the Count Register in this example.



Figure 3-12. Call to foo Via Pointer Code Sequence

lwz R11,foo # copy address of function foo into R11
li R3,2 # function argument R3 = 2
mtctr R11 # load Count Register
bctrl # branch to contents of Count Register
# copies return address to Link Register

Another possibility involves calling a procedure through an intermediate routine in order to support some type of global subroutine linkage. The details of this subject depend strongly on linkage conventions established in the ABI. Figure 3-13 shows an example of this process. The procedure main executes until it encounters a procedure call through a pointer, represented in this example by the bl instruction. Some identifier of the desired procedure is written to a register that is passed as a parameter. Then, a branch to the intermediate routine, glue, is executed with LK set to one so that the desired procedure, task, will return to the original procedure, main. In glue, the address of task is loaded from memory into a General-Purpose Register and then copied to the Count Register. The glue procedure executes a branch to the Count Register with LK disabled, transferring control to task. When ready to return from task, a branch to the Link Register, which contains the address of the next instruction in main, is executed. For details regarding ABI linkage conventions, see Appendix A.


3.1.3.6 Traps and System Calls

Trap and system call instructions generate conditional and unconditional interrupts, respectively. They are seldom used by compilers. Section 3.3.12 on page 93 describes an example which uses trap instructions to handle floating-point interrupts precisely using software. Another possible use involves bounds checking (see Section 5.8 on page 144).


Figure 3-13. Indirect Subroutine Linkage


3.1.4 Branch Prediction

Branches constitute varying portions of PowerPC programs, but usually around 10% for floating-point code and around 20% for integer code (see Appendix C). Often these branches can neither be avoided nor resolved, so speculatively executing down the path indicated by the predicted outcome of the unresolved branch offers a significant performance gain if the prediction is accurate.

PowerPC implementations have varying types of branch prediction. Some implementations support only static branch prediction; some include dynamic branch prediction hardware. In general, use the static branch prediction bit (the y bit in the BO field of the conditional branch instruction) when the likely direction of a conditional branch is known. The use of this bit helps on some implementations and hurts on none.


3.1.4.1 Default Prediction and Rationale

The default static branch prediction (i.e., the y bit in BO is cleared) corresponds to:

Conditional branches to either the Link Register or the Count Register are assumed to be not taken by default. Conditional branching with a negative displacement (a backward branch) defaults as taken, but with a positive displacement (a forward branch) defaults as not taken. In general, branches tend to be not taken, with the exception of loops, in which they are almost always taken. Loops involve backward branches, so the default for this case is taken. These considerations lead to a Backward Taken/Forward Not-Taken branch prediction algorithm. The rules apply to absolute as well as relative displacements, although the sign of the displacement for absolute branches does not necessarily indicate forward or backward. Branch prediction does not apply to unconditional branches, including the case of a Branch Conditional with the BO field indicating branch always.


3.1.4.2 Overriding Default Prediction

Use conditional branching with the override of the static branch prediction (i.e., the y bit in BO is set to one) when the branch will most likely behave in contradiction to the default prediction. The assembly mnemonic indicates this override by the addition of a suffix: "+" for predicted taken and "-" for predicted not taken. Overriding static branch prediction is useful when:

Ball and Larus [1993] describe heuristics that enable correct static branch prediction more than 80% of the time. Even better prediction is possible if the code is profiled, that is, executed to collect information on branch outcomes for subsequent compilation.

It is preferable to reverse the direction of a branch, rather than override the default branch prediction because branch reversal makes more effective use of the instruction cache. To reverse the direction of a branch, exchange the instructions on the taken path with those on the fall-through path and adjust the condition calculation and program layout appropriately. In some cases, however, it is not convenient to reverse a branch's direction, so overriding the default branch prediction remains an important option.



Figure 3-14. Conditional Return Code Example

C Source Code
int condret(int a)
{
if(a>0) return(a);
else return(a+1);
}
Assembly Code
# R3 contains the input and return value
# the compiler knows that most likely a > 0
cmpwi cr0,R3,0 # set cr0 with comparison a > 0
bgtlr+ # conditional branch to the link register
# determined by cr0[gt], R3 = a
# "+" indicates branch is predicted taken
addi R3,R3,1 # R3 = a + 1
blr # normal return


3.1.4.3 Dynamic Branch Prediction

PowerPC processors have implementation-dependent dynamic branch-prediction capabilities. Although software does not directly control these mechanisms, knowledge of their behavior can help software estimate the costs of misprediction for those processors that implement dynamic prediction. See Section 4.2.2 on page 102 for details.


3.1.5 Avoiding Branches

Branching, both conditional and unconditional, slows most implementations. Even an unconditional branch or a correctly predicted taken branch may cause a delay if the target instruction is not in the fetch buffer or the cache. It is therefore best to use branch instructions carefully and to devise algorithms that reduce branching. Many operations that normally use branches may be performed either with fewer or no branches.

Reducing the number of branches:


3.1.5.1 Computing Predicates

Predicates utilize a boolean expression as an integer value. For example, in C:

ne = ((x != y) ? 1 : 0);

Figure 3-15 shows how to calculate this expression using an ordinary control flow translation.



Figure 3-15. Predicate Calculation: Branching Code Sequence

cmpw cr0,Rx,Ry # place compare result in cr0
li R3,1 # R3 = 1
bne cr0,lab # x != y
li R3,0 # R3 = 0
lab:

You can avoid the branch by using Condition Register logical instructions, as shown in Figure 3-16. In this case, the result of the comparison is directly transferred from the Condition Register to a General-Purpose Register, from which the bit is extracted and then flipped.



Figure 3-16. Predicate Calculation: Condition-Register Logical Code Sequence

cmpw cr0,Rx,Ry # place compare result in cr0
mfcr R4 # R4 = condition register
rlwinm R5,R4,3,31,31 # extract the cr0[eq] bit
xori R3,R5,1 # flip the bit to obtain 0/1

Some implementations have delays associated with accessing the Condition Register using the mfcr instruction. An alternative that uses only fixed-point operations is shown in Figure 3-17.



Figure 3-17. Predicate Calculation: Fixed-Point-Operation Code Sequence

subf R0,Rx,Ry # R0 = y - x
subf R3,Ry,Rx # R3 = x - y
or R3,R3,R0 # R3 = R3 | R0
# sign bit holds desired result
rlwinm R3,R3,1,31,31 # extract the sign bit

You can generate all boolean predicates with straight-line code that does not use the Condition Register. Figure 3-18 shows arithmetic expressions that yield a sign-bit reflecting the appropriate result.



Figure 3-18. Arithmetic Expressions for Boolean Predicates

Boolean Predicate Arithmetic Expression
x y (x - y) | (y - x)
x = y ¬((x - y) | (y - x))
x < y (x & ¬y) | ((x y) & (x - y))
x y (x | ¬y) & ((x y) | ¬(y - x))
x < y, unsigned (¬x & y) | ((x y) & (x - y)), or (¬x & y) | ((¬x | y) & (x - y))
x y, unsigned (¬x | y) & ((x y) | ¬(y - x))

Shorter sequences exist for many of these operations. The GNU superoptimizer is a program that exhaustively searches a subset of machine instructions to find the shortest branch-free combinations that perform a specified operation. Appendix D lists the PowerPC GNU superoptimizer results for a number of functions.


3.1.5.2 Conditionally Incrementing a Value by 1

Figure 3-19 shows the C code fragment for conditionally incrementing a value by 1 and equivalent branching and non-branching assembly code sequences. For simple conditions, branch-free equivalents can be formed using computed predicates. See Appendix D.



Figure 3-19. Conditionally Incrementing a Value by 1 Code Example

C Source Code
if (a < b) ++b;
Branching Code
# R3 contains a
# R4 contains b
cmpw cr0,R3,R4 # compare a and b
bge cr0,lab # branch if a >= b
addi R4,R4,1 # b = b + 1
lab:
# R4 contains the desired result
Branch-Free Code
# R3 contains a
# R4 contains b
subf R0,R4,R3 # a - b
eqv R2,R3,R4 # a b
and R2,R0,R2 # (a b) & (a - b)
andc R0,R3,R4 # a & ~b
or R0,R0,R2 # (a & ~b) | ((a b) & (a - b))
rlwinm R0,R0,1,31,31 # extract predicate
add R4,R4,R0 # if (a < b) then b++


3.1.5.3 Condition Register Logical

The Condition Register logical instructions can be used to combine several branch conditions, thus reducing the number of branches. For example, Figure 3-20 shows the C code fragment for a complex conditional expression and two equivalent assembly code sequences: one that has a comparison and branch for each side of the OR, and another that uses a Condition Register logical OR to combine the results of the compare and recording operations for a single test by a conditional branch. This form may present itself in situations where a common sub-expression exists between this and other code, thus offering opportunities for multiple compare and recording instructions within a single basic block.



Figure 3-20. Complex Condition Code Example

C Source Code
if ((a + b) < 0) || ((x + y) > 0)
...do_something...
Separate Branches
add. R3,Ra,Rb # perform add (a + b) with record
blt cr0,lab1 # if (a + b) < 0, goto lab1
add. R4,Rx,Ry # perform add (x + y) with record
ble cr0,else # if (x + y) <= 0, goto else
lab1:
...statement...
else:
Combined Branch
add R3,Ra,Rb # perform add (a + b)
cmpwi cr3,R3,0 # compare (a + b) to 0
add. R4,Rx,Ry # perform add (x + y) with record
cror 27,1,12 # cr6[so] = cr0[gt] | cr3[lt]
bf cr6[so],else # branch to else if condition bit is false
...statement...
else:

Figure 3-21 shows code sequences for a C switch for which the optimal implementation of the multi-way branch is simply a sequence of compare-branch tests. Because the tests all have a common target address, they can be combined using Condition Register logical instructions, reducing the total number of branches from four to one. For a specific implementation, compare the timing of sequences using Condition Register logical instructions to the equivalent multiple-branch sequences because the relative performance may vary.



Figure 3-21. C Switch: Condition Register Logical Code Example

C Source Code
switch(i){
case 0: case 20: case 30: case 40:
i+=10; break;
}
Assembly Code
lwz R3,i # load i into R3
cmpwi cr0,R3,0 # compare R3 to 0 -> cr0
cmpwi cr1,R3,20 # compare R3 to 20 -> cr1
cmpwi cr6,R3,30 # compare R3 to 30 -> cr6
cmpwi cr7,R3,40 # compare R3 to 40 -> cr7
cror cr5[eq],cr0[eq],cr1[eq] # cr5[eq] = cr0[eq] | cr1[eq]
cror cr0[eq],cr7[eq],cr6[eq] # cr0[eq] = cr7[eq] | cr6[eq]
cror cr1[eq],cr5[eq],cr0[eq] # cr1[eq] = cr5[eq] | cr0[eq]
bne cr1,out # i != 0, 20, 30, 40, goto out
addi R3,R3,10 # i += 10
stw R3,i # store new i
out:


[ Table of Contents | Index ]
Copyright 1998 IBMchips