IBM Search Java Site Map Microelectronics News Products Services Technology About Us IBM Microelectronics Order Contact Legal

power pcProducts

overview
news
products
documents
performance
technology


[ Table of Contents | Index ]

4.3 Scheduling

Scheduling is a machine-dependent optimization that reorders the instruction sequence subject to data and control flow restrictions so as to minimize the execution time for a given processor hardware. To do effective scheduling, the compiler must possess a model for the processor that reflects instruction timing and serialization properties, mispredict penalties for branches, and hardware resources available. Instruction scheduling is an area of code generation that is sensitive to small perturbations of the algorithm parameters.


4.3.1 Fixed-Point Instructions

The availability of the result of a fixed-point instruction depends on the length of the pipeline in the Fixed-Point Unit and whether forwarding is supported. Figure 4-3 shows the instruction pipeline for most fixed-point instructions. Each stage requires a single cycle. The completion and write back stages usually occur concurrently. The result is forwarded from the execution stage to the input latches of the execution stage for use by a subsequent integer instruction, which can execute concurrently with the write back stage of the first. Therefore, two dependent, single-execution-cycle fixed-point instructions can execute in consecutive cycles.


Figure 4-3. Integer Instruction Pipeline

The fixed-point multiply and divide instructions generally require multiple cycles of execution. The instruction pipeline has the same form as for single-cycle instructions, but these instructions remain in the execution stage for several cycles, the number of which is implementation-dependent (see Appendix B for the specific timing values). In most PowerPC implementations, the number of cycles may also depend on the values of the operands. For example, if a sufficient number of the high-order bits of the multiplier are sign-extension bits, the number of cycles may be reduced.

Although most implementations forward the General-Purpose Register result of a recording instruction from the execution stage to the input latches of the fixed-point execute unit for the next fixed-point instruction, the condition code result may not be available in the Branch Processing Unit until after the write back stage. Similar restrictions may also apply to the Carry (CA) and Overflow (OV) fields in the Fixed-Point Exception Register (XER).


4.3.2 Floating-Point Instructions

Figure 4-4 shows the floating-point pipeline. The execute stages have a multiply-add structure: 1 cycle for multiply, 1 cycle for add, and one cycle for normalization. Some implementations combine these stages in different ways. Each stage of the pipeline requires 1 cycle, but the Floating-Point Unit displays more variation in timing than the Fixed-Point Unit. Implementations that have only a single-precision multiplier or adder increase the latency of some double-precision operations by 1 cycle; for example, a double-precision multiply instruction would occupy a single-precision multiply stage for 2 cycles. Divide, square root, and floating reciprocal estimate are instructions that require multiple cycles to execute. Floating-point operations whose source or destination operands are IEEE 754 special values (NaNs or infinity) may require additional cycles. With these exceptions, independent floating-point instructions can be dispatched on successive cycles. Higher-performance PowerPC processors forward the result from the normalize execute stage to the input latches of the multiply execute stage for use by a subsequent dependent floating-point instruction. This forwarding reduces the stall between consecutive dependent floating-point instructions by 1 cycle.


Figure 4-4. Floating-Point Instruction Pipeline


4.3.3 Load and Store Instructions

Load and store instructions have complex timing and reordering properties because they access cache and memory. Organizing code to minimize cache misses is a difficult implementation-specific task that is normally attempted only in research compilers, but code that minimizes cache misses may realize significant performance gains. If a data-cache access misses, several cycles are required to retrieve the data from memory. This section assumes that all accesses hit in the cache.

Figure 4-5 shows the instruction pipeline for load and store instructions. The execution of load and store operations is composed of two stages. The first stage calculates the address. The second stage performs the cache access and either loads or stores the value when the data is available. Floating-point loads and stores differ from fixed-point loads and stores, even though both execute in the same Fixed-Point Unit or Load-Store Unit, due to differences in operand availability, conversions, and alignment. In most implementations, load instructions forward their result to the input latches of either the fixed-point or floating-point execute stages. The load of a General-Purpose Register should be separated from the instruction which uses it by the cache access delay, which is usually 1 cycle. The cache access delay for the load of a Floating-Point Register is usually 2 cycles.


Figure 4-5. Load-Store Instruction Pipeline

The example in Figure 4-6 uses pointer chasing to illustrate how execution pipelines can stall because of the latency for cache access. This latency stalls dispatch of the dependent compare, creating an idle execution cycle in the pipeline. Moving an independent instruction between the compare and the branch can hide the stall, that is, perform useful work during the delay. The delay is referred to as the load-use delay. The same principle applies to any instruction which follows the load and has operands that depend on the result of the load.

To enhance performance, some PowerPC implementations may dynamically reorder the execution of memory accessing instructions, executing loads prior to stores with the intent of preventing processor starvation. Processor starvation occurs when an execution unit is stalled waiting for operand data. This reordering could violate program semantics if a reordered load is executed prior to a store that modifies an overlapping area in memory. This situation is called a load-following-store contention. PowerPC implementations must correct this situation in order to maintain correct program behavior, but the mechanism of correction varies among implementations. The correction, however, must result in re-executing the load in program order. This serialization of the load may involve redispatching or even refetching the load and subsequent instructions, thus significantly adding to the effective latency of the load instruction. This situation can arise in implementations with a single Load-Store Unit that dynamically reorder the loads and stores, or in implementations with multiple Load-Store Units, which can execute a load instruction and a store instruction during the same cycle in different units.



Figure 4-6. Pointer Chasing—Load-Use Delay

C Source Code
typedef struct ss {
struct ss *p_prev,*p_nxt;
int field;
} SS,*P_SS;
P_SS p_s;
void t001(int i,P_SS x)
{ P_SS p;
for(p = p_s; ; p = p->p_nxt){
if(p->p_nxt == x) break;
}
p->field = i;
}
Assembly Code for Loop Body
# R4 contains x
CL.40:
mr R5,R0 # p = p->p_nxt
lwz R0,4(R5) # load p->p_nxt
cmplw cr0,R0,R4 # compare p->p_nxt and x
bne cr0,CL.40 # if p->p_nxt != x, branch back
CL.4:

The example in Figure 4-7 uses an integer to floating-point conversion to illustrate the load-following-store contention. The code fragment sums two integer vectors (a and b) into a double-precision floating-point vector (d), converting an integer value to a floating-point value. This example uses consecutive storage locations to perform the in-line conversion. The instructions indicated by asterisks at the left margin consecutively access the same memory location so that the loaded value is data-dependent on the preceding store value. Some implementations require additional cycles to perform this sequence because of the special serialization of the cache accesses. Where possible, compilers should avoid such load-following-store contentions by separating the memory-dependent instructions.



Figure 4-7. Integer-to-Float Conversion: Load-Store Contention Code Example

C Source Code
void t011(int *a, int *b, double *d)
{ int i;
for(i=0;i<100;++i){
d[i] = (double)(a[i] + b[i]);
}
}
Assembly Code
# R3 points to int array a
# R4 points to int array b
# R5 points to double array d
# SP is reference address for temporary location
lfs FR1,0(R6) # float short constant 0x59800004
addis R0,R0,0x4330 # R0 = 0x4330000
stw R0,-8(SP) # prepare floating-point double
# temporary location containing the
... # value 0x4330 0000 0000 0000
loop:
lwzu R0,4(R3) # load a[i]
lwzu R6,4(R4) # load b[i]
add R0,R0,R6 # a[i] + b[i]
xoris R0,R0,0x8000 # flip sign bit
* stw R0,-4(SP) # store into the low-order part of the
* # floating-point temporary location
* lfd FR0,-8(SP) # floating-point load double of value
fsub FR0,FR0,FR1 # perform the conversion
stfdu FR0,8(R5) # d[i] = converted value
bdnz loop

If load and store queues are present, they may perform some of the following functions:

The implementation-dependent depth of the store queue affects the number of outstanding loads and stores that the processor can support. A filled store queue may stall instruction dispatch.

In most implementations, a delay should be scheduled following an instruction that computes the value for a subsequent store. For example, consider a floating-point add and dependent store of the result. If the store immediately follows the add, the store will likely complete execution and wait several cycles for completion of the add. On the other hand, a dependent store should immediately follow a floating-point arithmetic operation in implementations that have dynamic store forwarding, which synchronizes the store with the completion of the arithmetic operation. The User Manual for each processor specifies the specific behavior.


4.3.4 Branch Instructions

Branch prediction in PowerPC implementations uses a combination of static and dynamic branch prediction. In order to hide any delay due to a mispredicted branch, independent instructions should be scheduled between the instruction generating the branch condition or the target address and the dependent branch that tests the condition or transfers control to the address. The length of the delay depends on when the condition is available to the Branch Processing Unit and whether the branch was predicted correctly. A move to the Link Register or move to the Count Register instruction that generates the target address for a dependent branch behaves similarly to the compare instruction case, but the dependence is through the target address rather than the condition code.

The example in Figure 4-8 illustrates the use of mtctr in a switch statement implemented as a branch table. In this example, assume TABLE contains the 32-bit instruction addresses of code corresponding to the various case n: labels. The mtctr instruction loads the Count Register, and the bctr instruction branches to the destination code for the desired case, which depends on the value in the Count Register. For most implementations, you should schedule several independent instructions between these two operations to eliminate the stall caused by an empty instruction fetch buffer. A comparable delay also occurs between mtlr and a dependent branch.



Figure 4-8. mtctr Delay: C Switch Code Example

C Source Code
switch(x){
case 0: code_for_case_0;
case 1: code_for_case_1;
case 2: code_for_case_2;
case 3: code_for_case_3;
case 4: code_for_case_4;
case 5: code_for_case_5;
...
}
Assembly Code
lwz R4,x # load the value of x
lwz R7,$TABLE # load the address of the base of TABLE
slwi R5,R4,2 # multiply by 4 (4 bytes/entry in TABLE)
lwz R3,R7,R5 # R3 = TABLE[x]
mtctr R3 # load Count Register
bctr # branch to contents of Count Register

Figure 4-9 illustrates a similar situation that occurs when making function calls via pointers. The mtctr instruction loads the Count Register, and the bctrl instruction branches to the destination code for the desired control transfer, which depends on the value in the Count Register. Implementations frequently require additional execution cycles between these two instructions to eliminate the stall caused by an empty instruction fetch buffer.



Figure 4-9. mtctr Delay: Call to Function foo Via Pointer Code Example

lwz R11,foo # load address of function foo into R11
mtctr R11 # load Count Register with address of foo
bctrl # branch to contents of Count Register
# sets LR for return address


4.3.5 List Scheduling Algorithm

List scheduling attempts to reorder instructions in a basic block, subject to data dependence constraints, to yield the minimum execution time for a given processor. This section presents the list scheduling algorithm used in the IBM XL family of compilers, which evolved from many years of extensive empirical testing at IBM. Because the scheduler is required to support a wide variety of processor implementations, it isolates processor-specific information in tables, so scheduling for a different processor merely requires a different set of tables. The algorithm is processor-independent. Section 4.3.6 presents the tables for the PowerPC Common Model. For further description of the list scheduling algorithm as well as global scheduling issues and techniques, see Blainey [1994] and references contained therein.

The list scheduling algorithm reorders the instructions in a window that contains straight-line code. This window is determined by reaching a maximum allowed size (chosen to limit compile time), a basic block boundary, or an instruction that restricts code motion. If the windows are smaller than the basic block, they are overlapped to ensure that instructions near the boundaries are well scheduled.

The dependence graph for the scheduling window is computed to map the data and name dependences. Each node of the dependence graph represents an instruction, which is marked with the execution unit and execution time (in cycles) required to process the instruction. The directed edges that connect the nodes represent the dependences. A data dependence edge, marked with the number representing the delay (in cycles) between the connected instructions, points to the instruction node that has a data dependence on the source of the edge. A name dependence edge, marked weak, points to the instruction node that has a name dependence on the source of the edge. Figure 4-10 shows a Fortran code sequence and a simple translation to an assembly code sequence. Figure 4-11 shows the corresponding dependence graph for this example. The indicated execution times, execution units, and delays are those of the Common Model (see Section 4.3.6).



Figure 4-10. Basic Block Code Example

Fortran Source Code
t5 = (a(i) + b(i))
t0 = (a(i) - b(i))
t0 = t0*c(i)
e(i) = t5/t0
* Assembly Code
1 lwz R0,-1596(R3) # load b(i)
2 lwz R4,-1196(R3) # load a(i)
3 add R5,R0,R4 # t5 = (a(i) + b(i))
4 subf R0,R0,R4 # t0 = (a(i) - b(i))
5 lwz R6,-796(R3) # load c(i)
6 mullw R0,R0,R6 # t0 = t0*c(i)
7 divw R0,R5,R0 # e(i) = t5/t0
8 stw R0,4(R3) # store e(i)
* Instruction labels to which Figure 4-11 refers.


Figure 4-11. Basic Block Dependence Graph

Before describing the algorithm, some preliminary concepts need to be defined. The instructions at the beginning of the window include:

The sum-delay for instruction v, Sv, is the maximum delay over all execution paths from the instruction to the end of the window:

,

where Wvi is the delay between a successor instruction i that is data dependent on instruction v, and Succ(v) represents the set of successor instructions to instruction v in the dependence graph. The critical path for instruction v, Cv, is the maximum execution time over all execution paths from the instruction to the end of the window:

,

where Ev is the CPI for the instruction v. The expected execution time for a window, T, is the maximum of the longest critical path for the instructions in the window or the longest execution time required from an execution unit for the window:

The earliest time for instruction v, Dv, is the longest of the execution paths from the beginning of the window to the instruction.

,

where Pred(v) is the set of predecessor instructions of the instruction v in the dependence graph. The latest time for instruction v, Fv, is

.

Figure 4-12 shows the values of the sum delay, critical path, earliest time, and latest time for the instructions in Figures 4-10 and 4-11.



Figure 4-12. Values for Scheduling Example

Instruction Sum Delay Critical Path Earliest Time Latest Time
1 1 45 0 2
2 1 45 0 2
3 0 44 2 3
4 0 43 2 4
5 1 44 0 3
6 0 42 3 5
7 0 37 8 10
8 0 1 44 46

The list scheduling algorithm executes a time-driven simulation that dispatches instructions in each cycle that are expected to dispatch on the target processor during the equivalent cycle. Given the instruction window, the list scheduling algorithm proceeds as follows:

  1. Generate the dependence graph and compute the expected time to complete the window and the sum delay, critical path, earliest time, and latest time for each instruction.
  2. Initialize the ready list of instructions. For each cycle, a ready list is generated that includes all instructions that have no incoming edges from undispatched nodes or weak incoming edges from nodes that are currently in the ready list. The instructions at the beginning of the window comprise the initial ready list.
  3. Clear the machine cycle counter.
  4. Clear the tentatively scheduled set. The tentatively scheduled set is a working set of instructions considered for commitment during this cycle.
  5. Select the next instruction v from the ready list. If all the instructions in the ready list have been considered, goto step 8.
  6. Determine if instruction v meets the criteria to be eligible for dispatch this cycle:
    If ineligible, goto step 5.
    If eligible, goto step 7.
  7. Attempt to allocate resources for instruction v.
    If the resources are available, enter the instruction in the tentatively scheduled list. Goto step 5.
    If the resources are not available, identify the instructions in the tentatively scheduled set that compete for the needed resources. Calculate the preference function, defined below, for instruction v and each of the competing instructions. If the preference function prefers the instruction v over a competing instruction, replace this instruction in the tentatively scheduled set by instruction v. Attempt to allocate resources for the displaced instruction and calculate the preference function for the displaced instruction with competing members of the tentatively scheduled set, except for instruction v. If the initially displaced instruction can displace another instruction, do it. Goto step 5.
  8. Commit the tentatively scheduled instructions for this cycle. Add any uncovered instructions to the ready list. Increment the time counter. If there are additional instructions in the window to dispatch, goto step 3; otherwise, the algorithm has completed.
The preference function takes two instructions as arguments and returns the instruction that is heuristically preferred for dispatch in this cycle. The function involves making the following sequence of steps until one instruction is preferred to the other.

  1. If the store queue is filled, prefer a non-store instruction.
  2. If the number of stores remaining to be dispatched in the window is greater than the critical store count, qr/(q+1), where q is the size of the store queue and r is the number of instructions that remain to be dispatched in the window, prefer a store instruction.
  3. If the number of required registers is greater than the number of available registers for a given register file, prefer the instruction that minimizes the use of registers. This check involves lifetime analysis.
  4. If dispatching one instruction of the pair causes the other to dispatch after its latest time, prefer the other instruction.
  5. Prefer the instruction with the larger uncover count. The uncover count is the number of new instructions that enter the ready list following the dispatch of an instruction.
  6. Prefer the instruction with the larger sum delay.
  7. Prefer the instruction with the larger critical path.
  8. Prefer the instruction that appears first in program order.
Figure 4-13 compares the example from Figure 4-10 to the same basic block scheduled for the Common Model. The difference is the upward movement of instruction 5, the load of c(i), because its sum delay was greater than that of instruction 3 or 4.


4.3.6 Common Model

In order for the compiler to schedule instructions, it requires a model of the target processor. This section describes the tables of timing values and resources for the Common Model, which closely resembles the PowerPC 601 processor. The scheduler has similar tables for each target processor.

The parameters in the tables do not necessarily reflect the timing of an actual processor, but rather they lead to the generation of well scheduled code when used in the scheduler. For example, following a fixed-point compare, there is a delay of 3 cycles before a conditional branch can use the result. In most current implementations, there is no delay. The 3 cycle delay parameter causes the scheduling algorithm to insert independent operations between the compare and branch to fill the stall that occurs in the event of a misprediction.



Figure 4-13. Scheduled Basic Block Code Example

Unscheduled Assembly Code
1 lwz R0,-1596(R3) # load b(i)
2 lwz R4,-1196(R3) # load a(i)
3 add R5,R0,R4 # t5 = (a(i) + b(i))
4 subf R0,R0,R4 # t0 = (a(i) - b(i))
5 lwz R6,-796(R3) # load c(i)
6 mullw R0,R0,R6 # t0 = t0*c(i)
7 divw R0,R5,R0 # e(i) = t5/t0
8 stw R0,4(R3) # store e(i)
Scheduled Assembly Code for Common Model
1 lwz R0,-1596(R3) # load b(i)
2 lwz R4,-1196(R3) # load a(i)
5 lwz R6,-796(R3) # load c(i)
3 add R5,R0,R4 # t5 = (a(i) + b(i))
4 subf R0,R0,R4 # t0 = (a(i) - b(i))
6 mullw R0,R0,R6 # t0 = t0*c(i)
7 divw R0,R5,R0 # e(i) = t5/t0
8 stw R0,4(R3) # store e(i)

The Common Model is a fictional processor whose properties serve as a compiler target. Code developed for this target executes well on all PowerPC processors, although probably not optimally for any given processor.

The implementation of the Common Model consists of a 32-bit processor that has three execution units: Branch (BPU), Fixed-Point (FXU), and Floating-Point (FPU). It has a store queue of size 1, and the synchronization can vary between +6 and -2.

The instructions are divided into a series of classes that share certain timing or delay properties.

The unit-restricted instruction classes are denoted as follows:

Figure 4-14 shows further classification of the instruction set along with the unit(s) required to execute the class of instruction and the execution time in cycles. The execution units and execution times label the nodes in the dependence graph.



Figure 4-14. Common Model Instruction Classes

Class Description Execution Time (cycles)
BPU FXU FPU
call_insn Branch instructions with LK = 1. 1 1 1
branch_unconditional Unconditional branch instructions. 1
branch_on_count Branch-on-count instructions. 1
branch_conditional Conditional branch instructions. 1
cr_logic Condition Register logical instructions. 1
mcrf Move Condition Register fields instruction. 1
uses_cr Branch_on_count, branch_conditional, cr_logic, and mcrf classes 1
fixed_load Fixed-point load (except multiple and string). 1
fixed_store Fixed-point store (except multiple and string). 1
multiple_string Load multiple, store multiple, load string, and store string instructions. #reg
touch Data cache touch instructions. 1
fixed Integer arithmetic instructions (except multiplication and division. 1
fixed_mul Integer multiply instructions (when the magnitude of the multiplier has 16 or fewer bits). 5
long_mul Integer multiply instructions (when the magnitude of the multiplier has more than 16 bits). 10
fixed_div Integer divide instructions. 36
fixed_compare Integer compare instructions. 1
trap_insn Trap instructions. 1
fixed_logic Logic instructions. 1
fixed_rot_shift Rotate and shift instructions. 1
mtfspr Move to/from SPRs instructions. 1
mfctrlr Move from CTR or LR instructions. 1
mtctrlr Move to CTR or LR instructions. 1 1
fixed_normal_setcr Integer compare and recording operations (except multiply and divide). 1
fixed_delayed_setcr Integer multiply and divide recording operations. same as Rc=0
uses_ctrlr Instructions that read the Link Register or Count Register. Varies with instruction.
float_load Floating-point load instructions. 1 1
float_store Floating-point store instructions. 1 1
flstor8 Double-precision floating-point store instructions. 1 1
flstor4 Single-precision floating-point store instructions. 1 1
float_sngl Single-precision floating-point arithmetic instructions. 1
float_dbl_mult Double-precision floating-point operations that include a multiply. 2
float_dbl_nomult Double-precision floating-point operations that do not include a multiply. 1
float_ds Single-precision floating-point division. 17
float_dl Double-precision floating-point divide instructions. 31
float_compare Floating-point compare instructions. 2 1
frsp Floating-point round to single instructions. 1
convert_to_integer Convert to integer instructions. 1
mffs mffs instruction. 1
mcrfs mcrfs instruction. 1 1
mtfsf mtfsf, mtfsfi, mtfsb1, mtfsb0 instructions. 1
#reg—The number of registers accessed.
Rc=0—Non-recording instruction forms.

The call_insn class uses all three units for one cycle. From a scheduling perspective, this establishes a resource dependence on all units (typically calls pass parameters in registers and place results into General-Purpose Registers or Floating-Point Registers).

Figure 4-15 shows the delay between dependent instructions of the indicated classes. These delays label the directed edges in the dependence graph. Adding the execution time and the delay yields the instruction latency in most cases. Where a data dependence exists but a delay is not specified, the delay is zero. A name dependence leads to a weak delay, which means that the weak predecessor must dispatch before or concurrently with dispatch of the name dependent instruction. The weak delay (i.e., in call_insn) means that the instruction order is preserved between a call instruction and any subsequent instruction, even though they may be dispatched and/or executed in the same cycle. This name dependance prevents the scheduler from hoisting instructions around call points because the hoisted code might cause register save/restore code to be emitted. (The nature of save/restore code and rules governing it are ABI issues.) The delay time should be added to the earlier throughput number to yield the latency. The (m) on some delays indicates that cache latency may modify the delay.

The examples in the next section compare scheduling for the Common Model to scheduling for the PowerPC 604 processor. Although the full scheduling model of the PowerPC 604 processor is not presented here, its most significant differences from the Common Model include:

See Appendix B for a summary of the PowerPC 604 implementation.



Figure 4-15. Common Model Instruction Delays

Delay Type Class of First Instruction Class of Second Instruction Delay (cycles)
Call Delays call_insn any_insn weak
any_insn call_insn weak
mfctrlr any_insn 1
Load-Use Delays fixed_load FXU_insn 1
float_load FPU_insn 2
Float-Float Delay float_dbl_mult flstor8 2
frsp flstor4 1
FPU_insn flstor4 2
FPU_insn FPU_insn 3
Compare-Branch Delays fixed_normal_setcr branch_conditional 3
fixed_delayed_setcr branch_conditional 4
float_compare branch_conditional 8
Compare-CR Delays fixed_normal_setcr uses_cr 2
fixed_delayed_setcr uses_cr 3
float_compare uses_cr 7
Load CTR/LR Delays sets_ctrlr branch_on_count 3
sets_ctrlr uses_ctrlr 4
Store-Load Delays fixed_store fixed_load 0 (m)
float_store fixed_load 2 (m)
fixed_store float_load 0 (m)
float_store float_load 2 (m)
weak—Implies a name dependence, so the second instruction must execute during the same cycle or later than the first instruction.
(m)—The cache latency may modify the delay.


4.3.7 Examples

From the perspective of scheduling, the most important implementation features are the number of instructions dispatched in a cycle and the number and kinds of execution units. If instructions are available for the different execution units, they should be intermixed so that the dispatcher can make forward progress by dispatching some instructions to execution units on every cycle. Therefore, the compiler must try to schedule the instructions so that the processor dispatches them at a rate commensurate with their flow through the pipelines. If the instructions are not well mixed, dispatch may stall because the next instruction requires a particular execution unit that is busy. Scheduling can affect code selections for a given operation. Code that is sub-optimal in isolation may execute faster than optimal code in a specific context.

The code sequence in Figure 4-16 shows a simple example with a load-use delay slot that has been scheduled for the Common Model and the PowerPC 604 processor. The cycle column on the left indicates the clock cycle in which the instruction begins execution. The difference between these sequences is the location of the subtract instruction. The Common Model executes the subtraction in the load-use delay slot, but the PowerPC 604 processor executes the subtraction early because it has multiple Fixed-Point Units.



Figure 4-16. Simple Scheduling Example with Load-Use Delay Slot

C Code
int foo(int a[], int i, int j, int k) {
a[i] = a[i] + (k - j);
}
Cycle PowerPC 601 Processor Schedule
0 slwi R4,R4,2 # i * 4
1 lwzx R0,R3,R4 # load a[i]
2 subf R5,R5,R6 # k - j
3 add R5,R0,R5 # a[i] = a[i] + (k - j)
4 stwx R5,R3,R4 # store a[i]
Cycle PowerPC 604 Processor Schedule
0 subf R0,R5,R6 # k - j
0 slwi R5,R4,2 # i * 4
1 lwzx R4,R3,R5 # load a[i]
3 add R0,R4,R0 # a[i] = a[i] + (k - j)
4 stwx R0,R3,R5 # store a[i]

Figure 4-17 shows a related example that uses the evaluation of an expression with many independent parts to illustrate the differences between various PowerPC implementations with respect to number and type of execution units. Each iteration involves the loading and summing of ten elements of an array.



Figure 4-17. Multi-Part Expression Evaluation Scheduling Example

C Source Code
int t008i(int *a)
{ int i; int r;
r = 0;
for(i=0;i<100;i+=10){
r = r + a[i+0] + a[i+1] + a[i+2] + a[i+3] + a[i+4]
+ a[i+5] + a[i+6] + a[i+7] + a[i+8] + a[i+9];
}
return(r);
}
Cycle Unscheduled Assembly Fragment
0 lwz R0,4(R3) # load a[i+1]
2 add R5,R5,R0 # r = r + a[i+1]
3 lwz R6,8(R3) # load a[i+2]
5 add R5,R5,R6 # r = r + a[i+2]
...
Cycle Assembly Fragment Scheduled to Account for Cache Latency
0 lwz R0,4(R3) # load a[i+1]
1 lwz R6,8(R3) # load a[i+2]
2 add R5,R5,R0 # r = r + a[i+1]
3 add R5,R5,R6 # r = r + a[i+2]
...
Cycle Assembly Fragment Scheduled for a PowerPC 604 Processor
0 lwz R0,4(R3) # load a[i+1]
1 lwz R6,8(R3) # load a[i+2]
2 add R5,R5,R0 # r = r + a[i+1]
2 lwz R0,12(R3) # load a[i+3]
3 add R5,R5,R6 # r = r + a[i+2]
3 lwz R0,16(R3) # load a[i+4]
...

The unscheduled assembly fragment consists of alternating loads and dependent adds, whose sum accumulates in R0. Because of the load-use delay, each load-add combination requires 3 cycles to execute. The fragment that has been scheduled to account for cache latency pairs the loads and adds to fill the load-use delay slot. This code motion reduces the number of cycles to execute each load-add combination to 2 cycles. The schedule for the Common Model would consist of a larger group of loads followed by a group of adds because the load and add operations require the same execution unit, and the sum delay associated with the loads is larger. On the PowerPC 604 processor or other implementations with a separate Load-Store Unit, after some initial setup, the schedule will consist of alternating independent loads and adds. The optimal arrangement of the 20 operations varies among implementations due to variations in the maximum number of instructions dispatched per cycle and the types of execution units.

Figure 4-18 shows the C code for a basic block that contains a mix of floating-point, load, store, and other fixed-point instructions. Figures 4-19 and 4-20 show the corresponding assembly code when scheduled for the Common Model and for the PowerPC 604 processor. It shows that the primary scheduling differences arise from the fact that the PowerPC 604 processor can execute loads and stores in parallel with fixed-point arithmetic.



Figure 4-18. Basic Block Code Example: C Code

#include <stdlib.h>
double compute (double a[], double b[], double c[], double d[],
int i, int j, int k) {
double asq, bsq, csq, dsq;
asq = a[i+j] * a[i+j+1];
bsq = b[i+k] * b[i+k+1];
csq = c[j+k] * c[j+k+1];
dsq = d[i+j+k] * d[i+j+k+1];
a[i+j] = bsq;
b[i+k] = asq;
c[j+k] = dsq;
d[i+j+k] = csq;
return asq + bsq + csq + dsq;
}



Figure 4-19. Basic Block Code Example: Scheduled for Common Model

Cycle Instruction
0 add R10,R8,R9 # j + k
1 add R0,R7,R9 # i + k
2 addi R11,R0,1 # i + k + 1
3 slwi R11,R11,3 # (i + k + 1) * 8
4 addi R12,R10,1 # j + k + 1
5 slwi R12,R12,3 # (j + k + 1) * 8
6 lfdx FR3,R5,R12 # load c[j+k+1]
7 add R8,R7,R8 # i + j
8 add R9,R8,R9 # i + j + k
9 addi R7,R9,1 # i + j + k + 1
10 slwi R7,R7,3 # (i + j + k + 1) * 8
11 lfdx FR1,R6,R7 # load d[i+j+k+1]
12 slwi R7,R0,3 # (i + k)*8
13 addi R0,R8,1 # i + j + 1
14 lfdx FR5,R4,R11 # load b[i+k+1]
15 slwi R11,R10,3 # (j + k) * 8
16 slwi R10,R0,3 # (i + j + 1) * 8
17 slwi R8,R8,3 # (i + j) * 8
18 lfdx FR0,R3,R10 # load a[i+j+1]
19 lfdx FR4,R4,R7 # load b[i+k]
20 lfdx FR2,R3,R8 # load a[i+j]
22 fmul FR5,FR4,FR5 # bsq = b[i+k] * b[i+k+1]
23 fmul FR2,FR2,FR0 # asq = a[i+j] * a[i+j+1]
24 lfdx FR0,R5,R11 # load c[j+k]
25 slwi R9,R9,3 # (i + j + k) * 8
27 fmul FR3,FR0,FR3 # csq = c[j+k] * c[j+k+1]
28 lfdx FR4,R6,R9 # load d[i+j+k]
29 fadd FR0,FR5,FR2 # asq + bsq
30 stfdx FR5,R3,R8 # store a[i+j] = bsq
31 stfdx FR2,R4,R7 # store b[i+k] = asq
32 fmul FR1,FR4,FR1 # dsq = d[i+j+k] * d[i+j+k+1]
33 fadd FR0,FR3,FR0 # (asq + bsq) + csq
35 stfdx FR1,R5,R11 # store c[j+k] = dsq
36 stfdx FR3,R6,R9 # store d[i+j+k] = csq
37 fadd FR1,FR1,FR0 # (asq + bsq + csq) + dsq



Figure 4-20. Basic Block Code Example: Scheduled for PowerPC 604 Processor

Cycle Instruction
0 add R10,R7,R9 # i + k
0 add R0,R8,R9 # j + k
1 add R7,R7,R8 # i + j
1 addi R11,R10,1 # i + k + 1
2 slwi R8,R11,3 # (i + k + 1) * 8
2 addi R12,R0,1 # j + k + 1
3 slwi R11,R12,3 # (j + k + 1) * 8
3 lfdx FR2,R4,R8 # load b[i+k+1]
3 addi R8,R7,1 # i + j + 1
4 slwi R8,R8,3 # (i + j + 1) * 8
4 lfdx FR3,R5,R11 # load c[j+k+1]
4 add R9,R7,R9 # i + j + k
5 slwi R7,R7,3 # (i + j) * 8
5 lfdx FR0,R3,R8 # load a[i+j+1]
5 slwi R8,R0,3 # (j + k) * 8
6 addi R0,R9,1 # i + j + k + 1
6 lfdx FR1,R3,R7 # load a[i+j]
6 slwi R9,R9,3 # (i + j + k) * 8
7 slwi R11,R0,3 # (i + j + k + 1) * 8
9 fmul FR5,FR1,FR0 # asq = a[i+j] * a[i+j+1]
7 lfdx FR4,R6,R9 # load d[i+j+k]
8 slwi R10,R10,3 # (i + k) * 8
8 lfdx FR1,R4,R10 # load b[i+k]
10 fmul FR1,FR1,FR2 # bsq = b[i+k] * b[i+k+1]
9 lfdx FR0,R5,R8 # load c[j+k]
11 fmul FR0,FR0,FR3 # csq = c[j+k] * c[j+k+1]
10 lfdx FR2,R6,R11 # load d[i+j+k+1]
13 stfdx FR1,R3,R7 # store a[i+j] = bsq
13 fadd FR1,FR1,FR5 # asq + bsq
14 fmul FR2,FR4,FR2 # dsq = d[i+j+k] * d[i+j+k+1]
14 stfdx FR5,R4,R10 # store b[i+k] = asq
16 fadd FR1,FR0,FR1 # (asq + bsq) + csq
17 stfdx FR2,R5,R8 # store c[j+k] = dsq
18 stfdx FR0,R6,R9 # store d[i+j+k] = csq
19 fadd FR1,FR2,FR1 # (asq + bsq + csq) + dsq

The decision to dispatch instructions may be dependent on:

  1. Data pending due to a memory (cache) latency or an instruction currently executing.
  2. Required execution resource busy.
  3. Insufficient execution units of a desired type.
  4. Synchronization events.
The following rules summarize an effective means to manage the instruction scheduling process:

Thus far, the examples have involved code motion within a basic block. The average size of a basic block in a typical program is between five and ten instructions, so restricting scheduling to basic blocks may substantially reduce opportunities for performance enhancement. Figure 4-21 shows an example that uses serially dependent arithmetic and a conditional assignment. In the C code, each statement in the body of the loop has a data dependence on the statement immediately preceding it, so the statements themselves cannot be rearranged. That is, each addition to r feeds data to a compare which, if true, resets the variable r used by the next sum.

The basic blocks of the assembly code fragment are indicated by the letters A through D on the left margin. Consider the first basic block, which is marked with A. Scheduling within the basic block has placed the load at the top of the block because of the load-use delay, which is covered by the add. The compare is data dependent on both the load and the add, and the branch is data dependent on the compare. Assuming the branch is predicted correctly, this block requires at least 3 cycles to execute.



Figure 4-21. Dependent Arithmetic-Conditional Assignments Example

C Source Code
/* dependent integer ops */
int t009i(int *a)
{ int i; int r;
int r0,r1,r2,r3,r4,r5,r6,r7,r8,r9;
r = 0;
r0 = a[0]; r1 = a[1]; r2 = a[2]; r3 = a[3]; r4 = a[4];
for(i=0;i<100;i+=10){
r = r + r0; if(r >= a[i+0]) r = 0;
r = r + r1; if(r >= a[i+1]) r = 0;
r = r + r2; if(r >= a[i+2]) r = 0;
r = r + r3; if(r >= a[i+3]) r = 0;
r = r + r4; if(r >= a[i+4]) r = 0;
}
return(r);
}
Assembly Code
# R29 contains the address of a[i]
# R12 contains r0
# R13 contains r1
A lwz R10,0(R29) # get next a[i+0]
A add R9,R9,R12 # r = r + r0
A cmpw cr0,R9,R10 # r >= a[i+0]
A blt cr0,lab1
B li R9,0 # r = 0
lab1:
C lwz R10,4(R29) # get next a[i+1]
C add R9,R9,R13 # r = r + r1
C cmpw cr0,R9,R10 # r >= a[i+1]
C blt cr0,lab2
D li R9,0 # r = 0
lab2: ....

Figure 4-22 shows a version of this code that has been scheduled using global scheduling to move code between basic blocks. The load in the first basic block has been hoisted (moved up in program order) from the third basic block, which lies deeper (later) in the flow graph. This hoist of the load is legal because every thread of control entering the third basic block includes the first basic block. This code motion removes the load-use delay. The compare retains its data dependence on the add, but the compare, load, and branch can execute simultaneously on a processor with a Load-Store Unit that is separate from the Fixed-Point Unit, such as the PowerPC 604 processor. If the branch is predicted correctly, the basic block can execute in 2 cycles.



Figure 4-22. Rescheduled Dependent Arithmetic-Conditional Assignments Example

# This sequence is the same as in Figure 4-22,
# except as indicated
A add R9,R9,R12
A cmpw cr0,R9,R10
A
A lwz R10,4(R29) # This load is hoisted from the
A # following basic block.
A blt cr0,lab1
B li R9,0
lab1:
C add R9,R9,R13
C cmpw cr0,R9,R10
C
C lwz R30,8(R29) # This load is hoisted from the
C # following basic block.
C blt cr0,lab2
D li R9,0
lab2: ....

Figure 4-23 shows the Fortran code for a basic matrix multiply kernel before and after loop optimizations have been applied. The optimizations include interchanging the i and j loops, unrolling the i loop five times, and fusing the resulting five copies of the k loop.

Figure 4-24 shows the assembly code for the innermost loop. In addition to the previously mentioned optimizations, the loop has been software pipelined to cover the latency of the loads.



Figure 4-23. Basic Matrix Multiply Kernel Code Example

Fortran Source Code

subroutine multiply (a, b, c, n, m)

integer*4 i, j, k, n;

real*8 a(n,m), b(m,n), c(n,n)

do 30 i = 1, n

do 20 j = 1, n

c(i,j) = 0.0d0

do 10 k = 1, m

c(i,j) = c(i,j) + a(i,k) * b(k,j)
10 continue
20 continue
30 continue

end

Fortran Code Following Loop Transformations

subroutine multiply (a, b, c, n, m)

integer*4 i, j, k, n;

real*8 a(n,m), b(m,n), c(n,n)

do 20 j= 1, n

do 30 i= 1, n/5

c(i,j) = 0.0d0

do 10 k = 1, m

c(i,j) = c(i,j) + a(i,k) * b(k,j)

c(i+1,j) = c(i+1,j) + a(i+1,k) * b(k,j)

c(i+2,j) = c(i+2,j) + a(i+2,k) * b(k,j)

c(i+3,j) = c(i+3,j) + a(i+3,k) * b(k,j)

c(i+4,j) = c(i+4,j) + a(i+4,k) * b(k,j)
10 continue
30 continue
...clean-up code for mod(n,5) iterations of i loop...
20 continue
end



Figure 4-24. Matrix Multiply Code Example—Scheduled for PowerPC 604 Processor

mtctr R30 # load m-1 into the Count Register
lfdux FR8,R7,R14 # load a(i,1)
mr R6,R8 # put address b(1,j) - 8 in R6
addi R3,R3,40
lfd FR6,16(R4) # load c(i+1,j)
lfd FR2,24(R4) # load c(i+2,j)
lfd FR0,32(R4) # load c(i+3,j)
lfd FR4,8(R7) # load a(i+1,1)
lfd FR1,40(R4) # load c(i+4,j)
bdz CL.18
CL.27:
lfdu FR3,8(R6) # load b(k,j)
lfd FR5,16(R7) # load a(i+2,k)
lfd FR7,24(R7) # load a(i+3,k)
fmadd FR10,FR8,FR3,FR10 # c(i,j)=c(i,j)+a(i,k)*b(k,j)
lfd FR9,32(R7) # load a(i+4,k)
fmadd FR6,FR4,FR3,FR6 # c(i+1,j)=c(i+1,j)+a(i+1,k)*b(k,j)
fmadd FR2,FR5,FR3,FR2 # c(i+2,j)=c(i+2,j)+a(i+2,k)*b(k,j)
lfdux FR8,R7,R14 # load a(i,k)
fmadd FR0,FR7,FR3,FR0 # c(i+3,j)=c(i+3,j)+a(i+3,k)*b(k,j)
lfd FR4,8(R7) # load a(i+1,k+1)
fmadd FR1,FR9,FR3,FR1 # c(i+4,j)=c(i+4,j)+a(i+4,k)*b(k,j)
bdnz CL.27 # latch to CL.27
CL.18:
lfdu FR7,8(R6) # load b(m,j)
lfd FR3,16(R7) # load a(i+2,m)
lfd FR5,24(R7) # load a(i+3,m)
fmadd FR9,FR8,FR7,FR10 # c(i+4,j)=c(i+4,j)+a(i+4,m)*b(m,j)
fmadd FR4,FR4,FR7,FR6 # c(i+4,j)=c(i+4,j)+a(i+4,m)*b(m,j)
fmadd FR2,FR3,FR7,FR2 # c(i+4,j)=c(i+4,j)+a(i+4,m)*b(m,j)
stfd FR9,8(R4) # store c(i,j)
lfd FR8,32(R7) # load a(i+4,m)
fmadd FR0,FR5,FR7,FR0 # c(i+4,j)=c(i+4,j)+a(i+4,m)*b(m,j)
stfd FR4,16(R4) # store c(i+1,j)
fmadd FR1,FR8,FR7,FR1 # c(i+4,j)=c(i+4,j)+a(i+4,m)*b(m,j)
stfd FR2,24(R4) # store c(i+2,j)
stfd FR0,32(R4) # store c(i+3,j)
stfdu FR1,40(R4) # store c(i+4,j)

The overall observations are:

Although loop transformations are performed in a different part of the compilation process than instruction scheduling, they may play an important role in balancing the instruction mix for effective use of the processor's resources.


[ Table of Contents | Index ]
Copyright 1998 IBMchips