Figure 4-4 shows the floating-point pipeline. The execute stages have a multiply-add structure: 1 cycle for multiply, 1 cycle for add, and one cycle for normalization. Some implementations combine these stages in different ways. Each stage of the pipeline requires 1 cycle, but the Floating-Point Unit displays more variation in timing than the Fixed-Point Unit. Implementations that have only a single-precision multiplier or adder increase the latency of some double-precision operations by 1 cycle; for example, a double-precision multiply instruction would occupy a single-precision multiply stage for 2 cycles. Divide, square root, and floating reciprocal estimate are instructions that require multiple cycles to execute. Floating-point operations whose source or destination operands are IEEE 754 special values (NaNs or infinity) may require additional cycles. With these exceptions, independent floating-point instructions can be dispatched on successive cycles. Higher-performance PowerPC processors forward the result from the normalize execute stage to the input latches of the multiply execute stage for use by a subsequent dependent floating-point instruction. This forwarding reduces the stall between consecutive dependent floating-point instructions by 1 cycle.
Figure 4-5 shows the instruction pipeline for load and store instructions. The execution of load and store operations is composed of two stages. The first stage calculates the address. The second stage performs the cache access and either loads or stores the value when the data is available. Floating-point loads and stores differ from fixed-point loads and stores, even though both execute in the same Fixed-Point Unit or Load-Store Unit, due to differences in operand availability, conversions, and alignment. In most implementations, load instructions forward their result to the input latches of either the fixed-point or floating-point execute stages. The load of a General-Purpose Register should be separated from the instruction which uses it by the cache access delay, which is usually 1 cycle. The cache access delay for the load of a Floating-Point Register is usually 2 cycles.
The example in Figure 4-6 uses pointer chasing to illustrate how execution pipelines can stall because of the latency for cache access. This latency stalls dispatch of the dependent compare, creating an idle execution cycle in the pipeline. Moving an independent instruction between the compare and the branch can hide the stall, that is, perform useful work during the delay. The delay is referred to as the load-use delay. The same principle applies to any instruction which follows the load and has operands that depend on the result of the load.
The example in Figure 4-7 uses an integer to floating-point conversion to illustrate the load-following-store contention. The code fragment sums two integer vectors (a and b) into a double-precision floating-point vector (d), converting an integer value to a floating-point value. This example uses consecutive storage locations to perform the in-line conversion. The instructions indicated by asterisks at the left margin consecutively access the same memory location so that the loaded value is data-dependent on the preceding store value. Some implementations require additional cycles to perform this sequence because of the special serialization of the cache accesses. Where possible, compilers should avoid such load-following-store contentions by separating the memory-dependent instructions.
The example in Figure 4-8 illustrates the use of mtctr in a switch statement implemented as a branch table. In this example, assume TABLE contains the 32-bit instruction addresses of code corresponding to the various case n: labels. The mtctr instruction loads the Count Register, and the bctr instruction branches to the destination code for the desired case, which depends on the value in the Count Register. For most implementations, you should schedule several independent instructions between these two operations to eliminate the stall caused by an empty instruction fetch buffer. A comparable delay also occurs between mtlr and a dependent branch.
Figure 4-9 illustrates a similar situation that occurs when making function calls via pointers. The mtctr instruction loads the Count Register, and the bctrl instruction branches to the destination code for the desired control transfer, which depends on the value in the Count Register. Implementations frequently require additional execution cycles between these two instructions to eliminate the stall caused by an empty instruction fetch buffer.
Figure 4-13 compares the example from Figure 4-10 to the same basic block scheduled for the Common Model. The difference is the upward movement of instruction 5, the load of c(i), because its sum delay was greater than that of instruction 3 or 4.
Figure 4-14 shows further classification of the instruction set along with the unit(s) required to execute the class of instruction and the execution time in cycles. The execution units and execution times label the nodes in the dependence graph.
Figure 4-15 shows the delay between dependent instructions of the indicated classes. These delays label the directed edges in the dependence graph. Adding the execution time and the delay yields the instruction latency in most cases. Where a data dependence exists but a delay is not specified, the delay is zero. A name dependence leads to a weak delay, which means that the weak predecessor must dispatch before or concurrently with dispatch of the name dependent instruction. The weak delay (i.e., in call_insn) means that the instruction order is preserved between a call instruction and any subsequent instruction, even though they may be dispatched and/or executed in the same cycle. This name dependance prevents the scheduler from hoisting instructions around call points because the hoisted code might cause register save/restore code to be emitted. (The nature of save/restore code and rules governing it are ABI issues.) The delay time should be added to the earlier throughput number to yield the latency. The (m) on some delays indicates that cache latency may modify the delay.
The code sequence in Figure 4-16 shows a simple example with a load-use delay slot that has been scheduled for the Common Model and the PowerPC 604 processor. The cycle column on the left indicates the clock cycle in which the instruction begins execution. The difference between these sequences is the location of the subtract instruction. The Common Model executes the subtraction in the load-use delay slot, but the PowerPC 604 processor executes the subtraction early because it has multiple Fixed-Point Units.
Figure 4-17 shows a related example that uses the evaluation of an expression with many independent parts to illustrate the differences between various PowerPC implementations with respect to number and type of execution units. Each iteration involves the loading and summing of ten elements of an array.
Figure 4-18 shows the C code for a basic block that contains a mix of floating-point, load, store, and other fixed-point instructions. Figures 4-19 and 4-20 show the corresponding assembly code when scheduled for the Common Model and for the PowerPC 604 processor. It shows that the primary scheduling differences arise from the fact that the PowerPC 604 processor can execute loads and stores in parallel with fixed-point arithmetic.
Figure 4-22 shows a version of this code that has been scheduled using global scheduling to move code between basic blocks. The load in the first basic block has been hoisted (moved up in program order) from the third basic block, which lies deeper (later) in the flow graph. This hoist of the load is legal because every thread of control entering the third basic block includes the first basic block. This code motion removes the load-use delay. The compare retains its data dependence on the add, but the compare, load, and branch can execute simultaneously on a processor with a Load-Store Unit that is separate from the Fixed-Point Unit, such as the PowerPC 604 processor. If the branch is predicted correctly, the basic block can execute in 2 cycles.
Figure 4-23 shows the Fortran code for a basic matrix multiply kernel before and after loop optimizations have been applied. The optimizations include interchanging the i and j loops, unrolling the i loop five times, and fusing the resulting five copies of the k loop.
Figure 4-24 shows the assembly code for the innermost loop. In addition to the previously mentioned optimizations, the loop has been software pipelined to cover the latency of the loads.