IBM Search Java Site Map Microelectronics News Products Services Technology About Us IBM Microelectronics Order Contact Legal

power pcProducts

overview
news
products
documents
performance
technology


[ Table of Contents | Index ]

4.4 Alignment

The alignment of data and instructions relative to various implementation- and system-defined boundaries can result in increased throughput. Detailed knowledge of these boundaries and sufficiently regular algorithms and data structures to take advantage of this knowledge generally occur only in high-performance situations such as numerically intensive computing or performance-optimized library routines. Alignment of scalar load and store instructions, however, is important in PowerPC implementations.


4.4.1 Loads and Stores

A value is aligned in memory if its address is a multiple of its size. Aligned memory references almost always execute in fewer cycles than misaligned references, which usually require two sequential accesses to transfer the data. More importantly, most implementations may not handle the misaligned access within the hardware. Instead, these implementations generate an interrupt and the associated overhead (~100 cycles). The use of interrupts to handle misaligned references is especially common in Little-endian mode. Therefore, compilers should search carefully for misaligned or potentially misaligned storage references at compile time and generate appropriate equivalent code to avoid the misaligned references.

For the purpose of high-speed transfers, like memory-to-memory copies, if the hardware manages misaligned accesses with a sufficiently small penalty, the most efficient way to perform the fixed-length loads and stores may be to execute misaligned loads and stores and let the hardware handle the alignment corrections. The specific trade-offs are highly implementation-specific.


4.4.2 Fetch Buffer

The fetch rate of PowerPC implementations ranges from 1 to 8 instructions per cycle. This number is generally determined by the cache interface and width of the data bus between the CPU and the cache. To make most effective use of the fetch buffer, all fetched instructions should be executed. Branches complicate matters because they may redirect program flow so as to cancel the execution of instructions loaded into the fetch buffer. It is possible to arrange the code or introduce no-ops such that branches reside in the last position in the buffer, and branch targets reside in the first position. In this way, instructions loaded into the fetch buffer are not needlessly cancelled. Perhaps the most useful application of fetch buffer alignment is small loops. Small loops that can be condensed (perhaps with the assistance of branch-on-count and load-with-update instructions) and aligned to fit in one or two widths of the fetch buffer can substantially increase the efficiency of code fetching.


4.4.3 TLB and Cache

TLB size, cache size, and cache geometry can have an important effect on the performance of code. The example in Figure 4-25 shows the Fortran source code for some nested loops and the corresponding optimized assembly code for the body of the innermost loop. The principal optimizations include loop unrolling, software pipelining, scheduling, and the use of data cache touch instructions.

The inner loop in this code sequence has been unrolled eight times, and the copies are indicated in the figure. Software pipelining separates the execution of the loop body into two stages:

The code motion among the multiple copies in the inner loop reveals the scheduling of the code.

Data cache touch instructions prefetch the parts of the arrays that will be needed in a few iterations. This prefetching prevents cache misses and associated stalls. This example does a 100-byte forward touch to ensure that the touch prefetches data from the next cache block. This code was compiled for a PowerPC 601 processor, which has a 32KB unified 8-way set associative cache with a block size of 64 bytes.

The techniques, such as array blocking, that make the most efficient use of the cache and TLB resources are beyond the scope of this book and are not specific to the PowerPC architecture. For further reading on this topic, see IBM Corporation [1993b].



Figure 4-25. Nested Loops: Touch Instruction Example

Fortran Source Code
program main
integer i,j,n
real*4 a(1000,1000), b(1000,1000)
real*4 c(1000,1000)
n=1000
do 20 j=1,n
do 10 i = 1,n
c(i,j) = c(i,j+1) + a(i,j+2) + b(i,j+3)
10 continue
20 continue
end
Assembly Code for Innermost Loop Body
# index for touching a is (R31) = 100
# index for touching b is (R4) = 100
# index for touching the next column in c is (R3) = 4100
CL.2:
lfsu FR2,4(R5) # load b(i,j+3) copy 1
fadds FR0,FR1,FR0 # tmp = c(i,j+1) + a(i,j+2) copy 1
lfsu FR3,4(R29) # load a(i,j+2) copy 2
lfs FR4,4008(R30) # load c(i,j+1) copy 2
lfsu FR5,4(R5) # load b(i,j+3) copy 2
fadds FR0,FR0,FR2 # c(i,j) = tmp + b(i,j+3), copy 1
stfsu FR0,4(R30) # store c(i,j) copy 1
dcbt R30,R3 # touch c
fadds FR1,FR4,FR3 # tmp = c(i,j+1) + a(i,j+2) copy 2
lfs FR6,4008(R30) # load c(i,j+1) copy 3
lfsu FR2,4(R29) # load a(i,j+2) copy 3
lfsu FR4,4(R29) # load a(i,j+2) copy 4
fadds FR0,FR1,FR5 # c(i,j) = tmp + b(i,j+3) copy 2
stfsu FR0,4(R30) # store c(i,j) copy 2
lfs FR3,4008(R30) # load c(i,j+1) copy 4
lfsu FR0,4(R5) # load b(i,j+3) copy 3
fadds FR1,FR6,FR2 # tmp = c(i,j+1) + a(i,j+2) copy 3
lfsu FR5,4(R5) # load b(i,j+3) copy 4
lfsu FR2,4(R29) # load a(i,j+2) copy 5
fadds FR0,FR1,FR0 # c(i,j) = tmp + b(i,j+3) copy 3
stfsu FR0,4(R30) # store c(i,j) copy 3
lfs FR6,4008(R30) # load c(i,j+1) copy 5
fadds FR1,FR3,FR4 # tmp = c(i,j+1) + a(i,j+2) copy 4
lfsu FR4,4(R29) # load a(i,j+2) copy 6
fadds FR0,FR1,FR5 # c(i,j) = tmp + b(i,j+3) copy 4
stfsu FR0,4(R30) # store c(i,j) copy 4
lfs FR3,4008(R30) # load c(i,j+1) copy 6
lfsu FR0,4(R5) # load b(i,j+3) copy 5
fadds FR1,FR6,FR2 # tmp = c(i,j+1) + a(i,j+2) copy 5
dcbt R5,R4 # touch b
lfsu FR5,4(R5) # load b(i,j+3) copy 6
lfsu FR2,4(R29) # load a(i,j+2) copy 7
fadds FR0,FR1,FR0 # c(i,j) = tmp + b(i,j+3) copy 5
stfsu FR0,4(R30) # store c(i,j) copy 5
lfs FR6,4008(R30) # load c(i,j+1) copy 7
fadds FR1,FR3,FR4 # tmp = c(i,j+1) + a(i,j+2) copy 6
lfsu FR4,4(R29) # load a(i,j+2) copy 8
fadds FR0,FR1,FR5 # c(i,j) = tmp + b(i,j+3) copy 6
stfsu FR0,4(R30) # store c(i,j) copy 6
lfs FR3,4008(R30) # load c(i,j+1) copy 8
dcbt R29,R31 # touch a
lfsu FR0,4(R5) # load b(i,j+3) copy 7
fadds FR1,FR6,FR2 # tmp = c(i,j+1) + a(i,j+2) copy 7
lfsu FR5,4(R5) # load b(i,j+3) copy 8
fadds FR2,FR3,FR4 # tmp = c(i,j+1) + a(i,j+2) copy 8
fadds FR1,FR1,FR0 # c(i,j) = tmp + b(i,j+3) copy 7
stfsu FR1,4(R30) # store c(i,j) copy 7
fadds FR2,FR2,FR5 # c(i,j) = tmp + b(i,j+3) copy 8
lfsu FR0,4(R29) # load a(i,j+2) copy 1
lfs FR1,4008(R30) # load c(i,j+1) copy 1
stfsu FR2,4(R30) # store c(i,j) copy 8
bdnz CL.2 # latch to CL.2
CL.40:


[ Table of Contents | Index ]
Copyright 1998 IBMchips