B-1B.3 Two Real-World Issues: Vector Length and Stride B-15 B.6 Putting It All Together: Performance of Vector Processors B-29 ■ Clock cycle time—The clock cycle time can be decreased by
Trang 1A-62 Appendix A Computer Arithmetic
and run with a cycle time of about 40 nanoseconds However, as we will see, theyuse quite different algorithms The Weitek chip is well described in Birman et al.[1990], the MIPS chip is described in less detail in Rowen, Johnson, and Ries[1988], and details of the TI chip can be found in Darley et al [1989]
These three chips have a number of things in common They perform additionand multiplication in parallel, and they implement neither extended precision nor
a remainder step operation (Recall from section A.6 that it is easy to implementthe IEEE remainder function in software if a remainder step instruction is avail-able.) The designers of these chips probably decided not to provide extended pre-cision because the most influential users are those who run portable codes, whichcan’t rely on extended precision However, as we have seen, extended precisioncan make for faster and simpler math libraries
In the summary of the three chips given in Figure A.36, note that a highertransistor count generally leads to smaller cycle counts Comparing the cycles/opnumbers needs to be done carefully, because the figures for the MIPS chip arethose for a complete system (R3000/3010 pair), while the Weitek and TI numbersare for stand-alone chips and are usually larger when used in a complete system.The MIPS chip has the fewest transistors of the three This is reflected in thefact that it is the only chip of the three that does not have any pipelining or hard-ware square root Further, the multiplication and addition operations are not com-pletely independent because they share the carry-propagate adder that performsthe final rounding (as well as the rounding logic)
Addition on the R3010 uses a mixture of ripple, CLA, and carry select A ry-select adder is used in the fashion of Figure A.20 (page A-45) Within eachhalf, carries are propagated using a hybrid ripple-CLA scheme of the type indi-cated in Figure A.18 (page A-43) However, this is further tuned by varying thesize of each block, rather than having each fixed at 4 bits (as they are inFigure A.18) The multiplier is midway between the designs of Figures A.2(page A-4) and A.27 (page A-53) It has an array just large enough so that outputcan be fed back into the input without having to be clocked Also, it uses radix-4Booth recoding and the even-odd technique of Figure A.29 (page A-55) TheR3010 can do a divide and multiply in parallel (like the Weitek chip but unlikethe TI chip) The divider is a radix-4 SRT method with quotient digits −2, −1, 0,
car-1, and 2, and is similar to that described in Taylor [1985] Double-precision sion is about four times slower than multiplication The R3010 shows that for
divi-chips using an O(n) multiplier, an SRT divider can operate fast enough to keep a
reasonable ratio between multiply and divide
The Weitek 3364 has independent add, multiply, and divide units It also usesradix-4 SRT division However, the add and multiply operations on the Weitek
Trang 2A.10 Putting It All Together A-63
chip are pipelined The three addition stages are (1) exponent compare, (2) addfollowed by shift (or vice versa), and (3) final rounding Stages (1) and (3) takeonly a half-cycle, allowing the whole operation to be done in two cycles, eventhough there are three pipeline stages The multiplier uses an array of the style ofFigure A.28 but uses radix-8 Booth recoding, which means it must compute 3
times the multiplier The three multiplier pipeline stages are (1) compute 3b, (2)
pass through array, and (3) final carry-propagation add and round Single sion passes through the array once, double precision twice Like addition, the la-tency is two cycles
preci-The Weitek chip uses an interesting addition algorithm It is a variant on the
carry-skip adder pictured in Figure A.19 (page A-44) However, P ij , which is thelogical AND of many terms, is computed by rippling, performing one AND per rip-
ple Thus, while the carries propagate left within a block, the value of P ij is gating right within the next block, and the block sizes are chosen so that bothwaves complete at the same time Unlike the MIPS chip, the 3364 has hardwaresquare root, which shares the divide hardware The ratio of double-precision mul-tiply to divide is 2:17 The large disparity between multiply and divide is due tothe fact that multiplication uses radix-8 Booth recoding, while division uses a ra-dix-4 method In the MIPS R3010, multiplication and division use the same radix.The notable feature of the TI 8847 is that it does division by iteration (usingthe Goldschmidt algorithm discussed in section A.6) This improves the speed ofdivision (the ratio of multiply to divide is 3:11), but means that multiplication anddivision cannot be done in parallel as on the other two chips Addition has a two-stage pipeline Exponent compare, fraction shift, and fraction addition are done
propa-in the first stage, normalization and roundpropa-ing propa-in the second stage Multiplicationuses a binary tree of signed-digit adders and has a three-stage pipeline The firststage passes through the array, retiring half the bits; the second stage passesthrough the array a second time; and the third stage converts from signed-digitform to two’s complement Since there is only one array, a new multiply opera-tion can only be initiated in every other cycle However, by slowing down theclock, two passes through the array can be made in a single cycle In this case, anew multiplication can be initiated in each cycle The 8847 adder uses a carry-select algorithm rather than carry lookahead As mentioned in section A.6, the TIcarries 60 bits of precision in order to do correctly rounded division
These three chips illustrate the different trade-offs made by designers withsimilar constraints One of the most interesting things about these chips is the di-versity of their algorithms Each uses a different add algorithm, as well as a dif-ferent multiply algorithm In fact, Booth recoding is the only technique that isuniversally used by all the chips
Trang 3A-64 Appendix A Computer Arithmetic
TI 8847
MIPS R3010
Figure continued on next page
Trang 4A.11 Fallacies and Pitfalls A-65
Fallacy: Underflows rarely occur in actual floating-point application code.
Although most codes rarely underflow, there are actual codes that underflow quently SDRWAVE [Kahaner 1988], which solves a one-dimensional waveequation, is one such example This program underflows quite frequently, evenwhen functioning properly Measurements on one machine show that addinghardware support for gradual underflow would cause SDRWAVE to run about50% faster
fre-Fallacy: Conversions between integer and floating point are rare.
In fact, in spice they are as frequent as divides The assumption that conversionsare rare leads to a mistake in the SPARC version 8 instruction set, which does notprovide an instruction to move from integer registers to floating-point registers
Weitek 3364
FIGURE A.37 Chip layout for the TI 8847, MIPS R3010, and Weitek 3364 In the left-hand columns are the
photomicro-graphs; the right-hand columns show the corresponding floor plans.
Trang 5A-66 Appendix A Computer Arithmetic
Pitfall: Don’t increase the speed of a floating-point unit without increasing its memory bandwidth.
A typical use of a floating-point unit is to add two vectors to produce a third tor If these vectors consist of double-precision numbers, then each floating-pointadd will use three operands of 64 bits each, or 24 bytes of memory The memorybandwidth requirements are even greater if the floating-point unit can performaddition and multiplication in parallel (as most do)
vec-Pitfall: −x is not the same as 0 − x.
This is a fine point in the IEEE standard that has tripped up some designers cause floating-point numbers use the sign/magnitude system, there are two zeros,+0 and −0 The standard says that 0 − 0 = +0, whereas −(0) = −0 Thus −x is not
Be-the same as 0 − x when x = 0.
The earliest computers used fixed point rather than floating point In “PreliminaryDiscussion of the Logical Design of an Electronic Computing Instrument,”Burks, Goldstine, and von Neumann [1946] put it like this:
There appear to be two major purposes in a “floating” decimal point system both
of which arise from the fact that the number of digits in a word is a constant fixed
by design considerations for each particular machine The first of these purposes
is to retain in a sum or product as many significant digits as possible and the ond of these is to free the human operator from the burden of estimating and in- serting into a problem “scale factors” — multiplicative constants which serve to keep numbers within the limits of the machine.
sec-There is, of course, no denying the fact that human time is consumed in arranging for the introduction of suitable scale factors We only argue that the time so con- sumed is a very small percentage of the total time we will spend in preparing an interesting problem for our machine The first advantage of the floating point is,
we feel, somewhat illusory In order to have such a floating point, one must waste memory capacity which could otherwise be used for carrying more digits per word It would therefore seem to us not at all clear whether the modest advantages
of a floating binary point offset the loss of memory capacity and the increased complexity of the arithmetic and control circuits
This enables us to see things from the perspective of early computer designers,who believed that saving computer time and memory were more important thansaving programmer time
Trang 6A.12 Historical Perspective and References A-67
The original papers introducing the Wallace tree, Booth recoding, SRT sion, overlapped triplets, and so on, are reprinted in Swartzlander [1990] A goodexplanation of an early machine (the IBM 360/91) that used a pipelined Wallacetree, Booth recoding, and iterative division is in Anderson et al [1967] A discus-sion of the average time for single-bit SRT division is in Freiman [1961]; this isone of the few interesting historical papers that does not appear in Swartzlander.The standard book of Mead and Conway [1980] discouraged the use of CLAs
divi-as not being cost effective in VLSI The important paper by Brent and Kung[1982] helped combat that view An example of a detailed layout for CLAs can befound in Ngai and Irwin [1985] or in Weste and Eshraghian [1993], and a moretheoretical treatment is given by Leighton [1992] Takagi, Yasuura, and Yajima[1985] provide a detailed description of a signed-digit tree multiplier
Before the ascendancy of IEEE arithmetic, many different floating-point mats were in use Three important ones were used by the IBM/370, the DECVAX, and the Cray Here is a brief summary of these older formats The VAX for-mat is closest to the IEEE standard Its single-precision format (F format) is likeIEEE single precision in that it has a hidden bit, 8 bits of exponent, and 23 bits offraction However, it does not have a sticky bit, which causes it to round halfwaycases up instead of to even The VAX has a slightly different exponent range from
for-IEEE single: Emin is −128 rather than −126 as in IEEE, and Emax is 126 instead of
127 The main differences between VAX and IEEE are the lack of special valuesand gradual underflow The VAX has a reserved operand, but it works like asignaling NaN: it traps whenever it is referenced Originally, the VAX’s doubleprecision (D format) also had 8 bits of exponent However, as this is too small formany applications, a G format was added; like the IEEE standard, this format has
11 bits of exponent The VAX also has an H format, which is 128 bits long.The IBM/370 floating-point format uses base 16 rather than base 2 Thismeans it cannot use a hidden bit In single precision, it has 7 bits of exponent and
24 bits (6 hex digits) of fraction Thus, the largest representable number is 1627 =
24× 2 7
= 229, compared with 228 for IEEE However, a number that is normalized
in the hexadecimal sense only needs to have a nonzero leading digit When preted in binary, the three most-significant bits could be zero Thus, there are po-tentially fewer than 24 bits of significance The reason for using the higher basewas to minimize the amount of shifting required when adding floating-pointnumbers However, this is less significant in current machines, where the float-ing-point add time is usually fixed independently of the operands Another differ-ence between 370 arithmetic and IEEE arithmetic is that the 370 has neither around digit nor a sticky digit, which effectively means that it truncates rather thanrounds Thus, in many computations, the result will systematically be too small.Unlike the VAX and IEEE arithmetic, every bit pattern is a valid number Thus,library routines must establish conventions for what to return in case of errors Inthe IBM FORTRAN library, for example, returns 2!
inter-Arithmetic on Cray computers is interesting because it is driven by a tion for the highest possible floating-point performance It has a 15-bit exponent
motiva-4
Trang 7A-68 Appendix A Computer Arithmetic
field and a 48-bit fraction field Addition on Cray computers does not have aguard digit, and multiplication is even less accurate than addition Thinking of
multiplication as a sum of p numbers, each 2p bits long, Cray computers drop the
low-order bits of each summand Thus, analyzing the exact error characteristics ofthe multiply operation is not easy Reciprocals are computed using iteration, and
division of a by b is done by multiplying a times 1/b The errors in multiplication
and reciprocation combine to make the last three bits of a divide operation reliable At least Cray computers serve to keep numerical analysts on their toes!The IEEE standardization process began in 1977, inspired mainly by W.Kahan and based partly on Kahan’s work with the IBM 7094 at the University ofToronto [Kahan 1968] The standardization process was a lengthy affair, withgradual underflow causing the most controversy (According to Cleve Moler, vis-itors to the U.S were advised that the sights not to be missed were Las Vegas, theGrand Canyon, and the IEEE standards committee meeting.) The standard was fi-nally approved in 1985 The Intel 8087 was the first major commercial IEEE im-plementation and appeared in 1981, before the standard was finalized It containsfeatures that were eliminated in the final standard, such as projective bits Ac-cording to Kahan, the length of double-extended precision was based on whatcould be implemented in the 8087 Although the IEEE standard was not based onany existing floating-point system, most of its features were present in some oth-
un-er system For example, the CDC 6600 resun-erved special bit pattun-erns for NITE and INFINITY, while the idea of denormal numbers appears in Goldberg[1967] as well as in Kahan [1968] Kahan was awarded the 1989 Turing prize inrecognition of his work on floating point
INDEFI-Although floating point rarely attracts the interest of the general press, papers were filled with stories about floating-point division in November 1994 Abug in the division algorithm used on all of Intel’s Pentium chips had just come tolight It was discovered by Thomas Nicely, a math professor at Lynchburg Col-lege in Virginia Nicely found the bug when doing calculations involving recipro-cals of prime numbers News of Nicely’s discovery first appeared in the press on
news-the front page of news-the November 7 issue of Electronic Engineering Times Intel’s
immediate response was to stonewall, asserting that the bug would only affecttheoretical mathematicians Intel told the press, “This doesn’t even qualify as anerrata even if you’re an engineer, you’re not going to see this.”
Under more pressure, Intel issued a white paper, dated November 30, ing why they didn’t think the bug was significant One of their arguments wasbased on the fact that if you pick two floating-point numbers at random and di-vide one into the other, the chance that the resulting quotient will be in error isabout 1 in 9 billion However, Intel neglected to explain why they thought that thetypical customer accessed floating-point numbers randomly
explain-Pressure continued to mount on Intel One sore point was that Intel had knownabout the bug before Nicely discovered it, but had decided not to make it public.Finally, on December 20, Intel announced that they would unconditionally re-place any Pentium chip that used the faulty algorithm and that they would take anunspecified charge against earnings, which turned out to be $300 million
Trang 8A.12 Historical Perspective and References A-69
The Pentium uses a simple version of SRT division as discussed in sectionA.9 The bug was introduced when they converted the quotient lookup table to aPLA Evidently there were a few elements of the table containing the quotientdigit 2 that Intel thought would never be accessed, and they optimized the PLAdesign using this assumption The resulting PLA returned 0 rather than 2 in thesesituations However, those entries were really accessed, and this caused the divi-sion bug Even though the effect of the faulty PLA was to cause 5 out of 2048 ta-ble entries to be wrong, the Pentium only computes an incorrect quotient 1 out of
9 billion times on random inputs This is explored in Exercise A.34
References
A NDERSON , S F., J G E ARLE , R E G OLDSCHMIDT , AND D M P OWERS [1967] “The IBM System/
360 Model 91: Floating-point execution unit,” IBM J Research and Development 11, 34–53
B IRMAN , M., A S AMUELS , G C HU , T C HUK , L H U , J M C L EOD , AND J B ARNES [1990]
“Develop-ing the WRL3170/3171 SPARC float“Develop-ing-point coprocessors,” IEEE Micro 10:1, 55–64.
These chips have the same floating-point core as the Weitek 3364, and this paper has a fairly detailed description of that floating-point design.
B RENT , R P AND H T K UNG [1982] “A regular layout for parallel adders,” IEEE Trans on puters C-31, 260–264.
Com-This is the paper that popularized CLAs in VLSI.
B URGESS , N AND T W ILLIAMS [1995] “Choices of operand truncation in the SRT division
algo-rithm,” IEEE Trans on Computers 44:7.
Analyzes how many bits of divisor and remainder need to be examined in SRT division.
B URKS , A W., H H G OLDSTINE , AND J VON N EUMANN [1946] “Preliminary discussion of the cal design of an electronic computing instrument,” Report to the U.S Army Ordnance Department,
logi-p 1; also appears in Papers of John von Neumann, W Aspray and A Burks, eds., MIT Press,
Cam-bridge, Mass., and Tomash Publishers, Los Angeles, Calif., 1987, 97–146.
C ODY , W J., J T C OONEN , D M G AY , K H ANSON , D H OUGH , W K AHAN , R K ARPINSKI ,
J P ALMER , F N R IS , AND D S TEVENSON [1984] “A proposed radix- and
word-length-indepen-dent standard for floating-point arithmetic,” IEEE Micro 4:4, 86–100.
Contains a draft of the 854 standard, which is more general than 754 The significance of this article is that it contains commentary on the standard, most of which is equally relevant to
754 However, be aware that there are some differences between this draft and the final dard.
stan-C OONEN, J [1984] Contributions to a Proposed Standard for Binary Floating-Point Arithmetic,
Ph.D Thesis, Univ of Calif., Berkeley.
The only detailed discussion of how rounding modes can be used to implement efficient binary decimal conversion.
Trang 9A-70 Appendix A Computer Arithmetic
D ARLEY , H M., ET AL [1989] “Floating point/integer processor with divide and square root tions,” U.S Patent 4,878,190, October 31, 1989.
func-Pretty readable as patents go Gives a high-level view of the TI 8847 chip, but doesn’t have all the details of the division algorithm.
D EMMEL , J W AND X L I [1994] “Faster numerical algorithms via exception handling,” IEEE Trans on Computers 43:8, 983–992.
A good discussion of how the features unique to IEEE floating point can improve the mance of an important software library.
perfor-F REIMAN, C V [1961] “Statistical analysis of certain binary division algorithms,” Proc IRE 49:1,
91–103.
Contains an analysis of the performance of shifting-over-zeros SRT division algorithm.
G OLDBERG , D [1991] “What every computer scientist should know about floating-point arithmetic,”
Computing Surveys 23:1, 5–48.
Contains an in-depth tutorial on the IEEE standard from the software point of view.
G OLDBERG, I B [1967] “27 bits are not enough for 8-digit accuracy,” Comm ACM 10:2, 105–106 This paper proposes using hidden bits and gradual underflow.
G OSLING, J B [1980] Design of Arithmetic Units for Digital Computers, Springer-Verlag, New
York.
A concise, well-written book, although it focuses on MSI designs.
H AMACHER , V C., Z G V RANESIC , AND S G Z AKY [1984] Computer Organization, 2nd ed.,
McGraw-Hill, New York.
Introductory computer architecture book with a good chapter on computer arithmetic.
H WANG, K [1979] Computer Arithmetic: Principles, Architecture, and Design, Wiley, New York This book contains the widest range of topics of the computer arithmetic books.
IEEE [1985] “IEEE standard for binary floating-point arithmetic,” SIGPLAN Notices 22:2, 9–25 IEEE 754 is reprinted here.
K AHAN, W [1968] “7094-II system support for numerical analysis,” SHARE Secretarial tion SSD-159.
Distribu-This system had many features that were incorporated into the IEEE floating-point standard
K AHANER, D K [1988] “Benchmarks for ‘real’ programs,” SIAM News (November).
The benchmark presented in this article turns out to cause many underflows.
K NUTH, D [1981] The Art of Computer Programming, vol II, 2nd ed., Addison-Wesley, Reading,
Mass.
Has a section on the distribution of floating-point numbers.
K OGGE, P [1981] The Architecture of Pipelined Computers, McGraw-Hill, New York.
Has brief discussion of pipelined multipliers.
K OHN , L AND S.-W F U [1989] “A 1,000,000 transistor microprocessor,” IEEE Int’l Solid-State cuits Conf., 54–55.
Cir-There are several articles about the i860, but this one contains the most details about its ing-point algorithms.
float-K OREN, I [1989] Computer Arithmetic Algorithms, Prentice Hall, Englewood Cliffs, N.J.
L EIGHTON, F T [1992] Introduction to Parallel Algorithms and Architectures: Arrays, Trees, percubes, Morgan Kaufmann, San Mateo, Calif.
Hy-This is an excellent book, with emphasis on the complexity analysis of algorithms Section 1.2.1 has a nice discussion of carry-lookahead addition on a tree.
Trang 10A.12 Historical Perspective and References A-71
M AGENHEIMER , D J., L P ETERS , K W P ETTIS , AND D Z URAS [1988] “Integer multiplication and
division on the HP Precision architecture,” IEEE Trans on Computers 37:8, 980–990.
Gives rationale for the integer- and divide-step instructions in the Precision architecture.
M ARKSTEIN , P W [1990] “Computation of elementary functions on the IBM RISC System/6000
processor,” IBM J of Research and Development 34:1, 111–119.
Explains how to use fused muliply-add to compute correctly rounded division and square root.
M EAD , C AND L C ONWAY [1980] Introduction to VLSI Systems, Addison-Wesley, Reading, Mass.
M ONTOYE , R K., E H OKENEK , AND S L R UNYON [1990] “Design of the IBM RISC System/6000
floating-point execution,” IBM J of Research and Development 34:1, 59–70.
Describes one implementation of fused multiply-add.
N GAI , T.-F AND M J I RWIN [1985] “Regular, area-time efficient carry-lookahead adders,” Proc Seventh IEEE Symposium on Computer Arithmetic, 9–15.
Describes a CLA like that of Figure A.17, where the bits flow up and then come back down.
P ATTERSON , D.A AND J.L H ENNESSY [1994] Computer Organization and Design: The Hardware/ Software Interface, Morgan Kaufmann, San Francisco.
Chapter 4 is a gentler introduction to the first third of this appendix.
P ENG , V., S S AMUDRALA , AND M G AVRIELOV [1987] “On the implementation of shifters,
multipli-ers, and dividers in VLSI floating point units,” Proc Eighth IEEE Symposium on Computer metic, 95–102.
Arith-Highly recommended survey of different techniques actually used in VLSI designs.
R OWEN , C., M J OHNSON , AND P R IES [1988] “The MIPS R3010 floating-point coprocessor,” IEEE Micro 53–62 (June).
S ANTORO , M R., G B EWICK , AND M A H OROWITZ [1989] “Rounding algorithms for IEEE
multi-pliers,” Proc Ninth IEEE Symposium on Computer Arithmetic, 176–183.
A very readable discussion of how to efficiently implement rounding for floating-point plication.
multi-S COTT, N R [1985] Computer Number Systems and Arithmetic, Prentice Hall, Englewood Cliffs,
N.J.
S WARTZLANDER , E., ED [1990] Computer Arithmetic, IEEE Computer Society Press, Los Alamitos,
Calif.
A collection of historical papers in two volumes.
T AKAGI , N., H Y ASUURA , AND S Y AJIMA [1985].“High-speed VLSI multiplication algorithm with a
redundant binary addition tree,” IEEE Trans on Computers C-34:9, 789–796.
A discussion of the binary-tree signed multiplier that was the basis for the design used in the
TI 8847.
T AYLOR, G S [1981] “Compatible hardware for division and square root,” Proc Fifth IEEE sium on Computer Arithmetic, 127–134.
Sympo-Good discussion of a radix-4 SRT division algorithm.
T AYLOR, G S [1985] “Radix 16 SRT dividers with overlapped quotient selection stages,” Proc enth IEEE Symposium on Computer Arithmetic, 64–71.
Sev-Describes a very sophisticated high-radix division algorithm.
W ESTE , N AND K E SHRAGHIAN [1993] Principles of CMOS VLSI Design: A Systems Perspective,
2nd ed., Addison-Wesley, Reading, Mass.
This textbook has a section on the layouts of various kinds of adders.
Trang 11A-72 Appendix A Computer Arithmetic
W ILLIAMS , T E., M H OROWITZ , R L A LVERSON , AND T S Y ANG [1987] “A self-timed chip for
di-vision,” Advanced Research in VLSI, Proc 1987 Stanford Conf., MIT Press, Cambridge, Mass Describes a divider that tries to get the speed of a combinational design without using the area that would be required by one.
E X E R C I S E S
A.1 [12] <A.2> Using n bits, what is the largest and smallest integer that can be
represent-ed in the two’s complement system?
A.2 [20/25] <A.2> In the subsection Signed Numbers (page A-7), it was stated that two’s
complement overflows when the carry into the high-order bit position is different from the carry-out from that position.
a [20] <A.2> Give examples of pairs of integers for all four combinations of carry-in and carry-out Verify the rule stated above.
b [25] <A.2> Explain why the rule is always true.
A.3 [12] <A.2> Using 4-bit binary numbers, multiply − 8 × − 8 using Booth recoding.
A.4 [15] <A.2> Equations A.2.1 and A.2.2 are for adding two n-bit numbers Derive
sim-ilar equations for subtraction, where there will be a borrow instead of a carry.
A.5 [25] <A.2> On a machine that doesn’t detect integer overflow in hardware, show how
you would detect overflow on a signed addition operation in software.
A.6 [15/15/20] <A.3> Represent the following numbers as single-precision and
double-precision IEEE floating-point numbers.
a [15] <A.3> 10.
b [15] <A.3> 10.5.
c [20] <A.3> 0.1.
A.7 [12/12/12/12/12] <A.3> Below is a list of floating-point numbers In single precision,
write down each number in binary, in decimal, and give its representation in IEEE metic.
arith-a [12] <A.3> The largest number less than 1.
b [12] <A.3> The largest number.
c [12] <A.3> The smallest positive normalized number.
d [12] <A.3> The largest denormal number.
e [12] <A.3> The smallest positive number.
A.8 [15] <A.3> Is the ordering of nonnegative floating-point numbers the same as integers
when denormalized numbers are also considered?
A.9 [20] <A.3> Write a program that prints out the bit patterns used to represent
floating-point numbers on your favorite computer What bit pattern is used for NaN?
Trang 12Exercises A-73
A.10 [15] <A.4> Using p = 4, show how the binary floating-point multiply algorithm
com-putes the product of 1.875 × 1.875.
A.11 [12/10] <A.4> Concerning the addition of exponents in floating-point multiply:
a [12] <A.4> What would the hardware that implements the addition of exponents look like?
b [10] <A.4> If the bias in single precision were 129 instead of 127, would addition be harder or easier to implement?
A.12 [15/12] <A.4> In the discussion of overflow detection for floating-point
multiplica-tion, it was stated that (for single precision) you can detect an overflowed exponent by forming exponent addition in a 9-bit adder.
per-a [15] <A.4> Give the exact rule for detecting overflow.
b [12] <A.4> Would overflow detection be any easier if you used a 10-bit adder instead?
A.13 [15/10] <A.4> Floating-point multiplication:
a [15] <A.4> Construct two single-precision floating-point numbers whose product doesn’t overflow until the final rounding step.
b [10] <A.4> Is there any rounding mode where this phenomenon cannot occur?
A.14 [15] <A.4> Give an example of a product with a denormal operand but a normalized
output How large was the final shifting step? What is the maximum possible shift that can occur when the inputs are double-precision numbers?
A.15 [15] <A.5> Use the floating-point addition algorithm on page A-24 to compute
1.0102− 10012 (in 4-bit precision)
A.16 [10/15/20/20/20] <A.5> In certain situations, you can be sure that a + b is exactly
rep-resentable as a floating-point number, that is, no rounding is necessary.
a. [10] <A.5> If a, b have the same exponent and different signs, explain why a + b is exact This was used in the subsection Speeding Up Addition on page A-27.
b. [15] <A.5> Give an example where the exponents differ by 1, a and b have different signs, and a + b is not exact.
c. [20] <A.5> If a ≥ b ≥ 0, and the top two bits of a cancel when computing a − b, explain
why the result is exact (this fact is mentioned on page A-23).
d. [20] <A.5> If a ≥ b ≥ 0, and the exponents differ by 1, show that a − b is exact unless the high order bit of a − b is in the same position as that of a (mentioned in Speeding
Up Addition, page A-27).
e. [20] <A.5> If the result of a − b or a + b is denormal, show that the result is exact tioned in the subsection Underflow, page A-38).
(men-A.17 [15/20] <A.5> Fast floating-point addition (using parallel adders) for p = 5.
a. [15] <A.5> Step through the fast addition algorithm for a + b, where a = 1.01112 and
b = 11011.
Trang 13A-74 Appendix A Computer Arithmetic
b [20] <A.5> Suppose the rounding mode is toward + ∞ What complication arises in the above example for the adder that assumes a carry-out? Suggest a solution.
A.18 [12] <A.4,A.5> How would you use two parallel adders to avoid the final round-up
addition in floating-point multiplication?
A.19 [30/10] <A.5> This problem presents a way to reduce the number of addition steps
in floating-point addition from three to two using only a single adder.
a. [30] <A.5> Let A and B be integers of opposite signs, with a and b be their tudes Show that the following rules for manipulating the unsigned numbers a and b gives A + B.
magni-1 Complement one of the operands.
2 Using end around carry to add the complemented operand and the other plemented) one.
(uncom-3 If there was a carry-out, the sign of the result is the sign associated with the complemented operand.
un-4 Otherwise, if there was no carry-out, complement the result, and give it the sign
of the complemented operand.
b [10] <A.5> Use the above to show how steps 2 and 4 in the floating-point addition gorithm can be performed using only a single addition.
al-A.20 [20/15/20/15/20/15] <A.6> Iterative square root.
a [20] <A.6> Use Newton’s method to derive an iterative algorithm for square root The formula will involve a division
b [15] <A.6> What is the fastest way you can think of to divide a floating-point number
by 2?
c [20] <A.6> If division is slow, then the iterative square root routine will also be slow.
Use Newton’s method on f(x) = 1/x2 − a to derive a method that doesn’t use any
divi-sions.
d [15] <A.6> Assume that the ratio division by 2 : floating-point add : floating-point multiply is 1:2:4 What ratios of multiplication time to divide time makes each itera- tion step in the method of part(c) faster than each iteration in the method of part(a)?
e [20] <A.6> When using the method of part(a), how many bits need to be in the initial guess in order to get double-precision accuracy after three iterations? (You may ignore rounding error.)
f [15] <A.6> Suppose that when Spice runs on the TI 8847, it spends 16.7% of its time
in the square root routine (this percentage has been measured on other machines) ing the values in Figure A.36 and assuming three iterations, how much slower would Spice run if square root was implemented in software using the method of part(a)?
Us-A.21 [10/20/15/15/15] <A.6> Correctly rounded iterative division Let a and b be
floating-point numbers with p-bit significands (p = 53 in double precision) Let q be the exact tient q = a/b, 1 ≤ q < 2 Suppose that q is the result of an iteration process, that q has a few
Trang 14quo-Exercises A-75
extra bits of precision, and that 0 < q − q < 2 −p For the following, it is important that
q < q, even when q can be exactly represented as a floating-point number.
a. [10] <A.6> If x is a floating-point number, and 1 ≤ x < 2, what is the next representable number after x?
b. [20] <A.6> Show how to compute q′ from q, where q′ has p + 1 bits of precision and
q − q′< 2 −p.
c [15] <A.6> Assuming round to nearest, show that the correctly rounded quotient is
either q′, q′ − 2 −p , or q′ + 2 −p.
d. [15] <A.6> Give rules for computing the correctly rounded quotient from q′ based on
the low-order bit of q′ and the sign of a − bq′
e [15] <A.6> Solve part(c) for the other three rounding modes.
A.22 [15] <A.6> Verify the formula on page A-31 [Hint: If x n = x0(2 − x0b) ×Πi=1, n [1 + (1 − x0b)2i
], then 2 − x n b = 2 − x0b(2 − x0b) Π[1 + (1 − x0b)2i
] = 2 − [1 − (1 − x0b)2 ] Π[1 + (1 − x0b)2i].]
A.23 [15] <A.7> Our example that showed that double rounding can give a different
an-swer from rounding once used the round-to-even rule If halfway cases are always rounded
up, is double rounding still dangerous?
A.24 [10/10/20/20] <A.7> Some of the cases of the italicized statement in the Precisions
subsection (page A-34) aren’t hard to demonstrate.
a. [10] <A.7> What form must a binary number have if rounding to q bits followed by rounding to p bits gives a different answer than rounding directly to p bits?
b. [10] <A.7> Show that for multiplication of p-bit numbers, rounding to q bits followed
by rounding to p bits is the same as rounding immediately to p bits if q ≥ 2p.
c. [20] <A.7> If a and b are p-bit numbers with the same sign, show that rounding a + b
to q bits followed by a rounding to p bits is the same as rounding immediately to p bits
if q ≥ 2p + 1.
d. [20] <A.7> Do part (c) when a and b have opposite signs.
A.25 [Discussion] <A.7> In the MIPS approach to exception handling, you need a test for
determining whether two floating-point operands could cause an exception This should be fast and also not have too many false positives Can you come up with a practical test? The performance cost of your design will depend on the distribution of floating-point numbers This is discussed in Knuth [1981] and the Hamming paper in Swartzlander [1990].
A.26 [12/12/10] <A.8> Carry-skip adders
a [12] <A.8> Assuming that time is proportional to logic levels, how long does it take
an n-bit adder divided into (fixed) blocks of length k bits to perform an addition?
b [12] <A.8> What value of k gives the fastest adder?
c [10] <A.8> Explain why the carry-skip adder takes time 0 ( n)
Trang 15A-76 Appendix A Computer Arithmetic
A.27 [10/15/20] <A.8> Complete the details of the block diagrams for the following
adders.
a [10] <A.8> In Figure A.15, show how to implement the “1” and “2” boxes in terms of
AND and OR gates.
b [15] <A.8> In Figure A.18, what signals need to flow from the adder cells in the top row into the “C” cells? Write the logic equations for the “C” box.
c [20] <A.8> Show how to extend the block diagram in A.17 so it will produce the
carry-out bit c8.
A.28 [15] <A.9> For ordinary Booth recoding, the multiple of b used in the ith step is
simply a i–1− a i. Can you find a similar formula for radix-4 Booth recoding (overlapped triplets)?
A.29 [20] <A.9> Expand Figure A.29 in the fashion of A.27, showing the individual
adders.
A.30 [25] <A.9> Write out the analogue of Figure A.25 for radix-8 Booth recoding.
A.31 [18] <A.9> Suppose that a n–1 .a1a0 and b n–1 .b1b0 are being added in a signed-digit
adder as illustrated in the Example on page A-56 Write a formula for the ith bit of the sum,
s i , in terms of a i , a i–1 , a i–2 , b i , b i–1 , and b i–2.
A.32 [15] <A.9> The text discussed radix-4 SRT division with quotient digits of − 2, − 1, 0,
1, 2 Suppose that 3 and − 3 are also allowed as quotient digits What relation replaces r i
≤ 2b/3?
A.33 [25/20/30] <A.9> Concerning the SRT division table, Figure A.34:
a [25] <A.9> Write a program to generate the results of Figure A.34.
b [20] <A.9> Note that Figure A.34 has a certain symmetry with respect to positive and negative values of P Can you find a way to exploit the symmetry and only store the values for positive P?
c [30] <A.9> Suppose a carry-save adder is used instead of a propagate adder The input
to the quotient lookup table will be k bits of divisor, and l bits of remainder, where the remainder bits are computed by summing the top l bits of the sum and carry registers What are k and l? Write a program to generate the analogue of Figure A.34.
A.34 [12/12/12]<A.9,A.12>The first several million Pentium chips produced had a flaw
that caused division to sometimes return the wrong result The Pentium uses a radix-4 SRT algorithm similar to the one illustrated in the Example on page A-59 (but with the remain- der stored in carry-save format: see Exercise A.33(c)) According to Intel, the bug was due
to five incorrect entries in the quotient lookup table.
a [12] <A.9,A.12> The bad entries should have had a quotient of plus or minus 2, but instead had a quotient of 0 Because of redundancy, it’s conceivable that the algorithm could “recover” from a bad quotient digit on later iterations Show that this is not pos- sible for the Pentium flaw.
Trang 16Exercises A-77
b [12] <A.9,A.12> Since the operation is a floating-point divide rather than an integer divide, the SRT division algorithm on page A-47 must be modified in two ways First, step 1 is no longer needed, since the divisor is already normalized Second, the very first remainder may not satisfy the proper bound ( r ≤ 2b/3 for Pentium, see page A-
58) Show that skipping the very first left shift in step 2(a) of the SRT algorithm will solve this problem.
c [12] <A.9,A.12> If the faulty table entries were indexed by a remainder that could cur at the very first divide step (when the remainder is the divisor), random testing would quickly reveal the bug This didn’t happen What does that tell you about the remainder values that index the faulty entries?
oc-A.35 [12/12/12] <A.6,A.9> The discussion of the remainder-step instruction assumed that
division was done using a bit-at-a-time algorithm What would have to change if division were implemented using a higher-radix method?
A.36 [25] <A.9> In the array of Figure A.28, the fact that an array can be pipelined is not
exploited Can you come up with a design that feeds the output of the bottom CSA into the bottom CSAs instead of the top one, and that will run faster than the arrangement of Figure A.28?
Trang 17B Vector Processors 2
I’m certainly not inventing vector processors There are three kinds that I know of existing today They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor
Those three were all pioneering processors One of the problems
of being a pioneer is you always make mistakes and I never, never want to be a pioneer It’s always best to come second when you can look at the mistakes the pioneers made.
Seymour Cray
Public Lecture at Lawrence Livermore Laboratories
on the Introduction of the CRAY-1 (1976)
Trang 18B.1 Why Vector Processors? B-1
B.3 Two Real-World Issues: Vector Length and Stride B-15
B.6 Putting It All Together: Performance of Vector Processors B-29
■ Clock cycle time—The clock cycle time can be decreased by making the lines deeper, but a deeper pipeline will increase the pipeline dependences andresult in a higher CPI At some point, each increase in pipeline depth has a cor-responding increase in CPI As we saw in Chapter 3’s Fallacies and Pitfalls,
pipe-very deep pipelining can slow down a processor
■ Instruction fetch and decode rate—This obstacle, sometimes called the Flynn bottleneck (based on Flynn [1966]), makes it difficult to fetch and issue many
Trang 19B-2 Appendix B Vector Processors
instructions per clock This obstacle is one reason that it has been difficult tobuild processors with high clock rates and very high issue rates
The dual limitations imposed by deeper pipelines and issuing multiple tions can be viewed from the standpoint of either clock rate or CPI: It is just asdifficult to schedule a pipeline that is n times deeper as it is to schedule a proces-sor that issues n instructions per clock cycle
instruc-High-speed, pipelined processors are particularly useful for large scientificand engineering applications A high-speed pipelined processor will usually use acache to avoid forcing memory reference instructions to have very long latency.Unfortunately, big, long-running, scientific programs often have very large activedata sets that are sometimes accessed with low locality, yielding poor perfor-mance from the memory hierarchy This problem could be overcome by not cach-ing these structures if it were possible to determine the memory-access patternsand pipeline the memory accesses efficiently Novel cache architectures and com-piler assistance through blocking and prefetching are decreasing these memoryhierarchy problems, but they continue to be serious in some applications
Vector processors provide high-level operations that work on vectors —lineararrays of numbers A typical vector operation might add two 64-element, floating-point vectors to obtain a single 64-element vector result The vector instruction isequivalent to an entire loop, with each iteration computing one of the 64 elements
of the result, updating the indices, and branching back to the beginning
Vector instructions have several important properties that solve most of theproblems mentioned above:
■ The computation of each result is independent of the computation of previousresults, allowing a very deep pipeline without generating any data hazards Es-sentially, the absence of data hazards was determined by the compiler or by theprogrammer when she decided that a vector instruction could be used
■ A single vector instruction specifies a great deal of work—it is equivalent to ecuting an entire loop Thus, the instruction bandwidth requirement is reduced,and the Flynn bottleneck is considerably mitigated
ex-■ Vector instructions that access memory have a known access pattern If the tor’s elements are all adjacent, then fetching the vector from a set of heavily in-terleaved memory banks works very well (as we saw in section 5.6) The highlatency of initiating a main memory access versus accessing a cache is amor-tized, because a single access is initiated for the entire vector rather than to asingle word Thus, the cost of the latency to main memory is seen only once forthe entire vector, rather than once for each word of the vector
vec-■ Because an entire loop is replaced by a vector instruction whose behavior ispredetermined, control hazards that would normally arise from the loop branchare nonexistent
Trang 20B.2 Basic Vector Architecture B-3
For these reasons, vector operations can be made faster than a sequence of scalaroperations on the same number of data items, and designers are motivated to in-clude vector units if the applications domain can use them frequently
As mentioned above, vector processors pipeline the operations on the ual elements of a vector The pipeline includes not only the arithmetic operations(multiplication, addition, and so on), but also memory accesses and effective ad-dress calculations In addition, most high-end vector processors allow multiplevector operations to be done at the same time, creating parallelism among the op-erations on different elements In this appendix, we focus on vector processorsthat gain performance by pipelining and instruction overlap
individ-A vector processor typically consists of an ordinary pipelined scalar unit plus avector unit All functional units within the vector unit have a latency of severalclock cycles This allows a shorter clock cycle time and is compatible with long-running vector operations that can be deeply pipelined without generating haz-ards Most vector processors allow the vectors to be dealt with as floating-pointnumbers, as integers, or as logical data Here we will focus on floating point Thescalar unit is basically no different from the type of advanced pipelined CPU dis-cussed in Chapter 3
There are two primary types of architectures for vector processors: register processors and memory-memory vector processors In a vector-registerprocessor, all vector operations—except load and store—are among the vectorregisters These architectures are the vector counterpart of a load-store architec-ture All major vector computers shipped since the late 1980s use a vector-registerarchitecture; these include the Cray Research processors (CRAY-1, CRAY-2, X-
vector-MP, Y-vector-MP, and C-90), the Japanese supercomputers (NEC SX/2 and SX/3, FujitsuVP200 and VP400, and the Hitachi S820), as well as the mini-supercomputers(Convex C-1 and C-2) In a memory-memory vector processor, all vector opera-tions are memory to memory The first vector computers were of this type, as wereCDC’s vector computers From this point on we will focus on vector-register ar-chitectures only; we will briefly return to memory-memory vector architectures atthe end of the appendix (section B.7) to discuss why they have not been as suc-cessful as vector-register architectures
We begin with a vector-register processor consisting of the primary ponents shown in Figure B.1 This processor, which is loosely based on theCRAY-1, is the foundation for discussion throughout most of this appendix Wewill call it DLXV; its integer portion is DLX, and its vector portion is the logicalvector extension of DLX The rest of this section examines how the basic archi-tecture of DLXV relates to other processors
Trang 21B-4 Appendix B Vector Processors
The primary components of the instruction set architecture of DLXV are
■ Vector registers—Each vector register is a fixed-length bank holding a singlevector DLXV has eight vector registers, and each vector register holds 64 el-ements Each vector register must have at least two read ports and one writeport in DLXV This will allow a high degree of overlap among vector opera-tions to different vector registers (We do not consider the problem of a short-age of vector register ports In real machines this would result in a structuralhazard.) The read and write ports, which total at least 16 read ports and eightwrite ports, are connected to the functional unit inputs or outputs by a pair ofcrossbars (The CRAY-1 manages to implement the register file with only asingle port per register using some clever implementation techniques.)
FIGURE B.1 The basic structure of a vector-register architecture, DLXV This sor has a scalar architecture just like DLX There are also eight 64-element vector registers, and all the functional units are vector functional units Special vector instructions are defined both for arithmetic and for memory accesses We show vector units for logical and integer operations These are included so that DLXV looks like a standard vector processor, which usually includes these units However, we will not be discussing these units except in the Exercises The vector and scalar registers have a significant number of read and write ports
proces-to allow multiple simultaneous vecproces-tor operations These ports are connected proces-to the inputs and outputs of the vector functional units by a set of crossbars (shown in thick gray lines) In section B.5 we add chaining, which will require additional interconnect capability.
Main memory
Vector registers
Scalar registers
FP add/subtract
FP multiply
FP divide Integer Logical Vector
load-store
Trang 22B.2 Basic Vector Architecture B-5
■ Vector functional units—Each unit is fully pipelined and can start a new ation on every clock cycle A control unit is needed to detect hazards, both fromconflicts for the functional units (structural hazards) and from conflicts for reg-ister accesses (data hazards) DLXV has five functional units, as shown inFigure B.1 For simplicity, we will focus exclusively on the floating-point func-tional units Depending on the vector processor, scalar operations either use thevector functional units or use a dedicated set We assume the functional unitsare shared, but again, for simplicity, we ignore potential conflicts
oper-■ Vector load-store unit—This is a vector memory unit that loads or stores a tor to or from memory The DLXV vector loads and stores are fully pipelined,
vec-so that words can be moved between the vector registers and memory with abandwidth of one word per clock cycle, after an initial latency This unit wouldalso normally handle scalar loads and stores
■ A set of scalar registers—Scalar registers can also provide data as input to thevector functional units, as well as compute addresses to pass to the vector load-store unit These are the normal 32 general-purpose registers and 32 floating-point registers of DLX, though more read and write ports are needed The sca-lar registers are also connected to the functional units by the pair of crossbars Figure B.2 shows the characteristics of some typical vector processors, includ-ing the size and count of the registers, the number and types of functional units,and the number of load-store units
In DLXV, vector operations use the same names as DLX operations, but withthe letter “V” appended These are double-precision, floating-point vector opera-tions (We have omitted single-precision FP operations and integer and logicaloperations for simplicity.) Thus, ADDV is an add of two double-precision vectors.The vector instructions take as their input either a pair of vector registers (ADDV)
or a vector register and a scalar register, designated by appending “SV” (ADDSV)
In the latter case, the value in the scalar register is used as the input for all tions—the operation ADDSV will add the contents of a scalar register to each ele-ment in a vector register Most vector operations have a vector destinationregister, though a few (population count) produce a scalar value, which is stored
opera-to a scalar register The names LV and SV denote vector load and vector store, andthey load or store an entire vector of double-precision data One operand isthe vector register to be loaded or stored; the other operand, which is a DLXgeneral-purpose register, is the starting address of the vector in memory.Figure B.3 lists the DLXV vector instructions In addition to the vector registers,
we need two additional special-purpose registers: the length and mask registers We will discuss these registers and their purpose in sections B.3and B.5, respectively
Trang 23vector-B-6 Appendix B Vector Processors
Processor
Year announced
Clock rate (MHz) Registers
Elements per register (64-bit elements) Functional units
Load-store units
CRAY-1 1976 80 8 64 6: add, multiply, reciprocal,
integer add, logical, shift
1 CRAY X-MP
CRAY Y-MP
1983 1988
120 166
8 64 8: FP add, FP multiply, FP
re-ciprocal, integer add, 2 logical, shift, population count/parity
2 loads
1 store CRAY-2 1985 166 8 64 5: FP add, FP multiply, FP re-
ciprocal/sqrt, integer (add shift, population count), logical
multiply/divide, 4 FP add,
4 shift
8
integer add, logical
1 Cray C-90 1991 240 8 128 8: FP add, FP multiply, FP re-
ciprocal, integer add, 2 logical, shift, population count/parity
4
Convex C-4 1994 135 16 128 3: each is full integer, logical,
and FP (including multiply-add) NEC SX/4 1995 400 8 + 8192 256 variable 16: 4 integer add/logical, 4 FP
ciprocal, integer add, 2 logical, shift, population count/parity
4
FIGURE B.2 Characteristics of several vector-register architectures The vector functional units include all operation units used by the vector instructions The functional units are floating point unless stated otherwise If the processor is a multiprocessor, the entries correspond to the characteristics of one processor Each vector load-store unit represents the ability to do an independent, overlapped transfer to or from the vector registers The Fujitsu VP200’s vector registers are configurable: The size and count of the 8 K 64-bit entries may be varied inversely to one another (e.g., eight registers each
1 K elements long, or 128 registers each 64 elements long) The NEC SX/2 has eight fixed registers of length 256, plus 8 K
of configurable 64-bit registers The reciprocal unit on the CRAY processors is used to do division (and square root on the CRAY-2) Add pipelines perform floating-point add and subtract The multiply/divide–add unit on the Hitachi S810/820 per- forms an FP multiply or divide followed by an add or subtract (while the multiply-add unit performs a multiply followed by an add or subtract) Note that most processors use the vector FP multiply and divide units for vector integer multiply and divide, just like DLX, and several of the processors use the same units for FP scalar and FP vector operations Several of the machines have different clock rates in the vector and scalar units; the clock rates shown are for the vector units.
Trang 24B.2 Basic Vector Architecture B-7
A vector processor is best understood by looking at a vector loop on DLXV
Let’s take a typical vector problem, which will be used throughout this appendix:
Y = a × X + Y
X and Y are vectors, initially resident in memory, and a is a scalar This is the called SAXPY or DAXPY loop that forms the inner loop of the Linpack bench-mark (SAXPY stands for single-precision a × X plus Y; DAXPY for double-precision a × X plus Y.) Linpack is a collection of linear algebra routines, and the
so-Instruction Operands Function
LV V1,R1 Load vector register V1 from memory starting at address R1
SV R1,V1 Store vector register V1 into memory starting at address R1
LVWS V1,(R1,R2) Load V1 from address at R1 with stride in R2 , i.e., R1+i × R2
SVWS (R1,R2),V1 Store V1 from address at R1 with stride in R2 , i.e., R1+i × R2
LVI V1,(R1+V2) Load V1 with vector whose elements are at R1+V2(i) , i.e., V2 is an index.
SVI (R1+V2),V1 Store V1 to vector whose elements are at R1+V2(i) , i.e., V2 is an index.
CVI V1,R1 Create an index vector by storing the values 0, 1 × R1, 2 × R1, ,63 × R1
POP R1,VM Count the 1s in the vector-mask register and store count in R1
MOVI2S
MOVS2I
VLR,R1
R1,VLR
Move contents of R1 to the vector-length register.
Move the contents of the vector-length register to R1
MOVF2S
MOVS2F
VM,F0
F0,VM
Move contents of F0 to the vector-mask register.
Move contents of vector-mask register to F0
FIGURE B.3 The DLXV vector instructions Only the double-precision FP operations are shown In addition to the vector
registers, there are two special registers, VLR (discussed in section B.3) and VM (discussed in section B.5) The operations
with stride are explained in section B.3, and the use of the index creation and indexed load-store operations are explained
in section B.5.
Trang 25B-8 Appendix B Vector Processors
routines for performing Gaussian elimination constitute what is known as theLinpack benchmark The DAXPY routine, which implements the above loop,represents a small fraction of the source code of the Linpack benchmark, but itaccounts for most of the execution time for the benchmark
For now, let us assume that the number of elements, or length, of a vector ister (64) matches the length of the vector operation we are interested in (This re-striction will be lifted shortly.)
reg-E X A M P L reg-E Show the code for DLX and DLXV for the DAXPY loop Assume that the
starting addresses of X and Y are in Rx and Ry, respectively.
A N S W E R Here is the DLX code
Here is the code for DLXV for DAXPY
MULTSV V2,F0,V1 ;vector-scalar multiply
ADDV V4,V2,V3 ;add
SV Ry,V4 ;store the result
There are some interesting comparisons between the two code segments
in this Example The most dramatic is that the vector processor greatly duces the dynamic instruction bandwidth, executing only six instructions versus almost 600 for DLX This reduction occurs both because the vector operations work on 64 elements and because the overhead instructions that constitute nearly half the loop on DLX are not present in the DLXV
Trang 26B.2 Basic Vector Architecture B-9
Another important difference is the frequency of pipeline interlocks In thestraightforward DLX code every ADDD must wait for a MULTD, and every SD mustwait for the ADDD On the vector processor, each vector instruction operates on allthe vector elements independently Thus, pipeline stalls are required only onceper vector operation, rather than once per vector element In this example, thepipeline-stall frequency on DLX will be about 64 times higher than it is onDLXV The pipeline stalls can be eliminated on DLX by using software pipelin-ing or loop unrolling (as we saw in Chapter 4) However, the large difference ininstruction bandwidth cannot be reduced
Vector Execution Time
The execution time of a sequence of vector operations primarily depends on threefactors: the length of the vectors being operated on, structural hazards among the
operations, and the data dependences Given the vector length and the initiation rate, which is the rate at which a vector unit consumes new operands and pro-
duces new results, we can compute the time for a single vector instruction Theinitiation rate is usually one per clock cycle for individual operations However,some supercomputers have vector instructions that can produce two or more re-sults per clock, and others have units that may not be fully pipelined For simplic-ity, we assume that initiation rates are one throughout this appendix Thus, theexecution time for a single vector instruction is approximately the vector length
To simplify the discussion of vector execution and its timing, we will use the
notion of a convoy, which is the set of vector instructions that could potentially
begin execution together in one clock period (Although the concept of a convoy
is used in vector compilers, no standard terminology exists Hence, we created
the term convoy.) The instructions in a convoy must not contain any structural or
data hazards (though we will relax this later); if such hazards were present, theinstructions in the potential convoy would need to be serialized and initiated indifferent convoys To keep the analysis simple, we assume that a convoy of in-structions must complete execution before any other instructions (scalar or vec-tor) can begin execution We will relax this in section B.6 by using a lessrestrictive, but more complex, method for issuing instructions
Accompanying the notion of a convoy is a timing metric, called a chime, that
can be used for estimating the performance of a vector sequence consisting ofconvoys A chime is an approximate measure of execution time for a vector se-quence; a chime measurement is independent of vector length Thus, a vector se-
quence that consists of m convoys executes in m chimes, and for a vector length
of n, this is approximately m × n clock cycles A chime approximation ignoressome processor-specific overheads, many of which are dependent on vectorlength Hence, measuring time in chimes is a better approximation for long vec-tors We will use the chime measurement, rather than clock cycles per result, toexplicitly indicate that certain overheads are being ignored
Trang 27B-10 Appendix B Vector Processors
If we know the number of convoys in a vector sequence, we know the tion time in chimes One source of overhead ignored in measuring chimes is anylimitation on initiating multiple vector instructions in a clock cycle If only onevector instruction can be initiated in a clock cycle (the reality in most vectorprocessors), the chime count will underestimate the actual execution time of aconvoy Because the vector length is typically much greater than the number ofinstructions in the convoy, we will simply assume that the convoy executes in onechime
execu-E X A M P L execu-E Show how the following code sequence lays out in convoys, assuming a
single copy of each vector functional unit:
MULTSV V2,F0,V1 ;vector-scalar multiply
ADDV V4,V2,V3 ;add
SV Ry,V4 ;store the result
How many chimes will this vector sequence take? How many chimes per FLOP (floating-point operation) are needed?
A N S W E R The first convoy is occupied by the first LV instruction The MULTSV is
de-pendent on the first LV , so it cannot be in the same convoy The second
LV instruction can be in the same convoy as the MULTSV The ADDV is pendent on the second LV , so it must come in yet a third convoy, and finally the SV depends on the ADDV , so it must go in a following convoy This leads
de-to the following layout of vecde-tor instructions inde-to convoys:
1 LV
2 MULTSV LV
3 ADDV
4 SV The sequence requires four convoys and hence takes four chimes Note that although we allow the MULTSV and the LV both to execute in convoy
2, most vector machines will take two clock cycles to initiate the tions Since the sequence takes a total of four chimes and there are two floating-point operations per result, the number of chimes per FLOP is
The chime approximation is reasonably accurate for long vectors For ple, for 64-element vectors, the time in chimes is four, so the sequence wouldtake about 256 clock cycles The overhead of issuing convoy 2 in two separateclocks would be small
Trang 28exam-B.2 Basic Vector Architecture B-11
Another source of overhead is far more significant than the issue limitation.The most important source of overhead ignored by the chime model is vector
start-up time The start-up time comes from the pipelining latency of the vector
operation and is principally determined by how deep the pipeline is for the tional unit used The start-up time increases the effective time to execute a con-voy to more than one chime Because of our assumption that convoys do notoverlap in time, the start-up time delays the execution of subsequent convoys Ofcourse the instructions in successive convoys have either structural conflicts forsome functional unit or are data dependent, so the assumption of no overlap isreasonable The actual time to complete a convoy is determined by the sum ofthe vector length and the start-up time If vector lengths were infinite, this start-
func-up overhead would be amortized, but finite vector lengths expose it, as the lowing Example shows
fol-E X A M P L fol-E Assume the start-up overhead for functional units is shown in Figure B.4.
Show the time that each convoy can begin and the total number of cycles needed How does the time compare to the chime approximation for a vector of length 64?
A N S W E R Figure B.5 provides the answer in convoys, assuming that the vector
length is n:
One tricky question is when we assume the vector sequence is done; this determines whether the start-up time of the SV is visible or not We as- sume that the instructions following cannot fit in the same convoy, and we
Unit Start-up overhead
Load and store unit 12 cycles
FIGURE B.4 Start-up overhead.
Convoy Starting time First-result time Last-result time
2 MULTSV LV 12 + n 12 + n + 12 23 + 2n
FIGURE B.5 Starting times and first- and last-result times for convoys
1 through 4 The vector length is n.
Trang 29B-12 Appendix B Vector Processors
have already assumed that convoys do not overlap Thus the total time is given by the time until the last vector instruction in the last convoy com- pletes This is an approximation, and the start-up time of the last vector instruction may be seen in some sequences and not in others For sim- plicity, we always include it
The time per result for a vector of length 64 is 4 + (42/64) = 4.65 clock cycles, while the chime approximation would be 4 The execution time with start-up overhead is 1.16 times higher ■
For simplicity, we will use the chime approximation for running time, porating start-up time effects only when we want more detailed performance or
incor-to illustrate the benefits of some enhancement For long vecincor-tors, a typical tion, the overhead effect is not that large Later in the appendix we will exploreways to reduce start-up overhead
situa-Start-up time for an instruction comes from the pipeline depth for the tional unit implementing that instruction If the initiation rate is to be kept at oneclock cycle per result, then
func-For example, if an operation takes 10 clock cycles, it must be pipelined 10 deep
to achieve an initiation rate of one per clock cycle Pipeline depth, then, is mined by the complexity of the operation and the clock cycle time of the proces-sor The pipeline depths of functional units vary widely—from two to 20 stages isnot uncommon—though the most heavily used units have pipeline depths of four
deter-to eight clock cycles
For DLXV, we will use the same pipeline depths as the CRAY-1, though moremodern processors might have units with lower latency All functional units arefully pipelined As shown in Figure B.6, pipeline depths are six clock cycles forfloating-point add and seven clock cycles for floating-point multiply On DLXV,
as on most vector processors, independent vector operations using different tional units can issue in the same convoy
func-Operation Start-up penalty
FIGURE B.6 Start-up penalties on DLXV These are the start-up penalties in clock cycles
for DLXV vector operations
Pipeline depth Total functional unit time
Clock cycle time -
=
Trang 30B.2 Basic Vector Architecture B-13
Vector Load-Store Units and Vector Memory Systems
The behavior of the load-store vector unit is significantly more complicated thanthat of the arithmetic functional units The start-up time for a load is the time toget the first word from memory into a register If the rest of the vector can be sup-plied without stalling, then the vector initiation rate is equal to the rate at whichnew words are fetched or stored Unlike simpler functional units, the initiationrate may not necessarily be one clock cycle
Typically, penalties for start-ups on load-store units are higher than those forarithmetic functional units—up to 50 clock cycles on some processors ForDLXV we will assume a start-up time of 12 clock cycles; by comparison, theCRAY-1 and CRAY X-MP have load-store start-up times of between nine and 17clock cycles Figure B.6 summarizes the start-up penalties for DLXV vector op-erations
To maintain an initiation rate of one word fetched or stored per clock, thememory system must be capable of producing or accepting this much data This
is usually done by creating multiple memory banks, as discussed in section 5.6
As we will see in the next section, having significant numbers of banks is usefulfor dealing with vector loads or stores that access rows or columns of data Most vector processors use memory banks rather than simple interleaving fortwo primary reasons:
1 Many vector computers support multiple loads or stores per clock To supportmultiple simultaneous accesses, the memory system needs to have multiplebanks and be able to control the addresses to the banks independently
2 As we will see in the next section, many vector processors support the ability
to load or store data words that are not sequential In such cases, independentbank addressing, rather than interleaving, is required
In Chapter 5 we saw that the desired access rate and the bank access time mined how many banks were needed to access a memory without a stall Thenext Example shows how these timings work out in a vector processor
deter-E X A M P L deter-E Suppose we want to fetch a vector of 64 elements starting at byte address
136, and a memory access takes six clocks How many memory banks must we have? With what addresses are the banks accessed? When will the various elements arrive at the CPU?
A N S W E R Six clocks per access require at least six banks, but because we want the
number of banks to be a power of two, we choose to have eight banks Figure B.7 shows what byte addresses each bank accesses within each time period Remember that a bank begins a new access as soon as it has completed the old access
Trang 31B-14 Appendix B Vector Processors
Figure B.8 shows the timing for the first few sets of accesses for an eight-bank system with a six-clock-cycle access latency There are two important observations about Figures B.7 and B.8: First, notice that the exact address fetched by a bank is largely determined by the lower-order bits in the bank number; however, the initial access to a bank is always within eight double words of the starting address Second, notice that once the initial latency is overcome (six clocks in this case), the pattern
is to access a bank every n clock cycles, where n is the total number of banks (n = 8 in this case).
mem-a word in the smem-ame block of eight distinguishes this type of memory system from interleaved memory Normally, interleaved memory systems combine the bank ad- dress and the base starting address by concatenation rather than addition Also, interleaved memories are almost always implemented with synchronized access Memory banks require address latches for each bank, which are not normally needed in a system with only interleaving This timing diagram is drawn as if all banks access in clock 0, clock 16, etc In practice, since the bus allocations needed
to return the words are staggered, the actual accesses are often staggered.
FIGURE B.8 Access timing for the first 64 double-precision words of the load.
After the six-clock-cycle initial latency, eight double-precision words are returned every eight clock cycles.
Action Memoryaccess
Next access + deliver last
8 words
Next access + deliver last
8 words
Deliver last
8 words Time
0 6 14 22 62 70
Trang 32B.3 Two Real-World Issues: Vector Length and Stride B-15
The number of banks in the memory system and the pipeline depth in thefunctional units are essentially counterparts, since they determine the initiationrates for operations using these units The processor cannot access a memorybank faster than the memory cycle time Thus, if memory is built from DRAM,where the memory cycle time is about twice the access time, the processor needstwice as many banks as the above Example shows For memory systems that sup-port multiple simultaneous vector accesses or allow nonsequential accesses invector loads or stores, the number of memory banks should be larger than theminimum, otherwise, memory bank conflicts will exist We explore this in moredetail in the next section
This section deals with two issues that arise in real programs: What do you dowhen the vector length in a program is not exactly 64? How do you deal withnonadjacent elements in vectors that reside in memory? First, let’s consider theissue of vector length
Vector-Length Control
A vector-register processor has a natural vector length determined by the number
of elements in each vector register This length, which is 64 for DLXV, is likely to match the real vector length in a program Moreover, in a real programthe length of a particular vector operation is often unknown at compile time Infact, a single piece of code may require different vector lengths For example,consider this code:
do 10 i = 1,n
10 Y(i) = a ∗ X(i) + Y(i)The size of all the vector operations depends on n, which may not even be knownuntil runtime! The value of n might also be a parameter to a procedure containingthe above loop and therefore be subject to change during execution
The solution to these problems is to create a vector-length register (VLR) The
VLR controls the length of any vector operation, including a vector load or store.The value in the VLR, however, cannot be greater than the length of the vector
registers This solves our problem as long as the real length is less than the mum vector length (MVL) defined by the processor
maxi-What if the value of n is not known at compile time, and thus may be greaterthan MVL? To tackle the second problem where the vector is longer than the
maximum length, a technique called strip mining is used Strip mining is the
gen-eration of code such that each vector opgen-eration is done for a size less than or
Vector Length and Stride
Trang 33B-16 Appendix B Vector Processors
equal to the MVL We could strip-mine the loop in the same manner that we rolled loops in Chapter 4: Create one loop that handles any number of iterationsthat is a multiple of MVL and another loop that handles any remaining iterations,which must be less than MVL In practice, compilers usually create a single strip-mined loop that is parameterized to handle both portions by changing the length.The strip-mined version of the DAXPY loop written in FORTRAN, the majorlanguage used for scientific applications, is shown with C-style comments:
un-low = 1
VL = (n mod MVL) /*find the odd size piece*/
do 1 j = 0,(n / MVL) /*outer loop*/
do 10 i = low, low+VL-1 /*runs for length VL*/
Y(i) = a*X(i) + Y(i) /*main operation*/
low = low+VL /*start of next vector*/
VL = MVL /*reset the length to max*/
The term n/MVL represents truncating integer division (which is what TRAN does) and is used throughout this section The effect of this loop is toblock the vector into segments which are then processed by the inner loop Thelength of the first segment is (n mod MVL) and all subsequent segments are oflength MVL This is depicted in Figure B.9
FOR-The inner loop of the code above is vectorizable with length VL, which is equal
to either (n mod MVL) or MVL The VLR register must be set twice—once at eachplace where the variable VL in the code is assigned With multiple vector opera-tions executing in parallel, the hardware must copy the value of VLR when a vec-tor operation issues, in case VLR is changed for a subsequent vector operation
FIGURE B.9 A vector of arbitrary length processed with strip mining All blocks but the
first are of length MVL, utilizing the full power of the vector processor In this figure, the able m is used for the expression (n mod MVL)
vari-1 m (m+1)
m+MVL
(m+
MVL+1) m+2 * MVL
(m+2 * MVL+1) m+3 * MVL
+1) n Range of i
Value of j 0 1 2 3 . n/MVL
.
Trang 34B.3 Two Real-World Issues: Vector Length and Stride B-17
In addition to the start-up overhead, we need to account for the overhead ofexecuting the strip-mined loop This strip-mining overhead, which arises fromthe need to reinitiate the vector sequence and set the VLR, effectively adds to thevector start-up time, assuming that a convoy does not overlap with other instruc-tions If that overhead for a convoy is 10 cycles, then the effective overhead per
64 elements increases by 10 cycles, or 0.15 cycles per element
There are two key factors that contribute to the running time of a strip-minedloop consisting of a sequence of convoys:
1 The number of convoys in the loop, which determines the number of chimes
We use the notation T chime for the execution time in chimes
2 The overhead for each strip-mined sequence of convoys This overhead
con-sists of the cost of executing the scalar code for strip mining each block, T loop,
plus the vector start-up cost for each convoy, T start
There may also be a fixed overhead associated with setting up the vector quence the first time In recent vector processors this overhead has become quitesmall, so we ignore it
The components can be used to state the total running time for a vector
se-quence operating on a vector of length n, which we will call T n:
The values of Tstart, Tloop, and Tchime are compiler and processor dependent Theregister allocation and scheduling of the instructions affect both what goes in aconvoy and the start-up overhead of each convoy
For simplicity, we will use a constant value for Tloop on DLXV Based on a riety of measurements of CRAY-1 vector execution, the value chosen is 15 for
va-Tloop At first glance, you might think that this value is too small The overhead ineach loop requires setting up the vector starting addresses and the strides, incre-menting counters, and executing a loop branch In practice, these scalar instruc-tions can be totally or partially overlapped with the vector instructions,minimizing the time spent on these overhead functions The value of Tloop ofcourse depends on the loop structure, but the dependence is slight compared withthe connection between the vector code and the values of Tchime and Tstart
E X A M P L E What is the execution time on DLXV for the vector operation A = B × s,
where s is a scalar and the length of the vectors A and B is 200?
A N S W E R Assume the addresses of A and B are initially in Ra and Rb , s is in Fs , and
recall that for DLX (and DLXV) R0 always holds 0 Since (200 mod 64) =
8, the first iteration of the strip-mined loop will execute for a vector length
MVL × (Tloop+Tstart) +n× Tchime
=
Trang 35B-18 Appendix B Vector Processors
of eight elements, and the following iterations will execute for a vector length of 64 elements The starting byte addresses of the next segment of each vector is eight times the vector length Since the vector length is ei- ther eight or 64, we increment the address registers by 8 × 8 = 64 after the first segment and 8 × 64 = 512 for latter segments The total number of bytes in the vector is 8 × 200 = 1600, and we test for completion by com- paring the address of the next vector segment to the initial address plus
1600 Here is the actual code:
ADDI R2,R0,#1600 ;total # bytes in vector ADD R2,R2,Ra ;address of the end of A vector ADDI R1,R0,#8 ;loads length of 1st segment MOVI2S VLR,R1 ;load vector length in VLR ADDI R1,R0,#64 ;length in bytes of 1st segment ADDI R3,R0,#64 ;vector length other segments Loop: LV V1,Rb ;load B
MULTSV V2,Fs,V1 ;vector * scalar
SV Ra,V2 ;store A ADD Ra,Ra,R1 ;address of next segment of A ADD Rb,Rb,R1 ;address of next segment of B ADDI R1,R0,#512 ;load byte offset next segment MOVI2S VLR,R3 ;set length to 64 element SUB R4,R2,Ra ;at the end of A?
BNEZ R4,Loop ;if not, go back
The three vector instructions in the loop are dependent and must go into three convoys, hence Tchime = 3 Let’s use our basic formula:
The value of Tstart is the sum of
■ The vector load start-up of 12 clock cycles
■ A seven-clock-cycle start-up for the multiply
■ A 12-clock-cycle start-up for the store.
Thus, the value of Tstart is given by
=
T200 = 4 × ( 15 + Tstart) + 200 × 3
T200 = 60 + ( 4 × Tstart) + 600 = 660 + ( 4 × Tstart)
Trang 36B.3 Two Real-World Issues: Vector Length and Stride B-19
The execution time per element with all start-up costs is then 784/200 = 3.9, compared with a chime approximation of three In section B.6, we will
be more ambitious—allowing overlapping of separate convoys ■
Figure B.10 shows the overhead and effective rates per element for the aboveexample (A = B × s) with various vector lengths A chime counting model wouldlead to three clock cycles per element, while the two sources of overhead add 0.9clock cycles per element in the limit
The next few sections introduce enhancements that reduce this time We willsee how to reduce the number of convoys and hence the number of chimes using
a technique called chaining The loop overhead can be reduced by further
over-lapping the execution of vector and scalar instructions, allowing the scalar loopoverhead in one iteration to be executed while the vector instructions in the previ-ous instruction are completing Finally, the vector start-up overhead can also beeliminated, using a technique that allows overlap of vector instructions in sepa-rate convoys
FIGURE B.10 This shows the total execution time per element and the total overhead time per element, versus the vector length for the Example on page B-17 For short vec-
tors the total start-up time is more than one-half of the total time, while for long vectors it duces to about one-third of the total time The sudden jumps occur when the vector length crosses a multiple of 64, forcing another iteration of the strip-mining code and execution of a set of vector instructions These operations increase T by T + T
re-Total time per element
Total overhead per element 10
Trang 37B-20 Appendix B Vector Processors
Vector Stride
The second problem this section addresses is that the position in memory of cent elements in a vector may not be sequential Consider the straightforwardcode for matrix multiply:
At the statement labeled 10 we could vectorize the multiplication of each row of
B with each column of C and strip-mine the inner loop with k as the index able
vari-To do so, we must consider how adjacent elements in B and adjacent elements
in C are addressed As we discussed in section 5.3, when an array is allocatedmemory it is linearized and must be laid out in either row-major or column-majororder This linearization means that either the elements in the row or the elements
in the column are not adjacent in memory For example, if the above loop werewritten in FORTRAN, which allocates column-major order, the elements of Bthat are accessed by iterations in the inner loop are separated by the row sizetimes 8 (the number of bytes per entry) for a total of 800 bytes In Chapter 5, wesaw that blocking could be used to improve the locality in cache-based systems
In vector processors we do not have caches, so we need another technique tofetch elements of a vector that are not adjacent in memory
This distance separating elements that are to be gathered into a single register
is called the stride In the current example, using column-major layout for the
matrices means that matrix C has a stride of 1, or 1 double word (8 bytes), rating successive elements, and matrix B has a stride of 100, or 100 double words(800 bytes)
sepa-Once a vector is loaded into a vector register it acts as if it had logically cent elements Thus a vector-register processor can handle strides greater than
adja-one, called nonunit strides, using only vector-load and vector-store operations
with stride capability This ability to access nonsequential memory locations and
to reshape them into a dense structure is one of the major advantages of a vectorprocessor over a cache-based processor Caches inherently deal with unit stridedata, so that while increasing block size can help reduce miss rates for large sci-entific data sets, increasing block size can have a negative effect for data that isaccessed with nonunit stride While blocking techniques can solve some of theseproblems (see section 5.3), the ability to efficiently access data that is not contig-uous remains an advantage for vector processors on certain problems
On DLXV, where the addressable unit is a byte, the stride for our examplewould be 800 The value must be computed dynamically, since the size of thematrix may not be known at compile time, or—just like vector length—may
Trang 38B.3 Two Real-World Issues: Vector Length and Stride B-21
change for different executions of the same statement The vector stride, like thevector starting address, can be put in a general-purpose register Then the DLXVinstruction LVWS (load vector with stride) can be used to fetch the vector into avector register Likewise, when a nonunit stride vector is being stored, SVWS(store vector with stride) can be used In some vector processors the loads andstores always have a stride value stored in a register, so that only a single load and
a single store instruction are required
Complications in the memory system can occur from supporting strides greaterthan one In Chapter 5 we saw that memory accesses could proceed at full speed ifthe number of memory banks was at least as large as the memory-access time inclock cycles Once nonunit strides are introduced, however, it becomes possible torequest accesses from the same bank at a higher rate than the memory-access time.When multiple accesses contend for a bank, a memory bank conflict occurs andone access must be stalled A bank conflict, and hence a stall, will occur if
E X A M P L E Suppose we have 16 memory banks with a read latency of 12 clocks How
long will it take to complete a 64-element vector load with a stride of 1? With a stride of 32?
A N S W E R Since the number of banks is larger than the read latency, for a stride of
1, the load will take 12 + 64 = 76 clock cycles, or 1.2 clocks per element The worst possible stride is a value that is a multiple of the number of memory banks, as in this case with a stride of 32 and 16 memory banks Every access to memory will collide with the previous one This leads to
a read latency of 12 clock cycles per element and a total time for the
Memory bank conflicts will not occur if the stride and number of banks are atively prime with respect to each other and there are enough banks to avoid con-flicts in the unit-stride case When there are no bank conflicts, multiword and unitstrides run at the same rates Increasing the number of memory banks to a numbergreater than the minimum to prevent stalls with a stride of length 1 will decreasethe stall frequency for some other strides For example, with 64 banks, a stride of
rel-32 will stall on every other access, rather than every access If we originally had astride of 8 and 16 banks, every other access would stall; while with 64 banks, astride of 8 will stall on every eighth access If we have multiple memory pipelines,
we will also need more banks to prevent conflicts In 1995, most vector puters have at least 64 banks, and some have as many as 1024 in the maximummemory configuration Because bank conflicts can still occur in nonunit stridecases, many programmers favor unit stride accesses whenever possible
supercom-Least common multiple (Stride, Number of banks)
Stride - < Memory-access latency
Trang 39B-22 Appendix B Vector Processors
Two factors affect the success with which a program can be run in vector mode.The first factor is the structure of the program itself: Do the loops have true datadependences, or can they be restructured so as not to have such dependences?This factor is influenced by the algorithms chosen and, to some extent, by howthey are coded The second factor is the capability of the compiler While nocompiler can vectorize a loop where no parallelism among the loop iterations ex-ists, there is tremendous variation in the ability of compilers to determine wheth-
er a loop can be vectorized The techniques used to vectorize programs are thesame as those discussed in Chapter 4 for uncovering ILP; here we simply reviewhow well these techniques work
As an indication of the level of vectorization that can be achieved in scientificprograms, let's look at the vectorization levels observed for the Perfect Clubbenchmarks, mentioned in Chapter 1 These benchmarks are large, real scientificapplications Figure B.11 shows the percentage of floating-point operations in
each benchmark and the percentage executed in vector mode on the CRAY X-MP.The wide variation in level of vectorization has been observed by several studies
of the performance of applications on vector processors While better compilersmight improve the level of vectorization in some of these programs, most will
Benchmark name FP operations
Trang 40B.5 Enhancing Vector Performance B-23
require rewriting to achieve significant increases in vectorization For example, anew program or a significant rewrite will be needed to obtain the benefits of avector processor on SPICE
There is also tremendous variation in how well compilers do in vectorizingprograms As a summary of the state of vectorizing compilers, consider the data
in Figure B.12, which shows the extent of vectorization for different processorsusing a test suite of 100 hand-written FORTRAN kernels The kernels were de-signed to test vectorization capability and can all be vectorized by hand; we willsee several examples of these loops in the Exercises
Three techniques for improving the performance of vector processors are cussed in this section The first deals with making a sequence of dependent vectoroperations run faster The other two deal with expanding the class of loops that
dis-can be run in vector mode The first technique, chaining, originated in the
CRAY-1, but is now supported on most vector processors The techniques discussed inthe second and third parts of this section combat the effects of conditional execu-tion and sparse matrices The extensions are taken from a variety of processorsincluding the most recent supercomputers
Processor Compiler
Completely vectorized
Partially vectorized
Not vectorized
FIGURE B.12 Result of applying vectorizing compilers to the 100 FORTRAN test kernels For each
processor we indicate how many loops were completely vectorized, partially vectorized, and unvectorized These loops were collected by Callahan, Dongarra, and Levine [1988] Two different compilers for the CRAY X-MP show the large dependence on compiler technology.