1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Computer organization and design Design 2nd phần 10 pdf

91 399 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Computer Arithmetic
Tác giả Birman, Rowen, Johnson, and Ries, Darley et al.
Trường học University of Technology
Chuyên ngành Computer Organization and Design
Thể loại Giáo trình
Năm xuất bản N/A
Thành phố N/A
Định dạng
Số trang 91
Dung lượng 309,59 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

B-1B.3 Two Real-World Issues: Vector Length and Stride B-15 B.6 Putting It All Together: Performance of Vector Processors B-29 ■ Clock cycle time—The clock cycle time can be decreased by

Trang 1

A-62 Appendix A Computer Arithmetic

and run with a cycle time of about 40 nanoseconds However, as we will see, theyuse quite different algorithms The Weitek chip is well described in Birman et al.[1990], the MIPS chip is described in less detail in Rowen, Johnson, and Ries[1988], and details of the TI chip can be found in Darley et al [1989]

These three chips have a number of things in common They perform additionand multiplication in parallel, and they implement neither extended precision nor

a remainder step operation (Recall from section A.6 that it is easy to implementthe IEEE remainder function in software if a remainder step instruction is avail-able.) The designers of these chips probably decided not to provide extended pre-cision because the most influential users are those who run portable codes, whichcan’t rely on extended precision However, as we have seen, extended precisioncan make for faster and simpler math libraries

In the summary of the three chips given in Figure A.36, note that a highertransistor count generally leads to smaller cycle counts Comparing the cycles/opnumbers needs to be done carefully, because the figures for the MIPS chip arethose for a complete system (R3000/3010 pair), while the Weitek and TI numbersare for stand-alone chips and are usually larger when used in a complete system.The MIPS chip has the fewest transistors of the three This is reflected in thefact that it is the only chip of the three that does not have any pipelining or hard-ware square root Further, the multiplication and addition operations are not com-pletely independent because they share the carry-propagate adder that performsthe final rounding (as well as the rounding logic)

Addition on the R3010 uses a mixture of ripple, CLA, and carry select A ry-select adder is used in the fashion of Figure A.20 (page A-45) Within eachhalf, carries are propagated using a hybrid ripple-CLA scheme of the type indi-cated in Figure A.18 (page A-43) However, this is further tuned by varying thesize of each block, rather than having each fixed at 4 bits (as they are inFigure A.18) The multiplier is midway between the designs of Figures A.2(page A-4) and A.27 (page A-53) It has an array just large enough so that outputcan be fed back into the input without having to be clocked Also, it uses radix-4Booth recoding and the even-odd technique of Figure A.29 (page A-55) TheR3010 can do a divide and multiply in parallel (like the Weitek chip but unlikethe TI chip) The divider is a radix-4 SRT method with quotient digits −2, −1, 0,

car-1, and 2, and is similar to that described in Taylor [1985] Double-precision sion is about four times slower than multiplication The R3010 shows that for

divi-chips using an O(n) multiplier, an SRT divider can operate fast enough to keep a

reasonable ratio between multiply and divide

The Weitek 3364 has independent add, multiply, and divide units It also usesradix-4 SRT division However, the add and multiply operations on the Weitek

Trang 2

A.10 Putting It All Together A-63

chip are pipelined The three addition stages are (1) exponent compare, (2) addfollowed by shift (or vice versa), and (3) final rounding Stages (1) and (3) takeonly a half-cycle, allowing the whole operation to be done in two cycles, eventhough there are three pipeline stages The multiplier uses an array of the style ofFigure A.28 but uses radix-8 Booth recoding, which means it must compute 3

times the multiplier The three multiplier pipeline stages are (1) compute 3b, (2)

pass through array, and (3) final carry-propagation add and round Single sion passes through the array once, double precision twice Like addition, the la-tency is two cycles

preci-The Weitek chip uses an interesting addition algorithm It is a variant on the

carry-skip adder pictured in Figure A.19 (page A-44) However, P ij , which is thelogical AND of many terms, is computed by rippling, performing one AND per rip-

ple Thus, while the carries propagate left within a block, the value of P ij is gating right within the next block, and the block sizes are chosen so that bothwaves complete at the same time Unlike the MIPS chip, the 3364 has hardwaresquare root, which shares the divide hardware The ratio of double-precision mul-tiply to divide is 2:17 The large disparity between multiply and divide is due tothe fact that multiplication uses radix-8 Booth recoding, while division uses a ra-dix-4 method In the MIPS R3010, multiplication and division use the same radix.The notable feature of the TI 8847 is that it does division by iteration (usingthe Goldschmidt algorithm discussed in section A.6) This improves the speed ofdivision (the ratio of multiply to divide is 3:11), but means that multiplication anddivision cannot be done in parallel as on the other two chips Addition has a two-stage pipeline Exponent compare, fraction shift, and fraction addition are done

propa-in the first stage, normalization and roundpropa-ing propa-in the second stage Multiplicationuses a binary tree of signed-digit adders and has a three-stage pipeline The firststage passes through the array, retiring half the bits; the second stage passesthrough the array a second time; and the third stage converts from signed-digitform to two’s complement Since there is only one array, a new multiply opera-tion can only be initiated in every other cycle However, by slowing down theclock, two passes through the array can be made in a single cycle In this case, anew multiplication can be initiated in each cycle The 8847 adder uses a carry-select algorithm rather than carry lookahead As mentioned in section A.6, the TIcarries 60 bits of precision in order to do correctly rounded division

These three chips illustrate the different trade-offs made by designers withsimilar constraints One of the most interesting things about these chips is the di-versity of their algorithms Each uses a different add algorithm, as well as a dif-ferent multiply algorithm In fact, Booth recoding is the only technique that isuniversally used by all the chips

Trang 3

A-64 Appendix A Computer Arithmetic

TI 8847

MIPS R3010

Figure continued on next page

Trang 4

A.11 Fallacies and Pitfalls A-65

Fallacy: Underflows rarely occur in actual floating-point application code.

Although most codes rarely underflow, there are actual codes that underflow quently SDRWAVE [Kahaner 1988], which solves a one-dimensional waveequation, is one such example This program underflows quite frequently, evenwhen functioning properly Measurements on one machine show that addinghardware support for gradual underflow would cause SDRWAVE to run about50% faster

fre-Fallacy: Conversions between integer and floating point are rare.

In fact, in spice they are as frequent as divides The assumption that conversionsare rare leads to a mistake in the SPARC version 8 instruction set, which does notprovide an instruction to move from integer registers to floating-point registers

Weitek 3364

FIGURE A.37 Chip layout for the TI 8847, MIPS R3010, and Weitek 3364 In the left-hand columns are the

photomicro-graphs; the right-hand columns show the corresponding floor plans.

Trang 5

A-66 Appendix A Computer Arithmetic

Pitfall: Don’t increase the speed of a floating-point unit without increasing its memory bandwidth.

A typical use of a floating-point unit is to add two vectors to produce a third tor If these vectors consist of double-precision numbers, then each floating-pointadd will use three operands of 64 bits each, or 24 bytes of memory The memorybandwidth requirements are even greater if the floating-point unit can performaddition and multiplication in parallel (as most do)

vec-Pitfall: x is not the same as 0 x.

This is a fine point in the IEEE standard that has tripped up some designers cause floating-point numbers use the sign/magnitude system, there are two zeros,+0 and −0 The standard says that 0 − 0 = +0, whereas −(0) = −0 Thus −x is not

Be-the same as 0 − x when x = 0.

The earliest computers used fixed point rather than floating point In “PreliminaryDiscussion of the Logical Design of an Electronic Computing Instrument,”Burks, Goldstine, and von Neumann [1946] put it like this:

There appear to be two major purposes in a “floating” decimal point system both

of which arise from the fact that the number of digits in a word is a constant fixed

by design considerations for each particular machine The first of these purposes

is to retain in a sum or product as many significant digits as possible and the ond of these is to free the human operator from the burden of estimating and in- serting into a problem “scale factors” — multiplicative constants which serve to keep numbers within the limits of the machine.

sec-There is, of course, no denying the fact that human time is consumed in arranging for the introduction of suitable scale factors We only argue that the time so con- sumed is a very small percentage of the total time we will spend in preparing an interesting problem for our machine The first advantage of the floating point is,

we feel, somewhat illusory In order to have such a floating point, one must waste memory capacity which could otherwise be used for carrying more digits per word It would therefore seem to us not at all clear whether the modest advantages

of a floating binary point offset the loss of memory capacity and the increased complexity of the arithmetic and control circuits

This enables us to see things from the perspective of early computer designers,who believed that saving computer time and memory were more important thansaving programmer time

Trang 6

A.12 Historical Perspective and References A-67

The original papers introducing the Wallace tree, Booth recoding, SRT sion, overlapped triplets, and so on, are reprinted in Swartzlander [1990] A goodexplanation of an early machine (the IBM 360/91) that used a pipelined Wallacetree, Booth recoding, and iterative division is in Anderson et al [1967] A discus-sion of the average time for single-bit SRT division is in Freiman [1961]; this isone of the few interesting historical papers that does not appear in Swartzlander.The standard book of Mead and Conway [1980] discouraged the use of CLAs

divi-as not being cost effective in VLSI The important paper by Brent and Kung[1982] helped combat that view An example of a detailed layout for CLAs can befound in Ngai and Irwin [1985] or in Weste and Eshraghian [1993], and a moretheoretical treatment is given by Leighton [1992] Takagi, Yasuura, and Yajima[1985] provide a detailed description of a signed-digit tree multiplier

Before the ascendancy of IEEE arithmetic, many different floating-point mats were in use Three important ones were used by the IBM/370, the DECVAX, and the Cray Here is a brief summary of these older formats The VAX for-mat is closest to the IEEE standard Its single-precision format (F format) is likeIEEE single precision in that it has a hidden bit, 8 bits of exponent, and 23 bits offraction However, it does not have a sticky bit, which causes it to round halfwaycases up instead of to even The VAX has a slightly different exponent range from

for-IEEE single: Emin is −128 rather than −126 as in IEEE, and Emax is 126 instead of

127 The main differences between VAX and IEEE are the lack of special valuesand gradual underflow The VAX has a reserved operand, but it works like asignaling NaN: it traps whenever it is referenced Originally, the VAX’s doubleprecision (D format) also had 8 bits of exponent However, as this is too small formany applications, a G format was added; like the IEEE standard, this format has

11 bits of exponent The VAX also has an H format, which is 128 bits long.The IBM/370 floating-point format uses base 16 rather than base 2 Thismeans it cannot use a hidden bit In single precision, it has 7 bits of exponent and

24 bits (6 hex digits) of fraction Thus, the largest representable number is 1627 =

24× 2 7

= 229, compared with 228 for IEEE However, a number that is normalized

in the hexadecimal sense only needs to have a nonzero leading digit When preted in binary, the three most-significant bits could be zero Thus, there are po-tentially fewer than 24 bits of significance The reason for using the higher basewas to minimize the amount of shifting required when adding floating-pointnumbers However, this is less significant in current machines, where the float-ing-point add time is usually fixed independently of the operands Another differ-ence between 370 arithmetic and IEEE arithmetic is that the 370 has neither around digit nor a sticky digit, which effectively means that it truncates rather thanrounds Thus, in many computations, the result will systematically be too small.Unlike the VAX and IEEE arithmetic, every bit pattern is a valid number Thus,library routines must establish conventions for what to return in case of errors Inthe IBM FORTRAN library, for example, returns 2!

inter-Arithmetic on Cray computers is interesting because it is driven by a tion for the highest possible floating-point performance It has a 15-bit exponent

motiva-4

Trang 7

A-68 Appendix A Computer Arithmetic

field and a 48-bit fraction field Addition on Cray computers does not have aguard digit, and multiplication is even less accurate than addition Thinking of

multiplication as a sum of p numbers, each 2p bits long, Cray computers drop the

low-order bits of each summand Thus, analyzing the exact error characteristics ofthe multiply operation is not easy Reciprocals are computed using iteration, and

division of a by b is done by multiplying a times 1/b The errors in multiplication

and reciprocation combine to make the last three bits of a divide operation reliable At least Cray computers serve to keep numerical analysts on their toes!The IEEE standardization process began in 1977, inspired mainly by W.Kahan and based partly on Kahan’s work with the IBM 7094 at the University ofToronto [Kahan 1968] The standardization process was a lengthy affair, withgradual underflow causing the most controversy (According to Cleve Moler, vis-itors to the U.S were advised that the sights not to be missed were Las Vegas, theGrand Canyon, and the IEEE standards committee meeting.) The standard was fi-nally approved in 1985 The Intel 8087 was the first major commercial IEEE im-plementation and appeared in 1981, before the standard was finalized It containsfeatures that were eliminated in the final standard, such as projective bits Ac-cording to Kahan, the length of double-extended precision was based on whatcould be implemented in the 8087 Although the IEEE standard was not based onany existing floating-point system, most of its features were present in some oth-

un-er system For example, the CDC 6600 resun-erved special bit pattun-erns for NITE and INFINITY, while the idea of denormal numbers appears in Goldberg[1967] as well as in Kahan [1968] Kahan was awarded the 1989 Turing prize inrecognition of his work on floating point

INDEFI-Although floating point rarely attracts the interest of the general press, papers were filled with stories about floating-point division in November 1994 Abug in the division algorithm used on all of Intel’s Pentium chips had just come tolight It was discovered by Thomas Nicely, a math professor at Lynchburg Col-lege in Virginia Nicely found the bug when doing calculations involving recipro-cals of prime numbers News of Nicely’s discovery first appeared in the press on

news-the front page of news-the November 7 issue of Electronic Engineering Times Intel’s

immediate response was to stonewall, asserting that the bug would only affecttheoretical mathematicians Intel told the press, “This doesn’t even qualify as anerrata even if you’re an engineer, you’re not going to see this.”

Under more pressure, Intel issued a white paper, dated November 30, ing why they didn’t think the bug was significant One of their arguments wasbased on the fact that if you pick two floating-point numbers at random and di-vide one into the other, the chance that the resulting quotient will be in error isabout 1 in 9 billion However, Intel neglected to explain why they thought that thetypical customer accessed floating-point numbers randomly

explain-Pressure continued to mount on Intel One sore point was that Intel had knownabout the bug before Nicely discovered it, but had decided not to make it public.Finally, on December 20, Intel announced that they would unconditionally re-place any Pentium chip that used the faulty algorithm and that they would take anunspecified charge against earnings, which turned out to be $300 million

Trang 8

A.12 Historical Perspective and References A-69

The Pentium uses a simple version of SRT division as discussed in sectionA.9 The bug was introduced when they converted the quotient lookup table to aPLA Evidently there were a few elements of the table containing the quotientdigit 2 that Intel thought would never be accessed, and they optimized the PLAdesign using this assumption The resulting PLA returned 0 rather than 2 in thesesituations However, those entries were really accessed, and this caused the divi-sion bug Even though the effect of the faulty PLA was to cause 5 out of 2048 ta-ble entries to be wrong, the Pentium only computes an incorrect quotient 1 out of

9 billion times on random inputs This is explored in Exercise A.34

References

A NDERSON , S F., J G E ARLE , R E G OLDSCHMIDT , AND D M P OWERS [1967] “The IBM System/

360 Model 91: Floating-point execution unit,” IBM J Research and Development 11, 34–53

B IRMAN , M., A S AMUELS , G C HU , T C HUK , L H U , J M C L EOD , AND J B ARNES [1990]

“Develop-ing the WRL3170/3171 SPARC float“Develop-ing-point coprocessors,” IEEE Micro 10:1, 55–64.

These chips have the same floating-point core as the Weitek 3364, and this paper has a fairly detailed description of that floating-point design.

B RENT , R P AND H T K UNG [1982] “A regular layout for parallel adders,” IEEE Trans on puters C-31, 260–264.

Com-This is the paper that popularized CLAs in VLSI.

B URGESS , N AND T W ILLIAMS [1995] “Choices of operand truncation in the SRT division

algo-rithm,” IEEE Trans on Computers 44:7.

Analyzes how many bits of divisor and remainder need to be examined in SRT division.

B URKS , A W., H H G OLDSTINE , AND J VON N EUMANN [1946] “Preliminary discussion of the cal design of an electronic computing instrument,” Report to the U.S Army Ordnance Department,

logi-p 1; also appears in Papers of John von Neumann, W Aspray and A Burks, eds., MIT Press,

Cam-bridge, Mass., and Tomash Publishers, Los Angeles, Calif., 1987, 97–146.

C ODY , W J., J T C OONEN , D M G AY , K H ANSON , D H OUGH , W K AHAN , R K ARPINSKI ,

J P ALMER , F N R IS , AND D S TEVENSON [1984] “A proposed radix- and

word-length-indepen-dent standard for floating-point arithmetic,” IEEE Micro 4:4, 86–100.

Contains a draft of the 854 standard, which is more general than 754 The significance of this article is that it contains commentary on the standard, most of which is equally relevant to

754 However, be aware that there are some differences between this draft and the final dard.

stan-C OONEN, J [1984] Contributions to a Proposed Standard for Binary Floating-Point Arithmetic,

Ph.D Thesis, Univ of Calif., Berkeley.

The only detailed discussion of how rounding modes can be used to implement efficient binary decimal conversion.

Trang 9

A-70 Appendix A Computer Arithmetic

D ARLEY , H M., ET AL [1989] “Floating point/integer processor with divide and square root tions,” U.S Patent 4,878,190, October 31, 1989.

func-Pretty readable as patents go Gives a high-level view of the TI 8847 chip, but doesn’t have all the details of the division algorithm.

D EMMEL , J W AND X L I [1994] “Faster numerical algorithms via exception handling,” IEEE Trans on Computers 43:8, 983–992.

A good discussion of how the features unique to IEEE floating point can improve the mance of an important software library.

perfor-F REIMAN, C V [1961] “Statistical analysis of certain binary division algorithms,” Proc IRE 49:1,

91–103.

Contains an analysis of the performance of shifting-over-zeros SRT division algorithm.

G OLDBERG , D [1991] “What every computer scientist should know about floating-point arithmetic,”

Computing Surveys 23:1, 5–48.

Contains an in-depth tutorial on the IEEE standard from the software point of view.

G OLDBERG, I B [1967] “27 bits are not enough for 8-digit accuracy,” Comm ACM 10:2, 105–106 This paper proposes using hidden bits and gradual underflow.

G OSLING, J B [1980] Design of Arithmetic Units for Digital Computers, Springer-Verlag, New

York.

A concise, well-written book, although it focuses on MSI designs.

H AMACHER , V C., Z G V RANESIC , AND S G Z AKY [1984] Computer Organization, 2nd ed.,

McGraw-Hill, New York.

Introductory computer architecture book with a good chapter on computer arithmetic.

H WANG, K [1979] Computer Arithmetic: Principles, Architecture, and Design, Wiley, New York This book contains the widest range of topics of the computer arithmetic books.

IEEE [1985] “IEEE standard for binary floating-point arithmetic,” SIGPLAN Notices 22:2, 9–25 IEEE 754 is reprinted here.

K AHAN, W [1968] “7094-II system support for numerical analysis,” SHARE Secretarial tion SSD-159.

Distribu-This system had many features that were incorporated into the IEEE floating-point standard

K AHANER, D K [1988] “Benchmarks for ‘real’ programs,” SIAM News (November).

The benchmark presented in this article turns out to cause many underflows.

K NUTH, D [1981] The Art of Computer Programming, vol II, 2nd ed., Addison-Wesley, Reading,

Mass.

Has a section on the distribution of floating-point numbers.

K OGGE, P [1981] The Architecture of Pipelined Computers, McGraw-Hill, New York.

Has brief discussion of pipelined multipliers.

K OHN , L AND S.-W F U [1989] “A 1,000,000 transistor microprocessor,” IEEE Int’l Solid-State cuits Conf., 54–55.

Cir-There are several articles about the i860, but this one contains the most details about its ing-point algorithms.

float-K OREN, I [1989] Computer Arithmetic Algorithms, Prentice Hall, Englewood Cliffs, N.J.

L EIGHTON, F T [1992] Introduction to Parallel Algorithms and Architectures: Arrays, Trees, percubes, Morgan Kaufmann, San Mateo, Calif.

Hy-This is an excellent book, with emphasis on the complexity analysis of algorithms Section 1.2.1 has a nice discussion of carry-lookahead addition on a tree.

Trang 10

A.12 Historical Perspective and References A-71

M AGENHEIMER , D J., L P ETERS , K W P ETTIS , AND D Z URAS [1988] “Integer multiplication and

division on the HP Precision architecture,” IEEE Trans on Computers 37:8, 980–990.

Gives rationale for the integer- and divide-step instructions in the Precision architecture.

M ARKSTEIN , P W [1990] “Computation of elementary functions on the IBM RISC System/6000

processor,” IBM J of Research and Development 34:1, 111–119.

Explains how to use fused muliply-add to compute correctly rounded division and square root.

M EAD , C AND L C ONWAY [1980] Introduction to VLSI Systems, Addison-Wesley, Reading, Mass.

M ONTOYE , R K., E H OKENEK , AND S L R UNYON [1990] “Design of the IBM RISC System/6000

floating-point execution,” IBM J of Research and Development 34:1, 59–70.

Describes one implementation of fused multiply-add.

N GAI , T.-F AND M J I RWIN [1985] “Regular, area-time efficient carry-lookahead adders,” Proc Seventh IEEE Symposium on Computer Arithmetic, 9–15.

Describes a CLA like that of Figure A.17, where the bits flow up and then come back down.

P ATTERSON , D.A AND J.L H ENNESSY [1994] Computer Organization and Design: The Hardware/ Software Interface, Morgan Kaufmann, San Francisco.

Chapter 4 is a gentler introduction to the first third of this appendix.

P ENG , V., S S AMUDRALA , AND M G AVRIELOV [1987] “On the implementation of shifters,

multipli-ers, and dividers in VLSI floating point units,” Proc Eighth IEEE Symposium on Computer metic, 95–102.

Arith-Highly recommended survey of different techniques actually used in VLSI designs.

R OWEN , C., M J OHNSON , AND P R IES [1988] “The MIPS R3010 floating-point coprocessor,” IEEE Micro 53–62 (June).

S ANTORO , M R., G B EWICK , AND M A H OROWITZ [1989] “Rounding algorithms for IEEE

multi-pliers,” Proc Ninth IEEE Symposium on Computer Arithmetic, 176–183.

A very readable discussion of how to efficiently implement rounding for floating-point plication.

multi-S COTT, N R [1985] Computer Number Systems and Arithmetic, Prentice Hall, Englewood Cliffs,

N.J.

S WARTZLANDER , E., ED [1990] Computer Arithmetic, IEEE Computer Society Press, Los Alamitos,

Calif.

A collection of historical papers in two volumes.

T AKAGI , N., H Y ASUURA , AND S Y AJIMA [1985].“High-speed VLSI multiplication algorithm with a

redundant binary addition tree,” IEEE Trans on Computers C-34:9, 789–796.

A discussion of the binary-tree signed multiplier that was the basis for the design used in the

TI 8847.

T AYLOR, G S [1981] “Compatible hardware for division and square root,” Proc Fifth IEEE sium on Computer Arithmetic, 127–134.

Sympo-Good discussion of a radix-4 SRT division algorithm.

T AYLOR, G S [1985] “Radix 16 SRT dividers with overlapped quotient selection stages,” Proc enth IEEE Symposium on Computer Arithmetic, 64–71.

Sev-Describes a very sophisticated high-radix division algorithm.

W ESTE , N AND K E SHRAGHIAN [1993] Principles of CMOS VLSI Design: A Systems Perspective,

2nd ed., Addison-Wesley, Reading, Mass.

This textbook has a section on the layouts of various kinds of adders.

Trang 11

A-72 Appendix A Computer Arithmetic

W ILLIAMS , T E., M H OROWITZ , R L A LVERSON , AND T S Y ANG [1987] “A self-timed chip for

di-vision,” Advanced Research in VLSI, Proc 1987 Stanford Conf., MIT Press, Cambridge, Mass Describes a divider that tries to get the speed of a combinational design without using the area that would be required by one.

E X E R C I S E S

A.1 [12] <A.2> Using n bits, what is the largest and smallest integer that can be

represent-ed in the two’s complement system?

A.2 [20/25] <A.2> In the subsection Signed Numbers (page A-7), it was stated that two’s

complement overflows when the carry into the high-order bit position is different from the carry-out from that position.

a [20] <A.2> Give examples of pairs of integers for all four combinations of carry-in and carry-out Verify the rule stated above.

b [25] <A.2> Explain why the rule is always true.

A.3 [12] <A.2> Using 4-bit binary numbers, multiply − 8 × − 8 using Booth recoding.

A.4 [15] <A.2> Equations A.2.1 and A.2.2 are for adding two n-bit numbers Derive

sim-ilar equations for subtraction, where there will be a borrow instead of a carry.

A.5 [25] <A.2> On a machine that doesn’t detect integer overflow in hardware, show how

you would detect overflow on a signed addition operation in software.

A.6 [15/15/20] <A.3> Represent the following numbers as single-precision and

double-precision IEEE floating-point numbers.

a [15] <A.3> 10.

b [15] <A.3> 10.5.

c [20] <A.3> 0.1.

A.7 [12/12/12/12/12] <A.3> Below is a list of floating-point numbers In single precision,

write down each number in binary, in decimal, and give its representation in IEEE metic.

arith-a [12] <A.3> The largest number less than 1.

b [12] <A.3> The largest number.

c [12] <A.3> The smallest positive normalized number.

d [12] <A.3> The largest denormal number.

e [12] <A.3> The smallest positive number.

A.8 [15] <A.3> Is the ordering of nonnegative floating-point numbers the same as integers

when denormalized numbers are also considered?

A.9 [20] <A.3> Write a program that prints out the bit patterns used to represent

floating-point numbers on your favorite computer What bit pattern is used for NaN?

Trang 12

Exercises A-73

A.10 [15] <A.4> Using p = 4, show how the binary floating-point multiply algorithm

com-putes the product of 1.875 × 1.875.

A.11 [12/10] <A.4> Concerning the addition of exponents in floating-point multiply:

a [12] <A.4> What would the hardware that implements the addition of exponents look like?

b [10] <A.4> If the bias in single precision were 129 instead of 127, would addition be harder or easier to implement?

A.12 [15/12] <A.4> In the discussion of overflow detection for floating-point

multiplica-tion, it was stated that (for single precision) you can detect an overflowed exponent by forming exponent addition in a 9-bit adder.

per-a [15] <A.4> Give the exact rule for detecting overflow.

b [12] <A.4> Would overflow detection be any easier if you used a 10-bit adder instead?

A.13 [15/10] <A.4> Floating-point multiplication:

a [15] <A.4> Construct two single-precision floating-point numbers whose product doesn’t overflow until the final rounding step.

b [10] <A.4> Is there any rounding mode where this phenomenon cannot occur?

A.14 [15] <A.4> Give an example of a product with a denormal operand but a normalized

output How large was the final shifting step? What is the maximum possible shift that can occur when the inputs are double-precision numbers?

A.15 [15] <A.5> Use the floating-point addition algorithm on page A-24 to compute

1.0102− 10012 (in 4-bit precision)

A.16 [10/15/20/20/20] <A.5> In certain situations, you can be sure that a + b is exactly

rep-resentable as a floating-point number, that is, no rounding is necessary.

a. [10] <A.5> If a, b have the same exponent and different signs, explain why a + b is exact This was used in the subsection Speeding Up Addition on page A-27.

b. [15] <A.5> Give an example where the exponents differ by 1, a and b have different signs, and a + b is not exact.

c. [20] <A.5> If a b 0, and the top two bits of a cancel when computing a b, explain

why the result is exact (this fact is mentioned on page A-23).

d. [20] <A.5> If a b 0, and the exponents differ by 1, show that a b is exact unless the high order bit of a b is in the same position as that of a (mentioned in Speeding

Up Addition, page A-27).

e. [20] <A.5> If the result of a b or a + b is denormal, show that the result is exact tioned in the subsection Underflow, page A-38).

(men-A.17 [15/20] <A.5> Fast floating-point addition (using parallel adders) for p = 5.

a. [15] <A.5> Step through the fast addition algorithm for a + b, where a = 1.01112 and

b = 11011.

Trang 13

A-74 Appendix A Computer Arithmetic

b [20] <A.5> Suppose the rounding mode is toward + ∞ What complication arises in the above example for the adder that assumes a carry-out? Suggest a solution.

A.18 [12] <A.4,A.5> How would you use two parallel adders to avoid the final round-up

addition in floating-point multiplication?

A.19 [30/10] <A.5> This problem presents a way to reduce the number of addition steps

in floating-point addition from three to two using only a single adder.

a. [30] <A.5> Let A and B be integers of opposite signs, with a and b be their tudes Show that the following rules for manipulating the unsigned numbers a and b gives A + B.

magni-1 Complement one of the operands.

2 Using end around carry to add the complemented operand and the other plemented) one.

(uncom-3 If there was a carry-out, the sign of the result is the sign associated with the complemented operand.

un-4 Otherwise, if there was no carry-out, complement the result, and give it the sign

of the complemented operand.

b [10] <A.5> Use the above to show how steps 2 and 4 in the floating-point addition gorithm can be performed using only a single addition.

al-A.20 [20/15/20/15/20/15] <A.6> Iterative square root.

a [20] <A.6> Use Newton’s method to derive an iterative algorithm for square root The formula will involve a division

b [15] <A.6> What is the fastest way you can think of to divide a floating-point number

by 2?

c [20] <A.6> If division is slow, then the iterative square root routine will also be slow.

Use Newton’s method on f(x) = 1/x2 − a to derive a method that doesn’t use any

divi-sions.

d [15] <A.6> Assume that the ratio division by 2 : floating-point add : floating-point multiply is 1:2:4 What ratios of multiplication time to divide time makes each itera- tion step in the method of part(c) faster than each iteration in the method of part(a)?

e [20] <A.6> When using the method of part(a), how many bits need to be in the initial guess in order to get double-precision accuracy after three iterations? (You may ignore rounding error.)

f [15] <A.6> Suppose that when Spice runs on the TI 8847, it spends 16.7% of its time

in the square root routine (this percentage has been measured on other machines) ing the values in Figure A.36 and assuming three iterations, how much slower would Spice run if square root was implemented in software using the method of part(a)?

Us-A.21 [10/20/15/15/15] <A.6> Correctly rounded iterative division Let a and b be

floating-point numbers with p-bit significands (p = 53 in double precision) Let q be the exact tient q = a/b, 1 q < 2 Suppose that q is the result of an iteration process, that q has a few

Trang 14

quo-Exercises A-75

extra bits of precision, and that 0 < q q < 2 −p For the following, it is important that

q < q, even when q can be exactly represented as a floating-point number.

a. [10] <A.6> If x is a floating-point number, and 1 x < 2, what is the next representable number after x?

b. [20] <A.6> Show how to compute q from q, where q has p + 1 bits of precision and

q q′< 2 −p.

c [15] <A.6> Assuming round to nearest, show that the correctly rounded quotient is

either q, q′ − 2 −p , or q′ + 2 −p.

d. [15] <A.6> Give rules for computing the correctly rounded quotient from q′ based on

the low-order bit of q and the sign of a bq

e [15] <A.6> Solve part(c) for the other three rounding modes.

A.22 [15] <A.6> Verify the formula on page A-31 [Hint: If x n = x0(2 − x0b) ×Πi=1, n [1 + (1 − x0b)2i

], then 2 − x n b = 2 x0b(2 x0b) Π[1 + (1 − x0b)2i

] = 2 − [1 − (1 − x0b)2 ] Π[1 + (1 − x0b)2i].]

A.23 [15] <A.7> Our example that showed that double rounding can give a different

an-swer from rounding once used the round-to-even rule If halfway cases are always rounded

up, is double rounding still dangerous?

A.24 [10/10/20/20] <A.7> Some of the cases of the italicized statement in the Precisions

subsection (page A-34) aren’t hard to demonstrate.

a. [10] <A.7> What form must a binary number have if rounding to q bits followed by rounding to p bits gives a different answer than rounding directly to p bits?

b. [10] <A.7> Show that for multiplication of p-bit numbers, rounding to q bits followed

by rounding to p bits is the same as rounding immediately to p bits if q 2p.

c. [20] <A.7> If a and b are p-bit numbers with the same sign, show that rounding a + b

to q bits followed by a rounding to p bits is the same as rounding immediately to p bits

if q 2p + 1.

d. [20] <A.7> Do part (c) when a and b have opposite signs.

A.25 [Discussion] <A.7> In the MIPS approach to exception handling, you need a test for

determining whether two floating-point operands could cause an exception This should be fast and also not have too many false positives Can you come up with a practical test? The performance cost of your design will depend on the distribution of floating-point numbers This is discussed in Knuth [1981] and the Hamming paper in Swartzlander [1990].

A.26 [12/12/10] <A.8> Carry-skip adders

a [12] <A.8> Assuming that time is proportional to logic levels, how long does it take

an n-bit adder divided into (fixed) blocks of length k bits to perform an addition?

b [12] <A.8> What value of k gives the fastest adder?

c [10] <A.8> Explain why the carry-skip adder takes time 0 ( n)

Trang 15

A-76 Appendix A Computer Arithmetic

A.27 [10/15/20] <A.8> Complete the details of the block diagrams for the following

adders.

a [10] <A.8> In Figure A.15, show how to implement the “1” and “2” boxes in terms of

AND and OR gates.

b [15] <A.8> In Figure A.18, what signals need to flow from the adder cells in the top row into the “C” cells? Write the logic equations for the “C” box.

c [20] <A.8> Show how to extend the block diagram in A.17 so it will produce the

carry-out bit c8.

A.28 [15] <A.9> For ordinary Booth recoding, the multiple of b used in the ith step is

simply a i–1 a i. Can you find a similar formula for radix-4 Booth recoding (overlapped triplets)?

A.29 [20] <A.9> Expand Figure A.29 in the fashion of A.27, showing the individual

adders.

A.30 [25] <A.9> Write out the analogue of Figure A.25 for radix-8 Booth recoding.

A.31 [18] <A.9> Suppose that a n–1 .a1a0 and b n–1 .b1b0 are being added in a signed-digit

adder as illustrated in the Example on page A-56 Write a formula for the ith bit of the sum,

s i , in terms of a i , a i–1 , a i–2 , b i , b i–1 , and b i–2.

A.32 [15] <A.9> The text discussed radix-4 SRT division with quotient digits of − 2, − 1, 0,

1, 2 Suppose that 3 and − 3 are also allowed as quotient digits What relation replaces r i

2b/3?

A.33 [25/20/30] <A.9> Concerning the SRT division table, Figure A.34:

a [25] <A.9> Write a program to generate the results of Figure A.34.

b [20] <A.9> Note that Figure A.34 has a certain symmetry with respect to positive and negative values of P Can you find a way to exploit the symmetry and only store the values for positive P?

c [30] <A.9> Suppose a carry-save adder is used instead of a propagate adder The input

to the quotient lookup table will be k bits of divisor, and l bits of remainder, where the remainder bits are computed by summing the top l bits of the sum and carry registers What are k and l? Write a program to generate the analogue of Figure A.34.

A.34 [12/12/12]<A.9,A.12>The first several million Pentium chips produced had a flaw

that caused division to sometimes return the wrong result The Pentium uses a radix-4 SRT algorithm similar to the one illustrated in the Example on page A-59 (but with the remain- der stored in carry-save format: see Exercise A.33(c)) According to Intel, the bug was due

to five incorrect entries in the quotient lookup table.

a [12] <A.9,A.12> The bad entries should have had a quotient of plus or minus 2, but instead had a quotient of 0 Because of redundancy, it’s conceivable that the algorithm could “recover” from a bad quotient digit on later iterations Show that this is not pos- sible for the Pentium flaw.

Trang 16

Exercises A-77

b [12] <A.9,A.12> Since the operation is a floating-point divide rather than an integer divide, the SRT division algorithm on page A-47 must be modified in two ways First, step 1 is no longer needed, since the divisor is already normalized Second, the very first remainder may not satisfy the proper bound ( r ≤ 2b/3 for Pentium, see page A-

58) Show that skipping the very first left shift in step 2(a) of the SRT algorithm will solve this problem.

c [12] <A.9,A.12> If the faulty table entries were indexed by a remainder that could cur at the very first divide step (when the remainder is the divisor), random testing would quickly reveal the bug This didn’t happen What does that tell you about the remainder values that index the faulty entries?

oc-A.35 [12/12/12] <A.6,A.9> The discussion of the remainder-step instruction assumed that

division was done using a bit-at-a-time algorithm What would have to change if division were implemented using a higher-radix method?

A.36 [25] <A.9> In the array of Figure A.28, the fact that an array can be pipelined is not

exploited Can you come up with a design that feeds the output of the bottom CSA into the bottom CSAs instead of the top one, and that will run faster than the arrangement of Figure A.28?

Trang 17

B Vector Processors 2

I’m certainly not inventing vector processors There are three kinds that I know of existing today They are represented by the Illiac-IV, the (CDC) Star processor, and the TI (ASC) processor

Those three were all pioneering processors One of the problems

of being a pioneer is you always make mistakes and I never, never want to be a pioneer It’s always best to come second when you can look at the mistakes the pioneers made.

Seymour Cray

Public Lecture at Lawrence Livermore Laboratories

on the Introduction of the CRAY-1 (1976)

Trang 18

B.1 Why Vector Processors? B-1

B.3 Two Real-World Issues: Vector Length and Stride B-15

B.6 Putting It All Together: Performance of Vector Processors B-29

Clock cycle time—The clock cycle time can be decreased by making the lines deeper, but a deeper pipeline will increase the pipeline dependences andresult in a higher CPI At some point, each increase in pipeline depth has a cor-responding increase in CPI As we saw in Chapter 3’s Fallacies and Pitfalls,

pipe-very deep pipelining can slow down a processor

Instruction fetch and decode rate—This obstacle, sometimes called the Flynn bottleneck (based on Flynn [1966]), makes it difficult to fetch and issue many

Trang 19

B-2 Appendix B Vector Processors

instructions per clock This obstacle is one reason that it has been difficult tobuild processors with high clock rates and very high issue rates

The dual limitations imposed by deeper pipelines and issuing multiple tions can be viewed from the standpoint of either clock rate or CPI: It is just asdifficult to schedule a pipeline that is n times deeper as it is to schedule a proces-sor that issues n instructions per clock cycle

instruc-High-speed, pipelined processors are particularly useful for large scientificand engineering applications A high-speed pipelined processor will usually use acache to avoid forcing memory reference instructions to have very long latency.Unfortunately, big, long-running, scientific programs often have very large activedata sets that are sometimes accessed with low locality, yielding poor perfor-mance from the memory hierarchy This problem could be overcome by not cach-ing these structures if it were possible to determine the memory-access patternsand pipeline the memory accesses efficiently Novel cache architectures and com-piler assistance through blocking and prefetching are decreasing these memoryhierarchy problems, but they continue to be serious in some applications

Vector processors provide high-level operations that work on vectors —lineararrays of numbers A typical vector operation might add two 64-element, floating-point vectors to obtain a single 64-element vector result The vector instruction isequivalent to an entire loop, with each iteration computing one of the 64 elements

of the result, updating the indices, and branching back to the beginning

Vector instructions have several important properties that solve most of theproblems mentioned above:

■ The computation of each result is independent of the computation of previousresults, allowing a very deep pipeline without generating any data hazards Es-sentially, the absence of data hazards was determined by the compiler or by theprogrammer when she decided that a vector instruction could be used

■ A single vector instruction specifies a great deal of work—it is equivalent to ecuting an entire loop Thus, the instruction bandwidth requirement is reduced,and the Flynn bottleneck is considerably mitigated

ex-■ Vector instructions that access memory have a known access pattern If the tor’s elements are all adjacent, then fetching the vector from a set of heavily in-terleaved memory banks works very well (as we saw in section 5.6) The highlatency of initiating a main memory access versus accessing a cache is amor-tized, because a single access is initiated for the entire vector rather than to asingle word Thus, the cost of the latency to main memory is seen only once forthe entire vector, rather than once for each word of the vector

vec-■ Because an entire loop is replaced by a vector instruction whose behavior ispredetermined, control hazards that would normally arise from the loop branchare nonexistent

Trang 20

B.2 Basic Vector Architecture B-3

For these reasons, vector operations can be made faster than a sequence of scalaroperations on the same number of data items, and designers are motivated to in-clude vector units if the applications domain can use them frequently

As mentioned above, vector processors pipeline the operations on the ual elements of a vector The pipeline includes not only the arithmetic operations(multiplication, addition, and so on), but also memory accesses and effective ad-dress calculations In addition, most high-end vector processors allow multiplevector operations to be done at the same time, creating parallelism among the op-erations on different elements In this appendix, we focus on vector processorsthat gain performance by pipelining and instruction overlap

individ-A vector processor typically consists of an ordinary pipelined scalar unit plus avector unit All functional units within the vector unit have a latency of severalclock cycles This allows a shorter clock cycle time and is compatible with long-running vector operations that can be deeply pipelined without generating haz-ards Most vector processors allow the vectors to be dealt with as floating-pointnumbers, as integers, or as logical data Here we will focus on floating point Thescalar unit is basically no different from the type of advanced pipelined CPU dis-cussed in Chapter 3

There are two primary types of architectures for vector processors: register processors and memory-memory vector processors In a vector-registerprocessor, all vector operations—except load and store—are among the vectorregisters These architectures are the vector counterpart of a load-store architec-ture All major vector computers shipped since the late 1980s use a vector-registerarchitecture; these include the Cray Research processors (CRAY-1, CRAY-2, X-

vector-MP, Y-vector-MP, and C-90), the Japanese supercomputers (NEC SX/2 and SX/3, FujitsuVP200 and VP400, and the Hitachi S820), as well as the mini-supercomputers(Convex C-1 and C-2) In a memory-memory vector processor, all vector opera-tions are memory to memory The first vector computers were of this type, as wereCDC’s vector computers From this point on we will focus on vector-register ar-chitectures only; we will briefly return to memory-memory vector architectures atthe end of the appendix (section B.7) to discuss why they have not been as suc-cessful as vector-register architectures

We begin with a vector-register processor consisting of the primary ponents shown in Figure B.1 This processor, which is loosely based on theCRAY-1, is the foundation for discussion throughout most of this appendix Wewill call it DLXV; its integer portion is DLX, and its vector portion is the logicalvector extension of DLX The rest of this section examines how the basic archi-tecture of DLXV relates to other processors

Trang 21

B-4 Appendix B Vector Processors

The primary components of the instruction set architecture of DLXV are

Vector registers—Each vector register is a fixed-length bank holding a singlevector DLXV has eight vector registers, and each vector register holds 64 el-ements Each vector register must have at least two read ports and one writeport in DLXV This will allow a high degree of overlap among vector opera-tions to different vector registers (We do not consider the problem of a short-age of vector register ports In real machines this would result in a structuralhazard.) The read and write ports, which total at least 16 read ports and eightwrite ports, are connected to the functional unit inputs or outputs by a pair ofcrossbars (The CRAY-1 manages to implement the register file with only asingle port per register using some clever implementation techniques.)

FIGURE B.1 The basic structure of a vector-register architecture, DLXV This sor has a scalar architecture just like DLX There are also eight 64-element vector registers, and all the functional units are vector functional units Special vector instructions are defined both for arithmetic and for memory accesses We show vector units for logical and integer operations These are included so that DLXV looks like a standard vector processor, which usually includes these units However, we will not be discussing these units except in the Exercises The vector and scalar registers have a significant number of read and write ports

proces-to allow multiple simultaneous vecproces-tor operations These ports are connected proces-to the inputs and outputs of the vector functional units by a set of crossbars (shown in thick gray lines) In section B.5 we add chaining, which will require additional interconnect capability.

Main memory

Vector registers

Scalar registers

FP add/subtract

FP multiply

FP divide Integer Logical Vector

load-store

Trang 22

B.2 Basic Vector Architecture B-5

Vector functional units—Each unit is fully pipelined and can start a new ation on every clock cycle A control unit is needed to detect hazards, both fromconflicts for the functional units (structural hazards) and from conflicts for reg-ister accesses (data hazards) DLXV has five functional units, as shown inFigure B.1 For simplicity, we will focus exclusively on the floating-point func-tional units Depending on the vector processor, scalar operations either use thevector functional units or use a dedicated set We assume the functional unitsare shared, but again, for simplicity, we ignore potential conflicts

oper-■ Vector load-store unit—This is a vector memory unit that loads or stores a tor to or from memory The DLXV vector loads and stores are fully pipelined,

vec-so that words can be moved between the vector registers and memory with abandwidth of one word per clock cycle, after an initial latency This unit wouldalso normally handle scalar loads and stores

A set of scalar registers—Scalar registers can also provide data as input to thevector functional units, as well as compute addresses to pass to the vector load-store unit These are the normal 32 general-purpose registers and 32 floating-point registers of DLX, though more read and write ports are needed The sca-lar registers are also connected to the functional units by the pair of crossbars Figure B.2 shows the characteristics of some typical vector processors, includ-ing the size and count of the registers, the number and types of functional units,and the number of load-store units

In DLXV, vector operations use the same names as DLX operations, but withthe letter “V” appended These are double-precision, floating-point vector opera-tions (We have omitted single-precision FP operations and integer and logicaloperations for simplicity.) Thus, ADDV is an add of two double-precision vectors.The vector instructions take as their input either a pair of vector registers (ADDV)

or a vector register and a scalar register, designated by appending “SV” (ADDSV)

In the latter case, the value in the scalar register is used as the input for all tions—the operation ADDSV will add the contents of a scalar register to each ele-ment in a vector register Most vector operations have a vector destinationregister, though a few (population count) produce a scalar value, which is stored

opera-to a scalar register The names LV and SV denote vector load and vector store, andthey load or store an entire vector of double-precision data One operand isthe vector register to be loaded or stored; the other operand, which is a DLXgeneral-purpose register, is the starting address of the vector in memory.Figure B.3 lists the DLXV vector instructions In addition to the vector registers,

we need two additional special-purpose registers: the length and mask registers We will discuss these registers and their purpose in sections B.3and B.5, respectively

Trang 23

vector-B-6 Appendix B Vector Processors

Processor

Year announced

Clock rate (MHz) Registers

Elements per register (64-bit elements) Functional units

Load-store units

CRAY-1 1976 80 8 64 6: add, multiply, reciprocal,

integer add, logical, shift

1 CRAY X-MP

CRAY Y-MP

1983 1988

120 166

8 64 8: FP add, FP multiply, FP

re-ciprocal, integer add, 2 logical, shift, population count/parity

2 loads

1 store CRAY-2 1985 166 8 64 5: FP add, FP multiply, FP re-

ciprocal/sqrt, integer (add shift, population count), logical

multiply/divide, 4 FP add,

4 shift

8

integer add, logical

1 Cray C-90 1991 240 8 128 8: FP add, FP multiply, FP re-

ciprocal, integer add, 2 logical, shift, population count/parity

4

Convex C-4 1994 135 16 128 3: each is full integer, logical,

and FP (including multiply-add) NEC SX/4 1995 400 8 + 8192 256 variable 16: 4 integer add/logical, 4 FP

ciprocal, integer add, 2 logical, shift, population count/parity

4

FIGURE B.2 Characteristics of several vector-register architectures The vector functional units include all operation units used by the vector instructions The functional units are floating point unless stated otherwise If the processor is a multiprocessor, the entries correspond to the characteristics of one processor Each vector load-store unit represents the ability to do an independent, overlapped transfer to or from the vector registers The Fujitsu VP200’s vector registers are configurable: The size and count of the 8 K 64-bit entries may be varied inversely to one another (e.g., eight registers each

1 K elements long, or 128 registers each 64 elements long) The NEC SX/2 has eight fixed registers of length 256, plus 8 K

of configurable 64-bit registers The reciprocal unit on the CRAY processors is used to do division (and square root on the CRAY-2) Add pipelines perform floating-point add and subtract The multiply/divide–add unit on the Hitachi S810/820 per- forms an FP multiply or divide followed by an add or subtract (while the multiply-add unit performs a multiply followed by an add or subtract) Note that most processors use the vector FP multiply and divide units for vector integer multiply and divide, just like DLX, and several of the processors use the same units for FP scalar and FP vector operations Several of the machines have different clock rates in the vector and scalar units; the clock rates shown are for the vector units.

Trang 24

B.2 Basic Vector Architecture B-7

A vector processor is best understood by looking at a vector loop on DLXV

Let’s take a typical vector problem, which will be used throughout this appendix:

Y = a × X + Y

X and Y are vectors, initially resident in memory, and a is a scalar This is the called SAXPY or DAXPY loop that forms the inner loop of the Linpack bench-mark (SAXPY stands for single-precision a × X plus Y; DAXPY for double-precision a × X plus Y.) Linpack is a collection of linear algebra routines, and the

so-Instruction Operands Function

LV V1,R1 Load vector register V1 from memory starting at address R1

SV R1,V1 Store vector register V1 into memory starting at address R1

LVWS V1,(R1,R2) Load V1 from address at R1 with stride in R2 , i.e., R1+i × R2

SVWS (R1,R2),V1 Store V1 from address at R1 with stride in R2 , i.e., R1+i × R2

LVI V1,(R1+V2) Load V1 with vector whose elements are at R1+V2(i) , i.e., V2 is an index.

SVI (R1+V2),V1 Store V1 to vector whose elements are at R1+V2(i) , i.e., V2 is an index.

CVI V1,R1 Create an index vector by storing the values 0, 1 × R1, 2 × R1, ,63 × R1

POP R1,VM Count the 1s in the vector-mask register and store count in R1

MOVI2S

MOVS2I

VLR,R1

R1,VLR

Move contents of R1 to the vector-length register.

Move the contents of the vector-length register to R1

MOVF2S

MOVS2F

VM,F0

F0,VM

Move contents of F0 to the vector-mask register.

Move contents of vector-mask register to F0

FIGURE B.3 The DLXV vector instructions Only the double-precision FP operations are shown In addition to the vector

registers, there are two special registers, VLR (discussed in section B.3) and VM (discussed in section B.5) The operations

with stride are explained in section B.3, and the use of the index creation and indexed load-store operations are explained

in section B.5.

Trang 25

B-8 Appendix B Vector Processors

routines for performing Gaussian elimination constitute what is known as theLinpack benchmark The DAXPY routine, which implements the above loop,represents a small fraction of the source code of the Linpack benchmark, but itaccounts for most of the execution time for the benchmark

For now, let us assume that the number of elements, or length, of a vector ister (64) matches the length of the vector operation we are interested in (This re-striction will be lifted shortly.)

reg-E X A M P L reg-E Show the code for DLX and DLXV for the DAXPY loop Assume that the

starting addresses of X and Y are in Rx and Ry, respectively.

A N S W E R Here is the DLX code

Here is the code for DLXV for DAXPY

MULTSV V2,F0,V1 ;vector-scalar multiply

ADDV V4,V2,V3 ;add

SV Ry,V4 ;store the result

There are some interesting comparisons between the two code segments

in this Example The most dramatic is that the vector processor greatly duces the dynamic instruction bandwidth, executing only six instructions versus almost 600 for DLX This reduction occurs both because the vector operations work on 64 elements and because the overhead instructions that constitute nearly half the loop on DLX are not present in the DLXV

Trang 26

B.2 Basic Vector Architecture B-9

Another important difference is the frequency of pipeline interlocks In thestraightforward DLX code every ADDD must wait for a MULTD, and every SD mustwait for the ADDD On the vector processor, each vector instruction operates on allthe vector elements independently Thus, pipeline stalls are required only onceper vector operation, rather than once per vector element In this example, thepipeline-stall frequency on DLX will be about 64 times higher than it is onDLXV The pipeline stalls can be eliminated on DLX by using software pipelin-ing or loop unrolling (as we saw in Chapter 4) However, the large difference ininstruction bandwidth cannot be reduced

Vector Execution Time

The execution time of a sequence of vector operations primarily depends on threefactors: the length of the vectors being operated on, structural hazards among the

operations, and the data dependences Given the vector length and the initiation rate, which is the rate at which a vector unit consumes new operands and pro-

duces new results, we can compute the time for a single vector instruction Theinitiation rate is usually one per clock cycle for individual operations However,some supercomputers have vector instructions that can produce two or more re-sults per clock, and others have units that may not be fully pipelined For simplic-ity, we assume that initiation rates are one throughout this appendix Thus, theexecution time for a single vector instruction is approximately the vector length

To simplify the discussion of vector execution and its timing, we will use the

notion of a convoy, which is the set of vector instructions that could potentially

begin execution together in one clock period (Although the concept of a convoy

is used in vector compilers, no standard terminology exists Hence, we created

the term convoy.) The instructions in a convoy must not contain any structural or

data hazards (though we will relax this later); if such hazards were present, theinstructions in the potential convoy would need to be serialized and initiated indifferent convoys To keep the analysis simple, we assume that a convoy of in-structions must complete execution before any other instructions (scalar or vec-tor) can begin execution We will relax this in section B.6 by using a lessrestrictive, but more complex, method for issuing instructions

Accompanying the notion of a convoy is a timing metric, called a chime, that

can be used for estimating the performance of a vector sequence consisting ofconvoys A chime is an approximate measure of execution time for a vector se-quence; a chime measurement is independent of vector length Thus, a vector se-

quence that consists of m convoys executes in m chimes, and for a vector length

of n, this is approximately m × n clock cycles A chime approximation ignoressome processor-specific overheads, many of which are dependent on vectorlength Hence, measuring time in chimes is a better approximation for long vec-tors We will use the chime measurement, rather than clock cycles per result, toexplicitly indicate that certain overheads are being ignored

Trang 27

B-10 Appendix B Vector Processors

If we know the number of convoys in a vector sequence, we know the tion time in chimes One source of overhead ignored in measuring chimes is anylimitation on initiating multiple vector instructions in a clock cycle If only onevector instruction can be initiated in a clock cycle (the reality in most vectorprocessors), the chime count will underestimate the actual execution time of aconvoy Because the vector length is typically much greater than the number ofinstructions in the convoy, we will simply assume that the convoy executes in onechime

execu-E X A M P L execu-E Show how the following code sequence lays out in convoys, assuming a

single copy of each vector functional unit:

MULTSV V2,F0,V1 ;vector-scalar multiply

ADDV V4,V2,V3 ;add

SV Ry,V4 ;store the result

How many chimes will this vector sequence take? How many chimes per FLOP (floating-point operation) are needed?

A N S W E R The first convoy is occupied by the first LV instruction The MULTSV is

de-pendent on the first LV , so it cannot be in the same convoy The second

LV instruction can be in the same convoy as the MULTSV The ADDV is pendent on the second LV , so it must come in yet a third convoy, and finally the SV depends on the ADDV , so it must go in a following convoy This leads

de-to the following layout of vecde-tor instructions inde-to convoys:

1 LV

2 MULTSV LV

3 ADDV

4 SV The sequence requires four convoys and hence takes four chimes Note that although we allow the MULTSV and the LV both to execute in convoy

2, most vector machines will take two clock cycles to initiate the tions Since the sequence takes a total of four chimes and there are two floating-point operations per result, the number of chimes per FLOP is

The chime approximation is reasonably accurate for long vectors For ple, for 64-element vectors, the time in chimes is four, so the sequence wouldtake about 256 clock cycles The overhead of issuing convoy 2 in two separateclocks would be small

Trang 28

exam-B.2 Basic Vector Architecture B-11

Another source of overhead is far more significant than the issue limitation.The most important source of overhead ignored by the chime model is vector

start-up time The start-up time comes from the pipelining latency of the vector

operation and is principally determined by how deep the pipeline is for the tional unit used The start-up time increases the effective time to execute a con-voy to more than one chime Because of our assumption that convoys do notoverlap in time, the start-up time delays the execution of subsequent convoys Ofcourse the instructions in successive convoys have either structural conflicts forsome functional unit or are data dependent, so the assumption of no overlap isreasonable The actual time to complete a convoy is determined by the sum ofthe vector length and the start-up time If vector lengths were infinite, this start-

func-up overhead would be amortized, but finite vector lengths expose it, as the lowing Example shows

fol-E X A M P L fol-E Assume the start-up overhead for functional units is shown in Figure B.4.

Show the time that each convoy can begin and the total number of cycles needed How does the time compare to the chime approximation for a vector of length 64?

A N S W E R Figure B.5 provides the answer in convoys, assuming that the vector

length is n:

One tricky question is when we assume the vector sequence is done; this determines whether the start-up time of the SV is visible or not We as- sume that the instructions following cannot fit in the same convoy, and we

Unit Start-up overhead

Load and store unit 12 cycles

FIGURE B.4 Start-up overhead.

Convoy Starting time First-result time Last-result time

2 MULTSV LV 12 + n 12 + n + 12 23 + 2n

FIGURE B.5 Starting times and first- and last-result times for convoys

1 through 4 The vector length is n.

Trang 29

B-12 Appendix B Vector Processors

have already assumed that convoys do not overlap Thus the total time is given by the time until the last vector instruction in the last convoy com- pletes This is an approximation, and the start-up time of the last vector instruction may be seen in some sequences and not in others For sim- plicity, we always include it

The time per result for a vector of length 64 is 4 + (42/64) = 4.65 clock cycles, while the chime approximation would be 4 The execution time with start-up overhead is 1.16 times higher ■

For simplicity, we will use the chime approximation for running time, porating start-up time effects only when we want more detailed performance or

incor-to illustrate the benefits of some enhancement For long vecincor-tors, a typical tion, the overhead effect is not that large Later in the appendix we will exploreways to reduce start-up overhead

situa-Start-up time for an instruction comes from the pipeline depth for the tional unit implementing that instruction If the initiation rate is to be kept at oneclock cycle per result, then

func-For example, if an operation takes 10 clock cycles, it must be pipelined 10 deep

to achieve an initiation rate of one per clock cycle Pipeline depth, then, is mined by the complexity of the operation and the clock cycle time of the proces-sor The pipeline depths of functional units vary widely—from two to 20 stages isnot uncommon—though the most heavily used units have pipeline depths of four

deter-to eight clock cycles

For DLXV, we will use the same pipeline depths as the CRAY-1, though moremodern processors might have units with lower latency All functional units arefully pipelined As shown in Figure B.6, pipeline depths are six clock cycles forfloating-point add and seven clock cycles for floating-point multiply On DLXV,

as on most vector processors, independent vector operations using different tional units can issue in the same convoy

func-Operation Start-up penalty

FIGURE B.6 Start-up penalties on DLXV These are the start-up penalties in clock cycles

for DLXV vector operations

Pipeline depth Total functional unit time

Clock cycle time -

=

Trang 30

B.2 Basic Vector Architecture B-13

Vector Load-Store Units and Vector Memory Systems

The behavior of the load-store vector unit is significantly more complicated thanthat of the arithmetic functional units The start-up time for a load is the time toget the first word from memory into a register If the rest of the vector can be sup-plied without stalling, then the vector initiation rate is equal to the rate at whichnew words are fetched or stored Unlike simpler functional units, the initiationrate may not necessarily be one clock cycle

Typically, penalties for start-ups on load-store units are higher than those forarithmetic functional units—up to 50 clock cycles on some processors ForDLXV we will assume a start-up time of 12 clock cycles; by comparison, theCRAY-1 and CRAY X-MP have load-store start-up times of between nine and 17clock cycles Figure B.6 summarizes the start-up penalties for DLXV vector op-erations

To maintain an initiation rate of one word fetched or stored per clock, thememory system must be capable of producing or accepting this much data This

is usually done by creating multiple memory banks, as discussed in section 5.6

As we will see in the next section, having significant numbers of banks is usefulfor dealing with vector loads or stores that access rows or columns of data Most vector processors use memory banks rather than simple interleaving fortwo primary reasons:

1 Many vector computers support multiple loads or stores per clock To supportmultiple simultaneous accesses, the memory system needs to have multiplebanks and be able to control the addresses to the banks independently

2 As we will see in the next section, many vector processors support the ability

to load or store data words that are not sequential In such cases, independentbank addressing, rather than interleaving, is required

In Chapter 5 we saw that the desired access rate and the bank access time mined how many banks were needed to access a memory without a stall Thenext Example shows how these timings work out in a vector processor

deter-E X A M P L deter-E Suppose we want to fetch a vector of 64 elements starting at byte address

136, and a memory access takes six clocks How many memory banks must we have? With what addresses are the banks accessed? When will the various elements arrive at the CPU?

A N S W E R Six clocks per access require at least six banks, but because we want the

number of banks to be a power of two, we choose to have eight banks Figure B.7 shows what byte addresses each bank accesses within each time period Remember that a bank begins a new access as soon as it has completed the old access

Trang 31

B-14 Appendix B Vector Processors

Figure B.8 shows the timing for the first few sets of accesses for an eight-bank system with a six-clock-cycle access latency There are two important observations about Figures B.7 and B.8: First, notice that the exact address fetched by a bank is largely determined by the lower-order bits in the bank number; however, the initial access to a bank is always within eight double words of the starting address Second, notice that once the initial latency is overcome (six clocks in this case), the pattern

is to access a bank every n clock cycles, where n is the total number of banks (n = 8 in this case).

mem-a word in the smem-ame block of eight distinguishes this type of memory system from interleaved memory Normally, interleaved memory systems combine the bank ad- dress and the base starting address by concatenation rather than addition Also, interleaved memories are almost always implemented with synchronized access Memory banks require address latches for each bank, which are not normally needed in a system with only interleaving This timing diagram is drawn as if all banks access in clock 0, clock 16, etc In practice, since the bus allocations needed

to return the words are staggered, the actual accesses are often staggered.

FIGURE B.8 Access timing for the first 64 double-precision words of the load.

After the six-clock-cycle initial latency, eight double-precision words are returned every eight clock cycles.

Action Memoryaccess

Next access + deliver last

8 words

Next access + deliver last

8 words

Deliver last

8 words Time

0 6 14 22 62 70

Trang 32

B.3 Two Real-World Issues: Vector Length and Stride B-15

The number of banks in the memory system and the pipeline depth in thefunctional units are essentially counterparts, since they determine the initiationrates for operations using these units The processor cannot access a memorybank faster than the memory cycle time Thus, if memory is built from DRAM,where the memory cycle time is about twice the access time, the processor needstwice as many banks as the above Example shows For memory systems that sup-port multiple simultaneous vector accesses or allow nonsequential accesses invector loads or stores, the number of memory banks should be larger than theminimum, otherwise, memory bank conflicts will exist We explore this in moredetail in the next section

This section deals with two issues that arise in real programs: What do you dowhen the vector length in a program is not exactly 64? How do you deal withnonadjacent elements in vectors that reside in memory? First, let’s consider theissue of vector length

Vector-Length Control

A vector-register processor has a natural vector length determined by the number

of elements in each vector register This length, which is 64 for DLXV, is likely to match the real vector length in a program Moreover, in a real programthe length of a particular vector operation is often unknown at compile time Infact, a single piece of code may require different vector lengths For example,consider this code:

do 10 i = 1,n

10 Y(i) = a ∗ X(i) + Y(i)The size of all the vector operations depends on n, which may not even be knownuntil runtime! The value of n might also be a parameter to a procedure containingthe above loop and therefore be subject to change during execution

The solution to these problems is to create a vector-length register (VLR) The

VLR controls the length of any vector operation, including a vector load or store.The value in the VLR, however, cannot be greater than the length of the vector

registers This solves our problem as long as the real length is less than the mum vector length (MVL) defined by the processor

maxi-What if the value of n is not known at compile time, and thus may be greaterthan MVL? To tackle the second problem where the vector is longer than the

maximum length, a technique called strip mining is used Strip mining is the

gen-eration of code such that each vector opgen-eration is done for a size less than or

Vector Length and Stride

Trang 33

B-16 Appendix B Vector Processors

equal to the MVL We could strip-mine the loop in the same manner that we rolled loops in Chapter 4: Create one loop that handles any number of iterationsthat is a multiple of MVL and another loop that handles any remaining iterations,which must be less than MVL In practice, compilers usually create a single strip-mined loop that is parameterized to handle both portions by changing the length.The strip-mined version of the DAXPY loop written in FORTRAN, the majorlanguage used for scientific applications, is shown with C-style comments:

un-low = 1

VL = (n mod MVL) /*find the odd size piece*/

do 1 j = 0,(n / MVL) /*outer loop*/

do 10 i = low, low+VL-1 /*runs for length VL*/

Y(i) = a*X(i) + Y(i) /*main operation*/

low = low+VL /*start of next vector*/

VL = MVL /*reset the length to max*/

The term n/MVL represents truncating integer division (which is what TRAN does) and is used throughout this section The effect of this loop is toblock the vector into segments which are then processed by the inner loop Thelength of the first segment is (n mod MVL) and all subsequent segments are oflength MVL This is depicted in Figure B.9

FOR-The inner loop of the code above is vectorizable with length VL, which is equal

to either (n mod MVL) or MVL The VLR register must be set twice—once at eachplace where the variable VL in the code is assigned With multiple vector opera-tions executing in parallel, the hardware must copy the value of VLR when a vec-tor operation issues, in case VLR is changed for a subsequent vector operation

FIGURE B.9 A vector of arbitrary length processed with strip mining All blocks but the

first are of length MVL, utilizing the full power of the vector processor In this figure, the able m is used for the expression (n mod MVL)

vari-1 m (m+1)

m+MVL

(m+

MVL+1) m+2 * MVL

(m+2 * MVL+1) m+3 * MVL

+1) n Range of i

Value of j 0 1 2 3 . n/MVL

.

Trang 34

B.3 Two Real-World Issues: Vector Length and Stride B-17

In addition to the start-up overhead, we need to account for the overhead ofexecuting the strip-mined loop This strip-mining overhead, which arises fromthe need to reinitiate the vector sequence and set the VLR, effectively adds to thevector start-up time, assuming that a convoy does not overlap with other instruc-tions If that overhead for a convoy is 10 cycles, then the effective overhead per

64 elements increases by 10 cycles, or 0.15 cycles per element

There are two key factors that contribute to the running time of a strip-minedloop consisting of a sequence of convoys:

1 The number of convoys in the loop, which determines the number of chimes

We use the notation T chime for the execution time in chimes

2 The overhead for each strip-mined sequence of convoys This overhead

con-sists of the cost of executing the scalar code for strip mining each block, T loop,

plus the vector start-up cost for each convoy, T start

There may also be a fixed overhead associated with setting up the vector quence the first time In recent vector processors this overhead has become quitesmall, so we ignore it

The components can be used to state the total running time for a vector

se-quence operating on a vector of length n, which we will call T n:

The values of Tstart, Tloop, and Tchime are compiler and processor dependent Theregister allocation and scheduling of the instructions affect both what goes in aconvoy and the start-up overhead of each convoy

For simplicity, we will use a constant value for Tloop on DLXV Based on a riety of measurements of CRAY-1 vector execution, the value chosen is 15 for

va-Tloop At first glance, you might think that this value is too small The overhead ineach loop requires setting up the vector starting addresses and the strides, incre-menting counters, and executing a loop branch In practice, these scalar instruc-tions can be totally or partially overlapped with the vector instructions,minimizing the time spent on these overhead functions The value of Tloop ofcourse depends on the loop structure, but the dependence is slight compared withthe connection between the vector code and the values of Tchime and Tstart

E X A M P L E What is the execution time on DLXV for the vector operation A = B × s,

where s is a scalar and the length of the vectors A and B is 200?

A N S W E R Assume the addresses of A and B are initially in Ra and Rb , s is in Fs , and

recall that for DLX (and DLXV) R0 always holds 0 Since (200 mod 64) =

8, the first iteration of the strip-mined loop will execute for a vector length

MVL × (Tloop+Tstart) +n× Tchime

=

Trang 35

B-18 Appendix B Vector Processors

of eight elements, and the following iterations will execute for a vector length of 64 elements The starting byte addresses of the next segment of each vector is eight times the vector length Since the vector length is ei- ther eight or 64, we increment the address registers by 8 × 8 = 64 after the first segment and 8 × 64 = 512 for latter segments The total number of bytes in the vector is 8 × 200 = 1600, and we test for completion by com- paring the address of the next vector segment to the initial address plus

1600 Here is the actual code:

ADDI R2,R0,#1600 ;total # bytes in vector ADD R2,R2,Ra ;address of the end of A vector ADDI R1,R0,#8 ;loads length of 1st segment MOVI2S VLR,R1 ;load vector length in VLR ADDI R1,R0,#64 ;length in bytes of 1st segment ADDI R3,R0,#64 ;vector length other segments Loop: LV V1,Rb ;load B

MULTSV V2,Fs,V1 ;vector * scalar

SV Ra,V2 ;store A ADD Ra,Ra,R1 ;address of next segment of A ADD Rb,Rb,R1 ;address of next segment of B ADDI R1,R0,#512 ;load byte offset next segment MOVI2S VLR,R3 ;set length to 64 element SUB R4,R2,Ra ;at the end of A?

BNEZ R4,Loop ;if not, go back

The three vector instructions in the loop are dependent and must go into three convoys, hence Tchime = 3 Let’s use our basic formula:

The value of Tstart is the sum of

■ The vector load start-up of 12 clock cycles

■ A seven-clock-cycle start-up for the multiply

■ A 12-clock-cycle start-up for the store.

Thus, the value of Tstart is given by

=

T200 = 4 × ( 15 + Tstart) + 200 × 3

T200 = 60 + ( 4 × Tstart) + 600 = 660 + ( 4 × Tstart)

Trang 36

B.3 Two Real-World Issues: Vector Length and Stride B-19

The execution time per element with all start-up costs is then 784/200 = 3.9, compared with a chime approximation of three In section B.6, we will

be more ambitious—allowing overlapping of separate convoys ■

Figure B.10 shows the overhead and effective rates per element for the aboveexample (A = B × s) with various vector lengths A chime counting model wouldlead to three clock cycles per element, while the two sources of overhead add 0.9clock cycles per element in the limit

The next few sections introduce enhancements that reduce this time We willsee how to reduce the number of convoys and hence the number of chimes using

a technique called chaining The loop overhead can be reduced by further

over-lapping the execution of vector and scalar instructions, allowing the scalar loopoverhead in one iteration to be executed while the vector instructions in the previ-ous instruction are completing Finally, the vector start-up overhead can also beeliminated, using a technique that allows overlap of vector instructions in sepa-rate convoys

FIGURE B.10 This shows the total execution time per element and the total overhead time per element, versus the vector length for the Example on page B-17 For short vec-

tors the total start-up time is more than one-half of the total time, while for long vectors it duces to about one-third of the total time The sudden jumps occur when the vector length crosses a multiple of 64, forcing another iteration of the strip-mining code and execution of a set of vector instructions These operations increase T by T + T

re-Total time per element

Total overhead per element 10

Trang 37

B-20 Appendix B Vector Processors

Vector Stride

The second problem this section addresses is that the position in memory of cent elements in a vector may not be sequential Consider the straightforwardcode for matrix multiply:

At the statement labeled 10 we could vectorize the multiplication of each row of

B with each column of C and strip-mine the inner loop with k as the index able

vari-To do so, we must consider how adjacent elements in B and adjacent elements

in C are addressed As we discussed in section 5.3, when an array is allocatedmemory it is linearized and must be laid out in either row-major or column-majororder This linearization means that either the elements in the row or the elements

in the column are not adjacent in memory For example, if the above loop werewritten in FORTRAN, which allocates column-major order, the elements of Bthat are accessed by iterations in the inner loop are separated by the row sizetimes 8 (the number of bytes per entry) for a total of 800 bytes In Chapter 5, wesaw that blocking could be used to improve the locality in cache-based systems

In vector processors we do not have caches, so we need another technique tofetch elements of a vector that are not adjacent in memory

This distance separating elements that are to be gathered into a single register

is called the stride In the current example, using column-major layout for the

matrices means that matrix C has a stride of 1, or 1 double word (8 bytes), rating successive elements, and matrix B has a stride of 100, or 100 double words(800 bytes)

sepa-Once a vector is loaded into a vector register it acts as if it had logically cent elements Thus a vector-register processor can handle strides greater than

adja-one, called nonunit strides, using only vector-load and vector-store operations

with stride capability This ability to access nonsequential memory locations and

to reshape them into a dense structure is one of the major advantages of a vectorprocessor over a cache-based processor Caches inherently deal with unit stridedata, so that while increasing block size can help reduce miss rates for large sci-entific data sets, increasing block size can have a negative effect for data that isaccessed with nonunit stride While blocking techniques can solve some of theseproblems (see section 5.3), the ability to efficiently access data that is not contig-uous remains an advantage for vector processors on certain problems

On DLXV, where the addressable unit is a byte, the stride for our examplewould be 800 The value must be computed dynamically, since the size of thematrix may not be known at compile time, or—just like vector length—may

Trang 38

B.3 Two Real-World Issues: Vector Length and Stride B-21

change for different executions of the same statement The vector stride, like thevector starting address, can be put in a general-purpose register Then the DLXVinstruction LVWS (load vector with stride) can be used to fetch the vector into avector register Likewise, when a nonunit stride vector is being stored, SVWS(store vector with stride) can be used In some vector processors the loads andstores always have a stride value stored in a register, so that only a single load and

a single store instruction are required

Complications in the memory system can occur from supporting strides greaterthan one In Chapter 5 we saw that memory accesses could proceed at full speed ifthe number of memory banks was at least as large as the memory-access time inclock cycles Once nonunit strides are introduced, however, it becomes possible torequest accesses from the same bank at a higher rate than the memory-access time.When multiple accesses contend for a bank, a memory bank conflict occurs andone access must be stalled A bank conflict, and hence a stall, will occur if

E X A M P L E Suppose we have 16 memory banks with a read latency of 12 clocks How

long will it take to complete a 64-element vector load with a stride of 1? With a stride of 32?

A N S W E R Since the number of banks is larger than the read latency, for a stride of

1, the load will take 12 + 64 = 76 clock cycles, or 1.2 clocks per element The worst possible stride is a value that is a multiple of the number of memory banks, as in this case with a stride of 32 and 16 memory banks Every access to memory will collide with the previous one This leads to

a read latency of 12 clock cycles per element and a total time for the

Memory bank conflicts will not occur if the stride and number of banks are atively prime with respect to each other and there are enough banks to avoid con-flicts in the unit-stride case When there are no bank conflicts, multiword and unitstrides run at the same rates Increasing the number of memory banks to a numbergreater than the minimum to prevent stalls with a stride of length 1 will decreasethe stall frequency for some other strides For example, with 64 banks, a stride of

rel-32 will stall on every other access, rather than every access If we originally had astride of 8 and 16 banks, every other access would stall; while with 64 banks, astride of 8 will stall on every eighth access If we have multiple memory pipelines,

we will also need more banks to prevent conflicts In 1995, most vector puters have at least 64 banks, and some have as many as 1024 in the maximummemory configuration Because bank conflicts can still occur in nonunit stridecases, many programmers favor unit stride accesses whenever possible

supercom-Least common multiple (Stride, Number of banks)

Stride - < Memory-access latency

Trang 39

B-22 Appendix B Vector Processors

Two factors affect the success with which a program can be run in vector mode.The first factor is the structure of the program itself: Do the loops have true datadependences, or can they be restructured so as not to have such dependences?This factor is influenced by the algorithms chosen and, to some extent, by howthey are coded The second factor is the capability of the compiler While nocompiler can vectorize a loop where no parallelism among the loop iterations ex-ists, there is tremendous variation in the ability of compilers to determine wheth-

er a loop can be vectorized The techniques used to vectorize programs are thesame as those discussed in Chapter 4 for uncovering ILP; here we simply reviewhow well these techniques work

As an indication of the level of vectorization that can be achieved in scientificprograms, let's look at the vectorization levels observed for the Perfect Clubbenchmarks, mentioned in Chapter 1 These benchmarks are large, real scientificapplications Figure B.11 shows the percentage of floating-point operations in

each benchmark and the percentage executed in vector mode on the CRAY X-MP.The wide variation in level of vectorization has been observed by several studies

of the performance of applications on vector processors While better compilersmight improve the level of vectorization in some of these programs, most will

Benchmark name FP operations

Trang 40

B.5 Enhancing Vector Performance B-23

require rewriting to achieve significant increases in vectorization For example, anew program or a significant rewrite will be needed to obtain the benefits of avector processor on SPICE

There is also tremendous variation in how well compilers do in vectorizingprograms As a summary of the state of vectorizing compilers, consider the data

in Figure B.12, which shows the extent of vectorization for different processorsusing a test suite of 100 hand-written FORTRAN kernels The kernels were de-signed to test vectorization capability and can all be vectorized by hand; we willsee several examples of these loops in the Exercises

Three techniques for improving the performance of vector processors are cussed in this section The first deals with making a sequence of dependent vectoroperations run faster The other two deal with expanding the class of loops that

dis-can be run in vector mode The first technique, chaining, originated in the

CRAY-1, but is now supported on most vector processors The techniques discussed inthe second and third parts of this section combat the effects of conditional execu-tion and sparse matrices The extensions are taken from a variety of processorsincluding the most recent supercomputers

Processor Compiler

Completely vectorized

Partially vectorized

Not vectorized

FIGURE B.12 Result of applying vectorizing compilers to the 100 FORTRAN test kernels For each

processor we indicate how many loops were completely vectorized, partially vectorized, and unvectorized These loops were collected by Callahan, Dongarra, and Levine [1988] Two different compilers for the CRAY X-MP show the large dependence on compiler technology.

Ngày đăng: 07/08/2014, 23:20

TỪ KHÓA LIÊN QUAN