PRINCIPLES OF COMPUTER ARCHITECTURE phần 8 doc

The number of inputs to eachinput crosspoint and the number of outputs from each output crosspoint isselected according to the desired complexity of the crosspoints, and the desiredcompl

Trang 1

At the other extreme of complexity is the bus topology, which is illustrated in

Figure 10-15b With the bus topology, a fixed amount of bus bandwidth is

shared among the PEs The crosspoint complexity is N for N PEs, and the

net-work diameter is 1, so the bus grows more gracefully than the crossbar There can

Figure 10-15 Network topologies: (a) crossbar; (b) bus; (c) ring; (d) mesh; (e) star; (f) tree; (g)

per-fect shuffle; (h) hypercube.

Trang 2

only be one source at a time, and there is normally only one receiver, so blocking

is a frequent situation for a bus

In a ring topology, there are N crosspoints for N PEs as shown in Figure 10-15c.

As for the crossbar, each crosspoint is contained within a PE The network

diam-eter is N/2, but the collective bandwidth is N times greater than for the case of

the bus This is because adjacent PEs can communicate directly with each otherover their common link without affecting the rest of the network

In the mesh topology, there are N crosspoints for N PEs, but the diameter is only

as shown in Figure 10-15d All PEs can simultaneously communicate injust steps, as discussed in (Leighton, 1992) using an off-line routing algorithm (in which the crosspoint settings are determined external to the PEs).

In the star topology, there is a central hub through which all PEs communicate asshown in Figure 10-15e Since all of the connection complexity is centralized,the star can only grow to sizes that are bounded by the technology, which is nor-mally less than for decentralized topologies like the mesh The crosspoint com-plexity within the hub varies according to the implementation, which can beanything from a bus to a crossbar

In the tree topology, there are N crosspoints for N PEs, and the diameter is

2log2N – 1 as shown in Figure 10-15f The tree is effective for applications in

which there is a great deal of distributing and collecting of data

In the perfect shuffle topology, there are N crosspoints for N PEs as shown in

Figure 10-15g The diameter is log2N since it takes log2N passes through the

net-work to connect any PE with any other in the worst case The perfect shufflename comes from the property that if a deck of 2N cards, in which N is an integer, is cut in half and interleaved N times, then the original configuration of the deck is restored All N PEs can simultaneously communicate in 3log2N – 1

passes through the network as presented in (Wu and Feng, 1981)

Finally, the hypercube has N crosspoints for N PEs, with a diameter of log2N-1,

as shown in Figure 10-15h The smaller number of crosspoints with respect tothe perfect shuffle topology is balanced by a greater connection complexity in thePEs

Let us now consider the behavior of blocking in interconnection networks ure 10-17a shows a configuration in which four processors are interconnected

Fig-2 N

3 N

Trang 3

with a two-stage perfect shuffle network in which each crosspoint either passes

both inputs straight through to the outputs, or exchanges the inputs to the

out-puts A path is enabled from processor 0 to processor 3, and another path is

enabled from processor 3 to processor 0 Neither processor 1 nor processor 2

needs to communicate, but they participate in some arbitrary connections as a

side effect of the crosspoint settings that are already specified

Suppose that we want to add another connection, from processor 1 to processor

1 There is no way to adjust the unused crosspoints to accommodate this new

connection because all of the crosspoints are already set, and the needed

connec-tion does not occur as a side effect of the current settings Thus, connecconnec-tion 1 →

Crosspoints Unused

Figure 10-17 (a) Crosspoint settings for connections 0 → 3 and 3 → 0; (b) adjusted settings to

ac-commodate connection 1 → 1.

Trang 4

use, then we can accommodate all three connections, as illustrated in Figure10-17b An interconnection network that operates in this manner is referred to

as a rearrangeably nonblocking network.

The three-stage Clos network is strictly nonblocking That is, there is no need

to disturb the existing settings of the crosspoints in order to add another tion An example of a three stage Clos network is shown in Figure 10-18 for four

connec-PEs In the input stage, each crosspoint is actually a crossbar that can make anyconnection of the two inputs to the three outputs The crosspoints in the middlestage and the output stage are also small crossbars The number of inputs to eachinput crosspoint and the number of outputs from each output crosspoint isselected according to the desired complexity of the crosspoints, and the desiredcomplexity of the middle stage

The middle stage has three crosspoints in this example, and in general, there are

(n – 1) + (p – 1) + 1 = n + p – 1 crosspoints in the middle stage, in which n is the number of inputs to each input crosspoint and p is the number of outputs from

each output crosspoint This is how the three-stage Clos network maintains a

strictly nonblocking property There are n – 1 ways that an input can be blocked

Trang 5

at the output of an input stage crosspoint as a result of existing connections

Sim-ilarly, there are p – 1 ways that existing connections can block a desired

connec-tion into an output crosspoint In order to ensure that every desired connecconnec-tion

can be made between available input and output ports, there must be one more

path available

For this case, n = 2 and p = 2, and so we need n + p – 1 = 2 + 2 – 1 = 3 paths

from every input crosspoint to every output crosspoint Architecturally, this

rela-tionship is satisfied with three crosspoints in the middle stage that each connect

every input crosspoint to every output crosspoint

EXAMPLE: STRICTLY NONBLOCKING NETWORK

For this example, we want to design a strictly nonblocking (three-stage Clos)

network for 12 channels (12 inputs and 12 outputs to the network) while

main-taining a low maximum complexity of any crosspoint in the network.

There are a number of ways that we can organize the network For the input

stage, we can have two input nodes with 6 inputs per node, or 6 input nodes

with two inputs per node, to list just two possibilities We have similar choices

for the output stage Let us start by looking at a conÞguration that has two

nodes in the input stage, and two nodes in the output stage, with 6 inputs for

each node in the input stage and 6 outputs for each node in the output stage For

this case, n = p = 6, which means that n + p - 1 = 11 nodes are needed in the

mid-dle stage, as shown in Figure 10-19 The maximum complexity of any node for

this case is 6 × 11 = 66, for each of the input and output nodes.

Now let us try using 6 input nodes and 6 output nodes, with two inputs for each

input node and two outputs for each output node For this case, n = p = 2, which

means that n + p - 1 = 3 nodes are needed in the middle stage, as shown in Figure

10-20 The maximum node complexity for this case is 6 × 6 = 36 for each of the

middle stage nodes, which is better than the maximum node complexity of 66

for the previous case.

Similarly, networks for n = p = 4 and n = p = 3 are shown in Figure 10-21 and

Fig-ure 10-22, respectively The maximum node complexity for each of these

net-works is 4 × 7 = 28 and 4 × 4 = 16, respectively Among the four conÞgurations

Trang 6

10.9.3 MAPPING AN ALGORITHM ONTO A PARALLEL ARCHITECTURE

The process of mapping an algorithm onto a parallel architecture begins with a

dependency analysis in which data dependencies among the operations in a

program are identified Consider the C code shown in Figure 10-23 In an nary SISD processor, the four numbered statements require four time steps to

Figure 10-19 A 12-channel three-stage Clos network with n = p = 6.

Trang 7

complete, as illustrated in the control sequence of Figure 10-24a The

depen-dency graph shown in Figure 10-24b exposes the natural parallelism in the

con-trol sequence The dependency graph is created by assigning each operation in

the original program to a node in the graph, and then drawing a directed arc

from each node that produces a result to the node(s) that needs it

The control sequence requires four time steps to complete, but the dependency

graph shows that the program can be completed in just three time steps, since

operations 0 and 1 do not depend on each other and can be executed

Figure 10-21 A 12-channel three-stage Clos network with n = p = 4.

Trang 8

neously (as long as there are two processors available.) The resulting speedup of

may not be very great, but for other programs, the opportunity for speedup can

func(x, y) /* Compute (x 2 + y 2 ) × y 2 */

int x, y;

{ int temp0, temp1, temp2, temp3;

temp0 = x * x;

temp1 = y * y;

temp2 = temp1 + temp2;

temp3 = temp1 * temp2;

return(temp3);

}

0 1 2 3

Operation numbers

Figure 10-23 A C function computes (x 2 + y 2 ) × y 2 .

3

14

*

2 +

3

*

Arrows represent flow of control

*

Arrows represent flow of data

Trang 9

be substantial as we will see.

Consider a matrix multiplication problem Ax = b in which A is a 4×4 matrix and

x and b are both 4×1 matrices, as illustrated in Figure 10-25a Our goal is to

solve for the b i, using the equations shown in Figure 10-25b Every operation is

assigned a number, starting from 0 and ending at 27 There are 28 operations,

assuming that no operations can receive more than two operands A program

running on a SISD processor that computes the b i requires 28 time steps to

com-plete, if we make a simplifying assumption that additions and multiplications

take the same amount of time

A dependency graph for this problem is shown in Figure 10-26 The worst case

path from any input to any output traverses three nodes, and so the entire

pro-cess can be completed in three time steps, resulting in a speedup of

Now that we know the structure of the data dependencies, we can plan a

map-ping of the nodes of the dependency graph to PEs in a parallel processor Figure

Trang 10

10-27a shows a mapping in which each node of the dependency graph for b0 is

assigned to a unique PE The time required to complete each addition is 10 ns,

6 +

5 +

13 +

12 +

20 +

19 +

27 +

26 +

Figure 10-26 Dependency graph for matrix multiplication.

0

4 +

6 +

5 +

+ = 10 ns

* = 100 ns Communication = 1000 ns

Fine Grain: PT = 2120 ns 100ns 100ns 100ns 100ns

6 +

5 +

Figure 10-27 Mapping tasks to PEs: (a) one PE per operation; (b) one PE per b i .

Trang 11

the time to complete each multiplication is 100 ns, and the time to

communi-cate between PEs is 1000 ns These numbers are for a fictitious processor, but the

methods extend to real parallel processors

As we can see from the parallel time of 2120 ns to execute the program using the

mapping shown in Figure 10-27a, the time spent in communication dominates

performance This is worse than a SISD approach, since the 16 multiplications

and the 12 additions would require 16 × 100 ns + 12 × 10 ns = 1720 ns There is

no processor-to-processor communication cost within a SISD processor, and so

only the computation time is considered

An alternative mapping is shown in Figure 10-27b in which all of the operations

needed to compute b0 are clustered onto the same PE We have thus increased

the granularity of the computation, which is a measure of the number of

opera-tions assigned to each PE A single PE is a sequential, SISD processor, and so

none of the operations within a cluster can be executed in parallel, but the

com-munication time among the operations is reduced to 0 As shown in the diagram,

the parallel time for b0 is now 430 ns which is much better than either the

previ-ous parallel mapping or a straight SISD mapping Since there are no

dependen-cies among the b i, they can all be computed in parallel, using one processor per

b i The actual speedup is now:

Communication is always a bottleneck in parallel processing, and so it is

impor-tant that we maintain a proper balance We should not be led astray into

think-ing that addthink-ing processors to a problem will speed it up, when in fact, addthink-ing

processors can increase execution time as a result of communication time In

general, we want to maintain a ratio in which:

The Connection Machine (CM-1) is a massively parallel SIMD processor

designed and built by Thinking Machines Corporation during the 1980’s The

architecture is noted for high connectivity between a large number of small

pro-cessors The CM-1 consists of a large number of one-bit processors arranged at

the vertices of an n-space hypercube Each processor communicates with other

T Sequential

T Parallel

- 1720

430 - 4

T Communication

T Computation

- ≤ 1

Trang 12

processors via routers that send and receive messages along each dimension of thehypercube.

A block diagram of the CM-1 is shown in Figure 10-28 The host computer is a

conventional SISD machine such as a Symbolics computer (which was popular

at the time) that runs a program written in a high level language such as LISP or

C Parallelizeable parts of a high level program are farmed out to 2n processors(216 processors is the size of a full CM-1) via a memory bus (for data) and amicrocontroller (for instructions) and the results are collected via the memorybus A separate high bandwidth datapath is provided for input and outputdirectly to and from the hypercube

The CM-1 makes use of a 12-space hypercube between the routers that send andreceive data packets The overall CM-1 prototype uses a 16-space hypercube, and

so the difference between the 12-space router hypercube and the 16-space PEhypercube is made up by a crossbar that serves the 16 PEs attached to eachrouter For the purpose of example, a four-space hypercube is shown in Figure10-29 for the router network Each vertex of the hypercube is a router with anattached group of 16 PEs, each of which has a unique binary address The routerhypercube shown in Figure 10-29 thus serves 256 PEs Routers that are directlyconnected to other routers can be found by inverting any one of the four most

Micro-Figure 10-28 Block diagram of the CM-1 (Adapted from [Hillis, 1985]).

Trang 13

significant bits in the address.

Each PE is made up of a 16-bit flag register, a three-input, two-output ALU, and

a 4096-bit random access memory, as shown in Figure 10-30 During operation,

an external controller (the microcontroller of Figure 10-28) selects two bits from

memory via the A address and B address lines Only one value can be read from

memory at a time, so the A value is buffered while the B value is fetched The

controller selects a flag to read, and feeds the flag and the A and B values into an

Router Address (The router

address forms the four most

significant bits of each of the16

PEs that the router serves.)

Figure 10-29 A four-space hypercube for the router network.

Trang 14

ALU whose function it also selects The result of the computation produces anew value for the A addressed location and one of the flags.

The ALU takes three one-bit data inputs, two from the memory and one fromthe flag register, and 16 control inputs from the microcontroller and producestwo one-bit data outputs for the memory and flag registers The ALU generatesall 23 = 8 combinations (minterms) of the input variables for each of the twooutputs Eight of the 16 control lines select the minterms that are needed in thesum-of-products form of each output

PE’s communicate with other PE’s through routers Each router services nication between a PE and the network by receiving packets from the networkintended for the attached PEs, injecting packets into the network, bufferingwhen necessary, and forwarding messages that use the router as an intermediary

commu-to get commu-to their destinations

The CM-1 is a landmark machine for the massive parallelism made available by

the architecture For scalable problems like finite element analysis (such as

modeling heat flow through the use of partial differential equations), the able parallelism can be fully exploited There is usually a need for floating pointmanipulation for this case, and so floating point processors augment the PEs inthe next generation CM-2 A natural way to model heat flow is through a meshinterconnect, which is implemented as a hardwired bypass to the message-pass-ing routing mechanism through the North-East-West-South (NEWS) grid Thus

avail-we can reduce the cost of PE-to-PE communication for this application

Not all problems scale so well, and there is a general trend moving away fromfine grain parallel processing This is largely due to the difficulty of keeping thePEs busy doing useful work, while also keeping the time spent in computationgreater than the time spent in communication In the next section, we look at acoarse grain architecture: The CM-5

The CM-5 (Thinking Machines Corporation) combines properties of bothSIMD and MIMD architectures, and thereby provides greater flexibility formapping a parallel algorithm onto the architecture The CM-5 is illustrated inFigure 10-31 There are three types of processors for data processing, control,and I/O These processors are connected primarily by the Data Network and theControl Network, and to a lesser extent by the Diagnostic Network

Trang 15

The processing nodes are assigned to control processors, which form partitions,

as illustrated in Figure 10-32 A partition contains a control processor, a number

of processing nodes, and dedicated portions of the Control and Data Networks

Note that there are both user partitions (where the data processing takes place)

and I/O partitions

The Data Network uses a fat-tree topology, as illustrated in Figure 10-33 The

general idea is that the bandwidth from child nodes to parent nodes increases as

Data Network Diagnostic Network

Data Processor Control ControlI/O

Trang 16

the network approaches the root, to account for the increased traffic as data els from the leaves toward the root.

trav-The Control Network uses a simple binary tree topology in which the systemcomponents are at the leaves A control processor occupies one leaf in a partition,and the processing nodes are placed in the remaining nodes, although not neces-sarily filling all possible node positions in a subtree

The Diagnostic Network is a separate binary tree in which one or more tic processors are at the root At the leaves are physical components, such as cir-cuit boards and backplanes, rather than logical components such as processingnodes

diagnos-Each control processor is a self-contained system that is comparable in ity to a workstation A control processor contains a RISC microprocessor thatserves as a CPU, a local memory, I/O that contains disks and Ethernet connec-tions, and a CM-5 interface

complex-Each processing node is much smaller, and contains a SPARC-based cessor, a memory controller for 8, 16, or 32 Mbytes of local memory, and a net-work interface to the Control and Data Networks In a full implementation of aCM-5, there can be up to 16,384 processing nodes, each performing 64-bitfloating point and integer operations, operating at a clock rate of 32 MHz

micropro-Overall, the CM-5 provides a true mix of SIMD and MIMD styles of processing,and offers greater applicability than the stricter SIMD style of the CM-1 and

Figure 10-33 An example of a fat tree.

Trang 17

CM-2 predecessors.

10.10Case Study: Parallel Processing in the Sega Genesis

Home video game systems are examples of (nearly) full-featured computer

archi-tectures They have all of the basic features of modern computer architectures,

and several advanced features One notably lacking feature is permanent storage

(like a hard disk) for saving information, although newer models even have that

to a degree One notably advanced feature, which we explore here, is the use of

multiple processors in a MIMD configuration

Three of the most prominent home video game platforms are manufactured by

Sony, Nintendo, and Sega For the purpose of this discussion, we will study the

Sega Genesis, which exploits parallel processing for real-time performance

Figure 10-34 illustrates the external view of the Sega Genesis home video game

system The Sega Genesis consists of a motherboard, which contains electronic

components such as the processor, memory, and interconnects, and also a few

hand-held controllers and an interface to a television set

In terms of the conventional von Neumann model of a digital computer, the

Sega Genesis has all of the basic parts: input (the controllers), output (the

televi-sion set), arithmetic logic unit (inside of the processor), control unit (also inside

the processor), and memory (which includes the internal memory and the

plug-in game cartridges)

Figure 10-34 External view of the Sega Genesis home video game system.

Trang 18

The system bus model captures the logical connectivity of the Sega architecture

as well as some of the physical organization Figure 10-35 illustrates the system

bus model view of the Sega Genesis The Genesis contains two general-purposemicroprocessors, the Motorola 68000 and the Zilog Z80 These processors areolder, low cost processors that handle the general program execution Videogame systems must be able to generate a wide variety of sound effects, a processthat is computationally intensive In order to maintain game speed and qualityduring sound generation the Genesis off-loads sound effect computations to twospecial purpose chips, the Texas Instruments programmable sound generator (TIPSG) and the Yamaha sound synthesis chip There are also I/O interfaces for thevideo system and the hand-held controls

The 68000 processor runs the main program and controls the rest of themachine The 68000 accomplishes this by transferring data and instructions tothe other components via the system bus One of the components that the

68000 processor controls is the architecturally similar, but smaller Z80 processor,which can be loaded with a program that executes while the 68000 returns toexecute its own program, using an arbitration mechanism that allows both pro-cessors to share the bus (but only one at a time.)

The TI PSG has 3 square wave tones and 1 white noise tone Each tone/noisecan have its own frequency and volume

The Yamaha synthesis chip is based on FM synthesis There are 6 voices with 4operators each The chip is similar to those used in the Yamaha DX27 and

68000 Processor

SYSTEM BUS

Z80 Processor

Main Memory

Plug-in Cartridge

Programmable Sound Generator

Sound Synthesis Chip

Video and Sound Output DACs

Interface to Hand-Held Controller

Figure 10-35 System bus model view of the Sega Genesis.

Trang 19

DX100 synthesizers By setting up registers within the chips, a rich variety of

sounds can be created

The plug-in game cartridges contain the programs, and there is additional

runt-ime memory available in a separate unit (labeled “Main memory” in Figure

10-35.) Additional components are provided for video output, sound output,

and hand-held controllers

10.10.2SEGA GENESIS OPERATION

When the Sega Genesis is initially powered on, a RESET signal is enabled, which

allows all of the electrical voltage levels to stabilize and initializes a number of

runtime variables The RESET signal is then automatically disabled, and the

68000 begins reading and executing instructions from the game cartridge

During operation, the instructions in the game cartridge instruct the 68000 to

load a program into the Z80 processor, and to start the Z80 program execution

while the 68000 returns to its own program The Z80 program controls the

sound chips, while the 68000 carries out graphical operations, probes the

hand-held controllers for activity, and runs the overall game program

[Note from Authors: This section is adapted from a contribution by David Ashley,

dash@xdr.com.]

The Sega Genesis uses plug-in cartridges to store the game software Blank

car-tridges can be purchased from third party vendors, which can then be

pro-grammed using an inexpensive PROM burner card that be plugged into the card

cage of a desktop computer Games can be written in high level languages and

compiled into assembly language, or more commonly, programmed in assembly

language directly (even today, assembly language is still heavily used for game

programming) A suite of development tools translates the source code into

object code that can then be burned directly into the cartridges (once per

car-tridge.) As an alternative to burning cartridges during development, the cartridge

can be replaced with a reprogrammable development card

The Genesis contains two general-purpose microprocessors, the Motorola 68000

and the Zilog Z80 The 68000 runs at 8 MHz and has 64 KB of memory

Trang 20

devoted to it The ROM cartridge appears at memory location 0 The 68000off-loads sound effect computations to the TI PSG and the Yamaha sound syn-thesis chip

The Genesis graphics hardware consists of 2 scrollable planes Each plane is made

up of tiles Each tile is an 8×8 pixel square with 4 bits per pixel Each pixel canthus have 16 colors Each tile can use 1 of 4 color tables, so on the screen therecan be 64 colors at once, but only 16 different colors can be in any specific tile.Tiles require 32 bytes There is 64 KB of graphics memory, which allows for

2048 unique tiles if memory is used for nothing else

Each plane can be scrolled independently in various ways Planes consist of tables

of words, in which each word describes a tile A word contains 11 bits for fying the tile, 2 bits for “flip x” and “flip y,” 2 bits for the selection of the colortable, and 1 bit for a depth selector Sprites are also composed of tiles A spritecan be up to 4 tiles wide by four tiles high Since the size of each tile is 8×8, thismeans sprites can be anywhere from 8×8 pixels to 32×32 pixels in size There can

identi-be 80 sprites on the screen at one time On a single scan line there can identi-be 1032-pixel wide sprites or 20 16-pixel wide sprites Each sprite can only have 16colors taken from the 4 different color tables Colors are allocated 3 bits for eachgun, and so 512 colors are possible (Color 0=transparent.)

There is a memory copier program that is resident in hardware that performs fastcopies from the 68000 RAM into the graphics RAM The Z80 also has 8KB ofRAM The Z80 can access the graphics chip or the sound chips, but usually thesechips are controlled by the 68000

The process of creating a game cartridge involves (1) writing the game program,(2) translating the program into object code (compiling, assembling, and linkingthe code into an executable object module; some parts of the program may bewritten in a high level language, and other parts, directly in assembly language),(3) testing the program on a reprogrammable development card (if a reprogram-mable development card is available), and (4) burning the program into a blankgame cartridge

See Further Reading below for more information on programming the SegaGenesis

Trang 21

■ SUMMARY

In the RISC approach, the most frequently occurring instructions are optimized by

eliminating or reducing the complexity of other instructions and addressing modes

commonly found in CISC architectures The performance of RISC architectures is

further enhanced by pipelining and increasing the number of registers available to

the CPU Superscalar and VLIW architectures are examples of newer

perfor-mance enhancements that extend, rather than replace, the RISC approach.

Parallel architectures can be classified as MISD, SIMD, or MIMD The MISD

approach is used for systolic array processing, and is the least general architecture

of the three In a SIMD architecture, all PEs carry out the same operations on

dif-ferent data sets, in an “army of ants” approach to parallel processing The MIMD

approach can be characterized as “herd of elephants,” because there are a small

number of powerful processors, each with their own data and instruction streams.

The current trend is moving away from the fine grain parallelism that is

exempli-fied by the MISD and SIMD approaches, and toward the MIMD approach This

trend is due to the high time cost of communicating among PEs, and the economy

of using networks of workstations over tightly coupled parallel processors The goal

of the MIMD approach is to better balance the time spent in computation with

the time spent in communication.

Three primary characteristics of RISC architectures enumerated in Section 10.2

originated at IBM’s T J Watson Research Center, as summarized in (Ralston and

Reilly, 1993, pp 1165 - 1167) (Hennessy and Patterson, 1995) is a classic

refer-ence on much of the work that led to the RISC concept, although the word

“RISC” does not appear in the title of their textbook (Stallings, 1991) is a

thor-ough reference on RISCs (Tamir and Sequin, 1983) show that a window size of

eight will shift on less than 1% of the calls or returns (Tanenbaum, 1999)

pro-vides a readable introduction to the RISC concept (Dulong, 1998) describes the

IA-64 The PowerPC 601 architecture is described in (Motorola)

(Quinn, 1987) and (Hwang, 1993) overview the field of parallel processing in

terms of architectures and algorithms (Flynn, 1972) covers the Flynn taxonomy

of architectures (Yang and Gerasoulis, 1991) argue for maintaining a ratio of

communication time to computation time of less than 1 (Hillis, 1985) and

(Hil-lis, 1993) describe the architectures of the CM-1 and CM-5, respectively (Hui,

1990) covers interconnection networks, and (Leighton, 1992) covers routing

Trang 22

algorithms for a few types of interconnection networks (Wu and Feng, 1981)cover routing on a shuffle-exchange network.

Additional information can be found on programming the Sega Genesis athttp://hiwaay.net/~jfrohwei/sega/genesis.html

Dulong, C., “The IA-64 Architecture at Work,” IEEE Computer, vol 31, pp.

Hillis, W D., The Connection Machine, The MIT Press, (1985).

Hillis, W D and L W Tucker, “The CM-5 Connection Machine: A Scalable

Supercomputer,” Communications of the ACM, vol 36, no 11, pp 31-40, (Nov.,

1993)

Hui, J Y., Switching and Traffic Theory for Integrated Broadband Networks,

Klu-wer Academic Publishers, (1990)

Hwang, K., Advanced Computer Architecture: Parallelism, Scalability, bility, McGraw-Hill, (1993).

Programma-Knuth, D E., An Empirical Study of FORTRAN Programs, Software—Practice

and Experience, 1, 105-133, 1971

Leighton, F T., Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann, (1992).

Motorola Inc., PowerPC 601 RISC Microprocessor User’s Manual, Motorola

Liter-ature Distribution, P O Box 20912, Phoenix, AZ, 85036

Quinn, M J., Designing Efficient Algorithms for Parallel Computers,

McGraw-Hill, (1987)

Trang 23

Ralston, A and E D Reilly, eds., Encyclopedia of Computer Science, 3/e, van

Nostrand Reinhold, (1993)

SPARC International, Inc., The SPARC Architecture Manual: Version 8, Prentice

Hall, Englewood Cliffs, New Jersey, (1992)

Stone, H S and J Cocke, “Computer Architecture in the 1990s,” IEEE

Com-puter, vol 24, no 9, pp 30-38, (Sept., 1991).

Stallings, W., Computer Organization and Architecture: Designing for Performance,

4/e, Prentice Hall, Upper Saddle River, (1996)

Stallings, W., Reduced Instruction Set Computers, 3/e, IEEE Computer Society

Press, Washington, D.C., (1991)

Tamir, Y., and C Sequin, “Strategies for Managing the Register File in RISC,”

IEEE Trans Comp., (Nov 1983).

Tanenbaum, A., Structured Computer Organization, 4/e, Prentice Hall, Upper

Saddle River, New Jersey, (1999)

Yang, T and A Gerasoulis, “A Fast Static Scheduling Algorithm for DAGs on an

Unbounded Number of Processors,” Proceedings of Supercomputing ’91,

Albu-querque, New Mexico, (Nov 1991)

Wu, C.-L and T.-Y Feng, “The Universality of the Shuffle-Exchange Network,”

IEEE Transactions on Computers, vol C-30, no 5, pp 324-, (1981).

10.1 Increasing the number of cycles per instruction can sometimes improve

the execution efficiency of a pipeline If the time per cycle for the pipeline

described in Section 10.3 is 20 ns, then CPI Avg is 1.5 × 20 ns = 30 ns

Com-pute the execution efficiency for the same pipeline in which the pipeline depth

increases from 5 to 6 and the cycle time decreases from 20 ns to 10 ns

10.2 The SPARC code below is taken from the gcc generated code in Figure

10-10 Can %r0 be used in all three lines, instead of “wasting” %r1 in the

second line?

Trang 24

st %o0, [%fp-28]

sethi %hi(.LLC0), %o1

or %o1, %lo(.LLC0), %o1

10.3 Calculate the speedup that can be expected if a 200 MHz Pentium chip isreplaced with a 300 MHz Pentium chip, if all other parameters remainunchanged

10.4 What is the speedup that can be expected if the instruction set of a certainmachine is changed so that the branch instruction takes 1 clock cycle instead

of 3 clock cycles, if branch instructions account for 20% of all instructionsexecuted by a certain program? Assume that other instructions average 3 clockcycles per instruction, and that nothing else is altered by the change

10.5 Create a dependency graph for the following expression:

f(x, y) = x2 + 2xy + y2

10.6 Given 100 processors for a computation with 5% of the code that cannot

be parallelized, compute speedup and efficiency

10.7 What is the diameter of a 16-space hypercube?

10.8 For the EXAMPLE at the end of Section 10.9.2, compute the total point complexity over all three stages

Trang 25

APPENDIX A: DIGITAL LOGIC

In this appendix, we take a look at a few basic principles of digital logic that wecan apply in the design of a digital computer We start by studying combinational logicin which logical decisions are made based only on combinations ofthe inputs We then look at sequential logicin which decisions are made based

on combinations of the current inputs as well as the past history of inputs With

an understanding of these underlying principles, we can design digital logic cuits from which an entire computer can be constructed We begin with the fun-damental building block of a digital computer, the combinational logic unit (CLU).

A combinational logic unittranslates a set of inputs into a set of outputs ing to one or more mapping functions The outputs of a CLU are strictly func-tions of the inputs, and the outputs are updated immediately after the inputschange A basic model of a CLU is shown in Figure A-1 A set of inputs i0 – i n is

accord-presented to the CLU, which produces a set of outputs according to mappingfunctions f0 – f m There is no feedback from the outputs back to the inputs in acombinational logic circuit (we will study circuits with feedback in Section

Combinational logic unit

Figure A-1 External view of a combinational logic unit.

A

Trang 26

A.11.)

Inputs and outputs for a CLU normally have two distinct values: high and low.When signals (values) are taken from a finite set, the circuits that use them arereferred to as being digital A digital electronic circuit receives inputs and pro-duces outputs in which 0 volts (0 V) is typically considered to be a low value and+5 V is considered to be a high value This convention is not used everywhere:high speed circuits tend to use lower voltages; some computer circuits work in

opti-cal circuits might use phase or polarization in which high or low values are nolonger meaningful An application in which analog circuitry is appropriate is inflight simulation, since the analog circuits more closely approximate the mechan-ics of an aircraft than do digital circuits

Although the vast majority of digital computers are binary, multi-valued circuitsalso exist A wire that is capable of carrying more than two values can be moreefficient at transmitting information than a wire that carries only two values Adigital multi-valued circuit is different from an analog circuit in that a multi-val-ued circuit deals with signals that take on one of a finite number of values,whereas an analog signal can take on a continuum of values The use ofmulti-valued circuits is theoretically valuable, but in practice it is difficult to cre-ate reliable circuitry that distinguishes between more than two values For thisreason, multi-valued logic is currently in limited use

In this text, we are primarily concerned with digital binary circuits, in whichexactly two values are allowed for any input or output Thus, we will consideronly binary signals

In 1854 George Boole published his seminal work on an algebra for representinglogic Boole was interested in capturing the mathematics of thought, and devel-oped a representation for factual information such as “The door is open.” or

“The door is not open.” Boole’s algebra was further developed by Shannon intothe form we use today In Boolean algebra, we assume the existence of a basicpostulate, that a binary variable takes on a single value of 0 or 1 This value cor-responds to the 0 and +5 voltages mentioned in the previous section The assign-ment can also be done in reverse order for 1 and 0, respectively For purposes ofunderstanding the behavior of digital circuits, we can abstract away the physicalcorrespondence to voltages and consider only the symbolic values 0 and 1

Trang 27

A key contribution of Boole is the development of the truth table, which

cap-tures logical relationships in a tabular form Consider a room with two 3-way

switches A and B that control a light Z Either switch can be up or down, or both

switches can be up or down When exactly one switch is up, the light is on

When both switches are up or down, the light is off A truth table can be

con-structed that enumerates all possible settings of the switches as shown in Figure

A-2 In the table, a switch is assigned the value 0 if it is down, otherwise it is

assigned the value 1 The light is on when Z = 1

In a truth table, all possible input combinations of binary variables are

enumer-ated and a corresponding output value of 0 or 1 is assigned for each input

combi-nation For the truth table shown in Figure A-2, the output function Z depends

upon input variables A and B For each combination of input variables there are

two values that can be assigned to Z: 0 or 1 We can choose a different

assign-ment for Figure A-2, in which the light is on only when both switches are up or

both switches are down, in which case the truth table shown in Figure A-3

enu-merates all possible states of the light for each switch setting The wiring pattern

would also need to be changed to correspond For two input variables, there are

22 = 4 input combinations, and 24 = 16 possible assignments of outputs to input

0 0 1 1

0 1 0 1

A B

0 1 1 0

0 1 0 1

A B

1 0 0 1

Z

Inputs Output

Figure A-3 Alternate assignments of outputs to switch settings.

Trang 28

combinations In general, since there are 2n input combinations for n inputs,there are possible assignments of output values to input combinations

If we enumerate all possible assignments of switch settings for two input ables, then we will obtain the 16 assignments shown in Figure A-4 We refer to

vari-these functions as Boolean logic functions A number of assignments have cial names The AND function is true (produces a 1) only when A and B are 1,whereas the OR function is true when either A or B is 1, or when both A and B

spe-are 1 A function is false when its output is 0, and so the False function is always

0, whereas the True function is always 1 The plus signs ‘+’ in the Boolean sions denote logical OR, and do not imply arithmetic addition The juxtaposi-tion of two variables, as in AB, denotes logical AND among the variables

expres-The A and B functions simply repeat the A and B inputs, respectively, whereasthe and functions complement A and B, by producing a 0 where theuncomplemented function is a 1 and by producing a 1 where the uncomple-mented function is a 0 In general, a bar over a term denotes the complementoperation, and so the NAND and NOR functions are complements to AND and

OR, respectively The XOR function is true when either of its inputs, but not

2 〈 〉2

0 0 1 1

0 1 0 1

0 0 0 0

0 0 0 1

0 0 1 0

0 0 1 1

0 1 0 0

0 1 0 1

0 1 1 0

0 1 1 1

0 0 1 1

0 1 0 1

1 0 0 0

1 0 0 1

1 0 1 0

1 0 1 1

1 1 0 0

1 1 0 1

1 1 1 0

1 1 1 1

Trang 29

both, is true The XNOR function is the complement to XOR The remaining

functions are interpreted similarly

A logic gate is a physical device that implements a simple Boolean function The

functions that are listed in Figure A-4 have representations as logic gate symbols,

a few of which are shown in Figure A-5 and Figure A-6 For each of the

func-tions, A and B are binary inputs and F is the output

In Figure A-5, the AND and OR gates behave as previously described The

out-put of the AND gate is true when both of its inout-puts are true, and is false

other-wise The output of the OR gate is true when either or both of its inputs are true,

and is false otherwise The buffer simply copies its input to its output Although

the buffer has no logical significance, it serves an important practical role as an

amplifier, allowing a number of logic gates to be driven by a single signal The

NOT gate (also called an inverter) produces a 1 at its output for a 0 at its input,

and produces a 0 at its output for a 1 at its input Again, the inverted output

sig-nal is referred to as the complement of the input The circle at the output of the

A

0 0 1 1

B

0 1 0 1

F

0 0 0 1

AND

A

0 0 1 1

B

0 1 0 1

F

0 1 1 1

F

0 1

Buffer

A

0 1

F

1 0

NOT (Inverter)

Figure A-5 Logic gate symbols for AND, OR, buffer, and NOT Boolean functions.

Trang 30

NOT gate denotes the complement operation

In Figure A-6, the NAND and NOR gates produce complementary outputs tothe AND and OR gates, respectively The exclusive-OR (XOR) gate produces a 1when either of its inputs, but not both, is 1 In general, XOR produces a 1 at itsoutput whenever the number of 1’s at its inputs is odd This generalization isimportant in understanding how an XOR gate with more than two inputsbehaves The exclusive-NOR (XNOR) gate produces a complementary output

to the XOR gate

The logic symbols shown in Figure A-5 and Figure A-6 are only the basic forms,and there are a number of variations that are often used For example, there can

be more inputs, as for the three-input AND gate shown in Figure Figure A-7a.The circles at the outputs of the NOT, NOR, and XNOR gates denote the com-plement operation, and can be placed at the inputs of logic gates to indicate that

A B

A

0 0 1 1

B

0 1 0 1

F

1 1 1 0

NAND

A

0 0 1 1

B

0 1 0 1

F

1 0 0 0

NOR

A B

F = A B F = A + B

A

0 0 1 1

B

0 1 0 1

F

0 1 1 0

B

0 1 0 1

F

1 0 0 1

Trang 31

the inputs are inverted upon entering the gate, as shown in Figure A-7b

Depending on the technology used, some logic gates produce complementary

outputs The corresponding logic symbol for a complementary logic gate

indi-cates both outputs as illustrated in Figure A-7c

Physically, logic gates are not magical, although it may seem that they are when a

device like an inverter can produce a logical 1 (+5 V) at its output when a logical

0 (0 V) is provided at the input The next section covers the underlying

mecha-nism that makes electronic logic gates work

Electrically, logic gates have power terminals that are not normally shown Figure

A-8a illustrates an inverter in which the +5 V and 0 V (GND) terminals are

made visible The +5 V signal is commonly referred to as VCC for “voltage

collec-tor-collector.” In a physical circuit, all of the VCC and GND terminals are

Figure A-8 (a) Power terminals for an inverter made visible; (b) schematic symbol for a

tran-sistor; (c) transistor circuit for an inverter; (d) static transfer function for an inverter.

Trang 32

nected to the corresponding terminals of a power supply

Logic gates are composed of electrical devices called transistors, which have afundamental switching property that allows them to control a strong electricalsignal with a weak signal This supports the process of amplification, which iscrucial for cascading logic gates Without amplification, we would only be able

to send a signal through a few logic gates before the signal deteriorates to thepoint that it is overcome by noise, which exists at every point in an electrical cir-cuit to some degree

The schematic symbol for a transistor is shown in Figure A-8b When there is nopositive voltage on the base, then a current will not flow from VCC to GND

Thus, for an inverter, a logical 0 (0 V) on the base will produce a logical 1 (+5 V)

at the collector terminal as illustrated in Figure A-8c If, however, a positive age is applied to Vin, then a current will flow from VCC to GND, which prevents

volt-Vout from producing enough signal for the inverter output to be a logical 1 Ineffect, when +5 V is applied to Vin, a logical 0 appears at Vout The input-outputrelationship of a logic gate follows a nonlinear curve as shown in Figure A-8d fortransistor-transistor logic (TTL) The nonlinearity is an important gain propertythat makes cascadable operation possible

A useful paradigm is to think of current flowing through wires as water flowingthrough pipes If we open a connection on a pipe from VCC to GND, then thewater flowing to Vout will be reduced to a great extent, although some water willstill make it out By choosing an appropriate value for the resistor RL, the flowcan be restricted in order to minimize this effect

Since there will always be some current that flows even when we have a logical 0

at Vout, we need to assign logical 0 and 1 to voltages using safe margins If weassign logical 0 to 0 V and logical 1 to +5 V, then our circuits may not workproperly if 1 V appears at the output of an inverter instead of 0 V, which canhappen in practice For this reason, we design circuits in which assignments oflogical 0 and 1 are made using thresholds In Figure A-9a, logical 0 is assigned tothe voltage range [0 V to 0.4 V] and logical 1 is assigned to the voltage range [2.4

V to +5 V] The ranges shown in Figure A-9a are for the output of a logic gate

There may be some attenuation (a reduction in voltage) introduced in the nection between the output of one logic gate and the input to another, and forthat reason, the thresholds are relaxed by 0.4 V at the input to a logic gate asshown in Figure A-9b These ranges can differ depending on the logic family

con-The output ranges only make sense, however, if the gate inputs settle into the

Tiêu đề	Trends in Computer Architecture
Trường học	Standard University
Chuyên ngành	Computer Architecture
Thể loại	Bài báo

Định dạng
Số trang	65
Dung lượng	400,18 KB