Tài liệu PARALLEL COMPUTER ARCHITECTURES-8 doc

a A multicomputer with 16 CPUs, each with each own private memory.. Operating system ApplicationHardware b Language run-time system Operating system Shared memory Application Hardware La

Trang 1

PARALLEL COMPUTER ARCHITECTURES

1

Trang 2

P P P

P

Shared

memory

Figure 8-1 (a) A multiprocessor with 16 CPUs sharing a

com-mon memory (b) An image partitioned into 16 sections, eachbeing analyzed by a different CPU

Trang 3

M M M M

P P P

P

passing interconnection network Private memory

Message-Figure 8-2 (a) A multicomputer with 16 CPUs, each with

each own private memory (b) The bit-map image of Fig 8-1split up among the 16 memories

Trang 4

Operating system Application

Hardware

(b)

Language run-time system

Operating system

Shared memory

Application

Hardware

(c)

Operating system

Shared memory

Application

Hardware

Figure 8-3 Various layers where shared memory can be

im-plemented (a) The hardware (b) The operating system (c)The language runtime system

Trang 5

Figure 8-4 Various topologies The heavy dots represent

switches The CPUs and memories are not shown (a) A star.(b) A complete interconnect (c) A tree (d) A ring (e) A grid.(f) A double torus (g) A cube (h) A 4D hypercube

Trang 6

Figure 8-5 An interconnection network in the form of a

four-switch square grid Only two of the CPUs are shown

Trang 7

CPU 1 Input port

(a)

Output port

Entire packet

Entire

packet

Four-port switch

C

A

CPU 2

Entire packet D

(c)

C

A

D B

Figure 8-6 Store-and-forward packet switching.

Trang 9

60 50

40 30

20 10

Figure 8-8 Real programs achieve less than the perfect

speed-up indicated by the dotted line

Trang 10

n CPUs active

1 CPUactive

1 – ff

Potentiallyparallelizablepart

fT (1 – f)T/n

Figure 8-9 (a) A program has a sequential part and a

parallel-izable part (b) Effect of running part of the program in lel

Trang 11

Bus

Figure 8-10 (a) A 4-CPU bus-based system (b) A 16-CPU

bus-based system (c) A 4-CPU grid-based system (d) A CPU grid-based system

Trang 12

Figure 8-11 Computational paradigms (a) Pipeline (b)Phased computation (c) Divide and conquer (d) Replicatedworker

Trang 13

222222222222222222222222222222222222222222222222222222222222222222222222222222222222 Multiprocessor Message passing Message passing simulated with buffers in memory 222222222222222222222222222222222222222222222222222222222222222222222222222222222222 Multicomputer Shared variables DSM, Linda, Orca, etc on an SP/2 or a PC network 222222222222222222222222222222222222222222222222222222222222222222222222222222222222 Multicomputer Message passing PVM or MPI on an SP/2 or a network of PCs

Trang 14

Instruction

streams

Data streams Name Examples

Trang 15

processors

computers

Bus Switched CC-NUMA NC-NUMA Grid Hyper-cube

Shared memory Message passing

Figure 8-14 A taxonomy of parallel computers.

Trang 16

Input vectors

Vector ALU

Figure 8-15 A vector ALU.

Trang 20

8 64-Bit scalar registers

64 64-Bit holding registers for scalars

64 Elements per register

8 64-Bit vector registers

Scalar integer units

Scalar/vector floatng-point units

Vector integer units

ADD BOOLEAN SHIFT

ADD MUL RECIP.

ADD BOOLEAN SHIFT POP COUNT MUL

Figure 8-19 Registers and functional units of the Cray-1

Trang 21

Write

100

Write 200 Read 2x

Read 2x

W100 W200 R3 = 200 R3 = 200 R4 = 200 R4 = 200

(c)

W100 R3 = 100 W200 R4 = 200 R3 = 200 R4 = 200

(d)

W200 R4 = 200 W100 R3 = 100 R4 = 100 R3 = 100 1

2

3

4

x

Figure 8-20 (a) Two CPUs writing and two CPUs reading a

common memory word (b) - (d) Three possible ways the two

writes and four reads might be interleaved in time

Trang 22

Figure 8-21 Weakly consistent memory uses synchronization

operations to divide time into sequential epochs

Trang 23

Figure 8-22 Three bus-based multiprocessors (a) Without

caching (b) With caching (c) With caching and private

memories

Trang 24

Figure 8-23 The write through cache coherence protocol.

The empty boxes indicate that no action is taken

Trang 25

Modified

CPU 1 reads block A

CPU 2 reads block A

CPU 2 writes block A

CPU 3 reads block A

A A A A A

A

Figure 8-24 The MESI cache coherence protocol.

Trang 26

Closed crosspoint switch

Open crosspoint switch (a)

(b)

(c)

Crosspoint switch is closed

Crosspoint switch is open

Figure 8-25 (a) An 8×8 crossbar switch (b) An open

crosspoint (c) A closed crosspoint

Trang 28

B

XY

Module Address Opcode Value

Figure 8-27 (a) A 2×2 switch (b) A message format

Trang 30

Figure 8-29 A NUMA machine based on two levels of buses.

The Cm* was the first multiprocessor to use this design

Trang 31

0 0 1 0 0

2 18 -1

82

…

Figure 8-30 (a) A 256-node directory-based multiprocessor.

(b) Division of a 32-bit memory address into fields (c) The

directory at node 36

Trang 32

Uncached, shared, modified

This is the directory

for cluster 13 This bit

tells whether cluster 0

has block 1 of the memory

homed here in any of

its caches. 0

1 2 3

Trang 33

Snooping bus interface

Quad board with

4 Pentium Pros and

up to 4 GB of RAM

Figure 8-32 The NUMA-Q multiprocessor.

Trang 34

Local memory table

at home node

0

2 19 -1

Node 4 cache directory Node 9 cache directory Node 22 cache directory

Figure 8-33 SCI chains all the holders of a given cache line

together in a doubly-linked list In this example, a line isshown cached at three nodes

Trang 35

Local interconnect

Disk and I/O

High-performance interconnection network

Figure 8-34 A generic multicomputer.

Trang 36

Control +

E registers

Commun processor Mem

Figure 8-35 The Cray Research T3E.

Trang 37

(a) (b)

64-Bit local bus

PPro MB64

Kestrel board

2

I/O NIC PPro

MB I/O NICPPro

32

38

Figure 8-36 The Intel/Sandia Option Red system (a) The

kestrel board (b) The interconnection network

Trang 38

6

7

8 9

1

2

3 4

5 6

7

8 9

Figure 8-37 Scheduling a COW (a) FIFO (b) Without

head-of-line blocking (c) Tiling The shaded areas indicate idleCPUs

Trang 39

CPU CPU CPU

Packet

going east

Packet going west

Line card

Ethernet Switch

plane

Back-Figure 8-38 (a) Three computers on an Ethernet (b) An

Eth-ernet switch

Trang 40

10

1112

Figure 8-39 Sixteen CPUs connected by four ATM switches.

Two virtual circuits are shown

Trang 41

Globally shared virtual memory consisting of 16 pages

Memory

Network (a)

Figure 8-40 A virtual address space consisting of 16 pages

spread over four nodes of a multicomputer (a) The initial tion (b) After CPU 0 references page 10 (c) After CPU 1references page 10, here assumed to be a read-only page

Trang 42

situa-(′′abc′′, 2, 5)

(′′matrix-1′′, 1, 6, 3.14)

(′′family′′,′′is sister′′, Carolyn, Elinor)

Figure 8-41 Three Linda tuples.

Trang 43

Object implementation stack;

stack: array [integer 0 N-1] of integer;

operation push(item: integer); function returning nothing begin

end;

operation pop( ): integer; # function returning an integer

begin

Figure 8-42 A simplified ORCA stack object, with internal data

and two operations

Tiêu đề	Parallel Computer Architectures
Thể loại	Tài liệu

Định dạng
Số trang	43
Dung lượng	185,52 KB