Hardware Acceleration of EDA Algorithms- P11 pot

12 Conclusions 185Multi−Threaded Wide SIMD I$ D$ Multi−Threaded Wide SIMD I$ D$ Multi−Threaded Wide SIMD I$ D$ Multi−Threaded Wide SIMD I$ D$ L2 Cache Fig.. The block diagram of a single

Trang 1

12 Conclusions 185

Multi−Threaded Wide SIMD I$ D$

L2 Cache

Fig 12.2 Larrabee architecture from Intel

L2

Fig 12.3 Fermi architecture from NVIDIA

multiprocessor (SM) The block diagram of a single SM is shown in Fig 12.4 and the block diagram of a core within an SM is shown in Fig 12.5.

With these upcoming architectures, newer approaches for hardware acceleration

of algorithms would become viable These approaches could exploit the more gen-eral computing paradigm offered by the newer architectures For example, the close coupling between the GPU and the CPU (which reside on the same die) would

Trang 2

186 12 Conclusions

Core Core Core Core

Core Core Core

Core

Core Core Core

Core

Instruction Cache

Register File Dispatch Dispatch

Scheduler Scheduler

Load/Store Units X 16 Special Func Units X 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

Fig 12.4 Block diagram of a single shared multiprocessor (SM) in Fermi

reduce the communication cost Also, in these upcoming architectures the instruc-tion dispatch unit is distributed, and the instrucinstruc-tion set is more general purpose These enhancements would enable a more general computing paradigm (in compar-ison to the SIMD paradigm for current GPUs), which in turn would enable acceler-ation opportunities for more EDA applicacceler-ations.

The approaches presented in this monograph collectively aim to contribute toward enabling the CAD community to accelerate EDA algorithms on modern hardware platforms Our work demonstrates techniques to rearchitect several EDA algorithms to maximally harness their performance on the alternative platforms under consideration.

Trang 3

References 187

Dispatch Port Operand Collector

Result Queue

CUDA Core

Fig 12.5 Block diagram of a single processor (core) in SM

References

1 http://www.cs.chalmers.se/cs/research/formalmethods/minisat/main.html The MiniSAT Page

2 NVIDIA Tesla GPU Computing Processor http://www.nvidia.com/object/IO_ 43499.html

3 OmegaSim Mixed-Signal Fast-SPICE Simulator http://www.nascentric.com/ product.html

4 Lee, H.K., Ha, D.S.: An efficient, forward fault simulation algorithm based on the parallel pattern single fault propagation In: Proceedings of the IEEE International Test Conference on Test, pp 946–955 IEEE Computer Society, Washington, DC (1991)

5 Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A

many-core x86 architecture for visual computing ACM Transactions on Graphics 27(3), 1–15

(2008)

6 Silva, M., Sakallah, J.: GRASP-a new search algorithm for satisfiability In: Proceedings of the International Conference on Computer-Aided Design (ICCAD), pp 220–7 (1996)

Trang 4

A

Accelerators, 9

ACML-GPU, 15

Activity, 93

Algorithm parallel, 120, 121, 134

Amdahl’s Law, 158, 170

Application specific, 64

Arrival time, 110

Assignment, 31, 37, 40

B

Backtracking, 32

Bandwidth, 13

Bandwidth minimization, 52

Bank conflict, 27

BCP, 32, 37, 40

Bias

survey propagation, 89

Bins, 64

Bin packing, 52, 70

Bin utilization, 74

Bit parallel, 135, 146

Block, 28

Block-based

SSTA, 108

Board test, 15

Boolean Constant Propagation, see BCP

Boolean Satisfiability, see SAT

Box-Muller, 101

BRAM, 11, 14, 32, 63, 66, 72, 78

Brook+, 15

BSIM3

SPICE, 158

BSIM4

SPICE, 158

Bulldog Fortran, 171

C

Capacity, 31, 35

CDFG, 160 Clause, 31 Clock speed, 11 CNF, 31, 34 Co-processors, 9 Compilers, 16 Complete SAT, 83, 85 Conflict, 37, 40, 42, 44, 71 Conflict clause, 31 Conflict clause generation, 33, 64 Conjunctive Normal Form, 34 Constant Memory, 26, 161 Control and dataflow graph, 173 Control dominated

EDA, 3 Control plus data parallel EDA, 3

Core, 185 Critical line critical path tracing, 138 Critical path tracing, 138 CUBLAS, 15

CUDA, 15, 24 CUFFT, 15 Cumulative detectability, 138 Custom IC, 7, 10, 33

D

Data parallel, 28, 106, 120, 122, 134 Debuggers, 16

Decision engine, 37, 39, 49, 70 Decision level, 39, 67 Decisions

SAT, 32 Detectability, 138 DFF, 11 DIMACS, 45 Dimblock, 29

K Gulati, S.P Khatri, Hardware Acceleration of EDA Algorithms,

DOI 10.1007/978-1-4419-0944-2,

C

Springer Science+Business Media, LLC 2010

189

Trang 5

190 Index Dimensionality, 29

Dimgrid, 29

Divide, 12

Dominator, 138

DPLL, 85

DRAM, 14, 66, 184

Dropped

fault table, 134

Dynamic

power, 10

Dynamic bulk modulation, 10

E

EDA, 3

Embedded processor, 10

F

Factor Graph, 87

Fault detection, 134

Fault diagnosis, 134

Fault dropping, 134

Fault grading, 102, 120

Fault injection, 135

Fault parallel

data parallel, 120

Fault simulation, 4, 119

Fault table, 4, 134

Fermi, 184

Fingerprinting, 19

FPGA, 3, 7, 10, 32

Function

Factor Graph, 87

G

Global Memory, 13, 27, 110, 159

GPGPU, 3

Graphics Processors, see GPU

GRASP, 35, 64, 85

Grid, see dimgrid

GridSAT, 87

GSAT, 85

H

Hardware

IP cores, 15

HDL, 10, 14, 19

Hybrid

SAT, 85

I

Immediate dominator

dominator, 138

Implication, 37, 40, 44

Implication graph, 31, 33, 37, 50, 64

Infringement security, 19 Input vector control, 10 Instance specific, 64 Inter-bin

non-chronological, 32 Intra-bin

non-chronological, 32

IP cores, 15

K

Kernel, 28, 167, 184

L

Larrabee, 184 Latency, 11, 13 Leakage power, 10 Levelize, 112 Literal, 37 free literal, 41 Local memory, 12, 27 Logic analyzers, 15 Lookup table, 11, 106, 120 LUT, 12

M

Memory bandwidth, 1, 13 Memory wall, 1

Mersenne Twister, 101, 106, 112 MIMD, 171

Minimum unsatisfiable core, 31, 33, 53 MiniSAT, 85

MNA SPICE, 154 Model evaluation, 154 Model parallel, 122, 134 Monte Carlo, 4 SSTA, 101, 106 Moore’s Law, 24 MOPs, 17 MOPs per watt MOPs, 17 Multi-GPU, 16 Multi-port memory, 20 Multiprocessor, 12, 24 MUX, 11

N

Newton-Raphson, 154 NMOS

passgates, 11

Trang 6

Index 191 Non-chronological backtrack, 32, 43, 45, 64,

68, 85

Non-recurring engineering, 10, 18

Non-volatile

memory, 20

O

Off-chip, 14

On-chip, 14

OPB, 67, 72

P

Paging, 12

Parafrase, 170

Parallel

SAT, 85

Partition, 32, 35, 63, 78

Pass/fail fault dictionary, 134

Path-based

SSTA, 108

Pattern parallel

data parallel, 120

PCI, 15

PCI-X

PCI, 15

Pipeline, 11

Piracy

security, 19

PLB, 67, 72

PLB2OPB bridge, 72

Power, 10, 56

average power, 58

Power delay product, 18

Power gating, 10

Power wall, 1

PowerPC, 32

Precharged, 39

Predischarged, 39

Process variations, 106

Processor, 24

Profiling

code, 16

Programmable, 12

Prototyping, 16

Q

QuickPath Interconnect, 18

R

Random

variations, 106

Re-programmability, 19

Reconfigurable logic

FPGA, 11

Reconfigure, 12 Reduced OR, 144 Register, 26, 172 Resolution, 36 Reuse-based design, 19

S

Sample parallelism, 106 SAT, 4, 31, 33, 34, 36 3-SAT, 36 Scalability, 15, 31, 35, 66 Scattered reads, 29 SEE, 18, 114 Self-test, 15 Sensitive input, 138 Shared Memory, 26, 27, 110 Shared multiprocessor, 185 SIMD, 3, 18, 29

Software

IP cores, 15 Span, 69 Speedup, 31 SPICE, 31, 153 Square root, 12 SRAM, 11 SSTA, 4, 101, 106 STA, 101, 106 SPICE, 154 Stem, 137 Stem region, 138, 143 Stochastic

SAT, 83, 85 Subroutine, 167 Subsumption resolution, 56 Successive chord, 156 Supply voltage, 10 Survey propagation, 84 Surveys

survey propagation, 88 Synchronization points, 29 Synchronize, 28

System test, 15 Systematic variations, 106

T

Termination cell, 40 Texture fetching Texture Memory, 27 Texture Memory, 26, 110, 155, 160 Thread, 28, 146

Thread block, 28 Thread parallel, 135

Trang 7

192 Index Thread scheduler, 29

Throughput, 11

Time slicing, 29

Tree Factor Graph, 87

U

Unate covering, 134

V

Variable, 31, 37

Factor Graph, 87

Variable ordering

SAT, 32

Variable Vt, 10

Variations, 106

Virtual memory, 12 VLIW, 171 VLSI, 106 VSIDS, 93

W

WalkSAT, 85, 90, 96 Warp size, 29 Warps, 29 Watermarking, 19

X

XC2VP30 FPGA, 32

Định dạng
Số trang	7
Dung lượng	145,2 KB