12 Conclusions 185Multi−Threaded Wide SIMD I$ D$ Multi−Threaded Wide SIMD I$ D$ Multi−Threaded Wide SIMD I$ D$ Multi−Threaded Wide SIMD I$ D$ L2 Cache Fig.. The block diagram of a single
Trang 112 Conclusions 185
Multi−Threaded Wide SIMD I$ D$
Multi−Threaded Wide SIMD I$ D$
Multi−Threaded Wide SIMD I$ D$
Multi−Threaded Wide SIMD I$ D$
L2 Cache
Fig 12.2 Larrabee architecture from Intel
L2
Fig 12.3 Fermi architecture from NVIDIA
multiprocessor (SM) The block diagram of a single SM is shown in Fig 12.4 and the block diagram of a core within an SM is shown in Fig 12.5.
With these upcoming architectures, newer approaches for hardware acceleration
of algorithms would become viable These approaches could exploit the more gen-eral computing paradigm offered by the newer architectures For example, the close coupling between the GPU and the CPU (which reside on the same die) would
Trang 2186 12 Conclusions
Core Core Core Core
Core Core Core Core
Core Core Core
Core Core Core Core
Core
Core Core Core Core
Core Core Core Core
Core Core Core
Core Core Core Core
Core
Instruction Cache
Register File Dispatch Dispatch
Scheduler Scheduler
Load/Store Units X 16 Special Func Units X 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache
Fig 12.4 Block diagram of a single shared multiprocessor (SM) in Fermi
reduce the communication cost Also, in these upcoming architectures the instruc-tion dispatch unit is distributed, and the instrucinstruc-tion set is more general purpose These enhancements would enable a more general computing paradigm (in compar-ison to the SIMD paradigm for current GPUs), which in turn would enable acceler-ation opportunities for more EDA applicacceler-ations.
The approaches presented in this monograph collectively aim to contribute toward enabling the CAD community to accelerate EDA algorithms on modern hardware platforms Our work demonstrates techniques to rearchitect several EDA algorithms to maximally harness their performance on the alternative platforms under consideration.
Trang 3References 187
Dispatch Port Operand Collector
Result Queue
CUDA Core
Fig 12.5 Block diagram of a single processor (core) in SM
References
1 http://www.cs.chalmers.se/cs/research/formalmethods/minisat/main.html The MiniSAT Page
2 NVIDIA Tesla GPU Computing Processor http://www.nvidia.com/object/IO_ 43499.html
3 OmegaSim Mixed-Signal Fast-SPICE Simulator http://www.nascentric.com/ product.html
4 Lee, H.K., Ha, D.S.: An efficient, forward fault simulation algorithm based on the parallel pattern single fault propagation In: Proceedings of the IEEE International Test Conference on Test, pp 946–955 IEEE Computer Society, Washington, DC (1991)
5 Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A
many-core x86 architecture for visual computing ACM Transactions on Graphics 27(3), 1–15
(2008)
6 Silva, M., Sakallah, J.: GRASP-a new search algorithm for satisfiability In: Proceedings of the International Conference on Computer-Aided Design (ICCAD), pp 220–7 (1996)
Trang 4A
Accelerators, 9
ACML-GPU, 15
Activity, 93
Algorithm parallel, 120, 121, 134
Amdahl’s Law, 158, 170
Application specific, 64
Arrival time, 110
Assignment, 31, 37, 40
B
Backtracking, 32
Bandwidth, 13
Bandwidth minimization, 52
Bank conflict, 27
BCP, 32, 37, 40
Bias
survey propagation, 89
Bins, 64
Bin packing, 52, 70
Bin utilization, 74
Bit parallel, 135, 146
Block, 28
Block-based
SSTA, 108
Board test, 15
Boolean Constant Propagation, see BCP
Boolean Satisfiability, see SAT
Box-Muller, 101
BRAM, 11, 14, 32, 63, 66, 72, 78
Brook+, 15
BSIM3
SPICE, 158
BSIM4
SPICE, 158
Bulldog Fortran, 171
C
Capacity, 31, 35
CDFG, 160 Clause, 31 Clock speed, 11 CNF, 31, 34 Co-processors, 9 Compilers, 16 Complete SAT, 83, 85 Conflict, 37, 40, 42, 44, 71 Conflict clause, 31 Conflict clause generation, 33, 64 Conjunctive Normal Form, 34 Constant Memory, 26, 161 Control and dataflow graph, 173 Control dominated
EDA, 3 Control plus data parallel EDA, 3
Core, 185 Critical line critical path tracing, 138 Critical path tracing, 138 CUBLAS, 15
CUDA, 15, 24 CUFFT, 15 Cumulative detectability, 138 Custom IC, 7, 10, 33
D
Data parallel, 28, 106, 120, 122, 134 Debuggers, 16
Decision engine, 37, 39, 49, 70 Decision level, 39, 67 Decisions
SAT, 32 Detectability, 138 DFF, 11 DIMACS, 45 Dimblock, 29
K Gulati, S.P Khatri, Hardware Acceleration of EDA Algorithms,
DOI 10.1007/978-1-4419-0944-2,
C
Springer Science+Business Media, LLC 2010
189
Trang 5190 Index Dimensionality, 29
Dimgrid, 29
Divide, 12
Dominator, 138
DPLL, 85
DRAM, 14, 66, 184
Dropped
fault table, 134
Dynamic
power, 10
Dynamic bulk modulation, 10
E
EDA, 3
Embedded processor, 10
F
Factor Graph, 87
Fault detection, 134
Fault diagnosis, 134
Fault dropping, 134
Fault grading, 102, 120
Fault injection, 135
Fault parallel
data parallel, 120
Fault simulation, 4, 119
Fault table, 4, 134
Fermi, 184
Fingerprinting, 19
FPGA, 3, 7, 10, 32
Function
Factor Graph, 87
G
Global Memory, 13, 27, 110, 159
GPGPU, 3
Graphics Processors, see GPU
GRASP, 35, 64, 85
Grid, see dimgrid
GridSAT, 87
GSAT, 85
H
Hardware
IP cores, 15
HDL, 10, 14, 19
Hybrid
SAT, 85
I
Immediate dominator
dominator, 138
Implication, 37, 40, 44
Implication graph, 31, 33, 37, 50, 64
Infringement security, 19 Input vector control, 10 Instance specific, 64 Inter-bin
non-chronological, 32 Intra-bin
non-chronological, 32
IP cores, 15
K
Kernel, 28, 167, 184
L
Larrabee, 184 Latency, 11, 13 Leakage power, 10 Levelize, 112 Literal, 37 free literal, 41 Local memory, 12, 27 Logic analyzers, 15 Lookup table, 11, 106, 120 LUT, 12
M
Memory bandwidth, 1, 13 Memory wall, 1
Mersenne Twister, 101, 106, 112 MIMD, 171
Minimum unsatisfiable core, 31, 33, 53 MiniSAT, 85
MNA SPICE, 154 Model evaluation, 154 Model parallel, 122, 134 Monte Carlo, 4 SSTA, 101, 106 Moore’s Law, 24 MOPs, 17 MOPs per watt MOPs, 17 Multi-GPU, 16 Multi-port memory, 20 Multiprocessor, 12, 24 MUX, 11
N
Newton-Raphson, 154 NMOS
passgates, 11
Trang 6Index 191 Non-chronological backtrack, 32, 43, 45, 64,
68, 85
Non-recurring engineering, 10, 18
Non-volatile
memory, 20
O
Off-chip, 14
On-chip, 14
OPB, 67, 72
P
Paging, 12
Parafrase, 170
Parallel
SAT, 85
Partition, 32, 35, 63, 78
Pass/fail fault dictionary, 134
Path-based
SSTA, 108
Pattern parallel
data parallel, 120
PCI, 15
PCI-X
PCI, 15
Pipeline, 11
Piracy
security, 19
PLB, 67, 72
PLB2OPB bridge, 72
Power, 10, 56
average power, 58
Power delay product, 18
Power gating, 10
Power wall, 1
PowerPC, 32
Precharged, 39
Predischarged, 39
Process variations, 106
Processor, 24
Profiling
code, 16
Programmable, 12
Prototyping, 16
Q
QuickPath Interconnect, 18
R
Random
variations, 106
Re-programmability, 19
Reconfigurable logic
FPGA, 11
Reconfigure, 12 Reduced OR, 144 Register, 26, 172 Resolution, 36 Reuse-based design, 19
S
Sample parallelism, 106 SAT, 4, 31, 33, 34, 36 3-SAT, 36 Scalability, 15, 31, 35, 66 Scattered reads, 29 SEE, 18, 114 Self-test, 15 Sensitive input, 138 Shared Memory, 26, 27, 110 Shared multiprocessor, 185 SIMD, 3, 18, 29
Software
IP cores, 15 Span, 69 Speedup, 31 SPICE, 31, 153 Square root, 12 SRAM, 11 SSTA, 4, 101, 106 STA, 101, 106 SPICE, 154 Stem, 137 Stem region, 138, 143 Stochastic
SAT, 83, 85 Subroutine, 167 Subsumption resolution, 56 Successive chord, 156 Supply voltage, 10 Survey propagation, 84 Surveys
survey propagation, 88 Synchronization points, 29 Synchronize, 28
System test, 15 Systematic variations, 106
T
Termination cell, 40 Texture fetching Texture Memory, 27 Texture Memory, 26, 110, 155, 160 Thread, 28, 146
Thread block, 28 Thread parallel, 135
Trang 7192 Index Thread scheduler, 29
Throughput, 11
Time slicing, 29
Tree Factor Graph, 87
U
Unate covering, 134
V
Variable, 31, 37
Factor Graph, 87
Variable ordering
SAT, 32
Variable Vt, 10
Variations, 106
Virtual memory, 12 VLIW, 171 VLSI, 106 VSIDS, 93
W
WalkSAT, 85, 90, 96 Warp size, 29 Warps, 29 Watermarking, 19
X
XC2VP30 FPGA, 32