As the FPGA architectures prefer large, programmable logic blocks, efficient clustering algorithms are vital to make use of the benefits from those advanced architectures.. First, an are
Trang 1by
Chang Woo Kang
A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING)
May 2006
Copyright 2006 Chang Woo Kang
Trang 23237159 2007
UMI Microform Copyright
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company
Trang 3To my parents and family, my fiancée Eunju Lee, and my best friend Sung-Hoon
Kang, thanks for their unconditional support and love
Trang 4I would like to offer my humble acknowledgment to Professor Massoud Pedram,
who supervised and guided me through this achievement From him, I received not
only knowledge on my research but also emotional support whenever I encountered
frustration I also thank Professor Jeff Draper and Professor Roger Zimmermann
for being on my thesis committee
I would like to extend my deep gratitude to Professor Jeff Draper, who supported
me at the Information Sciences Institute for three years His support has been an
enormous encouragement during study at USC
I would like to thank all of the SPORT group members who have given freely of
their time, hearts, and resources to support this research A partial list includes:
Chanseok Hwang, Kihwan Choi, Ali Iranli, Yazdan Aghahiri, Peng Rong, Yu Hou,
Afshin Abdollahi, Wonbok Lee, Maryam Soltan, Morteza Maleki, Hanif Fatemi,
Soroush Abbaspour, Hwisung Jung, and Behnam Amelifard
Finally, I would like to express my deep affection to Yunjung Choi, Ihn Kim,
Joongseok Moon, Kisup Chong, and Kihoon Jeong
Chang Woo Kang
USC, May 2006
Trang 5TABLE OF CONTENTS
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract x
CHAPTER 1 Introduction 1
1.1 Motivation 1
1.2 Dissertation Outline 6
CHAPTER 2 Coarse-grained FPGAs and Previous Work 8
2.1 Overview of Coarse-grained FPGA Architecture 8
2.2 FPGA CAD Flow 13
2.2.1 Technology mapping 13
2.2.2 Clustering Techniques 14
2.2.3 Placement and Routing 23
2.3 Summary 26
CHAPTER 3 Tool Flow and Cell Library Generation 27
3.1 Introduction 27
3.2 Tool Flow 28
3.3 Cell Library Generation 29
3.4 Cost Assignment 34
3.5 Summary 35
Trang 6CHAPTER 4 Area-driven Clustering Algorithm with Considerations of
Interconnect Connectivity and Circuit Speed 36
4.1 Introduction 36
4.2 Lower-bound Calculation 38
4.2.1 Problem Statement and Dynamic Programming Approach 38
4.2.2 Set containment relations 42
4.2.3 Minimum number of pASIC3 logic cells with given base gates 43
4.2.4 Type distribution table 44
4.2.5 Problem formulation and solution 46
4.3 Area-driven Clustering Technique 50
4.3.1 Interconnect-aware Clustering 51
4.3.2 Timing Slack-driven Clustering 56
4.4 Experiment Results 61
4.5 Summary 64
CHAPTER 5 Timing-driven Clustering 66
5.1 Introduction 66
5.2 Problem statement 67
5.3 Multi-dimensional labeling algorithm 68
5.4 Signal Path Aware Slack-time relaxation 71
5.5 Merging algorithm 73
5.6 Experiment Results 74
5.7 Summary 77
CHAPTER 6 Low-power Clustering with Minimum Logic Replication 78
6.1 Introduction 78
6.2 Design Flow and Problem Description 83
6.3 Low Power Clustering 86
6.3.1 Cluster generation and power-delay curves 86
6.3.2 Correct accounting of logic replication 87
6.4 Cluster selection 94
6.5 Implementation and Experimental Results 96
6.6 Summary 99
Trang 7CHAPTER 7 Conclusion and Future Work 100
7.1 Dissertation Summary 100
7.2 Future Work 102
Bibliography 104
Trang 8LIST OF TABLES
Table 3.1: Cell distribution after cell personalization from base gates 31
Table 3.2: Cell distribution after identifying common primitive cells
among base gates 31
Table 3.3: Filtered primitive cells 33
Table 4.1: The type distribution table for primitive cell to base-gate
mapping 45
Table 4.2: Results of lower-bound calculation 62
Table 4.3: Results of different clustering objectives with the minimum
area solution 63
Table 5.1: Results of timing-driven clustering 75
Table 5.2: Results of slack-time relaxation 76
Table 6.1: Low-power clustering results: Area and delay 97
Table 6.2: Low-power clustering results: Power and CPU time 98
Trang 9LIST OF FIGURES
Figure 1.1: Virtex II CLB Element .3
Figure 1.2: Coarse-grained, antifuse-based FPGA: (a) pASIC3 logic cell,
(b) FPGA architecture, and (c) antifuse switch 5
Figure 2.1: Coarse-grained, SRAM-based FPGA [1] 10
Figure 2.2: Coarse-grained, antifuse-based FPGA 11
Figure 2.3: FPGA CAD flow .12
Figure 2.4: Input reduction by adding a BLE .17
Figure 2.5: BLE criticality assignment .19
Figure 2.6: Clustering: (a) before packing node B into cluster C and (b) after packing node B into cluster C .20
Figure 3.1: Proposed CAD tool flow for pASIC3 family FPGA 29
Figure 3.2: Functions in Packer-pASIC3 .29
Figure 3.3: pASIC3 base gates derived from the configurable logic cell .32
Figure 3.4: Venn’s diagram for the set of logic cells that can be personalized
from the base gates 33
Figure 4.1: Interconnect switch architecture for two different FPGAs 37
Figure 4.2: One dimensional coin change problem .41
Figure 4.3: Examples of local neighborhood connectivity factor computation .50
Trang 10Figure 4.4: Clustering nodes .55
Figure 4.5: Packing un-clustered nodes by using linear assignment:
(a) partially clustered network; (b) bipartite graph for linear assignment 56
Figure 4.6: Selecting the best node for clustering: (a) greedy selection and (b) intelligent selection .58
Figure 4.7: Selecting the best node for delay improvement 60
Figure 5.1: Multi-dimensional labeling algorithm 70
Figure 5.2: Clustering example 71
Figure 5.3: Slack-time relaxation with awareness of signal path 73
Figure 6.1: An example of redundant logic replication in clustering: (a)
clusters and the corresponding area-delay points, (b) non-inferior
clusters, (c) circuit after logic replication (i.e., n1, n2, and n3 are
duplicated), and (d) a desired clustering solution .82
Figure 6.2: PD curve generation for a node with a cluster 88
Figure 6.3: Example of logic replication prediction .92
Figure 6.4: Prediction of logic replication .93
Figure 6.5: Logic replication cases: (a) child node is replicated, and (b)
root node is replicated .94
Figure 6.6: Logic replication for cluster selection .95
Trang 11Coarse-grained, antifuse-based FPGAs have emerged as a compelling technology
to minimize the performance gaps between FPGAs and ASICs in area, speed, and
power dissipation As the FPGA architectures prefer large, programmable logic
blocks, efficient clustering algorithms are vital to make use of the benefits from
those advanced architectures
Circuit clustering is an important technique for coarse-grained FPGAs First,
clustering can reduce the complexity of large circuit designs by a significant factor
Second, clustering can improve the quality of the results of other operations such as
placement and routing
In this dissertation, clustering techniques for area, delay, and power dissipation are
proposed First, an area-driven clustering algorithm is presented to minimize the
number of macro logic cells required to cover a network This algorithm calculates
the minimum number of the logic cells by a multi-dimensional coin-change
problem or a linear programming formulation Subsequently, with the minimum
number of available macro logic cells, actual clustering, which packs nodes into
clusters, is performed to improve routability and delay Next, a timing-driven
clustering algorithm is presented to minimize the number of macro logic cells on
the longest input-output path The algorithm optimally labels nodes for the smallest
Trang 12delay and then minimizes redundant logic replication by using slack-time
relaxation during the clustering phase Finally, a low-power clustering algorithm is
presented to minimize power dissipation with the minimum logic replication The
algorithm accurately emulates logic replication to estimate the cost incurred by
logic replication to meet timing constraints Based on this information, the
proposed algorithm substantially reduces size of the replicated logic, resulting in
benefits in area, delay, and power dissipation
Trang 13CHAPTER 1
1.1 MOTIVATION
Getting a product to market quickly is a pivotal success factor in today’s
ever-changing electronics market Field programmable gate arrays (FPGAs) can create
unique advantages over application specific integrated circuits (ASICs) because of
their quick and cost-effective validation of products
There are two basic classes of FPGA devices in the market today: SRAM-based
FPGAs and antifuse based FPGAs SRAM based FPGAs utilize look-up tables
(LUTs) A LUT is a small one-bit wide memory array, where the address lines for
the memory are inputs of the logic cell, and a single-bit output from the memory is
the LUT output Generally, a programmable logic cell contains two or more LUTs
connected in some manner Each LUT can realize any logic function of K inputs by
writing the logic function’s truth table directly into the memory Loading
configuration data into the internal memory cells customizes the FPGAs In
Trang 14contrast, antifuse-based FPGAs are based on a structure similar to traditional gate
arrays An antifuse resides in a high-impedance state, and can be programmed into
a low impedance or "fused" state The antifuse technology is less expensive than
the RAM technology, but this device is a program-once device Here, a
programmable logic cell comprises of simple gates and multiplexers The logic cell
is programmed by assigning the input signals to constant binary values or shorting
them together Consequently, such a logic cell can also realize a wide range of
Boolean functions Antifuse-based FPGAs are, therefore, smaller in size when
compared to SRAM-based FPGAs with the same number of equivalent gate
capacity SRAM-based interconnect contains transistor switches, while the antifuse
based interconnect can be considered as a standard metal interconnect found in
ASIC chips
There are three primary classes of FPGA architectures: Coarse-grained,
medium-grained and fine-medium-grained Coarse-medium-grained architectures contain very large
programmable logic blocks with 20-30 inputs and 4-6 outputs One can think of
such large blocks as standard programmable logic devices (PLDs) This
architecture provides a cross-networked interconnect Therefore, this architecture
supports interconnect flexibility between such large blocks Medium-grained
architectures consist of large logic blocks, often containing two or more look-up
tables and two or more flip-flops In a majority of these architectures, a four-input
Trang 15Matrix
Slice X1Y1
Slice X0Y0
Slice X0Y1
Slice X1Y0
Fast connects
to neighbors
LUT
LUT slice configuration
Figure 1.1: Virtex II CLB Element
look-up table (think of it as a 16x1 ROM) implements the actual logic The larger
logic block usually corresponds to improved speed The other architecture type is
called fine-grained In these devices, there are a large number of relatively simple
logic blocks Fine-grained architecture usually only allows close connections,
which greatly reduces interconnect flexibility These devices are good at systolic
functions and have some benefits for designs created by logic synthesis
As interconnect became a dominant component in deep-submicron design
technology, coarse-grain FPGA architecture has become ubiquitous in FPGA
industries because of the benefits of area, delay, and power Figure 1.1 shows the
organization of Xilinx Virtex II Family [69] It has configurable logic blocks
(CLBs), which are organized in an array and may be used to build combinational
Trang 16and sequential logic designs Each CLB element is comprised of four similar slices
Each slice includes two four-input function generators Each function generator is a
four-input LUT Therefore, each CLB has eight LUTs QuickLogic pASIC3 Family
as shown in Figure 1.2 is based on an antifuse technology [52] Devices in the
family are based on an array of highly flexible logic cells, which have been
optimized to efficiently implement a wide range of logic functions at a high speed
Each cell can implement one large function, four independent smaller functions, or
any combination in-between The logic cell has a fanin of 29 and four five outputs
including the output of a flip-flop
The structure and granularity of the logic block has a significant impact on the
area-efficiency of the FPGA [1] If the logic block is fine-grained, the circuit to be
implemented will be distributed over a larger group of logic blocks This has a
negative impact on routability, since more blocks need to be interconnected If
several LUTs are clustered into one logic block, signal sharing among LUTs can be
exploited Since interconnect inside the logic blocks are hardwired, local
interconnect can be made to be very fast and operate efficiently This improves
routability and decreases the load on the router, significantly On the other hand, it
is not feasible to increase the complexity of the logic blocks beyond a certain limit
If the logic blocks become too complex it becomes difficult to fully utilize these
Trang 17pASIC3 Logic cell
Figure 1.2: Coarse-grained, antifuse-based FPGA: (a) pASIC3 logic cell,
(b) FPGA architecture, and (c) antifuse switch
blocks, which may lead to waste of logic, hence chip This may cause diminishing
returns such as large area and low speed
In a typical flow of FPGA CAD tools, clustering, which follows the technology
mapping step, is an important optimization phase because it maps the target circuit
netlist into an FPGA array The clustering, therefore, refers to the task of grouping
logic gates in the circuit netlist and assigning each group to a configurable logic
block in the FPGA array Since poor clustering may result in significant impact on
Trang 18the final design in terms of area, delay and power, clustering must be done carefully
before placement and routing
In this dissertation, our research focuses on clustering techniques for
coarse-grained, antifuse-based FPGAs The clustering problem for coarse-coarse-grained,
antifuse-based FPGAs is quite different from typical clustering problems that
we’ve known for SRAM-based FPGAs Coarse-grained, antifuse-based FPGA
architecture demands highly intelligent CAD algorithms, because the architecture
provides tremendous flexibility with the least hardware overhead The hardware
overhead for having a large logic cell, with many inputs and multiple outputs, is
very little; the size of an antifuse to connect two metal wires is smaller than a via
[52] On the other hand, the amount of functions, which the logic cell can
implement, increases exponentially In order to utilize the logic cell efficiently,
CAD tools should be equipped with highly efficient algorithms
1.2 DISSERTATION OUTLINE
There are four parts to this research: library cell generation, area-driven clustering,
timing-driven clustering, and low-power clustering
In CHAPTER 2, an overview of coarse-grained FPGAs is provided and brief
review of the flow of FPGA CAD tools is presented In CHAPTER 3, we present
the procedure for generating library cells from the target pASIC3 family FPGA
Trang 19architecture The library cells are used during technology mapping, before
clustering, to cover a logically optimized network For the area-driven clustering, in
CHAPTER 4, we present two approaches to calculate the minimum number of
pASIC3 logic cells to cover a network mapped by library cells First, we use
multi-dimensional coin change dynamic programming as a general solution for the
general logic cell architectures Secondly, by fully exploring the architectural
characteristics of the pASIC3 logic cell, we set up two linear equations and find the
optimal packing solution Under the constraints of the minimum area solution, we
present an interconnect-aware clustering algorithm and a timing-driven clustering
algorithm In CHAPTER 5, we present a timing-driven clustering algorithm, which
minimizes the number of pASIC3 logic cells on the longest input-output path
Logic replication is minimized by slack-time relaxation A low-power clustering
algorithm is presented in CHAPTER 6 In the low power clustering algorithm, we
minimize the logic replication while meeting the timing constraint under low power
optimization objective The algorithm accurately simulates logic replication caused
by timing constraints during the post-order traversal This technique reduces the
size of duplicated logic substantially, resulting in benefits in area, delay, and power
dissipation We provide a summary of this dissertation and possible extensions in
CHAPTER 7
Trang 20CHAPTER 2
This chapter contains an overview of the typical architectures of coarse-grained
FPGAs and related works from the past in the area of FPGA CAD tools
2.1 OVERVIEW OF COARSE-GRAINED FPGA
ARCHITECTURE
FPGAs usually consist of small, configurable basic elements connected by rich
programmable interconnects [4] Since routing resources grow faster than on-chip
logic resources, routing resources account for the major portion of the device’s
overall area and delay Speed and area-efficiency of an FPGA are directly related to
the granularity of its logic block [1][47][48] While coarse-grained blocks have
long, internal logic delays, they can reduce the placement and routing stress by
having fast local routing and significantly reduce external routing
Typically, synthesis tools prefer “gate array-like” fine-grained architectures;
however, fine-grained FPGA architectures generally yield a very poor delay, due to
Trang 21the long delays resulting from building functions with multiple levels of gates and
slow interconnect elements Coarse-grained architecture gives tools the needed
degrees of freedom for the high logic utilization benefits of a fine-grained
architecture without sacrificing the high performance benefits of coarse-grained,
high fan-in architecture
Figure 2.1 shows a typical SRAM-based FPGA A basic logic element (BLE)
consists of a lookup table and a flip-flop; and those basic elements comprise a
programmable logic block (PLB) Dedicated routing is provided inside each cluster
for communication between the local BLEs For clusters of size greater than one,
the architecture is fully connected, where each BLE input can be connected to any
of the cluster inputs or to the output of any output of the BLEs within the cluster
Examples of coarse-grained FPGAs are the Xilinx Virtex [69] and the Apex and
Flex from Altera [5] In these architectures, groups of basic logic elements are
clustered to provide better performance
Figure 2.2 illustrates the logic cell architecture of pASIC3 from QuickLogic [52]
The pASIC3 family FPGA is based on antifuse-based programming technology
The pASIC3 device architecture consists of an array of user-configurable logic
building blocks, called logic cells, set beneath a grid of metal wiring channels
similar to those of a gate array Through antifuses, located at the wire intersections,
Trang 22the output(s) of any cell may be programmed to connect to the input(s) of any other
cell
Figure 2.1: Coarse-grained, SRAM-based FPGA [1]
Trang 23The pASIC3 logic cell, shown in Figure 2.2(a), is a general-purpose building block
that can implement most gate array macro library functions Since the logic cell has
multiple outputs, it can implement one large function or multiple smaller
independent functions in parallel The function of a logic cell is determined by the
(a) Logic cell (b) Some of possible configurations of logic cell
Trang 24logic levels applied to the inputs of the AND gates and multiplexers The high logic
capacity and fan-in of the logic cell accommodate many user functions with a
single level of logic delay Figure 2.2(b) shows some of the possible configurations
of the logic cell Since all connections within the cell are hard-wired, the various
functions are available in parallel Thus, very wide, complex functions are
implemented with the same cell speed, as the much smaller “fragment” functions
Related and unrelated functions can be packed into the same logic cell, increasing
effective density and gate utilization
Results
Figure 2.3: FPGA CAD flow
Trang 252.2 FPGA CAD FLOW
Figure 2.3 illustrates the CAD tool flow that is typically encountered in the design
The logic can be optimized by any RT-level optimization tool, e.g., SIS [59]
During the technology-mapping phase, the logic optimized circuit is mapped to
basic logic elements, and closely connected basic logic elements are packed
together into programmable logic blocks Finally, placement and routing are
conducted with those programmable logic blocks In the following sections, we
briefly review technology mapping techniques, clustering techniques, placement,
and routing algorithms
2.2.1 TECHNOLOGY MAPPING
In a standard cell design procedure for the application specific integrated circuits
(ASIC), technology mapping maps the optimized circuits with a target library
However, FPGAs have clusters with basic logic elements; and those basic logic
elements are ready to be programmed to implement certain functions Therefore,
the technology mapping locates a feasible portion of circuits and implements
functions of that portion into that basic logic element For two different types of
FPGAs, various mapping techniques have been developed An extensive survey of
existing SRAM based FPGA mapping techniques is given by Cong and Ding [19]
They also developed FlowMap that guarantees to produce depth-optimal mapping
Trang 26solutions [22] CutMap [23] is an improvement over FlowMap, which considers
area minimization during delay optimization For antifuse based FPGAs, Boolean
matching techniques have been used for technology mapping and research results
on technology mapping for antifuse logic cells have been reported [30] Boolean
matching is therefore a key enabler for antifuse based FPGA mapping Lai et al in
[42] proposed a Boolean matching algorithm and introduced matching filters for
speedup A more comprehensive review is provided in [6]
2.2.2 CLUSTERING TECHNIQUES
Once the technology mapping is accomplished, then the mapped netlist is provided
to a clustering algorithm The clustering algorithm packs multiple basic logic
elements into a logic cluster
Many clustering techniques, for SRAM-based FPGAs, have been based on
constructive clustering techniques In the following section, the major
achievements on clustering techniques are reviewed in chronological order
2.2.2.1 Rapid System Prototyping (RASP)
The RASP system is a general synthesis and mapping system for SRAM-based
FPGA [20] It consists of a core with a set of synthesis and optimization algorithms
for technology independent logic synthesis and technology mapping for a generic
look-up table (LUT) network generation It has also a set of architecture-specific
Trang 27technology mapping routines to map the generic LUT network to programmable
logic blocks in various SRAM-based FPGA architectures
The clustering algorithm is based on a sequence of maximum weighted matching
operations on a compatibility graph, which yields the proper grouping of LUTs into
programmable logic blocks (PLBs) For each step, a compatibility graph is formed
in which vertices represent the partial PLBs (initially LUTs) that will be considered
for grouping at this step An edge is formed between two vertices if the two
corresponding partial PLBs that can be grouped into one Then, weights are
assigned to the edges, to guide the matching algorithm to select the best merging of
partial PLBs Different weights are assigned for different optimization objectives
For delay optimization, a larger weight is given to an edge corresponding to the
grouping of two LUTs that may reduce the length of a critical path in the PLB
network For routability, more significant weight is given to an edge that
corresponds to the grouping of two “close” LUTs, so that it does not create
complex interconnection patterns in the final mapping solution The “closeness” of
two LUTs can be measured in the overlap of their fanin subnetworks The weight
Trang 28where N v and N w are sets of edges on the vertex v and w, respectively The
compatibility graph becomes a bipartite graph whose edges have weight Finally,
the bipartite matching problem is solved to find the best packing solution
2.2.2.2 VPACK
VPACK [7] is a clustering algorithm which functions to minimize both the number
of logic clusters and the number of used inputs to each cluster Minimizing used
inputs for each cluster is important for a routable design The algorithm constructs
each cluster sequentially First, a seed BLE is chosen, which has the most used
inputs among currently un-clustered BLEs The inputs are a scarce resource Then,
VPACK greedily selects the BLE that shares the most inputs and outputs with the
cluster being constructed The attraction between a BLE B, and the current cluster
C, are the number of common nets that are shared
Attraction B C = Nets B ∩Nets C (2.2)
This procedure of greedily selecting a BLE, to include in the cluster, continue until
either the cluster is full or until adding any additional un-clustered BLEs would
cause the number of distinct inputs needed by the cluster to exceed the number of
inputs allowed If the cluster is full, a new seed BLE is selected and a new cluster is
generated to contain the new BLEs If, however, the cluster is not full but no BLE
can be added because of exceeding the number of allowed inputs, the hill-climbing
Trang 29phase is invoked to give a chance for packing into the cluster If a BLE has all
inputs in the cluster and its output is connected to a BLE in the cluster, adding the
BLE to the cluster will reduce the number of inputs by one Figure 2.4 illustrates an
example of reducing inputs by adding a BLE The hill-climbing phase terminates
when the cluster is full; if the cluster is still infeasible, VPACK backs up to the last
point when the cluster was feasible
T-VPACK [49] is based on VPACK algorithm [7] Its’ optimization goal is
minimizing the number of external connections (connections between clusters) on
the critical path Since the external cluster routing delay is much larger than the
local routing inside a cluster, minimizing the number of routing on critical paths
can improve delay significantly
Trang 30The algorithm consists of two steps: static timing analysis and clustering In the
step of static timing analysis, slack is estimated Slack [33] is defined as the amount
of delay, which can be added to a connection without increasing the delay of the
entire circuit The slack at BLE input pin, i is defined as:
( ) required( ) arrival( )
where T reqire (i) and T arrival (i) are the required time and arrival time at input pin, i
The criticality of connection in input i is described as:
where MaxSlack is the largest slack among all connections
During the clustering phase, selecting a seed BLE and attracting BLEs take place
The seed BLE is an un-clustered but has the most critical connection in the circuit
Attraction function was formed to include timing information The criticality of a
BLE is defined as the maximum Connection_Criticality value of all connections
which connect the BLE to BLEs within the cluster currently being packed An
example is shown in Figure 2.5 The attraction function is defined as follows:
Trang 31where G is a normalization factor which is set to the maximum number of nets to
which any a BLE can connect The time complexity of this algorithm is O(n 2
), where n is the number of BLEs in the circuit
Figure 2.5: BLE criticality assignment
2.2.2.4 Routability-driven Packing (RPACK)
RPACK [10] is a routability-driven packing algorithm, which first identifies
routability factors, then prioritizes these factors into an improved clustering cost
function Three factors have been introduced to achieve routable clustering:
• Ratio of pins per net
• Ratio of used pins to the total number of pins of the logic block
• Number of nets
The ratio of pins per net indicates the density of high-fanout nets in the circuit The
ratio of used pins to the total number of pins of the logic block indicates the traffic
in and out around a logic block The number of nets is also closely related to
routability
Trang 32In the first stage, a LUT and a register are packed into a basic logic block when
possible After that, the logic blocks are packed into clusters using a heuristic
approach In this approach, clusters are constructed sequentially A seed is chosen
to generate a new cluster The best choice for the seed for each cluster is the
un-clustered logic block with the most used inputs, as indicated by VPACK and
TVPACK After choosing a seed, the basic logic element, which gives the highest
gain, is selected to be added to the current cluster if the number of external inputs
does not exceed the number of input pins of the cluster
Figure 2.6: Clustering: (a) before packing node B into cluster C and (b)
after packing node B into cluster C
In equation (2.2), the gain is the number of inputs and outputs they have in
common between a BLE, B, and a current cluster, C However, contributions of
each net between two blocks can be significantly different in terms of routability
By considering just the number of shared inputs and outputs, the packing algorithm
cannot differentiate among the candidate blocks, which have different impact on
Trang 33routability For example, three nets, N1, N2, and N3 in Figure 2.6 have different
contributions respectively By moving the BLE B, one terminal can be reduced in
N1, one terminal and one input pin can be saved in N2, and one terminal and one
unit of output congestion can be reduced in N3 The authors have driven a table of
gains of the candidate block according to a single net, which stores gains for
different connections of a net The table was used to compute the total gain of
packing the logic element to a cluster
The results showed that the major portion of the decrease in the number of the nets
is due to decrease in the number of two-terminal nets The complexity of this
algorithm is O(M2), where M is the number of basic logic elements (BLEs)
2.2.2.5 Interconnect Resource Aware Clustering (iRAC)
Packing closely connected components together, by considering spatial uniformity
in the clustered design, using Rent’s Rule [27], iRAC [61] reduces the external
routing requirement in clustered FPGAs It alleviates routing congestion for
clustered FPGAs by absorbing as many small nets into clusters as possible, and
depopulating clusters according to Rent’s rule, in order to achieve spatial
uniformity in the clustered netlist
Characterizing the complexity of a cluster can be done with the well-known
exponential relationship in Rent’s rule [27]:
Trang 34p io
where N io is the number of pins in a cluster, B is the number of basic logic elements
(BLEs), K is the average number of connections per BLE in the cluster, and p (0 <
p < 1) is the Rent’s parameter (exponent) Given a specific FPGA architecture, a
Rent’s parameter, Pa, can be estimated from the equation (2.6), because Nio, K, and
B can be obtained from the architecture The connectivity factor is defined by the
ratio:
2
separation c
degree
where degree is the degree of a BLE as the number of nets incident to that BLE and
separation is the sum of all terminals of nets incident to the BLE The smaller c
value a BLE has, the more BLEs are located in a given BLE’s neighbor hood In
the second step, a BLE with the highest degree and lowest connectivity factor is
chosen as a seed for a new cluster Then gains, for neighbors connected to the BLE,
are assigned and the un-clustered BLE with the highest gain is absorbed into the
cluster The gain function for BLE X with a net x to the cluster C is defined as
follows:
Trang 35where αx is the number of pins of net x already inside cluster C, n is the cluster
size, and w(x) is the weight of net x (w(x) = 2/r where r is the number of pins on x)
In order to guarantee spatial uniformity of the clustered netlist, the algorithm limits
the number of available pins using Rent’s rule By limiting it, the Rent’s parameter
of any cluster is no more than the Rent’s parameter of the FPGA architecture, Pa
The area saved is 35%, on an average, compared to previously published results
[10]
2.2.3 PLACEMENT AND ROUTING
Placement is the process by which a netlist of logic blocks is mapped into physical
locations in an FPGA The locations where the blocks are mapped can significantly
effect the performance of the FPGA
Simulated annealing (SA) placement has been commonly used, since the number of
logic blocks is manageable so far within a reasonable amount of time The
simulated annealing algorithm mimics the annealing process used to gradually cool
molten metal to produce high-quality metal structures [57][40] A simulated
annealing placer initially places logic blocks randomly into physical locations in an
FPGA Then the placement is iteratively improved by randomly swapping blocks
and evaluating the quality of each swap with a cost function If the move will result
in a reduction in the placement cost, then the move is accepted If the move would
Trang 36cause an increase in the placement cost, then the move may still be accepted even
though it makes the placement worse The purpose of accepting some bad moves is
to prevent the placer from being trapped in a local minimum
VPR placer [8] employed a simulated annealing algorithm for FPGA placement to
minimize the wirelength of circuits The placer uses a bounding-box based “linear
congestion” [4][8] cost function to estimate wirelength requirements The linear
congestion cost function is expressed as follows:
where Nnets is the number of nets, bbx(i) and bby(i) are horizontal span and vertical
span of net i, respectively
The problem of routing FPGAs can be stated simply as that of assigning signals to
routing resources, in order to successfully route all signals, while achieving a given
overall performance The problem of routing FPGAs bears a considerable
resemblance to the problem of global routing for custom integrated circuit design
However, the two problems differ in several fundamental respects First, routing
resources in FPGAs are discrete and scarce, while they are reasonably continuous
in custom integrated circuits For this reason FPGAs require an integrated approach
using both a global and a detailed router A second difference is that the global
Trang 37routing problem for custom ICs is rooted in an undirected graph In FPGAs the
switches are often directional for SRAM-based FPGAs Both of these distinctions
are important, as they prevent direct application of much of the work that has been
done in custom IC routing to FPGAs [28]
PathFinder is a router developed based on an iterative maze-type router [50] Nets
are routed sequentially, and once a track segment has been used for one net, other
nets are allowed to use that segment, but must pay a higher cost Consequently, nets
tend to avoid overusing a segment unless it is necessary or particularly efficient At
the end of the first iteration (after all nets have been routed), either there are no
segments overused and the routing is successful, or some segments are overused
and more routing iterations are executed to try and resolve the contention In each
of these subsequent routing iterations, every net is ripped up and re-routed Since
the cost of over-used track segments increases with every routing iteration, they
become more expensive and are less likely to be used by more than one net This
gradual reduction in routing violations is a very successful routing approach
The VPR router [63] is based on the PathFinder routing algorithm [28] It has two
enhancements to increase the speed of the basic breadth-first search maze router
The first is to employ a depth-first search, which directs the router to head towards
specific targets The second is to reduce the amount of activity on the routing
Trang 38expansion list for higher-fanout nets, by only placing segments on the expansion
list that are in the neighborhood of the target
2.3 SUMMARY
In this chapter, the overview of coarse-grained FPGA architectures was provided
Mainly, SRAM-based FPGAs and antifuse-based FPGAs were introduced The
CAD flow for FPGAs was presented and brief reviews for each step were presented
The flow consists of logic optimization, technology mapping, clustering, placement,
and routing
Trang 39CHAPTER 3
In this chapter, the proposed tool flow and the procedure of generating library cells
from the pASIC3 logic cell are discussed The cell library is used during the
technology mapping phase to cover a network
3.1 INTRODUCTION
The cell library is the key to implementing the application specific integrated
circuit (ASIC) Each library cell has a specified function, layout, and various kinds
of information such as pin delays from inputs to outputs, area of the cell, and
expected power dissipation, and so forth Library cells are not required, until the
technology mapping phase, in the flow of computer-aided design for ASIC but it
represents the physical implementation of functions optimized by logic synthesis
Each cell has a few different sizes, to provide different characteristics with the
same functionality and thus area, drive strength, and power dissipation is different
for different sized cells Similarly, the cell library for antifuse-based FPGAs is a
Trang 40collection of cells, each of which has its own function, delay, and power
dissipation However, they cannot have multiple size cells for the same function in
a library, because all of the library cells have been generated from a logic cell like
pASIC3 in Figure 2.2 On the other hand, all of the library cells might have the
same delay, not exactly but very close In addition, library cells from a logic cell
provide a more complex function than the ASIC library cells
The organization of this chapter is as follows: A brief overview of the proposed
tool flow is presented in section 3.2 In section 3.3, the procedure of constructing
the cell library is presented The cost assignment for each library cell is discussed
in section 3.4 Finally, we conclude in section 3.5
3.2 TOOL FLOW
Figure 3.1 shows our CAD tool flow for pASIC3 family FPGAs We generate both
a cell library set and configuration information from the pASIC3 logic cell A
target circuit is synthesized by SIS [59] and then the circuit is mapped with cells in
the cell library by a technology mapper in SIS [59] Our clustering tool, called
Packer-pASIC3, packs nodes with four different functions as shown in Figure 3.2
A cluster is assigned to a pASIC3 logic cell The VPR [8] places and routes the
clustered network with the architecture description of pASIC3 family FPGAs