clustering techniques for coarse-grained, antifuse-based fpgas

As the FPGA architectures prefer large, programmable logic blocks, efficient clustering algorithms are vital to make use of the benefits from those advanced architectures.. First, an are

Trang 1

by

Chang Woo Kang

A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA

In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING)

May 2006

Trang 2

3237159 2007

UMI Microform Copyright

ProQuest Information and Learning Company

300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346

by ProQuest Information and Learning Company

Trang 3

To my parents and family, my fiancée Eunju Lee, and my best friend Sung-Hoon

Kang, thanks for their unconditional support and love

Trang 4

I would like to offer my humble acknowledgment to Professor Massoud Pedram,

who supervised and guided me through this achievement From him, I received not

only knowledge on my research but also emotional support whenever I encountered

frustration I also thank Professor Jeff Draper and Professor Roger Zimmermann

for being on my thesis committee

I would like to extend my deep gratitude to Professor Jeff Draper, who supported

me at the Information Sciences Institute for three years His support has been an

enormous encouragement during study at USC

I would like to thank all of the SPORT group members who have given freely of

their time, hearts, and resources to support this research A partial list includes:

Chanseok Hwang, Kihwan Choi, Ali Iranli, Yazdan Aghahiri, Peng Rong, Yu Hou,

Afshin Abdollahi, Wonbok Lee, Maryam Soltan, Morteza Maleki, Hanif Fatemi,

Soroush Abbaspour, Hwisung Jung, and Behnam Amelifard

Finally, I would like to express my deep affection to Yunjung Choi, Ihn Kim,

Joongseok Moon, Kisup Chong, and Kihoon Jeong

Chang Woo Kang

USC, May 2006

Trang 5

TABLE OF CONTENTS

Dedication ii

Acknowledgements iii

List of Tables vii

List of Figures viii

Abstract x

CHAPTER 1 Introduction 1

1.1 Motivation 1

1.2 Dissertation Outline 6

CHAPTER 2 Coarse-grained FPGAs and Previous Work 8

2.1 Overview of Coarse-grained FPGA Architecture 8

2.2 FPGA CAD Flow 13

2.2.1 Technology mapping 13

2.2.2 Clustering Techniques 14

2.2.3 Placement and Routing 23

2.3 Summary 26

CHAPTER 3 Tool Flow and Cell Library Generation 27

3.1 Introduction 27

3.2 Tool Flow 28

3.3 Cell Library Generation 29

3.4 Cost Assignment 34

3.5 Summary 35

Trang 6

CHAPTER 4 Area-driven Clustering Algorithm with Considerations of

Interconnect Connectivity and Circuit Speed 36

4.1 Introduction 36

4.2 Lower-bound Calculation 38

4.2.1 Problem Statement and Dynamic Programming Approach 38

4.2.2 Set containment relations 42

4.2.3 Minimum number of pASIC3 logic cells with given base gates 43

4.2.4 Type distribution table 44

4.2.5 Problem formulation and solution 46

4.3 Area-driven Clustering Technique 50

4.3.1 Interconnect-aware Clustering 51

4.3.2 Timing Slack-driven Clustering 56

4.4 Experiment Results 61

4.5 Summary 64

CHAPTER 5 Timing-driven Clustering 66

5.1 Introduction 66

5.2 Problem statement 67

5.3 Multi-dimensional labeling algorithm 68

5.4 Signal Path Aware Slack-time relaxation 71

5.5 Merging algorithm 73

5.6 Experiment Results 74

5.7 Summary 77

CHAPTER 6 Low-power Clustering with Minimum Logic Replication 78

6.1 Introduction 78

6.2 Design Flow and Problem Description 83

6.3 Low Power Clustering 86

6.3.1 Cluster generation and power-delay curves 86

6.3.2 Correct accounting of logic replication 87

6.4 Cluster selection 94

6.5 Implementation and Experimental Results 96

6.6 Summary 99

Trang 7

CHAPTER 7 Conclusion and Future Work 100

7.1 Dissertation Summary 100

7.2 Future Work 102

Bibliography 104

Trang 8

LIST OF TABLES

Table 3.1: Cell distribution after cell personalization from base gates 31

Table 3.2: Cell distribution after identifying common primitive cells

among base gates 31

Table 3.3: Filtered primitive cells 33

Table 4.1: The type distribution table for primitive cell to base-gate

mapping 45

Table 4.2: Results of lower-bound calculation 62

Table 4.3: Results of different clustering objectives with the minimum

area solution 63

Table 5.1: Results of timing-driven clustering 75

Table 5.2: Results of slack-time relaxation 76

Table 6.1: Low-power clustering results: Area and delay 97

Table 6.2: Low-power clustering results: Power and CPU time 98

Trang 9

LIST OF FIGURES

Figure 1.1: Virtex II CLB Element .3

Figure 1.2: Coarse-grained, antifuse-based FPGA: (a) pASIC3 logic cell,

(b) FPGA architecture, and (c) antifuse switch 5

Figure 2.1: Coarse-grained, SRAM-based FPGA [1] 10

Figure 2.2: Coarse-grained, antifuse-based FPGA 11

Figure 2.3: FPGA CAD flow .12

Figure 2.4: Input reduction by adding a BLE .17

Figure 2.5: BLE criticality assignment .19

Figure 2.6: Clustering: (a) before packing node B into cluster C and (b) after packing node B into cluster C .20

Figure 3.1: Proposed CAD tool flow for pASIC3 family FPGA 29

Figure 3.2: Functions in Packer-pASIC3 .29

Figure 3.3: pASIC3 base gates derived from the configurable logic cell .32

Figure 3.4: Venn’s diagram for the set of logic cells that can be personalized

from the base gates 33

Figure 4.1: Interconnect switch architecture for two different FPGAs 37

Figure 4.2: One dimensional coin change problem .41

Figure 4.3: Examples of local neighborhood connectivity factor computation .50

Trang 10

Figure 4.4: Clustering nodes .55

Figure 4.5: Packing un-clustered nodes by using linear assignment:

(a) partially clustered network; (b) bipartite graph for linear assignment 56

Figure 4.6: Selecting the best node for clustering: (a) greedy selection and (b) intelligent selection .58

Figure 4.7: Selecting the best node for delay improvement 60

Figure 5.1: Multi-dimensional labeling algorithm 70

Figure 5.2: Clustering example 71

Figure 5.3: Slack-time relaxation with awareness of signal path 73

Figure 6.1: An example of redundant logic replication in clustering: (a)

clusters and the corresponding area-delay points, (b) non-inferior

clusters, (c) circuit after logic replication (i.e., n1, n2, and n3 are

duplicated), and (d) a desired clustering solution .82

Figure 6.2: PD curve generation for a node with a cluster 88

Figure 6.3: Example of logic replication prediction .92

Figure 6.4: Prediction of logic replication .93

Figure 6.5: Logic replication cases: (a) child node is replicated, and (b)

root node is replicated .94

Figure 6.6: Logic replication for cluster selection .95

Trang 11

Coarse-grained, antifuse-based FPGAs have emerged as a compelling technology

to minimize the performance gaps between FPGAs and ASICs in area, speed, and

power dissipation As the FPGA architectures prefer large, programmable logic

blocks, efficient clustering algorithms are vital to make use of the benefits from

those advanced architectures

Circuit clustering is an important technique for coarse-grained FPGAs First,

clustering can reduce the complexity of large circuit designs by a significant factor

Second, clustering can improve the quality of the results of other operations such as

placement and routing

In this dissertation, clustering techniques for area, delay, and power dissipation are

proposed First, an area-driven clustering algorithm is presented to minimize the

number of macro logic cells required to cover a network This algorithm calculates

the minimum number of the logic cells by a multi-dimensional coin-change

problem or a linear programming formulation Subsequently, with the minimum

number of available macro logic cells, actual clustering, which packs nodes into

clusters, is performed to improve routability and delay Next, a timing-driven

clustering algorithm is presented to minimize the number of macro logic cells on

the longest input-output path The algorithm optimally labels nodes for the smallest

Trang 12

delay and then minimizes redundant logic replication by using slack-time

relaxation during the clustering phase Finally, a low-power clustering algorithm is

presented to minimize power dissipation with the minimum logic replication The

algorithm accurately emulates logic replication to estimate the cost incurred by

logic replication to meet timing constraints Based on this information, the

proposed algorithm substantially reduces size of the replicated logic, resulting in

benefits in area, delay, and power dissipation

Trang 13

CHAPTER 1

1.1 MOTIVATION

Getting a product to market quickly is a pivotal success factor in today’s

ever-changing electronics market Field programmable gate arrays (FPGAs) can create

unique advantages over application specific integrated circuits (ASICs) because of

their quick and cost-effective validation of products

There are two basic classes of FPGA devices in the market today: SRAM-based

FPGAs and antifuse based FPGAs SRAM based FPGAs utilize look-up tables

(LUTs) A LUT is a small one-bit wide memory array, where the address lines for

the memory are inputs of the logic cell, and a single-bit output from the memory is

the LUT output Generally, a programmable logic cell contains two or more LUTs

connected in some manner Each LUT can realize any logic function of K inputs by

writing the logic function’s truth table directly into the memory Loading

configuration data into the internal memory cells customizes the FPGAs In

Trang 14

contrast, antifuse-based FPGAs are based on a structure similar to traditional gate

arrays An antifuse resides in a high-impedance state, and can be programmed into

a low impedance or "fused" state The antifuse technology is less expensive than

the RAM technology, but this device is a program-once device Here, a

programmable logic cell comprises of simple gates and multiplexers The logic cell

is programmed by assigning the input signals to constant binary values or shorting

them together Consequently, such a logic cell can also realize a wide range of

Boolean functions Antifuse-based FPGAs are, therefore, smaller in size when

compared to SRAM-based FPGAs with the same number of equivalent gate

capacity SRAM-based interconnect contains transistor switches, while the antifuse

based interconnect can be considered as a standard metal interconnect found in

ASIC chips

There are three primary classes of FPGA architectures: Coarse-grained,

medium-grained and fine-medium-grained Coarse-medium-grained architectures contain very large

programmable logic blocks with 20-30 inputs and 4-6 outputs One can think of

such large blocks as standard programmable logic devices (PLDs) This

architecture provides a cross-networked interconnect Therefore, this architecture

supports interconnect flexibility between such large blocks Medium-grained

architectures consist of large logic blocks, often containing two or more look-up

tables and two or more flip-flops In a majority of these architectures, a four-input

Trang 15

Matrix

Slice X1Y1

Slice X0Y0

Slice X0Y1

Slice X1Y0

Fast connects

to neighbors

LUT

LUT slice configuration

Figure 1.1: Virtex II CLB Element

look-up table (think of it as a 16x1 ROM) implements the actual logic The larger

logic block usually corresponds to improved speed The other architecture type is

called fine-grained In these devices, there are a large number of relatively simple

logic blocks Fine-grained architecture usually only allows close connections,

which greatly reduces interconnect flexibility These devices are good at systolic

functions and have some benefits for designs created by logic synthesis

As interconnect became a dominant component in deep-submicron design

technology, coarse-grain FPGA architecture has become ubiquitous in FPGA

industries because of the benefits of area, delay, and power Figure 1.1 shows the

organization of Xilinx Virtex II Family [69] It has configurable logic blocks

(CLBs), which are organized in an array and may be used to build combinational

Trang 16

and sequential logic designs Each CLB element is comprised of four similar slices

Each slice includes two four-input function generators Each function generator is a

four-input LUT Therefore, each CLB has eight LUTs QuickLogic pASIC3 Family

as shown in Figure 1.2 is based on an antifuse technology [52] Devices in the

family are based on an array of highly flexible logic cells, which have been

optimized to efficiently implement a wide range of logic functions at a high speed

Each cell can implement one large function, four independent smaller functions, or

any combination in-between The logic cell has a fanin of 29 and four five outputs

including the output of a flip-flop

The structure and granularity of the logic block has a significant impact on the

area-efficiency of the FPGA [1] If the logic block is fine-grained, the circuit to be

implemented will be distributed over a larger group of logic blocks This has a

negative impact on routability, since more blocks need to be interconnected If

several LUTs are clustered into one logic block, signal sharing among LUTs can be

exploited Since interconnect inside the logic blocks are hardwired, local

interconnect can be made to be very fast and operate efficiently This improves

routability and decreases the load on the router, significantly On the other hand, it

is not feasible to increase the complexity of the logic blocks beyond a certain limit

If the logic blocks become too complex it becomes difficult to fully utilize these

Trang 17

pASIC3 Logic cell

Figure 1.2: Coarse-grained, antifuse-based FPGA: (a) pASIC3 logic cell,

(b) FPGA architecture, and (c) antifuse switch

blocks, which may lead to waste of logic, hence chip This may cause diminishing

returns such as large area and low speed

In a typical flow of FPGA CAD tools, clustering, which follows the technology

mapping step, is an important optimization phase because it maps the target circuit

netlist into an FPGA array The clustering, therefore, refers to the task of grouping

logic gates in the circuit netlist and assigning each group to a configurable logic

block in the FPGA array Since poor clustering may result in significant impact on

Trang 18

the final design in terms of area, delay and power, clustering must be done carefully

before placement and routing

In this dissertation, our research focuses on clustering techniques for

coarse-grained, antifuse-based FPGAs The clustering problem for coarse-coarse-grained,

antifuse-based FPGAs is quite different from typical clustering problems that

we’ve known for SRAM-based FPGAs Coarse-grained, antifuse-based FPGA

architecture demands highly intelligent CAD algorithms, because the architecture

provides tremendous flexibility with the least hardware overhead The hardware

overhead for having a large logic cell, with many inputs and multiple outputs, is

very little; the size of an antifuse to connect two metal wires is smaller than a via

[52] On the other hand, the amount of functions, which the logic cell can

implement, increases exponentially In order to utilize the logic cell efficiently,

CAD tools should be equipped with highly efficient algorithms

1.2 DISSERTATION OUTLINE

There are four parts to this research: library cell generation, area-driven clustering,

timing-driven clustering, and low-power clustering

In CHAPTER 2, an overview of coarse-grained FPGAs is provided and brief

review of the flow of FPGA CAD tools is presented In CHAPTER 3, we present

the procedure for generating library cells from the target pASIC3 family FPGA

Trang 19

architecture The library cells are used during technology mapping, before

clustering, to cover a logically optimized network For the area-driven clustering, in

CHAPTER 4, we present two approaches to calculate the minimum number of

pASIC3 logic cells to cover a network mapped by library cells First, we use

multi-dimensional coin change dynamic programming as a general solution for the

general logic cell architectures Secondly, by fully exploring the architectural

characteristics of the pASIC3 logic cell, we set up two linear equations and find the

optimal packing solution Under the constraints of the minimum area solution, we

present an interconnect-aware clustering algorithm and a timing-driven clustering

algorithm In CHAPTER 5, we present a timing-driven clustering algorithm, which

minimizes the number of pASIC3 logic cells on the longest input-output path

Logic replication is minimized by slack-time relaxation A low-power clustering

algorithm is presented in CHAPTER 6 In the low power clustering algorithm, we

minimize the logic replication while meeting the timing constraint under low power

optimization objective The algorithm accurately simulates logic replication caused

by timing constraints during the post-order traversal This technique reduces the

size of duplicated logic substantially, resulting in benefits in area, delay, and power

dissipation We provide a summary of this dissertation and possible extensions in

CHAPTER 7

Trang 20

CHAPTER 2

This chapter contains an overview of the typical architectures of coarse-grained

FPGAs and related works from the past in the area of FPGA CAD tools

2.1 OVERVIEW OF COARSE-GRAINED FPGA

ARCHITECTURE

FPGAs usually consist of small, configurable basic elements connected by rich

programmable interconnects [4] Since routing resources grow faster than on-chip

logic resources, routing resources account for the major portion of the device’s

overall area and delay Speed and area-efficiency of an FPGA are directly related to

the granularity of its logic block [1][47][48] While coarse-grained blocks have

long, internal logic delays, they can reduce the placement and routing stress by

having fast local routing and significantly reduce external routing

Typically, synthesis tools prefer “gate array-like” fine-grained architectures;

however, fine-grained FPGA architectures generally yield a very poor delay, due to

Trang 21

the long delays resulting from building functions with multiple levels of gates and

slow interconnect elements Coarse-grained architecture gives tools the needed

degrees of freedom for the high logic utilization benefits of a fine-grained

architecture without sacrificing the high performance benefits of coarse-grained,

high fan-in architecture

Figure 2.1 shows a typical SRAM-based FPGA A basic logic element (BLE)

consists of a lookup table and a flip-flop; and those basic elements comprise a

programmable logic block (PLB) Dedicated routing is provided inside each cluster

for communication between the local BLEs For clusters of size greater than one,

the architecture is fully connected, where each BLE input can be connected to any

of the cluster inputs or to the output of any output of the BLEs within the cluster

Examples of coarse-grained FPGAs are the Xilinx Virtex [69] and the Apex and

Flex from Altera [5] In these architectures, groups of basic logic elements are

clustered to provide better performance

Figure 2.2 illustrates the logic cell architecture of pASIC3 from QuickLogic [52]

The pASIC3 family FPGA is based on antifuse-based programming technology

The pASIC3 device architecture consists of an array of user-configurable logic

building blocks, called logic cells, set beneath a grid of metal wiring channels

similar to those of a gate array Through antifuses, located at the wire intersections,

Trang 22

the output(s) of any cell may be programmed to connect to the input(s) of any other

cell

Figure 2.1: Coarse-grained, SRAM-based FPGA [1]

Trang 23

The pASIC3 logic cell, shown in Figure 2.2(a), is a general-purpose building block

that can implement most gate array macro library functions Since the logic cell has

multiple outputs, it can implement one large function or multiple smaller

independent functions in parallel The function of a logic cell is determined by the

(a) Logic cell (b) Some of possible configurations of logic cell

Trang 24

logic levels applied to the inputs of the AND gates and multiplexers The high logic

capacity and fan-in of the logic cell accommodate many user functions with a

single level of logic delay Figure 2.2(b) shows some of the possible configurations

of the logic cell Since all connections within the cell are hard-wired, the various

functions are available in parallel Thus, very wide, complex functions are

implemented with the same cell speed, as the much smaller “fragment” functions

Related and unrelated functions can be packed into the same logic cell, increasing

effective density and gate utilization

Results

Figure 2.3: FPGA CAD flow

Trang 25

2.2 FPGA CAD FLOW

Figure 2.3 illustrates the CAD tool flow that is typically encountered in the design

The logic can be optimized by any RT-level optimization tool, e.g., SIS [59]

During the technology-mapping phase, the logic optimized circuit is mapped to

basic logic elements, and closely connected basic logic elements are packed

together into programmable logic blocks Finally, placement and routing are

conducted with those programmable logic blocks In the following sections, we

briefly review technology mapping techniques, clustering techniques, placement,

and routing algorithms

2.2.1 TECHNOLOGY MAPPING

In a standard cell design procedure for the application specific integrated circuits

(ASIC), technology mapping maps the optimized circuits with a target library

However, FPGAs have clusters with basic logic elements; and those basic logic

elements are ready to be programmed to implement certain functions Therefore,

the technology mapping locates a feasible portion of circuits and implements

functions of that portion into that basic logic element For two different types of

FPGAs, various mapping techniques have been developed An extensive survey of

existing SRAM based FPGA mapping techniques is given by Cong and Ding [19]

They also developed FlowMap that guarantees to produce depth-optimal mapping

Trang 26

solutions [22] CutMap [23] is an improvement over FlowMap, which considers

area minimization during delay optimization For antifuse based FPGAs, Boolean

matching techniques have been used for technology mapping and research results

on technology mapping for antifuse logic cells have been reported [30] Boolean

matching is therefore a key enabler for antifuse based FPGA mapping Lai et al in

[42] proposed a Boolean matching algorithm and introduced matching filters for

speedup A more comprehensive review is provided in [6]

2.2.2 CLUSTERING TECHNIQUES

Once the technology mapping is accomplished, then the mapped netlist is provided

to a clustering algorithm The clustering algorithm packs multiple basic logic

elements into a logic cluster

Many clustering techniques, for SRAM-based FPGAs, have been based on

constructive clustering techniques In the following section, the major

achievements on clustering techniques are reviewed in chronological order

2.2.2.1 Rapid System Prototyping (RASP)

The RASP system is a general synthesis and mapping system for SRAM-based

FPGA [20] It consists of a core with a set of synthesis and optimization algorithms

for technology independent logic synthesis and technology mapping for a generic

look-up table (LUT) network generation It has also a set of architecture-specific

Trang 27

technology mapping routines to map the generic LUT network to programmable

logic blocks in various SRAM-based FPGA architectures

The clustering algorithm is based on a sequence of maximum weighted matching

operations on a compatibility graph, which yields the proper grouping of LUTs into

programmable logic blocks (PLBs) For each step, a compatibility graph is formed

in which vertices represent the partial PLBs (initially LUTs) that will be considered

for grouping at this step An edge is formed between two vertices if the two

corresponding partial PLBs that can be grouped into one Then, weights are

assigned to the edges, to guide the matching algorithm to select the best merging of

partial PLBs Different weights are assigned for different optimization objectives

For delay optimization, a larger weight is given to an edge corresponding to the

grouping of two LUTs that may reduce the length of a critical path in the PLB

network For routability, more significant weight is given to an edge that

corresponds to the grouping of two “close” LUTs, so that it does not create

complex interconnection patterns in the final mapping solution The “closeness” of

two LUTs can be measured in the overlap of their fanin subnetworks The weight

Trang 28

where N v and N w are sets of edges on the vertex v and w, respectively The

compatibility graph becomes a bipartite graph whose edges have weight Finally,

the bipartite matching problem is solved to find the best packing solution

2.2.2.2 VPACK

VPACK [7] is a clustering algorithm which functions to minimize both the number

of logic clusters and the number of used inputs to each cluster Minimizing used

inputs for each cluster is important for a routable design The algorithm constructs

each cluster sequentially First, a seed BLE is chosen, which has the most used

inputs among currently un-clustered BLEs The inputs are a scarce resource Then,

VPACK greedily selects the BLE that shares the most inputs and outputs with the

cluster being constructed The attraction between a BLE B, and the current cluster

C, are the number of common nets that are shared

Attraction B C = Nets B ∩Nets C (2.2)

This procedure of greedily selecting a BLE, to include in the cluster, continue until

either the cluster is full or until adding any additional un-clustered BLEs would

cause the number of distinct inputs needed by the cluster to exceed the number of

inputs allowed If the cluster is full, a new seed BLE is selected and a new cluster is

generated to contain the new BLEs If, however, the cluster is not full but no BLE

can be added because of exceeding the number of allowed inputs, the hill-climbing

Trang 29

phase is invoked to give a chance for packing into the cluster If a BLE has all

inputs in the cluster and its output is connected to a BLE in the cluster, adding the

BLE to the cluster will reduce the number of inputs by one Figure 2.4 illustrates an

example of reducing inputs by adding a BLE The hill-climbing phase terminates

when the cluster is full; if the cluster is still infeasible, VPACK backs up to the last

point when the cluster was feasible

T-VPACK [49] is based on VPACK algorithm [7] Its’ optimization goal is

minimizing the number of external connections (connections between clusters) on

the critical path Since the external cluster routing delay is much larger than the

local routing inside a cluster, minimizing the number of routing on critical paths

can improve delay significantly

Trang 30

The algorithm consists of two steps: static timing analysis and clustering In the

step of static timing analysis, slack is estimated Slack [33] is defined as the amount

of delay, which can be added to a connection without increasing the delay of the

entire circuit The slack at BLE input pin, i is defined as:

( ) required( ) arrival( )

where T reqire (i) and T arrival (i) are the required time and arrival time at input pin, i

The criticality of connection in input i is described as:

where MaxSlack is the largest slack among all connections

During the clustering phase, selecting a seed BLE and attracting BLEs take place

The seed BLE is an un-clustered but has the most critical connection in the circuit

Attraction function was formed to include timing information The criticality of a

BLE is defined as the maximum Connection_Criticality value of all connections

which connect the BLE to BLEs within the cluster currently being packed An

example is shown in Figure 2.5 The attraction function is defined as follows:

Trang 31

where G is a normalization factor which is set to the maximum number of nets to

which any a BLE can connect The time complexity of this algorithm is O(n 2

), where n is the number of BLEs in the circuit

Figure 2.5: BLE criticality assignment

2.2.2.4 Routability-driven Packing (RPACK)

RPACK [10] is a routability-driven packing algorithm, which first identifies

routability factors, then prioritizes these factors into an improved clustering cost

function Three factors have been introduced to achieve routable clustering:

• Ratio of pins per net

• Ratio of used pins to the total number of pins of the logic block

• Number of nets

The ratio of pins per net indicates the density of high-fanout nets in the circuit The

ratio of used pins to the total number of pins of the logic block indicates the traffic

in and out around a logic block The number of nets is also closely related to

routability

Trang 32

In the first stage, a LUT and a register are packed into a basic logic block when

possible After that, the logic blocks are packed into clusters using a heuristic

approach In this approach, clusters are constructed sequentially A seed is chosen

to generate a new cluster The best choice for the seed for each cluster is the

un-clustered logic block with the most used inputs, as indicated by VPACK and

TVPACK After choosing a seed, the basic logic element, which gives the highest

gain, is selected to be added to the current cluster if the number of external inputs

does not exceed the number of input pins of the cluster

Figure 2.6: Clustering: (a) before packing node B into cluster C and (b)

after packing node B into cluster C

In equation (2.2), the gain is the number of inputs and outputs they have in

common between a BLE, B, and a current cluster, C However, contributions of

each net between two blocks can be significantly different in terms of routability

By considering just the number of shared inputs and outputs, the packing algorithm

cannot differentiate among the candidate blocks, which have different impact on

Trang 33

routability For example, three nets, N1, N2, and N3 in Figure 2.6 have different

contributions respectively By moving the BLE B, one terminal can be reduced in

N1, one terminal and one input pin can be saved in N2, and one terminal and one

unit of output congestion can be reduced in N3 The authors have driven a table of

gains of the candidate block according to a single net, which stores gains for

different connections of a net The table was used to compute the total gain of

packing the logic element to a cluster

The results showed that the major portion of the decrease in the number of the nets

is due to decrease in the number of two-terminal nets The complexity of this

algorithm is O(M2), where M is the number of basic logic elements (BLEs)

2.2.2.5 Interconnect Resource Aware Clustering (iRAC)

Packing closely connected components together, by considering spatial uniformity

in the clustered design, using Rent’s Rule [27], iRAC [61] reduces the external

routing requirement in clustered FPGAs It alleviates routing congestion for

clustered FPGAs by absorbing as many small nets into clusters as possible, and

depopulating clusters according to Rent’s rule, in order to achieve spatial

uniformity in the clustered netlist

Characterizing the complexity of a cluster can be done with the well-known

exponential relationship in Rent’s rule [27]:

Trang 34

p io

where N io is the number of pins in a cluster, B is the number of basic logic elements

(BLEs), K is the average number of connections per BLE in the cluster, and p (0 <

p < 1) is the Rent’s parameter (exponent) Given a specific FPGA architecture, a

Rent’s parameter, Pa, can be estimated from the equation (2.6), because Nio, K, and

B can be obtained from the architecture The connectivity factor is defined by the

ratio:

2

separation c

degree

where degree is the degree of a BLE as the number of nets incident to that BLE and

separation is the sum of all terminals of nets incident to the BLE The smaller c

value a BLE has, the more BLEs are located in a given BLE’s neighbor hood In

the second step, a BLE with the highest degree and lowest connectivity factor is

chosen as a seed for a new cluster Then gains, for neighbors connected to the BLE,

are assigned and the un-clustered BLE with the highest gain is absorbed into the

cluster The gain function for BLE X with a net x to the cluster C is defined as

follows:

Trang 35

where αx is the number of pins of net x already inside cluster C, n is the cluster

size, and w(x) is the weight of net x (w(x) = 2/r where r is the number of pins on x)

In order to guarantee spatial uniformity of the clustered netlist, the algorithm limits

the number of available pins using Rent’s rule By limiting it, the Rent’s parameter

of any cluster is no more than the Rent’s parameter of the FPGA architecture, Pa

The area saved is 35%, on an average, compared to previously published results

[10]

2.2.3 PLACEMENT AND ROUTING

Placement is the process by which a netlist of logic blocks is mapped into physical

locations in an FPGA The locations where the blocks are mapped can significantly

effect the performance of the FPGA

Simulated annealing (SA) placement has been commonly used, since the number of

logic blocks is manageable so far within a reasonable amount of time The

simulated annealing algorithm mimics the annealing process used to gradually cool

molten metal to produce high-quality metal structures [57][40] A simulated

annealing placer initially places logic blocks randomly into physical locations in an

FPGA Then the placement is iteratively improved by randomly swapping blocks

and evaluating the quality of each swap with a cost function If the move will result

in a reduction in the placement cost, then the move is accepted If the move would

Trang 36

cause an increase in the placement cost, then the move may still be accepted even

though it makes the placement worse The purpose of accepting some bad moves is

to prevent the placer from being trapped in a local minimum

VPR placer [8] employed a simulated annealing algorithm for FPGA placement to

minimize the wirelength of circuits The placer uses a bounding-box based “linear

congestion” [4][8] cost function to estimate wirelength requirements The linear

congestion cost function is expressed as follows:

where Nnets is the number of nets, bbx(i) and bby(i) are horizontal span and vertical

span of net i, respectively

The problem of routing FPGAs can be stated simply as that of assigning signals to

routing resources, in order to successfully route all signals, while achieving a given

overall performance The problem of routing FPGAs bears a considerable

resemblance to the problem of global routing for custom integrated circuit design

However, the two problems differ in several fundamental respects First, routing

resources in FPGAs are discrete and scarce, while they are reasonably continuous

in custom integrated circuits For this reason FPGAs require an integrated approach

using both a global and a detailed router A second difference is that the global

Trang 37

routing problem for custom ICs is rooted in an undirected graph In FPGAs the

switches are often directional for SRAM-based FPGAs Both of these distinctions

are important, as they prevent direct application of much of the work that has been

done in custom IC routing to FPGAs [28]

PathFinder is a router developed based on an iterative maze-type router [50] Nets

are routed sequentially, and once a track segment has been used for one net, other

nets are allowed to use that segment, but must pay a higher cost Consequently, nets

tend to avoid overusing a segment unless it is necessary or particularly efficient At

the end of the first iteration (after all nets have been routed), either there are no

segments overused and the routing is successful, or some segments are overused

and more routing iterations are executed to try and resolve the contention In each

of these subsequent routing iterations, every net is ripped up and re-routed Since

the cost of over-used track segments increases with every routing iteration, they

become more expensive and are less likely to be used by more than one net This

gradual reduction in routing violations is a very successful routing approach

The VPR router [63] is based on the PathFinder routing algorithm [28] It has two

enhancements to increase the speed of the basic breadth-first search maze router

The first is to employ a depth-first search, which directs the router to head towards

specific targets The second is to reduce the amount of activity on the routing

Trang 38

expansion list for higher-fanout nets, by only placing segments on the expansion

list that are in the neighborhood of the target

2.3 SUMMARY

In this chapter, the overview of coarse-grained FPGA architectures was provided

Mainly, SRAM-based FPGAs and antifuse-based FPGAs were introduced The

CAD flow for FPGAs was presented and brief reviews for each step were presented

The flow consists of logic optimization, technology mapping, clustering, placement,

and routing

Trang 39

CHAPTER 3

In this chapter, the proposed tool flow and the procedure of generating library cells

from the pASIC3 logic cell are discussed The cell library is used during the

technology mapping phase to cover a network

3.1 INTRODUCTION

The cell library is the key to implementing the application specific integrated

circuit (ASIC) Each library cell has a specified function, layout, and various kinds

of information such as pin delays from inputs to outputs, area of the cell, and

expected power dissipation, and so forth Library cells are not required, until the

technology mapping phase, in the flow of computer-aided design for ASIC but it

represents the physical implementation of functions optimized by logic synthesis

Each cell has a few different sizes, to provide different characteristics with the

same functionality and thus area, drive strength, and power dissipation is different

for different sized cells Similarly, the cell library for antifuse-based FPGAs is a

Trang 40

collection of cells, each of which has its own function, delay, and power

dissipation However, they cannot have multiple size cells for the same function in

a library, because all of the library cells have been generated from a logic cell like

pASIC3 in Figure 2.2 On the other hand, all of the library cells might have the

same delay, not exactly but very close In addition, library cells from a logic cell

provide a more complex function than the ASIC library cells

The organization of this chapter is as follows: A brief overview of the proposed

tool flow is presented in section 3.2 In section 3.3, the procedure of constructing

the cell library is presented The cost assignment for each library cell is discussed

in section 3.4 Finally, we conclude in section 3.5

3.2 TOOL FLOW

Figure 3.1 shows our CAD tool flow for pASIC3 family FPGAs We generate both

a cell library set and configuration information from the pASIC3 logic cell A

target circuit is synthesized by SIS [59] and then the circuit is mapped with cells in

the cell library by a technology mapper in SIS [59] Our clustering tool, called

Packer-pASIC3, packs nodes with four different functions as shown in Figure 3.2

A cluster is assigned to a pASIC3 logic cell The VPR [8] places and routes the

clustered network with the architecture description of pASIC3 family FPGAs

Định dạng
Số trang	122
Dung lượng	896,65 KB