Handbook of algorithms for physical design automation part 98 ppsx

Xu, Technology mapping for FPGAs with embedded memory blocks, in Proceedings of the 6th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp.. DeHon, Design of program

Trang 1

To logic

Memory block

FIGURE 45.15 Memory/logic interconnect block.

interconnect blocks; crosses indicate programmable connections The flexibility of the connection

block, Fm, can be defined as the number of programmable connections available between each

horizontal pin and the adjacent vertical channel In Figure 45.14, Fm = 4 In Ref [Wilton99], it is

shown that a value of Fm between 4 and 7 works well To increase routability, the architecture in Figure 45.14 includes dedicated tracks for memory-to-memory connections These tracks are used when multiple memory arrays are cascaded together to form larger user arrays, and are more efficient for such memory-to-memory connections EMBs can also be used to implement logic by configuring them as large ROMs [Cong98] [Wilton00]

45.5.2 DISTRIBUTEDMEMORY

Commercial FPGAs such as Xilinx’s Virtex-4, Virtex-II, and Spartan-3 devices allow the 4-input LUTs in their logic blocks to be configured as 16× 1-bit memories [Xilinx05a] These memories have synchronous inputs Their outputs can be synchronous through the use of the LUTs associated register These 16× 1-bit memories can also be cascaded to implement deeper or wider memory arrays through specialized logic resources

Another method for supporting distributed memory is proposed in Ref [Oldridge05] This archi-tecture allows the configuration memory in the interconnect switch blocks to be used as user memory and is very efficient for wide, shallow memories

45.6 EMBEDDED COMPUTATION BLOCKS

45.6.1 MULTIPLIERS ANDDSP BLOCKS

To address the performance requirements of digital signal processing (DSP) applications, FPGA manufacturers typically include dedicated hardware multipliers in their devices Altera Cyclone II and Xilinx Virtex-II/-II Pro devices include embedded 18×18-bit multipliers, which can be split into 9×9-bit multipliers [Xilinx05a] The Virtex-II/-II Pro devices are further optimized with direct connections

to the Xilinx block RAM resources for fast access to input operands As manufacturers moved toward high-performance platform FPGAs, they began to include more complex dedicated hardware blocks, referred to as DSP blocks, which are optimized for a wider range of DSP applications Altera’s Stratix and Stratix II DSP blocks support pipelining, shift registers, and can be configured to implement

Trang 2

Field-Programmable Gate Array Architectures 953

9×9-bit, 18×18-bit, or 36×36-bit multipliers that can optionally feed a dedicated adder/subtractor

or accumulator [Altera05] Xilinx Virtex-4 XtremeDSP slices contain a dedicated 18 18-bit 2’s complement signed multiplier, adder logic, 48-bit accumulator, and pipeline registers They also have dedicated connections for cascading DSP slices, with an optional wire-shift, without having to use the slower general routing fabric [Xilinx05a]

This inclusion of dedicated multipliers or DSP blocks to complement the general logic resources results in a heterogeneous FPGA architecture Research has considered what could be gained from tuning FPGA architectures to specific application domains, in particular DSP The work in Ref [Leijten03] deliberately avoids creating a heterogeneous architecture because they found that DSP applications contain both arithmetic and random logic, but that a suitable ratio between arithmetic and random logic is difficult to determine Instead they develop two mixed-grain logic blocks that are suitable for implementing both arithmetic and random logic by looking at properties of the target arithmetic operations and of the 4-LUT Their logic blocks are coarse-grained: each block can implement up to 4-bit addition/subtraction, 4 bits of an array multiplier, 4-bit 2:1 multiplexer,

or wide Boolean functions At the same time, each logic block continues to be able to implement single-bit output random logic functions much like a normal LUT Their architecture reduces con-figuration memory requirements by a factor of 4, which is good for embedded systems or those with dynamic reconfiguration, and offers higher flexibility for handling a range of proportions of datapath operations to random logic

45.6.2 EMBEDDEDPROCESSORS

The increase in the capacity of FPGAs has enabled the creation of entire systems on a chip

To support applications involving microcontrollers and microprocessors, FPGA manufacturers offer embedded processors tailored to interface with the FPGA logic fabric There are two types of FPGA embedded processors: soft and hard

Soft processors are intellectual property cores that have configurable features, such as caches, register file sizes, RAM/ROM blocks, and custom instructions They are typically available as hard-ware description language descriptions and are implemented in the logic blocks of the FPGA Altera and Xilinx have 32-bit reduced instruction set computer (RISC) processor cores that are optimized for their FPGAs: Nios/Nios II and PicoBlaze/MicroBlaze, respectively Altera and Xilinx also offer development and debugging tools and other intellectual property cores that interface with their processors The advantages of soft processors include the options to use and configure features only when they are needed, reducing area, and the ability to include multiple processors on a single chip

A Xilinx MicroBlaze requires as few as 923 LUTs [Xilinx05b] and can be used in the creation of multiprocessor systems Because soft processors are implemented using logic resources, they are slower and consume more power than off-the-shelf processors

Hard processors are dedicated hardware embedded on the FPGA Altera Excalibur devices include the ARM 32-bit RISC processor and Xilinx Virtex-4 and Virtex-II Pro devices include

up to two IBM PowerPC 32-bit RISC processors [Altera02,Xilinx05b]

45.7 SUMMARY

This chapter has described the essential architectural features of contemporary FPGAs Most commercial FPGAs contain small LUTs, in which logic is implemented These LUTs are usually arranged in clusters, often with special support for arithmetic circuits (such as carry chains) Signals are transmitted between logic blocks using fixed metal tracks, connected using programmable switches The topology of these tracks and switches make up the device’s routing architecture In addition to logic blocks, modern FPGAs contain significant amounts of embedded memory, and dedicated arithmetic functional blocks (such as multipliers) This chapter has set the stage for the next chapter, which describes physical design algorithms that target FPGAs

Trang 3

[Actel05a] Actel Corp., ProASIC3 Flash Family FPGAs Handbook, 2005 Available at: http://www.actel.

com/documents/PA3_HB.pdf

[Actel05b] Actel Corp., Actel Quality and Reliability Guide, 2005 Available at http://www.actel.com/

document/RelGuide.pdf

[Ahmed04] E Ahmed and J Rose, The effect of LUT and cluster size on deep-submicron FPGA

performance and density, IEEE Transactions on VLSI, 12(3): 288–298, March 2004.

[Altera02] Altera Corp., Excalibur Device Overview, May 2002 Available at: http://www.altera.com/

literature/ds/ds_arm.pdf

[Altera05] Altera Corp., Stratix II Device Handbook, 2005 Available at http://www.altera.com/

literature/list_stx2.jsp

[Betz99] V Betz, J Rose, and A Marquardt, Architecture and CAD for Deep-Submicron FPGAs,

Kluwer Academic Publishers, Norwell, MA, February 1999

[Brown96] S Brown, M Khellah, and G Lemieux, Segmented routing for speed-performance and

routability in field-programmable gate arrays, Journal of VLSI Design, 4(4): 275–291, 1996.

[Chang96] Y -W Chang, D Wong, and C Wong, Universal switch modules for FPGA design, in

ACM Transactions on Design Automation of Electronic Systems, Vol 1, NY, January 1996,

pp 80–101

[Cong98] J Cong and S Xu, Technology mapping for FPGAs with embedded memory blocks, in

Proceedings of the 6th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp 179–188, Monterey, CA, 1998.

[Dehon05] A DeHon, Design of programmable interconnect for sublithographic programmable logic

arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,

Monterey, CA, February 2005, pp 127–137

[Ebeling96] C Ebeling, D Conquist, and P Franklin, RaPiD—Reconfigurable pipelined datapath, in

Inter-national Conference on Field-Programmable Logic and Applications, Darmstadt, Germany,

1996, pp 126–135

[Ferrera04] S P Ferrera and N Carter, A magnoelectronic macrocell employing reconfigurable

thresh-old logic, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,

[Goldstein00] S C Goldstein, H Schmit, M Budiu, S Cadambi, M Moe, and R Taylor, PipeRench:

A reconfigurable architecture and compiler, Computer, 33(4): 70–77, 2000.

[Hauck00] S Hauck, M M Hosler, and T W Fry, High-performance carry chains for FPGAs, IEEE

Transactions on VLSI Systems, 8(2): 138–147, April, 2000.

[Lattice05] Lattice Semiconductor Corp., LatticeXP Datasheet, 2005. Available at http://www

latticesemi.com/lit/docs/datasheets/fpga/DS1001.pdf

[Leijten03] K Leijten-Nowak and J van Meerbergen, An FPGA architecture with enhanced datapath

functionality, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,

[Lemieux04a] G Lemieux and D Lewis, Design of Interconnection Networks for Programmable Logic,

Kluwer Academic Publishers, Norwell, MA, November 2004

[Lemieux04b] G Lemieux, E Lee, M Tom, and A Yu, Directional and single-driver wires in FPGA

inter-connect, in IEEE International Conference on Field-Programmable Technology, Brisbane,

Australia, December 2004, pp 41–48

[Lewis05] D Lewis, E Ahmed, G Baeckler, V Betz, M Bourgeault, D Cashman, D Galloway,

M Hutton, C Lane, A Lee, P Leventis, S Marquardt, C McClintock, K Padalia, B Pedersen,

G Powell, B Ratchev, S Reddy, J Schleicher, K Stevens, R Yuan, R Cliff, and J Rose, The

Stratix II logic and routing architecture, in ACM/SIGDA International Symposium on FPGAs,

[Lin94] C -C Lin, M Marek-Sadowska, and D Gatlin, Universal logic gate for FPGA design, in

Proceedings of the 1994 IEEE/ACM International Conference on Computer-Aided Design,

San Jose, CA, November 1994, pp 164–168

[Marquardt00] A Marquardt, V Betz, and J Rose, Speed and area trade-offs in cluster-based FPGA

architectures, IEEE Transactions on VLSI, 8(1): 84–93, February 2000.

Trang 4

Field-Programmable Gate Array Architectures 955

[Masud99] M I Masud and S J E Wilton, A new switch block for segmented FPGAs, in International

Workshop on Field Programmable Logic and Applications, Glasgow, U.K., August 1999,

pp 274–281

[Mei03] B Mei, S Vernalde, D Verkest, H De Man, and R Lauwereins, ADRES: An architecture

with tightly coupled VLIW processor and coarse-grained reconfigurable matrix, in Interna-tional Conference on Field-Programmable Logic and Applications, Lisbon, Portugal, 2003,

pp 61–70

[Oldridge05] S W Oldridge and S J E Wilton, A novel FPGA architecture supporting wide, shallow

memories, IEEE Transactions on Very-Large Scale Integration (VLSI) Systems, 13(6): 758–

762, June 2005

[Quick05] Quicklogic, Eclipse II Family Data Sheet, 2005 Available at http://www.quicklogic.com/

images/eclipse2_family_DS.pdf

[Rose90] J S Rose, R J Francis, D Lewis, and P Chow, Architecture of field-programmable gate

arrays: The effect of logic block functionality on area efficiency, IEEE Journal of Solid-State Circuits, 25(5): 1217–1225, October 1990.

[Rose93] J Rose, A El Gamal, and A Sangiovanni-Vincentelli, Architecture of field-programmable

gate arrays, Proceedings of the IEEE, 81(7): 1013–1029, July 1993.

[Singh92] S Singh, J Rose, P Chow, and D Lewis, The effect of logic block architecture on FPGA

performance, IEEE Journal of Solid-State Circuits, 27(3): 281–287, March 1992.

[Singh00] H Singh, M -H Lee, G Lu, F Kurdahi, N Bagherzadeh, and E Chaves, MorphoSys: An

integrated reconfigurable system for dataparallel and compute intensive applications, IEEE Transactions on Computers, 49(5): 465–481, 2000.

[Singh01a] A Singh, A Mukherjee, and M Marek-Sadowska, Interconnect pipeling in a

throughput-intensive FPGA architecture, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, February 2001, pp 153–160.

[Singh01b] D P Singh and S D Brown, The case for registered routing switches in field programmable

gate arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,

[Sivaswamy05] S Sivaswamy, G Wang, C Ababei, K Bazargan, R Kastner, and E Bozorgzadeh, HARP:

Hardwired routing pattern FPGAs, in ACM International Symposium on Filed Programmable Gate Arrays, Monterey, CA, February 2005, pp 21–32.

[Trimberger94] S Trimberger, Field-Programmable Gate Array Technology, Kluwer Academic Publishers,

Norwell, MA, 1994

[Weaver04] N Weaver, J Hauser, and J Wawrzynek, The SFRA: A corner-turn FPGA architecture, in

ACM/SIGDA International Symposium on FPGAs, February 2004, pp 3–12.

[Wilton00] S J E Wilton, Hetergenous technology mapping for area reduction in FPGAs with embedded

memory arrays, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(1):56–68, 2000.

[Wilton97] S J E Wilton, Architecture and algorithms for field-programmable gate arrays with embedded

memory, PhD thesis, University of Toronto, Toronto, Ontario, Canada, 1997

[Wilton99] S J E Wilton, J Rose, and Z G Vranesic, The memory/logic interface in FPGA’s with

large embedded memory arrays, IEEE Transactions on Very-Large Scale Integration Systems,

7(1):80–91, March 1999

[Xilinx05a] Xilinx Corp., Virtex-4 Users Guide, 2005 Available at http://www.xilinx.com/support/

documentation/user_guides/ug070.pdf

[Xilinx05b] Xilinx Corp., Processor IP Reference Guide, February 2005.

[Ye05] A G Ye and J Rose, Using bus-based connections to improve field-programmable gate array

density for implementing datapath circuits, in ACM/SIGDA Symposium on FPGAs, February

2005, Monterey, CA, pp 3–13

[Zeidman02] B Zeidman and R Zeidman, Designing with FPGAs and CPLDs, CMP Books, Upper Saddle

River, NJ, 2002

Trang 6

46 FPGA Technology Mapping,

Placement, and Routing

Kia Bazargan

CONTENTS

46.1 Introduction 957

46.2 Technology Mapping and Clustering 958

46.2.1 Technology Mapping 959

46.2.2 Clustering 960

46.3 Floorplanning 961

46.3.1 Hierarchical Methods 961

46.3.2 Floorplanning on FPGAs with Heterogeneous Resources 963

46.3.3 Dynamic Floorplanning 964

46.4 Placement 966

46.4.1 Island-Style FPGA Placement 967

46.4.2 Hierarchical FPGA Placement 969

46.4.3 Physical Synthesis and Incremental Placement Methods 969

46.4.4 Linear Datapath Placement 972

46.4.5 Variation-Aware Placement 974

46.4.6 Low Power Placement 975

46.5 Routing 975

46.5.1 Hierarchical Routing 976

46.5.2 SAT-Based Routing 977

46.5.3 Graph-Based Routing 979

46.5.4 Low Power Routing 980

46.5.5 Other Routing Methods 980

46.5.5.1 Pipeline Routing 980

46.5.5.2 Congestion-Driven Routing 981

46.5.5.3 Statistical Timing Routing 981

References 982

46.1 INTRODUCTION

Computer-aided design (CAD) tools for field-programmable gate arrays (FPGAs) primarily emerged

as extensions of their application-specific integrated circuit (ASIC) counterparts in the 1980s because

of the relative maturity of the ASIC CAD tools at that time Traditional logic optimization techniques, simulated-annealing-based placement algorithms, and maze routing methods were common in the FPGA world But as FPGA architecture developed distinct features both in terms of logic and routing architectures, FPGA CAD tools evolved into today’s FPGA design flows that are highly optimized for specific characteristics of FPGA devices More specialized timing models, technology mapping This work is supported in part by the National Science Foundation under grant CCF-0347891.

957

Trang 7

Technology mapping RTL synthesis

Placement

Routing

Configuration bitfile

Design entry

Verification/

simulation

Power analysis

Timing analysis

FIGURE 46.1 Typical FPGA flow.

solutions, and placement and routing strategies are needed to ensure high-quality mapping of circuits

to FPGAs

Figure 46.1 shows a common design flow for FPGA designs The high-level description

of the FPGA design is fed to a register transfer level (RTL) synthesis tool that performs technology-independent logic optimization The synthesis tool might detect opportunities for utiliz-ing special-purpose logic gates within the FPGA logic fabric Examples are carry chains, high-fanin sum-of-product gates, and embedded multiplier (see Sections 45.3.1.2 and 45.3.2)

The functional gates of the technology-independent optimized design are mapped to FPGA lookup tables (LUTs) (see Section 45.3.1), a process called technology mapping Clustering of the LUTs is performed next (see Section 45.3.2) Placement and routing steps follow clustering Floorplanning may or may not precede placement Each of these steps would use timing and power analysis engines to better optimize the design Furthermore, the user might simulate or perform formal verifications at various steps of the design cycle If timing or power constraints are not met, the design flow might backtrack to a previous step For example, if routing fails due to high congestion, then placement might be attempted again with different parameters

The rest of the chapter is organized into four sections FPGA-specific technology mapping and clustering algorithms are covered in Section 46.2.1 Sections 46.3 and 46.4 cover floorplanning and placement algorithms We conclude the chapter by discussing routing algorithms in Section 46.5

46.2 TECHNOLOGY MAPPING AND CLUSTERING

Technology mapping converts a logic circuit into a netlist of FPGA K-LUTs and their connections.

A K-LUT is usually implemented as a K-input, one output static random-access memory (SRAM) block By writing the truth table of a Boolean function in the K-LUT, we can implement any function that has K or fewer inputs regardless of the complexity of the function Neighboring LUTs can be

clustered into local groups with dedicated fast routing resources to improve the delay of the circuit Clustering algorithms are used to group together local LUTs to minimize connection delays Later

in the design flow, these clusters are used as input to the placement step Some placement algorithms might never touch a cluster, but some other placement methods (such as the ones presented in Section 46.4.3) might move individual logic blocks from one cluster to another to improve timing, power, etc

Given the fact that technology mapping considering area and delay optimization is NP-hard, Cong and Minkovich [1] synthesize benchmarks with known optimal or upper-bound technology mapping solutions and test state-of-the-art FPGA synthesis algorithms to see how far these algorithms are

Trang 8

FPGA Technology Mapping, Placement, and Routing 959

from producing optimal solutions (a preliminary version of their work appeared in the FPGA 2007 conference) They show that current technology mapping solutions are close to optimal (between 3 and 22 percent away, see Table III in Ref [1]) while logic optimization methods have much room for improvement Although some argue that the generated benchmarks are artificial and do not reflect characteristics of large industry benchmarks, nevertheless the work in Ref [1] gives us insights into what needs to be done to improve existing CAD algorithms Our goal in the next two sections is to introduce basic technology mapping and clustering algorithms so that the reader can better understand placement and routing algorithms for FPGAs Many great technology mapping algorithms (such as DAOmap [2], ABC [3], and the work by Mishchenko et al [4]) are not discussed here

46.2.1 TECHNOLOGYMAPPING

A major breakthrough in the FPGA technology mapping came about in 1994 with the introduction of the FlowMap tool [5] Library-based ASIC technology mapping (that maps a logic network to gates such as AND, OR, etc.) for depth minimization was known to be NP-hard, but Cong et al proved that

the K-LUT technology mapping can be done in O (KVE), where V and E are the number of nodes

(gates) and edges (wires) in the circuit, respectively The FlowMap algorithm traverses the circuit graph containing simple gates and their connections in a breadth-first search fashion and determines depth-optimal mappings of the fanin cones of the nodes as it progresses toward primary outputs The fanin cone of a node is the set of all gates from the circuit primary inputs (input pads) to the node itself

The algorithm uses the notion of K-feasible cuts to find K-LUT mappings of a subcircuit Figure 46.2a shows an example subgraph in which a cut separates the nodes into disjoint sets X and

X where only three nodes in set X provide inputs to nodes in X, that is, the nodes that are drawn

using thick lines Cut(X, X) is said to be K-feasible for K ≤3 All the nodes in set X can be mapped

into one 3-LUT, which gets its input values from the LUTs that implement the three boundary nodes

in X and their fanin cones.

The labels on the nodes in Figure 46.2a show the depth of the minimum depth K-LUT mapping

of the input cone of the node The authors prove that for a node t, the minimum depth is either the maximum label l in X, or l+1.∗Consider an example graph for another circuit shown in Figure 46.2b

s

d

1 1

0

h

t⬘ N⬘t

s

d

1

∞

∞ ∞ ∞ ∞

1 1 1 1

1 1

1

h

t⬘ N⬙t

1

1 2

2

X

3

4

4 4

4

t

1

s

X

s

d 1

1 1

0

0 0 0 0 1

2 2 2

h

a b c

t N t

X

(c) Checking the feasibility of a mapping of depth 2

FIGURE 46.2 Flowmap mapping steps (From Cong, J and Ding, Y., IEEE Trans Comput Aided Des.

∗If the new node t can be packed with the rest of the nodes with label l, then the depth of LUTs used in implementing the circuit up to this point would not increase Otherwise, a new LUT with depth l+ 1 has to be allocated to house the new

node t.

Trang 9

In a breadth-first search traversal on subgraph N t , when we get to node t, the question is whether we can pack t with nodes a, b, c (which have the maximum depth of l) in one K-LUT.

We can create an auxiliary graph N t(shown in Figure 46.2c, note that nodes with labels correspond

to their counterpart nodes in Figure 46.2b with the same labels), which replaces a, b, c, t with one node t and see if t—and possibly other nodes—can be packed in one K-LUT Node t can be

mapped to a K-LUT if we can find a cut (X, X) where t∈ X and at most K nodes in X provide input

to nodes in X Network flow algorithms can be used to answer this question We can model one LUT

in the fanin cone as a flow of one unit, and look for a maximum flow of K-units at the sink node.

If the maximum flow is K, it means that we have found a cut with at most K-LUTs as inputs, and anything below the cut can be packed into a K-LUT More details are provided next Subgraph N t can be transformed to a dual graph N t(Figure 46.2d) in which each node y is replaced by two nodes

yiand yothat are connected by an edge of weight 1 An edge(y, z) in N

t corresponds to edge(yo, z i )

in N twith an infinite edge weight If a flow of K units can be found in N t, then at most K nodes in X provide inputs to nodes in X, which means node t in the original N tgraph can indeed be packed with other nodes with the maximum label The authors introduce variations on the original technology mapping algorithm to minimize area as a secondary objective

46.2.2 CLUSTERING

Today’s FPGAs cluster LUTs into groups and provide fast routing resources for intracluster con-nections When two LUTs are assigned to one cluster, their connections can use the fast routing resources within the cluster, and hence reduce the delay on the connection On the other hand, if two LUTs are in two separate clusters, they have to use intracluster routing resources that are more scarce and more costly in terms of delay Placement and routing algorithms are needed to balance the usage of intracluster routing resources (see Sections 46.4 and 46.5)

Many clustering algorithms were introduced in the past decade Most work by first selecting a seed and then choosing LUTs to cluster with the seed The difference between various clustering algorithms is in their criteria for choosing the seed node and the way other nodes are chosen to be absorbed by the seed The clustering algorithm used in the popular versatile placement and routing (VPR) tool [6] is called T-VPack [7], which is an extension of the earlier packing algorithm VPack VPack chooses LUTs with high number of input connections as initial seeds for clusters The

criteria for packing a node B into a cluster C is the attraction of the node, defined as the number of nets that are shared between node B and nodes inside C The more sharing there is between nodes

within a cluster, the less routing demand is needed to connect clusters

T-VPack is the timing-driven version of VPack and extends the definition of the attraction of a node to include timing criticality of nets connecting the node to those packed into the cluster Timing

criticality of a net i is defined as 1 – [slack (i)/MaxSlack] If two nodes have equal net criticality

values connecting them to nodes packed into a cluster, then the one through which more critical paths pass is chosen to be packed into the cluster first The results in Ref [7] show that clusters of size 7–10 provide the best area/delay tradeoff

Clustering algorithms such as RPack [8] and the work by Singh et al [9] improve routability of the clustered circuit by introducing absorption costs that try to weigh nodes based on how promising they are in absorbing more nets into the cluster The authors in Ref [9] define connectivity factor

(c) of a LUT x as c(x) = separation(x)/degree(x)2, where separation of a LUT is the number of

LUTs adjacent to it Figure 46.3a shows node A with a separation value of 18, degree of 4, and connectivity of 1.125 Figure 46.3b shows node B with the same degree as A, but with a smaller separation and hence smaller connectivity Node A cannot absorb any nets if one node from each net is clustered into the same cluster as A On the other hand, node B can absorb all the nets shown

in Figure 46.3 by including one node from each net in its cluster The selection of the seed node in Singh et al work is done by lexicographically sorting nodes by their (degree, –connectivity) values and choosing the ones with highest values as initial seeds (T-VPack used only the degree values)

Trang 10

FPGA Technology Mapping, Placement, and Routing 961

(a) Number of nets absorbed = 0

(b) Number of nets absorbed = 4

FIGURE 46.3 Examples illustrating the usefulness of the connectivity factor (Based on Singh, A and

Marek-Sadowska, M., Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays,

59–66, 2002 With permission.)

Nodes are greedily packed into seed clusters based on how many nets they absorb, with higher priority given to the nets with fewer terminals To guarantee spatial uniformity of the clustered netlist, the authors limit the number of available pins to a cluster so that the number of logic blocks inside

a cluster and the number of connections to the nodes within the cluster follow Rent’s rule Doing so effectively depopulates clusters to reduce overall intercluster routing demands Such strategies are

in line with what DeHon’s study [10] on routing requirements of FPGA circuits suggested Because interconnect resources (switches and buffers) consume most of the silicon area of an FPGA (80–90 percent), sometimes it is beneficial to underutilize clusters to reduce routing demand in congested regions of the FPGA array

46.3 FLOORPLANNING

Floorplanning is used on FPGAs to speed up the placement process or to place hard macros with prespecified shapes The traditional FPGA floorplanning problem is discussed in Section 46.3.1 Another class of floorplanning algorithms for FPGAs is the ones that deal with heterogeneous resource types An example of this approach is the work by Cheng and Wong [11], to be covered

in Section 46.3.2 A third class of floorplanning for FPGAs addresses dynamically reconfigurable systems in which modules are added or removed at runtime, requiring fast, on-the-fly modification

of the floorplan These approaches are discussed in Section 46.3.3

46.3.1 HIERARCHICALMETHODS

Sankar and Rose [12] first use a bottom-up clustering method to build larger clusters out of logic blocks (refer to Section 46.2.2) Then they use a hierarchical simulated annealing algorithm to speed up the placement compared to a flat annealing methodology They show trade-offs between placement runtime and quality

While clustering the circuit into larger subcircuits, they limit the shape and size of the clusters

to prespecified values The leaves of the clustering tree are the logic blocks and the first level of the tree are nodes that combine exactly two leaves All level-one nodes will be placed in 1× 2 regions, that is, on two adjacent clusters in the same row The next level of hierarchy clusters two level 1 clusters and will be placed as 2× 2 squares Figure 46.4 shows the clustering and placement conceptually Such restrictions on the clustering and placement steps would limit the ability of the algorithms to search a larger solution space compared to an unrestricted version of the problem, but on the other hand relieve the algorithm designers of dealing with the sizing problem during the floorplanning process, described in Section 9.4.1

The work by Emmert and Bhatia [13] too starts by clustering the logic elements into larger subcircuits The input to their flow is a list of macros, each macro being either a logic block, or a set of logic blocks with a list of predefined shapes An example of a macro is a multiplier with two shape options, one for minimum area, the other for minimum delay

Định dạng
Số trang	10
Dung lượng	210,52 KB