Xu, Technology mapping for FPGAs with embedded memory blocks, in Proceedings of the 6th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp.. DeHon, Design of program
Trang 1To logic
To logic
Memory block
FIGURE 45.15 Memory/logic interconnect block.
interconnect blocks; crosses indicate programmable connections The flexibility of the connection
block, Fm, can be defined as the number of programmable connections available between each
horizontal pin and the adjacent vertical channel In Figure 45.14, Fm = 4 In Ref [Wilton99], it is
shown that a value of Fm between 4 and 7 works well To increase routability, the architecture in Figure 45.14 includes dedicated tracks for memory-to-memory connections These tracks are used when multiple memory arrays are cascaded together to form larger user arrays, and are more efficient for such memory-to-memory connections EMBs can also be used to implement logic by configuring them as large ROMs [Cong98] [Wilton00]
45.5.2 DISTRIBUTEDMEMORY
Commercial FPGAs such as Xilinx’s Virtex-4, Virtex-II, and Spartan-3 devices allow the 4-input LUTs in their logic blocks to be configured as 16× 1-bit memories [Xilinx05a] These memories have synchronous inputs Their outputs can be synchronous through the use of the LUTs associated register These 16× 1-bit memories can also be cascaded to implement deeper or wider memory arrays through specialized logic resources
Another method for supporting distributed memory is proposed in Ref [Oldridge05] This archi-tecture allows the configuration memory in the interconnect switch blocks to be used as user memory and is very efficient for wide, shallow memories
45.6 EMBEDDED COMPUTATION BLOCKS
45.6.1 MULTIPLIERS ANDDSP BLOCKS
To address the performance requirements of digital signal processing (DSP) applications, FPGA manufacturers typically include dedicated hardware multipliers in their devices Altera Cyclone II and Xilinx Virtex-II/-II Pro devices include embedded 18×18-bit multipliers, which can be split into 9×9-bit multipliers [Xilinx05a] The Virtex-II/-II Pro devices are further optimized with direct connections
to the Xilinx block RAM resources for fast access to input operands As manufacturers moved toward high-performance platform FPGAs, they began to include more complex dedicated hardware blocks, referred to as DSP blocks, which are optimized for a wider range of DSP applications Altera’s Stratix and Stratix II DSP blocks support pipelining, shift registers, and can be configured to implement
Trang 2Field-Programmable Gate Array Architectures 953
9×9-bit, 18×18-bit, or 36×36-bit multipliers that can optionally feed a dedicated adder/subtractor
or accumulator [Altera05] Xilinx Virtex-4 XtremeDSP slices contain a dedicated 18 18-bit 2’s complement signed multiplier, adder logic, 48-bit accumulator, and pipeline registers They also have dedicated connections for cascading DSP slices, with an optional wire-shift, without having to use the slower general routing fabric [Xilinx05a]
This inclusion of dedicated multipliers or DSP blocks to complement the general logic resources results in a heterogeneous FPGA architecture Research has considered what could be gained from tuning FPGA architectures to specific application domains, in particular DSP The work in Ref [Leijten03] deliberately avoids creating a heterogeneous architecture because they found that DSP applications contain both arithmetic and random logic, but that a suitable ratio between arithmetic and random logic is difficult to determine Instead they develop two mixed-grain logic blocks that are suitable for implementing both arithmetic and random logic by looking at properties of the target arithmetic operations and of the 4-LUT Their logic blocks are coarse-grained: each block can implement up to 4-bit addition/subtraction, 4 bits of an array multiplier, 4-bit 2:1 multiplexer,
or wide Boolean functions At the same time, each logic block continues to be able to implement single-bit output random logic functions much like a normal LUT Their architecture reduces con-figuration memory requirements by a factor of 4, which is good for embedded systems or those with dynamic reconfiguration, and offers higher flexibility for handling a range of proportions of datapath operations to random logic
45.6.2 EMBEDDEDPROCESSORS
The increase in the capacity of FPGAs has enabled the creation of entire systems on a chip
To support applications involving microcontrollers and microprocessors, FPGA manufacturers offer embedded processors tailored to interface with the FPGA logic fabric There are two types of FPGA embedded processors: soft and hard
Soft processors are intellectual property cores that have configurable features, such as caches, register file sizes, RAM/ROM blocks, and custom instructions They are typically available as hard-ware description language descriptions and are implemented in the logic blocks of the FPGA Altera and Xilinx have 32-bit reduced instruction set computer (RISC) processor cores that are optimized for their FPGAs: Nios/Nios II and PicoBlaze/MicroBlaze, respectively Altera and Xilinx also offer development and debugging tools and other intellectual property cores that interface with their processors The advantages of soft processors include the options to use and configure features only when they are needed, reducing area, and the ability to include multiple processors on a single chip
A Xilinx MicroBlaze requires as few as 923 LUTs [Xilinx05b] and can be used in the creation of multiprocessor systems Because soft processors are implemented using logic resources, they are slower and consume more power than off-the-shelf processors
Hard processors are dedicated hardware embedded on the FPGA Altera Excalibur devices include the ARM 32-bit RISC processor and Xilinx Virtex-4 and Virtex-II Pro devices include
up to two IBM PowerPC 32-bit RISC processors [Altera02,Xilinx05b]
45.7 SUMMARY
This chapter has described the essential architectural features of contemporary FPGAs Most commercial FPGAs contain small LUTs, in which logic is implemented These LUTs are usually arranged in clusters, often with special support for arithmetic circuits (such as carry chains) Signals are transmitted between logic blocks using fixed metal tracks, connected using programmable switches The topology of these tracks and switches make up the device’s routing architecture In addition to logic blocks, modern FPGAs contain significant amounts of embedded memory, and dedicated arithmetic functional blocks (such as multipliers) This chapter has set the stage for the next chapter, which describes physical design algorithms that target FPGAs
Trang 3[Actel05a] Actel Corp., ProASIC3 Flash Family FPGAs Handbook, 2005 Available at: http://www.actel.
com/documents/PA3_HB.pdf
[Actel05b] Actel Corp., Actel Quality and Reliability Guide, 2005 Available at http://www.actel.com/
document/RelGuide.pdf
[Ahmed04] E Ahmed and J Rose, The effect of LUT and cluster size on deep-submicron FPGA
performance and density, IEEE Transactions on VLSI, 12(3): 288–298, March 2004.
[Altera02] Altera Corp., Excalibur Device Overview, May 2002 Available at: http://www.altera.com/
literature/ds/ds_arm.pdf
[Altera05] Altera Corp., Stratix II Device Handbook, 2005 Available at http://www.altera.com/
literature/list_stx2.jsp
[Betz99] V Betz, J Rose, and A Marquardt, Architecture and CAD for Deep-Submicron FPGAs,
Kluwer Academic Publishers, Norwell, MA, February 1999
[Brown96] S Brown, M Khellah, and G Lemieux, Segmented routing for speed-performance and
routability in field-programmable gate arrays, Journal of VLSI Design, 4(4): 275–291, 1996.
[Chang96] Y -W Chang, D Wong, and C Wong, Universal switch modules for FPGA design, in
ACM Transactions on Design Automation of Electronic Systems, Vol 1, NY, January 1996,
pp 80–101
[Cong98] J Cong and S Xu, Technology mapping for FPGAs with embedded memory blocks, in
Proceedings of the 6th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp 179–188, Monterey, CA, 1998.
[Dehon05] A DeHon, Design of programmable interconnect for sublithographic programmable logic
arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
Monterey, CA, February 2005, pp 127–137
[Ebeling96] C Ebeling, D Conquist, and P Franklin, RaPiD—Reconfigurable pipelined datapath, in
Inter-national Conference on Field-Programmable Logic and Applications, Darmstadt, Germany,
1996, pp 126–135
[Ferrera04] S P Ferrera and N Carter, A magnoelectronic macrocell employing reconfigurable
thresh-old logic, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
Monterey, CA, February 2004, pp 143–154
[Goldstein00] S C Goldstein, H Schmit, M Budiu, S Cadambi, M Moe, and R Taylor, PipeRench:
A reconfigurable architecture and compiler, Computer, 33(4): 70–77, 2000.
[Hauck00] S Hauck, M M Hosler, and T W Fry, High-performance carry chains for FPGAs, IEEE
Transactions on VLSI Systems, 8(2): 138–147, April, 2000.
[Lattice05] Lattice Semiconductor Corp., LatticeXP Datasheet, 2005. Available at http://www
latticesemi.com/lit/docs/datasheets/fpga/DS1001.pdf
[Leijten03] K Leijten-Nowak and J van Meerbergen, An FPGA architecture with enhanced datapath
functionality, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
Monterey, CA, February 2003, pp 195–204
[Lemieux04a] G Lemieux and D Lewis, Design of Interconnection Networks for Programmable Logic,
Kluwer Academic Publishers, Norwell, MA, November 2004
[Lemieux04b] G Lemieux, E Lee, M Tom, and A Yu, Directional and single-driver wires in FPGA
inter-connect, in IEEE International Conference on Field-Programmable Technology, Brisbane,
Australia, December 2004, pp 41–48
[Lewis05] D Lewis, E Ahmed, G Baeckler, V Betz, M Bourgeault, D Cashman, D Galloway,
M Hutton, C Lane, A Lee, P Leventis, S Marquardt, C McClintock, K Padalia, B Pedersen,
G Powell, B Ratchev, S Reddy, J Schleicher, K Stevens, R Yuan, R Cliff, and J Rose, The
Stratix II logic and routing architecture, in ACM/SIGDA International Symposium on FPGAs,
Monterey, CA, February 2005, pp 14–20
[Lin94] C -C Lin, M Marek-Sadowska, and D Gatlin, Universal logic gate for FPGA design, in
Proceedings of the 1994 IEEE/ACM International Conference on Computer-Aided Design,
San Jose, CA, November 1994, pp 164–168
[Marquardt00] A Marquardt, V Betz, and J Rose, Speed and area trade-offs in cluster-based FPGA
architectures, IEEE Transactions on VLSI, 8(1): 84–93, February 2000.
Trang 4Field-Programmable Gate Array Architectures 955
[Masud99] M I Masud and S J E Wilton, A new switch block for segmented FPGAs, in International
Workshop on Field Programmable Logic and Applications, Glasgow, U.K., August 1999,
pp 274–281
[Mei03] B Mei, S Vernalde, D Verkest, H De Man, and R Lauwereins, ADRES: An architecture
with tightly coupled VLIW processor and coarse-grained reconfigurable matrix, in Interna-tional Conference on Field-Programmable Logic and Applications, Lisbon, Portugal, 2003,
pp 61–70
[Oldridge05] S W Oldridge and S J E Wilton, A novel FPGA architecture supporting wide, shallow
memories, IEEE Transactions on Very-Large Scale Integration (VLSI) Systems, 13(6): 758–
762, June 2005
[Quick05] Quicklogic, Eclipse II Family Data Sheet, 2005 Available at http://www.quicklogic.com/
images/eclipse2_family_DS.pdf
[Rose90] J S Rose, R J Francis, D Lewis, and P Chow, Architecture of field-programmable gate
arrays: The effect of logic block functionality on area efficiency, IEEE Journal of Solid-State Circuits, 25(5): 1217–1225, October 1990.
[Rose93] J Rose, A El Gamal, and A Sangiovanni-Vincentelli, Architecture of field-programmable
gate arrays, Proceedings of the IEEE, 81(7): 1013–1029, July 1993.
[Singh92] S Singh, J Rose, P Chow, and D Lewis, The effect of logic block architecture on FPGA
performance, IEEE Journal of Solid-State Circuits, 27(3): 281–287, March 1992.
[Singh00] H Singh, M -H Lee, G Lu, F Kurdahi, N Bagherzadeh, and E Chaves, MorphoSys: An
integrated reconfigurable system for dataparallel and compute intensive applications, IEEE Transactions on Computers, 49(5): 465–481, 2000.
[Singh01a] A Singh, A Mukherjee, and M Marek-Sadowska, Interconnect pipeling in a
throughput-intensive FPGA architecture, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, February 2001, pp 153–160.
[Singh01b] D P Singh and S D Brown, The case for registered routing switches in field programmable
gate arrays, in ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
Monterey, CA, February 2001, pp 161–172
[Sivaswamy05] S Sivaswamy, G Wang, C Ababei, K Bazargan, R Kastner, and E Bozorgzadeh, HARP:
Hardwired routing pattern FPGAs, in ACM International Symposium on Filed Programmable Gate Arrays, Monterey, CA, February 2005, pp 21–32.
[Trimberger94] S Trimberger, Field-Programmable Gate Array Technology, Kluwer Academic Publishers,
Norwell, MA, 1994
[Weaver04] N Weaver, J Hauser, and J Wawrzynek, The SFRA: A corner-turn FPGA architecture, in
ACM/SIGDA International Symposium on FPGAs, February 2004, pp 3–12.
[Wilton00] S J E Wilton, Hetergenous technology mapping for area reduction in FPGAs with embedded
memory arrays, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(1):56–68, 2000.
[Wilton97] S J E Wilton, Architecture and algorithms for field-programmable gate arrays with embedded
memory, PhD thesis, University of Toronto, Toronto, Ontario, Canada, 1997
[Wilton99] S J E Wilton, J Rose, and Z G Vranesic, The memory/logic interface in FPGA’s with
large embedded memory arrays, IEEE Transactions on Very-Large Scale Integration Systems,
7(1):80–91, March 1999
[Xilinx05a] Xilinx Corp., Virtex-4 Users Guide, 2005 Available at http://www.xilinx.com/support/
documentation/user_guides/ug070.pdf
[Xilinx05b] Xilinx Corp., Processor IP Reference Guide, February 2005.
[Ye05] A G Ye and J Rose, Using bus-based connections to improve field-programmable gate array
density for implementing datapath circuits, in ACM/SIGDA Symposium on FPGAs, February
2005, Monterey, CA, pp 3–13
[Zeidman02] B Zeidman and R Zeidman, Designing with FPGAs and CPLDs, CMP Books, Upper Saddle
River, NJ, 2002
Trang 646 FPGA Technology Mapping,
Placement, and Routing
Kia Bazargan
CONTENTS
46.1 Introduction 957
46.2 Technology Mapping and Clustering 958
46.2.1 Technology Mapping 959
46.2.2 Clustering 960
46.3 Floorplanning 961
46.3.1 Hierarchical Methods 961
46.3.2 Floorplanning on FPGAs with Heterogeneous Resources 963
46.3.3 Dynamic Floorplanning 964
46.4 Placement 966
46.4.1 Island-Style FPGA Placement 967
46.4.2 Hierarchical FPGA Placement 969
46.4.3 Physical Synthesis and Incremental Placement Methods 969
46.4.4 Linear Datapath Placement 972
46.4.5 Variation-Aware Placement 974
46.4.6 Low Power Placement 975
46.5 Routing 975
46.5.1 Hierarchical Routing 976
46.5.2 SAT-Based Routing 977
46.5.3 Graph-Based Routing 979
46.5.4 Low Power Routing 980
46.5.5 Other Routing Methods 980
46.5.5.1 Pipeline Routing 980
46.5.5.2 Congestion-Driven Routing 981
46.5.5.3 Statistical Timing Routing 981
References 982
46.1 INTRODUCTION
Computer-aided design (CAD) tools for field-programmable gate arrays (FPGAs) primarily emerged
as extensions of their application-specific integrated circuit (ASIC) counterparts in the 1980s because
of the relative maturity of the ASIC CAD tools at that time Traditional logic optimization techniques, simulated-annealing-based placement algorithms, and maze routing methods were common in the FPGA world But as FPGA architecture developed distinct features both in terms of logic and routing architectures, FPGA CAD tools evolved into today’s FPGA design flows that are highly optimized for specific characteristics of FPGA devices More specialized timing models, technology mapping This work is supported in part by the National Science Foundation under grant CCF-0347891.
957
Trang 7Technology mapping RTL synthesis
Placement
Routing
Configuration bitfile
Design entry
Verification/
simulation
Power analysis
Timing analysis
FIGURE 46.1 Typical FPGA flow.
solutions, and placement and routing strategies are needed to ensure high-quality mapping of circuits
to FPGAs
Figure 46.1 shows a common design flow for FPGA designs The high-level description
of the FPGA design is fed to a register transfer level (RTL) synthesis tool that performs technology-independent logic optimization The synthesis tool might detect opportunities for utiliz-ing special-purpose logic gates within the FPGA logic fabric Examples are carry chains, high-fanin sum-of-product gates, and embedded multiplier (see Sections 45.3.1.2 and 45.3.2)
The functional gates of the technology-independent optimized design are mapped to FPGA lookup tables (LUTs) (see Section 45.3.1), a process called technology mapping Clustering of the LUTs is performed next (see Section 45.3.2) Placement and routing steps follow clustering Floorplanning may or may not precede placement Each of these steps would use timing and power analysis engines to better optimize the design Furthermore, the user might simulate or perform formal verifications at various steps of the design cycle If timing or power constraints are not met, the design flow might backtrack to a previous step For example, if routing fails due to high congestion, then placement might be attempted again with different parameters
The rest of the chapter is organized into four sections FPGA-specific technology mapping and clustering algorithms are covered in Section 46.2.1 Sections 46.3 and 46.4 cover floorplanning and placement algorithms We conclude the chapter by discussing routing algorithms in Section 46.5
46.2 TECHNOLOGY MAPPING AND CLUSTERING
Technology mapping converts a logic circuit into a netlist of FPGA K-LUTs and their connections.
A K-LUT is usually implemented as a K-input, one output static random-access memory (SRAM) block By writing the truth table of a Boolean function in the K-LUT, we can implement any function that has K or fewer inputs regardless of the complexity of the function Neighboring LUTs can be
clustered into local groups with dedicated fast routing resources to improve the delay of the circuit Clustering algorithms are used to group together local LUTs to minimize connection delays Later
in the design flow, these clusters are used as input to the placement step Some placement algorithms might never touch a cluster, but some other placement methods (such as the ones presented in Section 46.4.3) might move individual logic blocks from one cluster to another to improve timing, power, etc
Given the fact that technology mapping considering area and delay optimization is NP-hard, Cong and Minkovich [1] synthesize benchmarks with known optimal or upper-bound technology mapping solutions and test state-of-the-art FPGA synthesis algorithms to see how far these algorithms are
Trang 8FPGA Technology Mapping, Placement, and Routing 959
from producing optimal solutions (a preliminary version of their work appeared in the FPGA 2007 conference) They show that current technology mapping solutions are close to optimal (between 3 and 22 percent away, see Table III in Ref [1]) while logic optimization methods have much room for improvement Although some argue that the generated benchmarks are artificial and do not reflect characteristics of large industry benchmarks, nevertheless the work in Ref [1] gives us insights into what needs to be done to improve existing CAD algorithms Our goal in the next two sections is to introduce basic technology mapping and clustering algorithms so that the reader can better understand placement and routing algorithms for FPGAs Many great technology mapping algorithms (such as DAOmap [2], ABC [3], and the work by Mishchenko et al [4]) are not discussed here
46.2.1 TECHNOLOGYMAPPING
A major breakthrough in the FPGA technology mapping came about in 1994 with the introduction of the FlowMap tool [5] Library-based ASIC technology mapping (that maps a logic network to gates such as AND, OR, etc.) for depth minimization was known to be NP-hard, but Cong et al proved that
the K-LUT technology mapping can be done in O (KVE), where V and E are the number of nodes
(gates) and edges (wires) in the circuit, respectively The FlowMap algorithm traverses the circuit graph containing simple gates and their connections in a breadth-first search fashion and determines depth-optimal mappings of the fanin cones of the nodes as it progresses toward primary outputs The fanin cone of a node is the set of all gates from the circuit primary inputs (input pads) to the node itself
The algorithm uses the notion of K-feasible cuts to find K-LUT mappings of a subcircuit Figure 46.2a shows an example subgraph in which a cut separates the nodes into disjoint sets X and
X where only three nodes in set X provide inputs to nodes in X, that is, the nodes that are drawn
using thick lines Cut(X, X) is said to be K-feasible for K ≤3 All the nodes in set X can be mapped
into one 3-LUT, which gets its input values from the LUTs that implement the three boundary nodes
in X and their fanin cones.
The labels on the nodes in Figure 46.2a show the depth of the minimum depth K-LUT mapping
of the input cone of the node The authors prove that for a node t, the minimum depth is either the maximum label l in X, or l+1.∗Consider an example graph for another circuit shown in Figure 46.2b
s
d
1 1
0
0
h
t⬘ N⬘t
s
d
1
∞
∞
∞
∞
∞ ∞ ∞ ∞
1 1 1 1
1 1
1
1
h
t⬘ N⬙t
1
1 2
2
X
3
3
3
3
4
4 4
4
t
1
s
X
X
s
d 1
1 1
0
0 0 0 0 1
2 2 2
h
a b c
t N t
X
(c) Checking the feasibility of a mapping of depth 2
FIGURE 46.2 Flowmap mapping steps (From Cong, J and Ding, Y., IEEE Trans Comput Aided Des.
Integrated Circuits Syst., 13, 1, 1994 Copyright IEEE 1994 With permission.)
∗If the new node t can be packed with the rest of the nodes with label l, then the depth of LUTs used in implementing the circuit up to this point would not increase Otherwise, a new LUT with depth l+ 1 has to be allocated to house the new
node t.
Trang 9In a breadth-first search traversal on subgraph N t , when we get to node t, the question is whether we can pack t with nodes a, b, c (which have the maximum depth of l) in one K-LUT.
We can create an auxiliary graph N t(shown in Figure 46.2c, note that nodes with labels correspond
to their counterpart nodes in Figure 46.2b with the same labels), which replaces a, b, c, t with one node t and see if t—and possibly other nodes—can be packed in one K-LUT Node t can be
mapped to a K-LUT if we can find a cut (X, X) where t∈ X and at most K nodes in X provide input
to nodes in X Network flow algorithms can be used to answer this question We can model one LUT
in the fanin cone as a flow of one unit, and look for a maximum flow of K-units at the sink node.
If the maximum flow is K, it means that we have found a cut with at most K-LUTs as inputs, and anything below the cut can be packed into a K-LUT More details are provided next Subgraph N t can be transformed to a dual graph N t(Figure 46.2d) in which each node y is replaced by two nodes
yiand yothat are connected by an edge of weight 1 An edge(y, z) in N
t corresponds to edge(yo, z i )
in N twith an infinite edge weight If a flow of K units can be found in N t, then at most K nodes in X provide inputs to nodes in X, which means node t in the original N tgraph can indeed be packed with other nodes with the maximum label The authors introduce variations on the original technology mapping algorithm to minimize area as a secondary objective
46.2.2 CLUSTERING
Today’s FPGAs cluster LUTs into groups and provide fast routing resources for intracluster con-nections When two LUTs are assigned to one cluster, their connections can use the fast routing resources within the cluster, and hence reduce the delay on the connection On the other hand, if two LUTs are in two separate clusters, they have to use intracluster routing resources that are more scarce and more costly in terms of delay Placement and routing algorithms are needed to balance the usage of intracluster routing resources (see Sections 46.4 and 46.5)
Many clustering algorithms were introduced in the past decade Most work by first selecting a seed and then choosing LUTs to cluster with the seed The difference between various clustering algorithms is in their criteria for choosing the seed node and the way other nodes are chosen to be absorbed by the seed The clustering algorithm used in the popular versatile placement and routing (VPR) tool [6] is called T-VPack [7], which is an extension of the earlier packing algorithm VPack VPack chooses LUTs with high number of input connections as initial seeds for clusters The
criteria for packing a node B into a cluster C is the attraction of the node, defined as the number of nets that are shared between node B and nodes inside C The more sharing there is between nodes
within a cluster, the less routing demand is needed to connect clusters
T-VPack is the timing-driven version of VPack and extends the definition of the attraction of a node to include timing criticality of nets connecting the node to those packed into the cluster Timing
criticality of a net i is defined as 1 – [slack (i)/MaxSlack] If two nodes have equal net criticality
values connecting them to nodes packed into a cluster, then the one through which more critical paths pass is chosen to be packed into the cluster first The results in Ref [7] show that clusters of size 7–10 provide the best area/delay tradeoff
Clustering algorithms such as RPack [8] and the work by Singh et al [9] improve routability of the clustered circuit by introducing absorption costs that try to weigh nodes based on how promising they are in absorbing more nets into the cluster The authors in Ref [9] define connectivity factor
(c) of a LUT x as c(x) = separation(x)/degree(x)2, where separation of a LUT is the number of
LUTs adjacent to it Figure 46.3a shows node A with a separation value of 18, degree of 4, and connectivity of 1.125 Figure 46.3b shows node B with the same degree as A, but with a smaller separation and hence smaller connectivity Node A cannot absorb any nets if one node from each net is clustered into the same cluster as A On the other hand, node B can absorb all the nets shown
in Figure 46.3 by including one node from each net in its cluster The selection of the seed node in Singh et al work is done by lexicographically sorting nodes by their (degree, –connectivity) values and choosing the ones with highest values as initial seeds (T-VPack used only the degree values)
Trang 10FPGA Technology Mapping, Placement, and Routing 961
(a) Number of nets absorbed = 0
(b) Number of nets absorbed = 4
FIGURE 46.3 Examples illustrating the usefulness of the connectivity factor (Based on Singh, A and
Marek-Sadowska, M., Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays,
59–66, 2002 With permission.)
Nodes are greedily packed into seed clusters based on how many nets they absorb, with higher priority given to the nets with fewer terminals To guarantee spatial uniformity of the clustered netlist, the authors limit the number of available pins to a cluster so that the number of logic blocks inside
a cluster and the number of connections to the nodes within the cluster follow Rent’s rule Doing so effectively depopulates clusters to reduce overall intercluster routing demands Such strategies are
in line with what DeHon’s study [10] on routing requirements of FPGA circuits suggested Because interconnect resources (switches and buffers) consume most of the silicon area of an FPGA (80–90 percent), sometimes it is beneficial to underutilize clusters to reduce routing demand in congested regions of the FPGA array
46.3 FLOORPLANNING
Floorplanning is used on FPGAs to speed up the placement process or to place hard macros with prespecified shapes The traditional FPGA floorplanning problem is discussed in Section 46.3.1 Another class of floorplanning algorithms for FPGAs is the ones that deal with heterogeneous resource types An example of this approach is the work by Cheng and Wong [11], to be covered
in Section 46.3.2 A third class of floorplanning for FPGAs addresses dynamically reconfigurable systems in which modules are added or removed at runtime, requiring fast, on-the-fly modification
of the floorplan These approaches are discussed in Section 46.3.3
46.3.1 HIERARCHICALMETHODS
Sankar and Rose [12] first use a bottom-up clustering method to build larger clusters out of logic blocks (refer to Section 46.2.2) Then they use a hierarchical simulated annealing algorithm to speed up the placement compared to a flat annealing methodology They show trade-offs between placement runtime and quality
While clustering the circuit into larger subcircuits, they limit the shape and size of the clusters
to prespecified values The leaves of the clustering tree are the logic blocks and the first level of the tree are nodes that combine exactly two leaves All level-one nodes will be placed in 1× 2 regions, that is, on two adjacent clusters in the same row The next level of hierarchy clusters two level 1 clusters and will be placed as 2× 2 squares Figure 46.4 shows the clustering and placement conceptually Such restrictions on the clustering and placement steps would limit the ability of the algorithms to search a larger solution space compared to an unrestricted version of the problem, but on the other hand relieve the algorithm designers of dealing with the sizing problem during the floorplanning process, described in Section 9.4.1
The work by Emmert and Bhatia [13] too starts by clustering the logic elements into larger subcircuits The input to their flow is a list of macros, each macro being either a logic block, or a set of logic blocks with a list of predefined shapes An example of a macro is a multiplier with two shape options, one for minimum area, the other for minimum delay