Logic is implemented in configurable logic blocks; these logic blocks are then connected to each other and to the I/O pads using a configurable routing network [Rose93,Betz99].. Figure 4
Trang 1to map a circuit to an application specific integrated circuit (ASIC) The next chapter will describe the physical design algorithms for FPGAs; this chapter sets the stage by describing the architecture
of FPGAs Section 45.2 describes several programming technologies, Section 45.3 describes logic block architectures, Section 45.4 describes routing architectures, and Sections 45.5 and 45.6 describe embedded memories and embedded computation blocks
45.2 PROGRAMMING TECHNOLOGIES
The circuit being implemented on an FPGA is stored in the FPGA using a set of configuration bits These bits can be constructed in various ways; this section describes static random access memory (SRAM), Flash, and antifuse-based configuration bits These schemes are all used in con-temporary commercial FPGAs; many FPGAs vendors, such as Xilinx, Altera, and Lattice, use SRAM configurable bits to control the programmable switches to configure routing and logic [Altera05,Lattice05,Xilinx05a] Actel produces both Flash and antifuse FPGA products [Actel05a] QuickLogic uses antifuse technology in their products [Quick05] Table 45.1 provides a comparison among these three technologies; details on each are provided below FPGAs based on emerging technologies have also been described [Ferrera04,Dehon05], but because they are not commercially available yet, they will not be discussed further here
45.2.1 SRAM-BASEDFPGAS
The most popular scheme to implement configuration bits is to use SRAM cells SRAM technology
is fast, and allows for reprogrammability In addition, SRAM bits can be implemented using standard complementary metal-oxide-semiconductor (CMOS) processes, meaning FPGAs using SRAMs can
be implemented in leading-edge processes Figure 45.1 shows a typical six-transistor SRAM memory cell It uses the data bit in both the true and complement forms to achieve fast read and write time [Trimberger94] Although a six-transistor cell is generally more stable because it is resistant
to state flipping owing to crosstalk or charge sharing [Betz99], four-transistor and five-transistor SRAM cells are possible Xilinx uses a five-transistor SRAM cell for their FPGAs [Trimberger94] The main disadvantage of SRAM is its volatility Data stored in SRAM cells is erased when the power is turned off Therefore, additional off-chip memory, like electrically erasable programmable read-only memory (EEPROM), is necessary to store the configuration bits and program the FPGA at power-up This potentially causes security concerns, because designs can be copied by capturing the external bit stream [Zeidman02] To address this, some FPGA vendors, such as Altera and Lattice, apply on-chip Flash memory to store the configuration bits, so the SRAM-based FPGA can be programmed without external memory upon power-up A second disadvantage of this technology
is that SRAM cells are susceptible to neutron-induced errors, also known as soft-errors, which are
TABLE 45.1 Comparison among SRAM, Antifuse, and Flash
Features SRAM Flash Antifuse
In-system programmable Yes Yes No Power consumption High Lower Lower
Soft-error resistance Low High High
Trang 2Data Data_bar
Program line:
Asserted during the configuration phase
Load line:
Load the value for
data during configuration
phase
Load line_bar:
Load the value for data_bar during configuration phase
FIGURE 45.1 Six-transistor SRAM cell.
caused by neutrons, alpha particles, cosmic or terrestrial radiation These errors are common in high-radiation environments, such as at high altitude or in space Such errors do not permanently damage the FPGA, but they may cause instability and functional failure in the system The main strategies
to overcome these errors in SRAM-based FPGAs are triple redundancy, error-correcting or parity codes, and redundancy in time
45.2.2 FLASH-BASEDFPGAS
Flash cells provide nonvolatile programmability while retaining the ability to reprogram the FPGAs Figure 45.2 illustrates the Flash switch used in Actel’s ProASIC3 In the Flash switch, two transistors share the floating gate, which stores the programming data The sensing transistor is used for writing and verification of the floating gate voltage while the switching transistor is employed to configure routing nets and logic Flash-based FPGAs are more secure and consume less power than their SRAM
Wordline
Switch output
Switch input
Sensing transistor
Switching transistor Floating gate
FIGURE 45.2 Flash-based switch.
Trang 3counterparts [Actel05a] However, the manufacturing process for Flash is more complicated than that of SRAM As a result, Flash technology usually lags one to two process generations behind SRAM technologies Testing is also lengthy owing to the nature of Flash Therefore, Flash-based FPGAs have a slower time-to-market compared to the SRAM-based FPGAs
45.2.3 ANTIFUSE-BASEDFPGAS
Antifuses can also be used to implement configuration bits [Actel05b] An antifuse is a thin insulating layer between conductors The insulating layer gets mutated by applying high voltage After the alteration, a low-resistance path is created between the conductors Such alteration is irreversible Like Flash, antifuse technology is nonvolatile The major disadvantage of antifuse FPGA is its one-time programmability However, it consumes less power and is more area-efficient than SRAM and Flash
45.3 LOGIC BLOCK ARCHITECTURES
Programmability is provided in an FPGA in two ways Logic is implemented in configurable logic blocks; these logic blocks are then connected to each other and to the I/O pads using a configurable routing network [Rose93,Betz99] This section focuses on logic blocks and the next section focuses
on the routing network
45.3.1 LOOKUP-TABLES
Most FPGAs use lookup-tables (LUTs) as their basic logic element A K-input LUT (K-LUT) is a
memory with 2K bits, K address lines, and a single output line Each K-LUT can be configured to implement any function of K inputs by storing the truth table of the desired function in the 2 Kstorage bits Figure 45.3 shows an example of a 2-input LUT implemented using SRAM cells (antifuse and Flash memory cells could also be used)
Early research has shown that K = 4 works well; this is used in most commercial FPGAs
[Rose90,Singh92] Later work reconfirmed that K = 4 is a good choice for area, but that for performance, K = 7 works well [Ahmed04] In general, the parameter K has a significant impact
on the efficiency of the architecture If K is too large, it may not be possible to completely fill each logic block, while if K is too small, delay will suffer because more logic blocks will be needed along
the critical path of a circuit Figure 45.4 shows how a 6-input function might be implemented with two 4-LUTs; had a 6-LUT been used, only 1-LUT would be required
Variations on the basic LUT architecture have been used Figure 45.5 shows a logic block that
employs a fracturable LUT mask (FLM) [Lewis05] A k, m-FLM can implement a single k-input function or two functions, each with up to k − 1 inputs, which together use no more than k + m
distinct inputs The architecture in Figure 45.5a is a 6,2-FLM An extension of the FLM architecture,
called a shared LUT mask (SLM) architecture, is shown in Figure 45.5b A k, m-SLM can implement two identical functions of k inputs provided that the two functions share k − m inputs The SLM
SRAM cells
SRAM cells
00
11 10 01
0 0 0 1
FIGURE 45.3 Two-input LUT Unprogrammed and programmed as a two-input and gate.
Trang 4Out
A
B
C
D
E
F
A B C D E F
4-input LUT
4-input LUT
6-input LUT
FIGURE 45.4 Implementing a 6-input function using two 4-LUTs.
E0
E1
F0
F1
A
E
F
B
C0
C1
D0
D1
A B
DC0
DC1
Z1(A,B,C1,D1,F)
Z2(A,B,C,D,E,F)
Z0(A,B,C0,D0,E) 4
4
4
4-LUT
Fixed
input
routing
Fixed input routing
4-LUT
4-LUT
4-LUT
3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT
MUX network
MUX network
4
/ / / / / / / /
/
/
/
/
3 3 3 3 3 3 3 3
Z0(A,B,DC0,DC1,E0,F0)
Z1(A,B,DC0,E0,F0)
Z0(A,B,DC0,DC1,E1,F1)
Z2(A,B,DC1,E1,F F)
FIGURE 45.5 Advanced logic block structures.
architecture does this through the sharing of LUT masks (the set of configuration bits that indicate the function implemented by the LUT) so that both functions are the same but can have different inputs The logic block in the Altera Stratix II FPGA is based on a 6,2-SLM [Altera05]
Lookup-tables are usually coupled with flip-flops, as shown in Figure 45.6 In this structure, a configuration bit is used to control the state of the output multiplexer Depending on the value of this configuration bit, the output signal of the LUT can either be registered or unregistered As in Ref [Betz99], we refer to the LUT and flip-flop as a basic logic element (BLE)
45.3.1.1 Clusters
To increase speed and reduce area and compile time, larger logic blocks are preferred However, LUT complexity grows exponentially with the number of inputs [Rose93] Clusters are logic blocks
of larger granularity, typically composed of multiple BLEs, internal cluster routing, and possibly specialized internal cluster connections, such as carry and arithmetic chains [Marquardt00] Within
a cluster, BLE inputs are typically connected to the cluster inputs and BLE outputs by a multiplexer-based crossbar This internal interconnect is generally faster than the general purpose routing between
Clock
flip-flop
FIGURE 45.6 LUT coupled with a flip-flop (BLE).
Trang 5N I
N Outputs
Logic cluster Clock
I Inputs
BLE
#1
BLE
#N
FIGURE 45.7 Basic BLE and basic cluster composed of identical BLEs.
blocks Altera refers to clusters as logic array blocks (LABs), while Xilinx refers to clusters as configurable logic blocks (CLBs)
Figure 45.7 shows a typical cluster The cluster architecture is described by these four parameters:
(1) K, the number of inputs to a LUT, (2) N, the number of BLEs in a cluster, (3) I, the number of inputs to the cluster that connect to LUT inputs, and (4) Mclk, the number of clock inputs to a cluster (most studies assume this is 1)
Increasing K or N increases the functionality of the cluster This reduces the number of blocks
needed to implement circuits and the number of blocks on the critical path, but increases the size
of the block and makes the local cluster interconnect slower Research has found that K= 4–6 and
N= 3–10 provide the best combined speed and area [Ahmed04]
The value of I is often smaller than K × N, because BLEs often share inputs or use the outputs from BLEs within the cluster Smaller values of I use smaller multiplexers in the crossbar, reducing area, but overly small I values make some BLEs unusable Research has found that 98 percent utilization can be achieved when I = [(K/2) × (N + 1)] [Ahmed04].
45.3.1.2 Carry Chains
Carry chains are locally routed connections that aid in the efficient implementation of arithmetic operations They also can be used in the efficient implementation of logical operations, such as parity and comparison Fast carry chains are important because the critical path for these operations
is often through the carry
Each 4-LUT in a BLE can be fractured to implement two 3-LUTs; this is sufficient to implement
both the sum and carry, given two input bits (a and b) and a carry input, as shown in Figure 45.8.
The carry out signal from one BLE would typically be connected to the carry in of an adjacent BLE
using a fast dedicated connection The Z-input is used to break the carry chain before the first bit of
an addition
More complex carry schemes have been described In Ref [Hauck00], carry chains based on carry select, variable block, and Brent–Kung schemes are described; the Brent–Kung scheme is shown to be 3.8 times faster than the simple ripple carry adder in Figure 45.8 Support for carry-lookahead adders is included in the Actel Axcelerator device, the Xilinx Virtex-II, Virtex-II Pro,
Trang 6Carry out
Sum out
Carry in
Z a
b
P
FIGURE 45.8 Carry chain connections to a 4-LUT.
and Virtex-4 devices Carry select capabilities are included in the Altera Stratix FPGAs The Altera Stratix-II contains two dedicated 1-bit adders in each logic block Because high-fanin arithmetic can cause routing congestion in a small area of the device, both Xilinx and Altera parts support two independent carry chains in each cluster This allows for narrower fanin logic, which helps reduce routing congestion around the adders
45.3.2 NON-LUT-BASEDLOGICBLOCKS
Not all FPGAs contain logic blocks based on LUTs The Actel ProASIC3 logic blocks contain a set
of multiplexers, which allow for the implementation of 3-input combinational or sequential functions
in each logic block [Actel05a] The QuickLogic Eclipse II logic cell contains two 6-input AND gates, four 2-input AND gates, and seven two-to-one multiplexers [Quick05] The use of universal logic modules as FPGA logic blocks has also been proposed; these blocks can implement any function of their inputs by applying input permutation and negation [Lin94] Finally, programmable devices using more coarse-grained logic blocks exist; these logic blocks are typically arithmetic/logic units and are suitable for computationally intensive applications [Ebeling96,Goldstein00,Singh00, Mei03]
45.4 ROUTING ARCHITECTURES
Connections between logic blocks are implemented using fixed prefabricated metal tracks These tracks are arranged in channels; channels typically run vertically and horizontally, forming a grid [Lemieux04a] Although many academic studies have assumed that all channels contain the same number of tracks [Betz99], many commercial architectures (such as those from Altera) contain more tracks in each horizontal channel than each vertical channel Figure 45.9 shows an FPGA with tracks arranged in horizontal and vertical channels
45.4.1 SEGMENTATION
Tracks within a channel can span one logic block, or multiple logic blocks Typically, not all tracks within a channel will be of the same length Several studies have investigated the optimum segment length In Ref [Brown96], a heterogeneous routing architecture, in which some tracks span three logic blocks, some span two logic blocks, and some span one logic block, is found to work well
In Ref [Betz99], it is shown that longer wires result in a more efficient architecture; they suggest
a homogeneous architecture in which all tracks span either four or eight logic block gives the most efficient FPGA
Trang 7channel
Switch block
Connection block
Vertical channel
Routing
track
Logic block
Logic block
Logic block
Logic block
Logic block Logic
block
FIGURE 45.9 Overall routing architecture.
45.4.2 PROGRAMMABLESWITCHES
The tracks are connected to each other and to the logic blocks using programmable switches These programmable switches can be buffered or unbuffered, as shown in Figure 45.10 Switches in modern FPGAs are typically buffered, because unbuffered switches result in a quadratic increase
in delay for long connections Buffered switches can be bidirectional, as shown in Figure 45.10b or unidirectional, as shown in Figure 45.10c Although many academic studies assume bidirectional switches [Betz99], most modern FPGAs contain unidirectional switches [Lemieux04b]; these switches allow for better delay optimization and result in a more dense routing fabric
45.4.3 SWITCHBLOCKS ANDCONNECTIONBLOCKS
Tracks are connected to each other using switch blocks, and to logic blocks using connection blocks Commercial FPGAs often contain combined switch blocks and connection blocks, however for clarity, this section will describe each separately
(a) Unbuffered (b) Buffered bidirectional (c) Buffered unidirectional
FIGURE 45.10 Programmable switches.
Trang 80
0
0
0
0 0
1
1
1
1 1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4 4
4
0
FIGURE 45.11 Switch block patterns.
A switch block lies at the intersection of each horizontal and vertical channel, and can connect
each incident track to some number of other incident tracks Academic work uses the notation Fsto describe the number of outgoing tracks to which each incoming track can be connected [Rose93]
Most physical design algorithm studies assume Fs = 3; in this case, each incoming track can be connected to one track on each of the other three sides of the switch block The switch pattern
determines which Fs tracks to which each incoming track can be connected Academic work has proposed the three switch patterns in Figure 45.11 The disjoint pattern divides the routing
fab-ric into domains; if there are W tracks in each channel, there are W domains This simplifies the
routing task, and results in an efficient layout The universal pattern has been shown to support the largest number of simultaneous connections through each switch block [Chang96], while the Wilton block has been shown to result in good overall routability [Wilton97] An extension of the Wilton block to architectures with different segment lengths is described in Ref [Masud99]
In Ref [Sivaswamy05], it is proposed that some of the connections in a switch block should be hard-wired (nonprogrammable); this gives 30 percent speedup, a slight reduction in area, and an 8 percent reduction in power
Connection blocks are used to connect logic block pins to the routing tracks Each logic block pin
can be connected to a subset of routing tracks in the neighboring channel The quantity Fcindicates the proportion of the tracks in each channel to which a pin can be connected In ref [Betz99], it is
shown that Fc= 0.25–0.5 (depending on the type of switch block employed) works well
45.4.4 BUS-BASEDROUTINGARCHITECTURES
FPGAs are often used to implement datapath-intensive circuits, in which many signals are part of wide buses Because each bit of a bus is connected in the same way, it has been suggested that a datapath routing architecture, in which a single configuration bit controls multiple switches, will lead
to an improvement in FPGA density In Ref [Ye05], the architecture in Figure 45.12 is presented In this architecture, some of the tracks (the top four in Figure 45.12) are dedicated bus-based routing tracks, and connections to them are controlled by a bus switch; a bus switch contains one switch for each bit controlled by a single configuration cell In this case, each bus (and each bus switch) is 4-bits wide The lower tracks are regular bit-based routing tracks, which are connected to each other and to the logic cells using standard connection and switch blocks, as described above In Ref [Ye05], it is shown that a bus-width of 4 works well, and that 40–50 percent of the tracks should be buses (with the remainder being bit-based routing tracks) It is shown that this results in a density improvement
of 9.6 percent compared to a conventional architecture
Trang 94-bit bus
Bit-based routing tracks
P = Configuration cell
P
FIGURE 45.12 Bus-based routing architecture.
45.4.5 PIPELINEDINTERCONNECTARCHITECTURES
In deep-submicron technologies, the delay of long wires can limit the clock speed of the circuit implemented on an FPGA To address this, several authors have proposed pipelined intercon-nect architectures [Singh01a,Singh01b,Weaver04] In these architectures, some of the interconintercon-nect switches contain registers This results in additional complexity for the router, however, because it must now balance the number of registers on each path
45.5 MEMORIES
Today, FPGAs are often used to implement entire systems These systems often require storage Although it is possible to implement storage off-chip, on-chip storage has a number of advantages On-chip storage reduces system costs, allows for a wider, faster memory interface, and reduces I/O demands on the FPGA
There are two ways of implementing memory on FPGAs: embedded memory and distributed memory Embedded memory solutions offer a number of relatively large fixed dedicated memory blocks on the FPGA Distributed memory, on the other hand, uses small memories spread across the entire FPGA chip, often implemented in unused logic elements
45.5.1 EMBEDDEDMEMORY
Most FPGAs contain embedded memory blocks (EMBs) EMBs are typically arranged in columns or rows to simplify connections to logic and between other EMBs [Wilton99], as shown in Figure 45.13 Altera’s Stratix and Stratix-II devices include three different sized EMBs: 512 bits, 4 Kbits, and
512 Kbits [Altera05] Xilinx’s Virtex-4, Virtex-II, and Spartan series contain 18 Kbits EMBs [Xil-inx05a] Actel’s ProASIC3 and ProASIC-Plus contain 4 Kbits and 2 Kbits EMBs, respectively [Actel05]
Each EMB has a fixed number of bits, but its aspect ratio can be configured by the user For example, in the Stratix II architecture, a 4-Kbit EMB may be configured to act as memories with aspect ratios of 4096× 1, 2048 × 2, 1024 × 4, 512 × 8, 256 × 16, or 128 × 32 On many devices, EMBs can be configured to act as a ROM, single-port RAM, or dual-port RAM In addition, they typically include parity bits, various enable/reset control signals, and have synchronous inputs with synchronous or asynchronous outputs
Of particular importance is the interface between the memory and the logic Figure 45.14 shows one published scheme; in this architecture, each EMB connects to the logic through a memory-logic interconnect block [Wilton99] Figure 45.15 shows the contents of one of these memory-memory-logic
Trang 10Logic blocks
Logic blocks
Memory arrays
FIGURE 45.13 Logic and memory in an FPGA.
Logic
block
Logic
block
Logic block
Logic block
Logic block
Logic block
Logic block Logic
block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block Logic
block
Logic block
Logic
block
Logic
block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block Logic
block
Memory/logic
interconnect
block
Memory block
Memory/logic interconnect block
Memory block
Memory/logic interconnect block
FIGURE 45.14 Memory/logic interconnect architecture.