Handbook of algorithms for physical design automation part 97 pdf

Logic is implemented in configurable logic blocks; these logic blocks are then connected to each other and to the I/O pads using a configurable routing network [Rose93,Betz99].. Figure 4

Trang 1

to map a circuit to an application specific integrated circuit (ASIC) The next chapter will describe the physical design algorithms for FPGAs; this chapter sets the stage by describing the architecture

of FPGAs Section 45.2 describes several programming technologies, Section 45.3 describes logic block architectures, Section 45.4 describes routing architectures, and Sections 45.5 and 45.6 describe embedded memories and embedded computation blocks

45.2 PROGRAMMING TECHNOLOGIES

The circuit being implemented on an FPGA is stored in the FPGA using a set of configuration bits These bits can be constructed in various ways; this section describes static random access memory (SRAM), Flash, and antifuse-based configuration bits These schemes are all used in con-temporary commercial FPGAs; many FPGAs vendors, such as Xilinx, Altera, and Lattice, use SRAM configurable bits to control the programmable switches to configure routing and logic [Altera05,Lattice05,Xilinx05a] Actel produces both Flash and antifuse FPGA products [Actel05a] QuickLogic uses antifuse technology in their products [Quick05] Table 45.1 provides a comparison among these three technologies; details on each are provided below FPGAs based on emerging technologies have also been described [Ferrera04,Dehon05], but because they are not commercially available yet, they will not be discussed further here

45.2.1 SRAM-BASEDFPGAS

The most popular scheme to implement configuration bits is to use SRAM cells SRAM technology

is fast, and allows for reprogrammability In addition, SRAM bits can be implemented using standard complementary metal-oxide-semiconductor (CMOS) processes, meaning FPGAs using SRAMs can

be implemented in leading-edge processes Figure 45.1 shows a typical six-transistor SRAM memory cell It uses the data bit in both the true and complement forms to achieve fast read and write time [Trimberger94] Although a six-transistor cell is generally more stable because it is resistant

to state flipping owing to crosstalk or charge sharing [Betz99], four-transistor and five-transistor SRAM cells are possible Xilinx uses a five-transistor SRAM cell for their FPGAs [Trimberger94] The main disadvantage of SRAM is its volatility Data stored in SRAM cells is erased when the power is turned off Therefore, additional off-chip memory, like electrically erasable programmable read-only memory (EEPROM), is necessary to store the configuration bits and program the FPGA at power-up This potentially causes security concerns, because designs can be copied by capturing the external bit stream [Zeidman02] To address this, some FPGA vendors, such as Altera and Lattice, apply on-chip Flash memory to store the configuration bits, so the SRAM-based FPGA can be programmed without external memory upon power-up A second disadvantage of this technology

is that SRAM cells are susceptible to neutron-induced errors, also known as soft-errors, which are

TABLE 45.1 Comparison among SRAM, Antifuse, and Flash

Features SRAM Flash Antifuse

In-system programmable Yes Yes No Power consumption High Lower Lower

Soft-error resistance Low High High

Trang 2

Data Data_bar

Program line:

Asserted during the configuration phase

Load line:

Load the value for

data during configuration

phase

Load line_bar:

Load the value for data_bar during configuration phase

FIGURE 45.1 Six-transistor SRAM cell.

caused by neutrons, alpha particles, cosmic or terrestrial radiation These errors are common in high-radiation environments, such as at high altitude or in space Such errors do not permanently damage the FPGA, but they may cause instability and functional failure in the system The main strategies

to overcome these errors in SRAM-based FPGAs are triple redundancy, error-correcting or parity codes, and redundancy in time

45.2.2 FLASH-BASEDFPGAS

Flash cells provide nonvolatile programmability while retaining the ability to reprogram the FPGAs Figure 45.2 illustrates the Flash switch used in Actel’s ProASIC3 In the Flash switch, two transistors share the floating gate, which stores the programming data The sensing transistor is used for writing and verification of the floating gate voltage while the switching transistor is employed to configure routing nets and logic Flash-based FPGAs are more secure and consume less power than their SRAM

Wordline

Switch output

Switch input

Sensing transistor

Switching transistor Floating gate

FIGURE 45.2 Flash-based switch.

Trang 3

counterparts [Actel05a] However, the manufacturing process for Flash is more complicated than that of SRAM As a result, Flash technology usually lags one to two process generations behind SRAM technologies Testing is also lengthy owing to the nature of Flash Therefore, Flash-based FPGAs have a slower time-to-market compared to the SRAM-based FPGAs

45.2.3 ANTIFUSE-BASEDFPGAS

Antifuses can also be used to implement configuration bits [Actel05b] An antifuse is a thin insulating layer between conductors The insulating layer gets mutated by applying high voltage After the alteration, a low-resistance path is created between the conductors Such alteration is irreversible Like Flash, antifuse technology is nonvolatile The major disadvantage of antifuse FPGA is its one-time programmability However, it consumes less power and is more area-efficient than SRAM and Flash

45.3 LOGIC BLOCK ARCHITECTURES

Programmability is provided in an FPGA in two ways Logic is implemented in configurable logic blocks; these logic blocks are then connected to each other and to the I/O pads using a configurable routing network [Rose93,Betz99] This section focuses on logic blocks and the next section focuses

on the routing network

45.3.1 LOOKUP-TABLES

Most FPGAs use lookup-tables (LUTs) as their basic logic element A K-input LUT (K-LUT) is a

memory with 2K bits, K address lines, and a single output line Each K-LUT can be configured to implement any function of K inputs by storing the truth table of the desired function in the 2 Kstorage bits Figure 45.3 shows an example of a 2-input LUT implemented using SRAM cells (antifuse and Flash memory cells could also be used)

Early research has shown that K = 4 works well; this is used in most commercial FPGAs

[Rose90,Singh92] Later work reconfirmed that K = 4 is a good choice for area, but that for performance, K = 7 works well [Ahmed04] In general, the parameter K has a significant impact

on the efficiency of the architecture If K is too large, it may not be possible to completely fill each logic block, while if K is too small, delay will suffer because more logic blocks will be needed along

the critical path of a circuit Figure 45.4 shows how a 6-input function might be implemented with two 4-LUTs; had a 6-LUT been used, only 1-LUT would be required

Variations on the basic LUT architecture have been used Figure 45.5 shows a logic block that

employs a fracturable LUT mask (FLM) [Lewis05] A k, m-FLM can implement a single k-input function or two functions, each with up to k − 1 inputs, which together use no more than k + m

distinct inputs The architecture in Figure 45.5a is a 6,2-FLM An extension of the FLM architecture,

called a shared LUT mask (SLM) architecture, is shown in Figure 45.5b A k, m-SLM can implement two identical functions of k inputs provided that the two functions share k − m inputs The SLM

SRAM cells

00

11 10 01

0 0 0 1

FIGURE 45.3 Two-input LUT Unprogrammed and programmed as a two-input and gate.

Trang 4

Out

A

B

C

D

E

F

A B C D E F

4-input LUT

6-input LUT

FIGURE 45.4 Implementing a 6-input function using two 4-LUTs.

E0

E1

F0

F1

A

E

F

B

C0

C1

D0

D1

A B

DC0

DC1

Z1(A,B,C1,D1,F)

Z2(A,B,C,D,E,F)

Z0(A,B,C0,D0,E) 4

4

4-LUT

Fixed

input

routing

Fixed input routing

4-LUT

3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT 3-LUT

MUX network

4

/ / / / / / / /

/

3 3 3 3 3 3 3 3

Z0(A,B,DC0,DC1,E0,F0)

Z1(A,B,DC0,E0,F0)

Z0(A,B,DC0,DC1,E1,F1)

Z2(A,B,DC1,E1,F F)

FIGURE 45.5 Advanced logic block structures.

architecture does this through the sharing of LUT masks (the set of configuration bits that indicate the function implemented by the LUT) so that both functions are the same but can have different inputs The logic block in the Altera Stratix II FPGA is based on a 6,2-SLM [Altera05]

Lookup-tables are usually coupled with flip-flops, as shown in Figure 45.6 In this structure, a configuration bit is used to control the state of the output multiplexer Depending on the value of this configuration bit, the output signal of the LUT can either be registered or unregistered As in Ref [Betz99], we refer to the LUT and flip-flop as a basic logic element (BLE)

45.3.1.1 Clusters

To increase speed and reduce area and compile time, larger logic blocks are preferred However, LUT complexity grows exponentially with the number of inputs [Rose93] Clusters are logic blocks

of larger granularity, typically composed of multiple BLEs, internal cluster routing, and possibly specialized internal cluster connections, such as carry and arithmetic chains [Marquardt00] Within

a cluster, BLE inputs are typically connected to the cluster inputs and BLE outputs by a multiplexer-based crossbar This internal interconnect is generally faster than the general purpose routing between

Clock

flip-flop

FIGURE 45.6 LUT coupled with a flip-flop (BLE).

Trang 5

N I

N Outputs

Logic cluster Clock

I Inputs

BLE

#1

BLE

#N

FIGURE 45.7 Basic BLE and basic cluster composed of identical BLEs.

blocks Altera refers to clusters as logic array blocks (LABs), while Xilinx refers to clusters as configurable logic blocks (CLBs)

Figure 45.7 shows a typical cluster The cluster architecture is described by these four parameters:

(1) K, the number of inputs to a LUT, (2) N, the number of BLEs in a cluster, (3) I, the number of inputs to the cluster that connect to LUT inputs, and (4) Mclk, the number of clock inputs to a cluster (most studies assume this is 1)

Increasing K or N increases the functionality of the cluster This reduces the number of blocks

needed to implement circuits and the number of blocks on the critical path, but increases the size

of the block and makes the local cluster interconnect slower Research has found that K= 4–6 and

N= 3–10 provide the best combined speed and area [Ahmed04]

The value of I is often smaller than K × N, because BLEs often share inputs or use the outputs from BLEs within the cluster Smaller values of I use smaller multiplexers in the crossbar, reducing area, but overly small I values make some BLEs unusable Research has found that 98 percent utilization can be achieved when I = [(K/2) × (N + 1)] [Ahmed04].

45.3.1.2 Carry Chains

Carry chains are locally routed connections that aid in the efficient implementation of arithmetic operations They also can be used in the efficient implementation of logical operations, such as parity and comparison Fast carry chains are important because the critical path for these operations

is often through the carry

Each 4-LUT in a BLE can be fractured to implement two 3-LUTs; this is sufficient to implement

both the sum and carry, given two input bits (a and b) and a carry input, as shown in Figure 45.8.

The carry out signal from one BLE would typically be connected to the carry in of an adjacent BLE

using a fast dedicated connection The Z-input is used to break the carry chain before the first bit of

an addition

More complex carry schemes have been described In Ref [Hauck00], carry chains based on carry select, variable block, and Brent–Kung schemes are described; the Brent–Kung scheme is shown to be 3.8 times faster than the simple ripple carry adder in Figure 45.8 Support for carry-lookahead adders is included in the Actel Axcelerator device, the Xilinx Virtex-II, Virtex-II Pro,

Trang 6

Carry out

Sum out

Carry in

Z a

b

P

FIGURE 45.8 Carry chain connections to a 4-LUT.

and Virtex-4 devices Carry select capabilities are included in the Altera Stratix FPGAs The Altera Stratix-II contains two dedicated 1-bit adders in each logic block Because high-fanin arithmetic can cause routing congestion in a small area of the device, both Xilinx and Altera parts support two independent carry chains in each cluster This allows for narrower fanin logic, which helps reduce routing congestion around the adders

45.3.2 NON-LUT-BASEDLOGICBLOCKS

Not all FPGAs contain logic blocks based on LUTs The Actel ProASIC3 logic blocks contain a set

of multiplexers, which allow for the implementation of 3-input combinational or sequential functions

in each logic block [Actel05a] The QuickLogic Eclipse II logic cell contains two 6-input AND gates, four 2-input AND gates, and seven two-to-one multiplexers [Quick05] The use of universal logic modules as FPGA logic blocks has also been proposed; these blocks can implement any function of their inputs by applying input permutation and negation [Lin94] Finally, programmable devices using more coarse-grained logic blocks exist; these logic blocks are typically arithmetic/logic units and are suitable for computationally intensive applications [Ebeling96,Goldstein00,Singh00, Mei03]

45.4 ROUTING ARCHITECTURES

Connections between logic blocks are implemented using fixed prefabricated metal tracks These tracks are arranged in channels; channels typically run vertically and horizontally, forming a grid [Lemieux04a] Although many academic studies have assumed that all channels contain the same number of tracks [Betz99], many commercial architectures (such as those from Altera) contain more tracks in each horizontal channel than each vertical channel Figure 45.9 shows an FPGA with tracks arranged in horizontal and vertical channels

45.4.1 SEGMENTATION

Tracks within a channel can span one logic block, or multiple logic blocks Typically, not all tracks within a channel will be of the same length Several studies have investigated the optimum segment length In Ref [Brown96], a heterogeneous routing architecture, in which some tracks span three logic blocks, some span two logic blocks, and some span one logic block, is found to work well

In Ref [Betz99], it is shown that longer wires result in a more efficient architecture; they suggest

a homogeneous architecture in which all tracks span either four or eight logic block gives the most efficient FPGA

Trang 7

channel

Switch block

Connection block

Vertical channel

Routing

track

Logic block

Logic block Logic

block

FIGURE 45.9 Overall routing architecture.

45.4.2 PROGRAMMABLESWITCHES

The tracks are connected to each other and to the logic blocks using programmable switches These programmable switches can be buffered or unbuffered, as shown in Figure 45.10 Switches in modern FPGAs are typically buffered, because unbuffered switches result in a quadratic increase

in delay for long connections Buffered switches can be bidirectional, as shown in Figure 45.10b or unidirectional, as shown in Figure 45.10c Although many academic studies assume bidirectional switches [Betz99], most modern FPGAs contain unidirectional switches [Lemieux04b]; these switches allow for better delay optimization and result in a more dense routing fabric

45.4.3 SWITCHBLOCKS ANDCONNECTIONBLOCKS

Tracks are connected to each other using switch blocks, and to logic blocks using connection blocks Commercial FPGAs often contain combined switch blocks and connection blocks, however for clarity, this section will describe each separately

(a) Unbuffered (b) Buffered bidirectional (c) Buffered unidirectional

FIGURE 45.10 Programmable switches.

Trang 8

0

0 0

1

1 1

1

2

3

4

4 4

4

0

FIGURE 45.11 Switch block patterns.

A switch block lies at the intersection of each horizontal and vertical channel, and can connect

each incident track to some number of other incident tracks Academic work uses the notation Fsto describe the number of outgoing tracks to which each incoming track can be connected [Rose93]

Most physical design algorithm studies assume Fs = 3; in this case, each incoming track can be connected to one track on each of the other three sides of the switch block The switch pattern

determines which Fs tracks to which each incoming track can be connected Academic work has proposed the three switch patterns in Figure 45.11 The disjoint pattern divides the routing

fab-ric into domains; if there are W tracks in each channel, there are W domains This simplifies the

routing task, and results in an efficient layout The universal pattern has been shown to support the largest number of simultaneous connections through each switch block [Chang96], while the Wilton block has been shown to result in good overall routability [Wilton97] An extension of the Wilton block to architectures with different segment lengths is described in Ref [Masud99]

In Ref [Sivaswamy05], it is proposed that some of the connections in a switch block should be hard-wired (nonprogrammable); this gives 30 percent speedup, a slight reduction in area, and an 8 percent reduction in power

Connection blocks are used to connect logic block pins to the routing tracks Each logic block pin

can be connected to a subset of routing tracks in the neighboring channel The quantity Fcindicates the proportion of the tracks in each channel to which a pin can be connected In ref [Betz99], it is

shown that Fc= 0.25–0.5 (depending on the type of switch block employed) works well

45.4.4 BUS-BASEDROUTINGARCHITECTURES

FPGAs are often used to implement datapath-intensive circuits, in which many signals are part of wide buses Because each bit of a bus is connected in the same way, it has been suggested that a datapath routing architecture, in which a single configuration bit controls multiple switches, will lead

to an improvement in FPGA density In Ref [Ye05], the architecture in Figure 45.12 is presented In this architecture, some of the tracks (the top four in Figure 45.12) are dedicated bus-based routing tracks, and connections to them are controlled by a bus switch; a bus switch contains one switch for each bit controlled by a single configuration cell In this case, each bus (and each bus switch) is 4-bits wide The lower tracks are regular bit-based routing tracks, which are connected to each other and to the logic cells using standard connection and switch blocks, as described above In Ref [Ye05], it is shown that a bus-width of 4 works well, and that 40–50 percent of the tracks should be buses (with the remainder being bit-based routing tracks) It is shown that this results in a density improvement

of 9.6 percent compared to a conventional architecture

Trang 9

4-bit bus

Bit-based routing tracks

P = Configuration cell

P

FIGURE 45.12 Bus-based routing architecture.

45.4.5 PIPELINEDINTERCONNECTARCHITECTURES

In deep-submicron technologies, the delay of long wires can limit the clock speed of the circuit implemented on an FPGA To address this, several authors have proposed pipelined intercon-nect architectures [Singh01a,Singh01b,Weaver04] In these architectures, some of the interconintercon-nect switches contain registers This results in additional complexity for the router, however, because it must now balance the number of registers on each path

45.5 MEMORIES

Today, FPGAs are often used to implement entire systems These systems often require storage Although it is possible to implement storage off-chip, on-chip storage has a number of advantages On-chip storage reduces system costs, allows for a wider, faster memory interface, and reduces I/O demands on the FPGA

There are two ways of implementing memory on FPGAs: embedded memory and distributed memory Embedded memory solutions offer a number of relatively large fixed dedicated memory blocks on the FPGA Distributed memory, on the other hand, uses small memories spread across the entire FPGA chip, often implemented in unused logic elements

45.5.1 EMBEDDEDMEMORY

Most FPGAs contain embedded memory blocks (EMBs) EMBs are typically arranged in columns or rows to simplify connections to logic and between other EMBs [Wilton99], as shown in Figure 45.13 Altera’s Stratix and Stratix-II devices include three different sized EMBs: 512 bits, 4 Kbits, and

512 Kbits [Altera05] Xilinx’s Virtex-4, Virtex-II, and Spartan series contain 18 Kbits EMBs [Xil-inx05a] Actel’s ProASIC3 and ProASIC-Plus contain 4 Kbits and 2 Kbits EMBs, respectively [Actel05]

Each EMB has a fixed number of bits, but its aspect ratio can be configured by the user For example, in the Stratix II architecture, a 4-Kbit EMB may be configured to act as memories with aspect ratios of 4096× 1, 2048 × 2, 1024 × 4, 512 × 8, 256 × 16, or 128 × 32 On many devices, EMBs can be configured to act as a ROM, single-port RAM, or dual-port RAM In addition, they typically include parity bits, various enable/reset control signals, and have synchronous inputs with synchronous or asynchronous outputs

Of particular importance is the interface between the memory and the logic Figure 45.14 shows one published scheme; in this architecture, each EMB connects to the logic through a memory-logic interconnect block [Wilton99] Figure 45.15 shows the contents of one of these memory-memory-logic

Trang 10

Logic blocks

Memory arrays

FIGURE 45.13 Logic and memory in an FPGA.

Logic

block

Logic

block

Logic block

Logic block Logic

block

Logic block

Logic block Logic

block

Logic block

Logic

block

Logic

block

Logic block

Logic block Logic

block

Memory/logic

interconnect

block

Memory block

Memory/logic interconnect block

Memory block

Memory/logic interconnect block

FIGURE 45.14 Memory/logic interconnect architecture.

Định dạng
Số trang	10
Dung lượng	160,08 KB