Scalable ternary content addressable memory implementation using FPGAs 2 VHDL

ABSTRACT Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGAbased TCAM designs are based on bruteforce implementations, which result in inefficient onchip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)based TCAM architecture aiming for efficient implementation on stateoftheart FPGAs. We give a formal study on RAMbased TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of smallsize RAMbased TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid rangetoternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2:4 Mbits while sustaining high throughput of 150 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance tradeoffs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than 1 Mbits. Categories and Subject Descriptors C.1.4 Processor Architectures: Parallel Architectures; C.2.6 Computer Communication Networks: Internetworking General Terms Algorithms, Design, Performance Keywords FPGA; RAM; TCAM 1. INTRODUCTION Ternary Content Addressable Memory (TCAM) is a specialized associative memory where each bit can be 0, 1, or don’t care (i.e. ∗). TCAM has been widely used in network infrastructure for various search functions including longest prefix matching (LPM), multifield packet classification, etc. For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle. A priority encoder is needed to obtain the index of the matching word with the highest priority. In a TCAM, the physical location normally determines the priority, e.g. the top word has the highest priority. Most of current TCAMs are implemented as a standalone applicationspecific integrated circuit (ASIC). We call them the native TCAMs. Native TCAMs are expensive, powerhungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs). The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g. OpenFlow 1) where the width andor the depth for different lookup tables can be variable 2. Various algorithmic solutions have been proposed as alternatives to native TCAMs. But none of them is exactly equivalent to TCAM. The success of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM. For some other search functions such as multifield packet classification, the algorithmic solutions 3, 4 employ various heuristics, leading to undeterministic performance that is often dependent on the characteristics of the data set. On the other hand, reconfigurable hardware such as fieldprogrammable gate array (FPGA) combines the flexibility of software and the nearASIC performance. Stateoftheart FPGA devices such as Xilinx Virtex7 5 and Altera StratixV 6 provide high clock rate, low power dissipation, rich onchip resources and large amounts of embedded memory with configurable word width. Due to their increasing capacity, modern FPGAs have been an attractive option for implementing various networking functions 7, 8, 9, 10. Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower development cost and the shrinking performance gap between it and ASIC. Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAMequivalent search engines. While there exist several FPGAbased TCAM designs, most of them are based on bruteforce implementations to mimic the native TCAM architecture. Their resouce usage is inefficient, which makes them less interesting in practice. On the other hand, some recent work 11, 12, 13 shows that RAMs can be employed to emulateimplement a TCAM.

Trang 1

Scalable Ternary Content Addressable Memory

Implementation Using FPGAs

Weirong Jiang Xilinx Research Labs San Jose, CA, USA weirongj@acm.org

ABSTRACT

Ternary Content Addressable Memory (TCAM) is widely

used in network infrastructure for various search functions

There has been a growing interest in implementing TCAM

using reconfigurable hardware such as Field Programmable

Gate Array (FPGA) Most of existing FPGA-based TCAM

designs are based on brute-force implementations, which

re-sult in inefficient on-chip resource usage As a rere-sult,

exist-ing designs support only a small TCAM size even with large

FPGA devices They also suffer from significant

through-put degradation in implementing a large TCAM, mainly

caused by deep priority encoding This paper presents a

scalable random access memory (RAM)-based TCAM

ar-chitecture aiming for efficient implementation on

state-of-the-art FPGAs We give a formal study on RAM-based

TCAM to unveil the ideas and the algorithms behind it To

conquer the timing challenge, we propose a modular

archi-tecture consisting of arrays of small-size RAM-based TCAM

units After decoupling the update logic from each unit, the

modular architecture allows us to share each update engine

among multiple units This leads to resource saving The

capability of explicit range matching is also offered to avoid

range-to-ternary conversion for search functions that require

range matching Implementation on a Xilinx Virtex 7 FPGA

shows that our design can support a large TCAM of up to

2.4 Mbits while sustaining high throughput of 150 million

packets per second The resource usage scales linearly with

the TCAM size The architecture is configurable, allowing

various performance trade-offs to be exploited To the best

of our knowledge, this is the first FPGA design that

imple-ments a TCAM larger than 1 Mbits

Categories and Subject Descriptors

C.1.4 [Processor Architectures]: Parallel Architectures;

C.2.6 [Computer Communication Networks]:

Internet-working

General Terms

Algorithms, Design, Performance

Keywords

FPGA; RAM; TCAM

Ternary Content Addressable Memory (TCAM) is a

spe-cialized associative memory where each bit can be 0, 1, or

“don’t care” (i.e “∗”) TCAM has been widely used in net-work infrastructure for various search functions including longest prefix matching (LPM), multi-field packet classifi-cation, etc For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle A priority encoder is needed

to obtain the index of the matching word with the highest priority In a TCAM, the physical location normally deter-mines the priority, e.g the top word has the highest priority Most of current TCAMs are implemented as a standalone application-specific integrated circuit (ASIC) We call them the native TCAMs Native TCAMs are expensive, power-hungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs) The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g OpenFlow [1]) where the width and/or the depth for differ-ent lookup tables can be variable [2] Various algorithmic so-lutions have been proposed as alternatives to native TCAMs But none of them is exactly equivalent to TCAM The suc-cess of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM For some other search functions such as multi-field packet classification, the algorithmic solutions [3, 4] employ various heuristics, lead-ing to undeterministic performance that is often dependent

on the characteristics of the data set

On the other hand, reconfigurable hardware such as field-programmable gate array (FPGA) combines the flexibility

of software and the near-ASIC performance State-of-the-art FPGA devices such as Xilinx Virtex-7 [5] and Altera Stratix-V [6] provide high clock rate, low power dissipa-tion, rich on-chip resources and large amounts of embedded memory with configurable word width Due to their increas-ing capacity, modern FPGAs have been an attractive option for implementing various networking functions [7, 8, 9, 10] Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower de-velopment cost and the shrinking performance gap between

it and ASIC Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAM-equivalent search engines

While there exist several FPGA-based TCAM designs, most of them are based on brute-force implementations to mimic the native TCAM architecture Their resouce usage

is inefficient, which makes them less interesting in practice

On the other hand, some recent work [11, 12, 13] shows that RAMs can be employed to emulate/implement a TCAM

Trang 2

But none of them gives a correctness proof or a thorough

study for efficient FPGA implementation Their

architec-tures are monolithic, which do not scale well in

implement-ing large TCAMs A goal of this paper is to advance the

FPGA-based TCAM designs by investigating both theory

and architecture for RAM-based TCAM implementation

The main contributions include:

• We give an in-depth introduction to the RAM-based

TCAM We formalize the key ideas and the algorithms

behind it We analyze thoroughly the theoretical

per-formance of the RAM-based TCAM and identify the

key challenges in implementing a large RAM-based

TCAM

• We propose a modular and scalable architecture that

consists of arrays of small-size RAM-based TCAM units

By decoupling the update logic from each unit, such a

modular architecture enables each update engine to be

shared among multiple units Thus the logic resource

is saved

• We share our experience in implementing the proposed

architecture on a state-of-the-art FPGA The post place

and route results show that our design can support a

large TCAM of up to 2.4 Mbits while sustaining high

throughput of 150 million packets per second (Mpps)

To the best of our knowledge this is the first FPGA

design that implements a TCAM larger than 1 Mbits

• We conduct comprehensive experiments to

character-ize the various performance trade-offs offered by the

configurable architecture We also discuss the

sup-port of range matching without range-to-ternary

con-version

The rest of the paper is organized as follows Section 2

gives a detailed introduction to the theoretic aspects of the

RAM-based TCAM Section 3 discusses the hardware

archi-tectures for scalable RAM-based TCAM Section 4 presents

the comprehensive evaluation results based on the

imple-mentation on a state-of-the-art FPGA Section 5 reviews

the related work on FPGA-based TCAM designs Section 6

concludes the paper

We first have the following definitions:

• The Depth of a TCAM (or RAM) is the number of

words in the TCAM (or RAM) Denoted as N

• The Width of a TCAM (or RAM) is the width (i.e

the number of bits) of each TCAM (or RAM) word

Denoted as W

• The Size of a TCAM (or RAM) is the total number of

bits of the TCAM (or RAM) It equals N × W

• The address width of a RAM is the number of bits of

the RAM address Denoted as d Note that N = 2d

for a RAM

We describe the organization of a TCAM or RAM as Depth × Width, i.e., N ×W For example, a 2×1 RAM con-sists of 2 words where each word is 1-bit We call a TCAM

or RAM wide (or narrow ) if its width is large (or small)

We call a TCAM or RAM deep (or shallow ) if its depth is large (or small)

We also have the notation as shown in Table 1:

Table 1: Notation Notation Description

k An input key or a binary number

t A ternary word

A An alphabet for 1-bit characters

An The set of all n-bit strings over A

|s| The length of a string s ∈ An |s| = n

A TCAM can be divided into two logical areas: (1) TCAM words and (2) priority encoder Each TCAM word consists

of a row of matching cells attached to a same matching line During lookup, each input key goes to all the N words in parallel and retrieves a N -bit match vector The i-th bit

of the match vector indicates if the key matches the i-th word, i = 1, 2, · · · , N In this section, for ease of discussion

we consider a TCAM without priority encoder Thus the output of the considered TCAM is a N -bit match vector instead of the index of the matching word with the highest priority

Looking up a N × W TCAM is basically mapping a W -bit binary input key into a N bit binary match vector The same mapping can be achieved by using a 2W × N RAM where the W -bit input key is used as the address to access the RAM and each RAM word stores a N -bit vector Figure 1(a) shows a 1 × 1 TCAM and its corresponding RAM-based implementation As the TCAM word stores a “don’t care” bit, the match vector is always 1 no matter the input 1-bit key is 0 or 1

2.2.1 Depth Extension

The depth of a native TCAM is increased by stacking ver-tically words with the same width Correspondingly in the RAM-based implementation, the depth of a TCAM is ex-tended by increasing the width of the RAM Each column

of the RAM represents the match vector for a word Figure 1(b) shows a 2 × 1 TCAM which adds a word to the TCAM shown in Figure 1(a) Correspondingly the RAM-based im-plementation adds a column to the RAM shown in Figure 1(a) We see that the memory requirement of either the na-tive TCAM or its RAM-based implementation is linear with the depth

We can also view the depth extension as concatenating the match vectors from multiple “shallower” TCAMs For instance, a N × W TCAM can be horizontally divided into two TCAMs: one is N1× W and the other N2× W , where

N = N1+ N2 Then there are two RAMs in the corre-sponding RAM-based implementation: one is 2W × N1 and the other 2W× N2 The outputs of the two RAMs are con-catenated to obtain the final N -bit match vector This is essentially equivalent to building a wider RAM by concate-nating two RAMs with the same depth For the sake of

Trang 3

Native TCAM RAM

(a)

1 1

1 0

Key

(b)

1 0

Key

0 1

1 1

0 0 1 1

01 00

11 10

(c)

* Key

Match

0 Key

*

0

Key[1]

* Key[0]

Key

Figure 1: (a) Matching a 1-bit key with a 1 × 1

TCAM; (b) Matching a 1-bit key with a 2×1 TCAM;

(c) Matching a 2-bit key with a 1 × 2 TCAM

simplicity, we consider the wide RAM built based on

con-catenating multiple RAMs as a single RAM

2.2.2 Width Extension

A wider TCAM deals with a wider input key When

im-plementing the TCAM in a single RAM, a wider input key

(which is used as the address to access the RAM) indicates

a wider address width for the RAM This results in a deeper

RAM whose depth is 2W Figure 1(c) shows a 1 × 2 TCAM

which extends the width of the TCAM shown in Figure 1(a)

As the width of the input key is increased by 1 bit, the depth

of the RAM in the corresponding RAM-based TCAM gets

doubled Such a design cannot scale well for wide input keys

An alternative solution is using multiple “narrow” TCAMs

to implement a wide TCAM For example, a N × W TCAM

can be vertically divided into two TCAMs: one is N × W1

and the other N ×W2, where W = W1+W2 During lookup,

a W -bit input key is divided into two segments accordingly:

one is W1-bit and the other W2-bit Each of the two

“nar-rower” TCAMs matches the corresponding segment of the

key and outputs a N -bit match vector The two match

vectors are then bitwise ANDed to obtain the final match

vector The two “narrow” TCAMs map to two “shallow”

RAMs in the corresponding RAM-based implementation

The total memory requirement becomes 2W1+ 2W2 instead

of 2W=2W 1· 2W 2 Figure 2 shows how a 1 × 2 TCAM is built

based on two 1 × 1 TCAMs

2.2.3 Populating the RAM

Given a set of ternary words, we need to populate the

RAMs so that the RAM-based implementation can fulfill

the same search function as the native TCAM As shown in

Figure 1(a), it is easy to populate the RAM for the

RAM-based implementation of a 1 × 1 TCAM Table 2 shows the

content of the 2 × 1 RAM populated for the 1 × 1 TCAM,

where RAM[k] denotes the RAM word at the address k,

k = 0, 1

0 Key

* Key[1]

Key[0]

1 1

1 0

0 1

1 0

Key[1]

Key[0]

Key

Figure 2: Building a 1 × 2 TCAM using two 1 × 1 TCAMs

Table 2: Representing a ternary bit in RAM The value of The value stored at the ternary bit RAM[0] RAM[1]

don’t care 1 1

Principle 1 shows the principle in populating the 2 × 1 RAM to represent a 1-bit TCAM where k ∈ {0, 1} and t ∈ {0, 1, ∗}

Principle 1 RAM[k]=1 if and only if k matches t; oth-erwise RAM[k]=0

Theorem 1 The RAM populated following Principle 1 achieves the equivalent function as the TCAM that stores t Proof In the TCAM, the output for an input k is 1 ⇔

k matches t Otherwise, the output is 0

In the populated RAM, the output for an input k is 1 ⇔ RAM[k]=1 Otherwise, the output is 0

Thus the populated RAM is equivalent to the represented TCAM

Both Principle 1 and Theorem 1 are directly applicable

to the case of 1 × W TCAM where k ∈ AW, A = {0, 1} and

t ∈ AW, A = {0, 1, ∗}

Principle 1 can be extended to the case of a N ×W TCAM that is implemented in a 2W×N RAM Let RAM[k][i] denote the i-th bit of the k-th word in the RAM Let ti denote the i-th word in the TCAM i = 1, 2, · · · , N So we have Principle 2 for populating the 2W × N RAM to represent a

N ×W TCAM, where k ∈ AW, A = {0, 1} and ti∈ AW

, A = {0, 1, ∗}

Principle 2 RAM[k][i]=1 if and only if k matches ti; otherwise RAM[k][i]=0 i = 1, 2, · · · , N

When a wide TCAM is built using multiple narrower TCAMs, the RAM corresponding to each narrow TCAM is populated individually by following Principle 2

2.3 Algorithms and Analysis

This section formalizes the algorithms for using a RAM-based TCAM that is built according to the discussion in Section 2.2

2.3.1 General Model

Based on the discussion in Section 2.2.2, a N × W TCAM can be constructed using P narrow TCAMs, P = 1, 2, · · · , W

Trang 4

The size of the i-th TCAM is N × Wi, i = 1, 2, · · · , P , and

W = PP

i=1Wi Let RAMi denote the RAM

correspond-ing to the i-th narrow TCAM, i = 1, 2, · · · , P The size of

RAMiis 2Wi×N , i = 1, 2, · · · , P Hence the N ×W TCAM

can be implemented using these P RAMs

2.3.2 Lookup

Algorithm 1 shows the algorithm to search a key over the

RAM-based TCAM It takes O(1) time to access each RAM

Since the P RAMs are accessed in parallel, the overall time

complexity for lookup is O(1)

Algorithm 1 Lookup

Input: A W -bit key k

Input: {RAMi}, i = 1, 2, · · · , P

Output: A N -bit match vector m

1: Divide k into P segments: k → {k1, k2, · · · , kP} |ki| =

Wi, i = 1, 2, · · · , P

2: Initialize m to be all 1’s: m ← 11 · · · 1

3: for i ← 1 to P do {bitwise AND}

4: m ← m & RAMi[ki]

5: end for

2.3.3 Update

Updating a TCAM can be either adding or deleting a

spe-cific TCAM word Algorithm 2 shows the algorithm to add

or delete the n-th word of the TCAM in the RAM-based

im-plementation, where n = 1, 2, · · · , N It takes O(2Wi) time

to update the RAMi, i = 1, 2, · · · , P As the P RAMs are

updated in parallel, the overall time complexity for update

is determined by the RAM that takes the longest time for

update, which is O(maxP

i=12Wi) = O(2maxPi=1Wi)

Algorithm 2 Updating a TCAM word

Input: A W -bit ternary word t

Input: The index of t: n

Input: The update operation: op ∈ {add, delete}

Output: Updated {RAMi}, i = 1, 2, · · · , P

1: Divide t into P segments: t → {t1, t2, · · · , tP} |ti| =

Wi, i = 1, 2, · · · , P

2: for i ← 1 to P do {Update each RAM}

3: for k ← 0 to 2Wi− 1 do

4: if k matches tiand op == add then

5: RAMi[k][n] = 1

6: else

7: RAMi[k][n] = 0

8: end if

9: end for

10: end for

2.3.4 Space Analysis

The size of RAMiis 2Wi× N , i = 1, 2, · · · , P Hence the

overall memory requirement isPP

i=1(2Wi× N ) = NPP

i=12Wi

To minimize the overall memory requirement, we formulate

the problem as:

min

P

X

i=1

2Wi =

W min

P =1( min {W1,W 2 ,··· ,WP}

P X i=1

2Wi) (1)

subject to

P X i=1

For a given P , min{W1,W 2 ,··· ,WP}PPi=12Wi= P ·2WP when

Wi = WP, i = 1, 2, · · · , P Hence the overall memory re-quirement is minimum when all the P RAMs have the same address width, denoted as w = W

P The depth of each RAM

is 2w Then the overall memory requirement is P

X i=1 (2Wi· N ) =

W w

X i=1

(2w· N ) = W

w2 w

N = N W2

w

w (3)

We define the RAM/TCAM ratio as the number of RAM bits needed to implement a TCAM bit According to Equa-tion (3), the RAM/TCAM ratio is 2w

w when all the RAMs employ the same address width of w Basically a larger w results in a larger RAM/TCAM ratio, which indicates lower memory efficiency The minimum RAM/TCAM ratio is 2 when w = 1 (P = W ) or w = 2 (P = W/2) In other words, when the depth of each RAM is 2 (w = 1) or 4 (w = 2), the overall memory requirement achieves the minimum, which is 2N W , i.e twice the size of the corresponding native TCAM

2.3.5 Comparison with Native TCAM

Table 3 summarizes the difference between the native TCAM and its corresponding implementation using P RAMs, with respect to time and space complexities Here we consider all the RAMs employ the same address width (w), so that both the update time and the space complexities achieve the op-timum for the RAM-based TCAM (as discussed in Sections 2.3.3 and 2.3.4)

Table 3: Native TCAM vs RAM-based TCAM

Native TCAM RAM-based TCAM Lookup time O(1) O(1) Update time O(1) O(2w) Space N W 2wwN W

We are interested in implementing the RAM-based TCAM

on FPGA While the theoretical discussion in Section 2 ex-cludes the priority encoder, the hardware architecture of the RAM-based TCAM must consider the priority encoder

3.1 Basic Architecture

The theorectical model of the RAM-based TCAM im-plementation (discussed in Section 2.3.1) can be directly mapped to the hardware architecture shown in Figure 3

A N × W TCAM is implemented using P RAMs where the size of the i-th RAM is 2Wi× N andPP

i=1Wi= W

As illustrated in Algorithm 1, a lookup is performed by dividing the input W -bit key into P segments where the length of the i-th segment is Wi, i = 1, 2, · · · , P Then each segment of the key is used as the address to access the corresponding RAM Each RAM outputs a N -bit vector The P N -bit vectors are then bitwise ANDed to generate the final match vector The match vector is finally fed into

Trang 5

Data_out Addr_in

RAM1

Data_out Addr_in

RAM2

Data_out Addr_in

RAMP

Key

N N N

N W

W2

W P

ID

Match

Figure 3: Basic architecture (without update logic)

the priority encoder to obtain the index of the matching

word with the highest priority A 1-bit Match signal is also

generated to indicate if there is any match

We add update logic to the RAM-based TCAM so that it

can complete any update by itself at run time In accordance

with Algorithm 2, Figure 4 shows the logic for updating the

RAM-based TCAM, where Wmax = maxP

i=1Wi We use two W -bit binary numbers, denoted as Data and M ask, to

represent a W -bit ternary word t that is to be updated The

i-th bit of t is “don’t care” bit if and only if the i-th bit of

M ask is set to be 0, i = 1, 2, · · · , W For example, the 2-bit

ternary word 0∗ can be represented by: Data=00 or 01, and

M ask=10 Id specifies the index of the ternary word to be

updated Op indicates if the ternary word is to be added

(Op = 0) or deleted (Op = 1)

Data W

W1

W2

W P

W max-bit

counter

Data_out Addr_R

RAM1

Addr_W We Data_in

+1 CMP

CHG

Data_out Addr_R

RAM2

Addr_W We Data_in

+1 CMP

CHG

Data_out Addr_R

RAMP

Addr_W We Data_in

+1 CMP

CHG

Id

Mask

W

1

W

2

W P

Op

W

Figure 4: Update logic “CMP”: Compare “CHG”:

Change

Adding or deleting the Id-th ternary word t is

accom-plished by setting or clearing the Id-th bit from all the RAM

words whose addresses match t Meanwhile we must keep

the rest of the bits of these RAM words unchanged Hence

we need to read the original content of the RAM words, change only the Id-th bit and then write the updated RAM words back to the RAM This requires 2 · 2w clock cycles to update a single-port RAM whose address width is w To re-duce the update latency, we utilize a simple dual-port RAM and perform the read and write at the same clock cycle A simple dual-port RAM has two input ports and one output port One input port is for Read only and the other is for Write only At each clock cycle during update, the update logic writes the updated RAM word to the address k while reading the content of the RAM word at the address k + 1 Hence the update latency becomes 2w+ 1 clock cycles where the 1 clock cycle is consumed to fetch the content of the first RAM word Another part of the update logic is a state matchine (not shown in Figure 4) that switches the state of the TCAM between lookup and update During update, no lookup is permitted and any match result is invalid

In implementing a large-scale RAM-based TCAM on FPGA, there are two main challenges:

• Throughput: When the TCAM is deeper or wider, the logic and the routing complexities become larger, espe-cially for bitwise-ANDing a lot of wide bit vectors and for priority encoding a deep match vector This re-sults in significant degradation in the achievable clock rate, which determines the maximum throughput of the RAM-based TCAM

• Resource usage: The on-chip resource of a FPGA de-vice is limited Hence we must optimize the architec-ture to save the resource or use the resource efficiently

We need to find out the best memory configuration based on the physical capability It is also desirable to enable resource sharing between subsystems

We propose a scalable and modular architecture that em-ploys configurable small-size RAM-based TCAM units as building blocks Both bit vector bitwise-ANDing and prior-ity encoding are performed in a localized and pipelined fash-ion so that high throughput is sustained for large TCAMs

We decouple the update logic from each unit so that a sin-gle update engine can be shared flexibly by multiple TCAM units On-chip logic resources are thus saved Note that such resource sharing is only possible in a modular architecture

3.2.1 Overview

The top-level design consists of a grid of units, which are organized in multiple rows Figure 5 shows the top-level ar-chitecture with R rows each of which contains L units The TCAM words with higher priority are stored in the units with lower index The units within a row are searched se-quentially in a pipelined fashion Priority is resolved locally within each unit After each row outputs a matching result,

a global priority encoder is needed to select the one with the globally highest priority

3.2.2 Unit Design

A TCAM unit is basically a U × W TCAM implemented

in RAMs, where U is the number of TCAM words per unit Figure 6 depicts the architecture of a TCAM unit Each unit performs the local TCAM lookup and combines the lo-cal match result with the result from the preceding unit As

Trang 6

Key_out Match_out ID_out

Match_in ID_in

Key_in Match_in ID_in

Matching ID Key

Match_in ID_in

Key_in Match_in ID_in Unit 1

Unit L+1

Key_in Match_in ID_in Unit (R-1)L+1

Figure 5: Top-level architecture

the unit index determines the priority, a matching TCAM

word stored in the preceding units always has a higher

pri-ority than the local matching one The U × W TCAM is

constructed using P RAMs based on the basic architecture

shown in Section 3.1 We use the same address width w for

all the P RAMs to achieve the maximum memory efficiency

as discussed in Section 2.3.4

Key_in W

Match_out Match_in

ID_in

ID_out

Key_out Key

ID Match

U×W TCAM (in RAM)

Figure 6: A Unit

When W is large, there are many RAMs each of which

outputs a U -bit vector The throughput may degrade in

bitwise-ANDing a large number of bit vectors We divide

a unit into multiple pipelined stages Let H denote the

number of stages in a unit Then each stage contains HP

RAMs Within each stage, the bit vectors generated by the

P

H RAMs are bitwise-ANDed The resulting U -bit vector is

combined with the bit vector passed from the previous stage

and then passed to the next stage The last stage of the unit

performs the local priority encoding

3.2.3 Update Engine

We make the following observations:

• Updating a TCAM word involves updating only one

unit

• Update logic is identical for the units with the same memory organization

To save logic resource, it is desirable to share the update logic between units We decouple the update logic from units and build multiple update engines Each update en-gine contains the update logic and serves multiple units An update engine maintains a single state machine and decides which unit to be updated based on the index (Id) of the TCAM word to be updated A unit receives from its up-date engine the addresses and the write enable signals for its RAMs The unit also interacts with its update engine to exchange the bit vectors to update each RAM word Due to the decoupling of the update logic from the units, the association between the lookup units (LUs) and the up-date engines (UEs) is flexible The only constraint is that the units served by the same update engine must have the same memory organization (i.e P and w) Figure 7 shows three different example layouts of the update engines in a 4-row, 4-unit-per-row architecture

In some network search applications such as access con-trol list (ACL), a packet is matched against a set of rules

An ACL-like rule specifies the match condition on each of the multiple packet header fields Some fields such as TCP ports are specified using ranges rather than ternary strings Taking 5-field ACL as an example, the two 16-bit port fields are normally specified in ranges The ranges must be con-verted into ternary strings so that such rules can be stored in

a TCAM However, a range may be converted into multiple ternary strings A r-bit range can be expanded to 2(r − 1) prefixes or 2(r − 2) ternary strings If there are D of such fields in a rule, then this rule can be expanded to (2r − 4)D ternary words in the worst case Such a problem is called

“rule expansion” [14] Various range encoding methods have been proposed to minimize rule expansion Even with the optimal range encoding [14], it needs r ternary words to

Trang 7

LU LU LU LU

UE

(a) Row

(b) Column

(c) Square Figure 7: Example layouts of update engines (UEs)

LU: lookup unit

represent a r-bit range In such a case, a rule with D range

fields will occupy O(rD) TCAM words

An attractive advantage of FPGA compared with ASIC

is that we can reprogram the hardware on-the-fly to add

customorized logic So for the ACL-like search problems,

we adopt the similar idea as [15] to augment the TCAM

de-sign with explicit range match support instead of converting

ranges into ternary strings This is achieved by storing the

lower and upper bounds of each range explicitly in registers

Hence, if there are N rules each containing D r-bit port

fields, then we require totally N ∗ D ∗ r ∗ 2 bits of registers

to store the lower and the upper bounds of all the ranges

On the other hand, the size of the TCAM that needs to be

stored in RAMs is reduced to N × (W − D · r)

According to the theoretical analysis in Section 2.3.4, the

RAM-based implementation of a N ×W TCAM requires the

minimum memory when employing shallow RAMs with the

same depth of 2 or 4 However, real hardware has limitations

on the minimum depth of physical RAMs For example,

each block RAM (BRAM) available on a Xilinx Virtex 7

FPGA can be configured as 512 × 72, 1K × 36, 2K × 18,

4K × 9, 8K × 4, 16K × 2, or 32K × 1, in simple dual-port

mode In other words, the minimum depth for a BRAM is

512=29 Let dmindenote the minimum address width of the

physical RAM A N × W (logical) RAM where N ≤ 2dmin

will be mapped to a 2 × W physical RAM Thus the RAM/TCAM ratio becomes 2 max(w,dmin)

w instead of 2w

w

A trick that can be played is to map multiple shallow (logi-cal) RAMs to a deep physical RAM For example, two 2d×W (logical) RAMs can be mapped to a single 2d+1×W physical RAM But the throughput will be halved unless the physical RAM has two sets of input/output ports used independently for the two (logical) RAMs While some multi-port RAM designs [16] are available, they bring extra complications and are beyond the scope of this paper

Hence when implementing the RAM-based TCAM in real hardware, the address width of each RAM, i.e w, should

be carefully chosen based on the available physical configu-ration

We implement our modular RAM-based TCAM architec-ture on a Xilinx Virtex 7 XC7V2000T device with -2 speed grade We evaluate the performance based on the post place and route results from the Xilinx Vivado 2013.1 development toolset To recap, we list the key parameters of the archi-tecture in Table 4 Note that N = R · L · U

Table 4: Architectural parameters Parameter Description

R The number of rows

L The number of units per row

U The number of TCAM words per unit

H The number of stages per unit

w The address width of the RAM

4.1 Analysis and Estimation

Due to its pipelined architecture, our RAM-based TCAM implementation processes one packet every clock cycle Thus the throughput is F million packets per second (Mpps) when the clock rate of the implementation achieves F MHz During lookup, each packet traverses the R rows in par-allel It takes L · H clock cycles to go through each row One clock cycle is needed for final priority encoding when the architecture consists of more than one rows Thus the lookup latency in terms of the number of clock cycles is

=

L · H if R = 1

L · H + 1 if R > 1 . The address width of RAMs, i.e w, is a critical pa-rameter in our RAM-based TCAM The update latency is

2w+ 1 while the memory requirement for implementing a

N × W TCAM is 2wwN W To determine the optimal w,

we examine the physical memory resource available on the FPGA device There are two types of memory resources

in Xilinx Virtex FPGAs: distributed RAM and block RAM (BRAM) While BRAMs are provided as standalone RAMs, distributed RAM is coupled with logic resources The basic logic resource unit of FPGA is usually called a Slice Only

a certain type of Slice, named SliceM, can be used to build the distributed RAM As required by our architecture, we consider the RAMs only in simple dual-port (SDP) mode Table 5 summarizes the total amount and the minimum

Trang 8

20%

40%

60%

80%

100%

512 1024 2048 4096 8192 16384

1024 2048 4096 8192 16384

# words (N)

Memory (Kbits)

L=4 L=8 L=16 L=32 Utilization

199

178 156 105 94

134 117

154 138

50

100

150

200

250

1024 2048 4096 8192 16384

# words (N)

Throughput (Mpps)

L=4 L=8 L=16 L=32

0%

20%

40%

60%

80%

100%

16384 32768 65536 131072 262144 524288

1024 2048 4096 8192 16384

# words (N)

# Slices

L=4 L=8 L=16 L=32 Utilization

6 11

19 20

0

5

10

15

20

25

1024 2048 4096 8192 16384

# words (N)

Power (Watts)

L=4 L=8 L=16 L=32

Figure 8: Increasing the TCAM depth (N )

dress width (dmin) of the memory resource available on our

target FPGA device

Table 5: Memory resource on a XC7V2000T

RAM type (in SDP mode) Total size (bits) dmin

Distributed RAM 16550400 5

Either distributed RAM or BRAM can be employed to

implement the RAM-based TCAM architecture In either

case, we set w=dminof the employed RAM type to achieve

the highest memory efficiency Based on the information

from Table 5 we can estimate the maximum size of the

TCAM that can be implemented on the target device When

the architecture is implemented using distributed RAM, the

RAM/TCAM ratio is 255 and the maximum TCAM size is

16550400

32/5 =2586000 bits When using BRAM, the RAM/TCAM

ratio is29

9 and the maximum TCAM size is 47628288

512/9 =837216 bits We can see that, though the total amount of BRAM

bits is nearly the triple of that of distributed RAM bits,

BRAM-based implementation supports a much smaller TCAM

due to the higher RAM/TCAM ratio Moreover, the

up-date latency of distributed RAM-based implementation is

33 clock cycles, while the update latency for BRAM-based

implementation is 513 clock cycles Hence in most of our

experiments, distributed RAMs (w = 5) are employed Also

note that our architecture is modular where each unit may

independently select the RAM type for TCAM

implemen-tation Thus the maximum TCAM size would be 3423216

bits when both distributed RAMs and BRAMs are utilized

4.2 Scalability

We are interested in how the performance scales when the

TCAM depth (i.e N ) or the TCAM width (W ) is increased

The key performance metrics include the throughput, the memory requirement, the power consumption estimates, and the resource usage In these experiments, the default param-eter settings are L = 4, U = 64, H = 1, and w = 5 Each unit contains its own update logic

First, we fix W = 150 and increase N by doubling R Figure 8 shows the results where the memory and the Slices results are drawn using a logarithmic scale The throughput

is measured as As expected, the throughput degrades for a deeper TCAM This is because a larger R results in a deeper final priority encoder which becomes the critical path Also with a larger TCAM, the resource utilization approaches 100% This makes it difficult to route signals, which fur-ther lowers the achievable clock rate Fortunately because

of the configurable architecture, we can trade the latency for throughput Since N = R · L · U , we can increase L to reduce

R for a given N while keeping other parameters fixed As shown in Figure 8, a larger L results in a higher throughput, though it is at the expense of a larger latency By tuning the latency-throughput trade-off, our design can sustain a

150 MHz clock rate for large TCAMs up to 16K × 150 bits

= 2.4 Mbits Such a clock rate allows the design to process

150 million packets per second (Mpps) which translates to

100 Gbps throughput for minimum-size Ethernet packets Second, we fix the TCAM depth N = 4096 and increase the TCAM width W Figure 9 shows that a larger TCAM width results in a lower throughput This is because there are W

w RAMs per unit where w = 5 in the implementation With a large W , it becomes time-critical to bitwise-AND

a large number of bit vectors within each unit Again this can be amended by trading the latency for throughput We increase the number of stages per unit so that each stage handles a smaller number of RAMs As shown in Figure 9, the throughput is improved by increasing H by 1 This on the other hand increases the latency by L = 4 clock cycles

In both the above experiments, the resource usage is linear with the TCAM size (N × W ) The estimated power

Trang 9

10%

20%

30%

40%

50%

60%

2048 4096 8192

Word width (W)

Memory (Kbits)

H=1 H=2 Utilization

156

143

134

128

169

153

50

70

90

110

130

150

170

190

Word width (W)

Throughput (Mpps)

H=1 H=2

0%

10%

20%

30%

40%

50%

60%

65536 131072 262144

Word width (W)

# Slices

H=1 H=2 Utilization

6

8

10

12 9

12

0

2

4

6

8

10

12

14

Word width (W)

Power (Watts)

H=1 H=2

Figure 9: Increasing the TCAM width (W )

sumption is sublinear with the TCAM depth while is linear

with the TCAM width

4.3 Impact of Unit Size

Each TCAM unit in our architecture stores U TCAM

words It is desirable to have a small U so that the

lo-cal bit vector bitwise-ANDing and priority encoding within

each unit do not become the critical path On the other

hand a smaller U leads to a larger L when R is fixed for a

given N Thus we can tune the latency-throughput trade-off

by changing U In this experiment, we fix R = 4, H = 1

and vary U in implementing a 1024 × 150 TCAM As

ex-pected, Figure 10 shows that a larger U results in a lower

throughput as well as a lower latency Such a trade-off can

be exploited for some latency-sensitive applications where

the latency is measured in terms of nanoseconds instead of

the number of clock cycles Based on the results shown in

Figure 10, when U is doubled from 64 to 128, the

through-put is slightly degraded while the latency is reduced from

6 ∗ 5 = 30 ns to 4 ∗ 5.27 = 21 ns The change of U has

little impact on other performance metrics, which thus are

not shown here

199 190

162

1 2 3 4 5 6 7

50

100

150

200

250

64 128 256

Unit size (U)

Throughput (Mpps)

Latency (# of clock cycles)

Figure 10: Increasing the unit size (U )

4.4 Distributed vs Block RAMs

As discussed in Section 4.1, distributed RAMs are more ef-ficient than BRAMs in implementing the RAM-based TCAM

on the target FPGA But usually it is desirable to inte-grate the RAM-based TCAM with other engines (such as

a packet parser) in a single FPGA device to comprise a complete packet processing system Then the choice of the RAM type may depend on not only the efficiency but also the resource budget BRAMs will be preferred to imple-ment the RAM-based TCAM in case the other engines re-quire a lot of Slices but few BRAMs Hence we conduct experiments to characterize the performance of the RAM-based TCAMs implemented using the two different RAM types In these experiments, W = 150, L = 4, U = 64, and H = 1 Each TCAM unit contains its own update logic As shown in Table 6, distributed RAM-based im-plementations achieve higher clock rates and lower power consumption than BRAM-based implementations This is due to the fact that a BRAM is deeper and larger, and thus requires longer access time and dissipates more power than

a distributed RAM Because distributed RAMs are based

on Slices (SliceM), the distributed RAM-based implementa-tions require much more logic resource (in terms of Slices) than BRAM-based implementations

4.5 Impact of Update Engine Layout

As discussed in Section 3.2.3, we can have flexible asso-ciations between lookup units and update engines by de-coupling the update logic from each unit We conduct ex-periments to evaluate the impact of different update engine (UE) layouts on the performance of the architecture The evaluated update engine layouts include:

• All : Each unit contains its own update logic

• Square: The four neighboring units forming a square share the same update engine (Figure 7(c))

Trang 10

Table 6: Implementation results based on different RAM types TCAM size: N × W 1024×150 bits 2048×150 bits 4096×150 bits

RAM type Distributed Block Distributed Block Distributed Block

# of Slices 20526 12138 40239 23560 80622 45632 (Utilization) (6.72%) (3.97%) (13.18%) (7.71%) (26.40%) (14.94%)

(Utilization) (0.00%) (21.05%) (0.00%) (42.11%) (0.00%) (84.21%) Estimated Power (Watts) 1.933 3.211 3.448 5.73 6.135 10.757

0%

1%

2%

3%

4%

5%

6%

7%

0 200 400 600 800 1000 1200

All Square Row Column None

UE layout

Memory (Kbits)

199 194

50

100

150

200

250

UE layout Throughput (Mpps)

20526

15253 15343 14961

8659

0%

2%

4%

6%

8%

4096 8192 16384 32768

UE layout

# Slices

1.9 2.1

2.2 2.2

1.5

0

0.5

1

1.5

2

2.5

UE layout Power (Watts)

Figure 11: Impact of the update engine (UE) layout

• Row : The units in a same row share the same update

engine (Figure 7(a))

• Column: The units in a same column share the same

update engine (Figure 7(b))

• None: No update logic for any unit The TCAM is

not updatable

In these experiments, N = 1024, W = 150, R = 4, L = 4,

U = 64, H = 1, and w = 5 So the architecture consists of 4

by 4 units, basically the same as illustrated in Figure 7 The

implementation results are shown in Figure 11 Comparing

the Slice results of the All and the None layouts, we can

infer that the update logic accounts for more than half of

the total logic usage of the architecture in the All layout In

the Square, Row, and Column layouts, by sharing the

up-date engine, the logic resource is reduced by roughly 25%,

compared with the All layout These three layouts achieve

the similar logic resource saving, because all of them have

each update engine shared by four lookup units The costs

of sharing the update engine include the slightly degraded

throughput and the slightly increased power consumption

Such costs are basically due to the wide mux/demux and

the stretched signal routing between lookup units and

up-date engines Higher throughput could be obtained by

care-ful chip floor planning Also note that the update engine

layout has no effect on the memory requirement which is determined only by the lookup units

4.6 Cost of Explicit Range Matching

As discussed in Section 3.3, we provide the capability to add explicit range matching logic to the TCAM architecture

so that range-to-ternary conversion can be avoided for some search applications such as ACL Such explicit range match-ing logic is based on a heavy use of registers We conduct experiments to understand the performance cost of the ex-plicit range matching logic We fix W = 150 and increase the number of 16-bit fields that are specified in ranges The other parameters are by default: N = 1024, R = 4, L = 4,

U = 64, H = 1, and w = 5 (distributed RAM) Each TCAM unit has its own update logic Table 7 shows that adding the explicit range matching logic for every 16-bit range-based field requires 5K more Slices and 30K more registers The increased usage of logic also results in higher power con-sumption Whether to enable the explicit range matching should be based on the characteristics of the ruleset used in the search application Consider a ruleset whose expansion ratio (due to range-to-ternary conversion) is a while it re-quires b times more logic resource to add the explicit range matching logic Then it is better not to enable the explicit range matching if a < b

Định dạng
Số trang	12
Dung lượng	685,9 KB