ABSTRACT Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGAbased TCAM designs are based on bruteforce implementations, which result in inefficient onchip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)based TCAM architecture aiming for efficient implementation on stateoftheart FPGAs. We give a formal study on RAMbased TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of smallsize RAMbased TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid rangetoternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2:4 Mbits while sustaining high throughput of 150 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance tradeoffs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than 1 Mbits. Categories and Subject Descriptors C.1.4 Processor Architectures: Parallel Architectures; C.2.6 Computer Communication Networks: Internetworking General Terms Algorithms, Design, Performance Keywords FPGA; RAM; TCAM 1. INTRODUCTION Ternary Content Addressable Memory (TCAM) is a specialized associative memory where each bit can be 0, 1, or don’t care (i.e. ∗). TCAM has been widely used in network infrastructure for various search functions including longest prefix matching (LPM), multifield packet classification, etc. For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle. A priority encoder is needed to obtain the index of the matching word with the highest priority. In a TCAM, the physical location normally determines the priority, e.g. the top word has the highest priority. Most of current TCAMs are implemented as a standalone applicationspecific integrated circuit (ASIC). We call them the native TCAMs. Native TCAMs are expensive, powerhungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs). The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g. OpenFlow 1) where the width andor the depth for different lookup tables can be variable 2. Various algorithmic solutions have been proposed as alternatives to native TCAMs. But none of them is exactly equivalent to TCAM. The success of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM. For some other search functions such as multifield packet classification, the algorithmic solutions 3, 4 employ various heuristics, leading to undeterministic performance that is often dependent on the characteristics of the data set. On the other hand, reconfigurable hardware such as fieldprogrammable gate array (FPGA) combines the flexibility of software and the nearASIC performance. Stateoftheart FPGA devices such as Xilinx Virtex7 5 and Altera StratixV 6 provide high clock rate, low power dissipation, rich onchip resources and large amounts of embedded memory with configurable word width. Due to their increasing capacity, modern FPGAs have been an attractive option for implementing various networking functions 7, 8, 9, 10. Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower development cost and the shrinking performance gap between it and ASIC. Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAMequivalent search engines. While there exist several FPGAbased TCAM designs, most of them are based on bruteforce implementations to mimic the native TCAM architecture. Their resouce usage is inefficient, which makes them less interesting in practice. On the other hand, some recent work 11, 12, 13 shows that RAMs can be employed to emulateimplement a TCAM.
Trang 1Scalable Ternary Content Addressable Memory
Implementation Using FPGAs
Weirong Jiang Xilinx Research Labs San Jose, CA, USA weirongj@acm.org
ABSTRACT
Ternary Content Addressable Memory (TCAM) is widely
used in network infrastructure for various search functions
There has been a growing interest in implementing TCAM
using reconfigurable hardware such as Field Programmable
Gate Array (FPGA) Most of existing FPGA-based TCAM
designs are based on brute-force implementations, which
re-sult in inefficient on-chip resource usage As a rere-sult,
exist-ing designs support only a small TCAM size even with large
FPGA devices They also suffer from significant
through-put degradation in implementing a large TCAM, mainly
caused by deep priority encoding This paper presents a
scalable random access memory (RAM)-based TCAM
ar-chitecture aiming for efficient implementation on
state-of-the-art FPGAs We give a formal study on RAM-based
TCAM to unveil the ideas and the algorithms behind it To
conquer the timing challenge, we propose a modular
archi-tecture consisting of arrays of small-size RAM-based TCAM
units After decoupling the update logic from each unit, the
modular architecture allows us to share each update engine
among multiple units This leads to resource saving The
capability of explicit range matching is also offered to avoid
range-to-ternary conversion for search functions that require
range matching Implementation on a Xilinx Virtex 7 FPGA
shows that our design can support a large TCAM of up to
2.4 Mbits while sustaining high throughput of 150 million
packets per second The resource usage scales linearly with
the TCAM size The architecture is configurable, allowing
various performance trade-offs to be exploited To the best
of our knowledge, this is the first FPGA design that
imple-ments a TCAM larger than 1 Mbits
Categories and Subject Descriptors
C.1.4 [Processor Architectures]: Parallel Architectures;
C.2.6 [Computer Communication Networks]:
Internet-working
General Terms
Algorithms, Design, Performance
Keywords
FPGA; RAM; TCAM
Ternary Content Addressable Memory (TCAM) is a
spe-cialized associative memory where each bit can be 0, 1, or
“don’t care” (i.e “∗”) TCAM has been widely used in net-work infrastructure for various search functions including longest prefix matching (LPM), multi-field packet classifi-cation, etc For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle A priority encoder is needed
to obtain the index of the matching word with the highest priority In a TCAM, the physical location normally deter-mines the priority, e.g the top word has the highest priority Most of current TCAMs are implemented as a standalone application-specific integrated circuit (ASIC) We call them the native TCAMs Native TCAMs are expensive, power-hungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs) The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g OpenFlow [1]) where the width and/or the depth for differ-ent lookup tables can be variable [2] Various algorithmic so-lutions have been proposed as alternatives to native TCAMs But none of them is exactly equivalent to TCAM The suc-cess of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM For some other search functions such as multi-field packet classification, the algorithmic solutions [3, 4] employ various heuristics, lead-ing to undeterministic performance that is often dependent
on the characteristics of the data set
On the other hand, reconfigurable hardware such as field-programmable gate array (FPGA) combines the flexibility
of software and the near-ASIC performance State-of-the-art FPGA devices such as Xilinx Virtex-7 [5] and Altera Stratix-V [6] provide high clock rate, low power dissipa-tion, rich on-chip resources and large amounts of embedded memory with configurable word width Due to their increas-ing capacity, modern FPGAs have been an attractive option for implementing various networking functions [7, 8, 9, 10] Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower de-velopment cost and the shrinking performance gap between
it and ASIC Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAM-equivalent search engines
While there exist several FPGA-based TCAM designs, most of them are based on brute-force implementations to mimic the native TCAM architecture Their resouce usage
is inefficient, which makes them less interesting in practice
On the other hand, some recent work [11, 12, 13] shows that RAMs can be employed to emulate/implement a TCAM
Trang 2But none of them gives a correctness proof or a thorough
study for efficient FPGA implementation Their
architec-tures are monolithic, which do not scale well in
implement-ing large TCAMs A goal of this paper is to advance the
FPGA-based TCAM designs by investigating both theory
and architecture for RAM-based TCAM implementation
The main contributions include:
• We give an in-depth introduction to the RAM-based
TCAM We formalize the key ideas and the algorithms
behind it We analyze thoroughly the theoretical
per-formance of the RAM-based TCAM and identify the
key challenges in implementing a large RAM-based
TCAM
• We propose a modular and scalable architecture that
consists of arrays of small-size RAM-based TCAM units
By decoupling the update logic from each unit, such a
modular architecture enables each update engine to be
shared among multiple units Thus the logic resource
is saved
• We share our experience in implementing the proposed
architecture on a state-of-the-art FPGA The post place
and route results show that our design can support a
large TCAM of up to 2.4 Mbits while sustaining high
throughput of 150 million packets per second (Mpps)
To the best of our knowledge this is the first FPGA
design that implements a TCAM larger than 1 Mbits
• We conduct comprehensive experiments to
character-ize the various performance trade-offs offered by the
configurable architecture We also discuss the
sup-port of range matching without range-to-ternary
con-version
The rest of the paper is organized as follows Section 2
gives a detailed introduction to the theoretic aspects of the
RAM-based TCAM Section 3 discusses the hardware
archi-tectures for scalable RAM-based TCAM Section 4 presents
the comprehensive evaluation results based on the
imple-mentation on a state-of-the-art FPGA Section 5 reviews
the related work on FPGA-based TCAM designs Section 6
concludes the paper
We first have the following definitions:
• The Depth of a TCAM (or RAM) is the number of
words in the TCAM (or RAM) Denoted as N
• The Width of a TCAM (or RAM) is the width (i.e
the number of bits) of each TCAM (or RAM) word
Denoted as W
• The Size of a TCAM (or RAM) is the total number of
bits of the TCAM (or RAM) It equals N × W
• The address width of a RAM is the number of bits of
the RAM address Denoted as d Note that N = 2d
for a RAM
We describe the organization of a TCAM or RAM as Depth × Width, i.e., N ×W For example, a 2×1 RAM con-sists of 2 words where each word is 1-bit We call a TCAM
or RAM wide (or narrow ) if its width is large (or small)
We call a TCAM or RAM deep (or shallow ) if its depth is large (or small)
We also have the notation as shown in Table 1:
Table 1: Notation Notation Description
k An input key or a binary number
t A ternary word
A An alphabet for 1-bit characters
An The set of all n-bit strings over A
|s| The length of a string s ∈ An |s| = n
A TCAM can be divided into two logical areas: (1) TCAM words and (2) priority encoder Each TCAM word consists
of a row of matching cells attached to a same matching line During lookup, each input key goes to all the N words in parallel and retrieves a N -bit match vector The i-th bit
of the match vector indicates if the key matches the i-th word, i = 1, 2, · · · , N In this section, for ease of discussion
we consider a TCAM without priority encoder Thus the output of the considered TCAM is a N -bit match vector instead of the index of the matching word with the highest priority
Looking up a N × W TCAM is basically mapping a W -bit binary input key into a N bit binary match vector The same mapping can be achieved by using a 2W × N RAM where the W -bit input key is used as the address to access the RAM and each RAM word stores a N -bit vector Figure 1(a) shows a 1 × 1 TCAM and its corresponding RAM-based implementation As the TCAM word stores a “don’t care” bit, the match vector is always 1 no matter the input 1-bit key is 0 or 1
2.2.1 Depth Extension
The depth of a native TCAM is increased by stacking ver-tically words with the same width Correspondingly in the RAM-based implementation, the depth of a TCAM is ex-tended by increasing the width of the RAM Each column
of the RAM represents the match vector for a word Figure 1(b) shows a 2 × 1 TCAM which adds a word to the TCAM shown in Figure 1(a) Correspondingly the RAM-based im-plementation adds a column to the RAM shown in Figure 1(a) We see that the memory requirement of either the na-tive TCAM or its RAM-based implementation is linear with the depth
We can also view the depth extension as concatenating the match vectors from multiple “shallower” TCAMs For instance, a N × W TCAM can be horizontally divided into two TCAMs: one is N1× W and the other N2× W , where
N = N1+ N2 Then there are two RAMs in the corre-sponding RAM-based implementation: one is 2W × N1 and the other 2W× N2 The outputs of the two RAMs are con-catenated to obtain the final N -bit match vector This is essentially equivalent to building a wider RAM by concate-nating two RAMs with the same depth For the sake of
Trang 3Native TCAM RAM
(a)
1 1
1 0
Key
(b)
1 0
Key
0 1
1 1
0 0 1 1
01 00
11 10
(c)
* Key
Match
0 Key
*
0
Key[1]
* Key[0]
Key
Key
Figure 1: (a) Matching a 1-bit key with a 1 × 1
TCAM; (b) Matching a 1-bit key with a 2×1 TCAM;
(c) Matching a 2-bit key with a 1 × 2 TCAM
simplicity, we consider the wide RAM built based on
con-catenating multiple RAMs as a single RAM
2.2.2 Width Extension
A wider TCAM deals with a wider input key When
im-plementing the TCAM in a single RAM, a wider input key
(which is used as the address to access the RAM) indicates
a wider address width for the RAM This results in a deeper
RAM whose depth is 2W Figure 1(c) shows a 1 × 2 TCAM
which extends the width of the TCAM shown in Figure 1(a)
As the width of the input key is increased by 1 bit, the depth
of the RAM in the corresponding RAM-based TCAM gets
doubled Such a design cannot scale well for wide input keys
An alternative solution is using multiple “narrow” TCAMs
to implement a wide TCAM For example, a N × W TCAM
can be vertically divided into two TCAMs: one is N × W1
and the other N ×W2, where W = W1+W2 During lookup,
a W -bit input key is divided into two segments accordingly:
one is W1-bit and the other W2-bit Each of the two
“nar-rower” TCAMs matches the corresponding segment of the
key and outputs a N -bit match vector The two match
vectors are then bitwise ANDed to obtain the final match
vector The two “narrow” TCAMs map to two “shallow”
RAMs in the corresponding RAM-based implementation
The total memory requirement becomes 2W1+ 2W2 instead
of 2W=2W 1· 2W 2 Figure 2 shows how a 1 × 2 TCAM is built
based on two 1 × 1 TCAMs
2.2.3 Populating the RAM
Given a set of ternary words, we need to populate the
RAMs so that the RAM-based implementation can fulfill
the same search function as the native TCAM As shown in
Figure 1(a), it is easy to populate the RAM for the
RAM-based implementation of a 1 × 1 TCAM Table 2 shows the
content of the 2 × 1 RAM populated for the 1 × 1 TCAM,
where RAM[k] denotes the RAM word at the address k,
k = 0, 1
0 Key
* Key[1]
Key[0]
1 1
1 0
0 1
1 0
Key[1]
Key[0]
Key
Figure 2: Building a 1 × 2 TCAM using two 1 × 1 TCAMs
Table 2: Representing a ternary bit in RAM The value of The value stored at the ternary bit RAM[0] RAM[1]
don’t care 1 1
Principle 1 shows the principle in populating the 2 × 1 RAM to represent a 1-bit TCAM where k ∈ {0, 1} and t ∈ {0, 1, ∗}
Principle 1 RAM[k]=1 if and only if k matches t; oth-erwise RAM[k]=0
Theorem 1 The RAM populated following Principle 1 achieves the equivalent function as the TCAM that stores t Proof In the TCAM, the output for an input k is 1 ⇔
k matches t Otherwise, the output is 0
In the populated RAM, the output for an input k is 1 ⇔ RAM[k]=1 Otherwise, the output is 0
Thus the populated RAM is equivalent to the represented TCAM
Both Principle 1 and Theorem 1 are directly applicable
to the case of 1 × W TCAM where k ∈ AW, A = {0, 1} and
t ∈ AW, A = {0, 1, ∗}
Principle 1 can be extended to the case of a N ×W TCAM that is implemented in a 2W×N RAM Let RAM[k][i] denote the i-th bit of the k-th word in the RAM Let ti denote the i-th word in the TCAM i = 1, 2, · · · , N So we have Principle 2 for populating the 2W × N RAM to represent a
N ×W TCAM, where k ∈ AW, A = {0, 1} and ti∈ AW
, A = {0, 1, ∗}
Principle 2 RAM[k][i]=1 if and only if k matches ti; otherwise RAM[k][i]=0 i = 1, 2, · · · , N
When a wide TCAM is built using multiple narrower TCAMs, the RAM corresponding to each narrow TCAM is populated individually by following Principle 2
2.3 Algorithms and Analysis
This section formalizes the algorithms for using a RAM-based TCAM that is built according to the discussion in Section 2.2
2.3.1 General Model
Based on the discussion in Section 2.2.2, a N × W TCAM can be constructed using P narrow TCAMs, P = 1, 2, · · · , W
Trang 4The size of the i-th TCAM is N × Wi, i = 1, 2, · · · , P , and
W = PP
i=1Wi Let RAMi denote the RAM
correspond-ing to the i-th narrow TCAM, i = 1, 2, · · · , P The size of
RAMiis 2Wi×N , i = 1, 2, · · · , P Hence the N ×W TCAM
can be implemented using these P RAMs
2.3.2 Lookup
Algorithm 1 shows the algorithm to search a key over the
RAM-based TCAM It takes O(1) time to access each RAM
Since the P RAMs are accessed in parallel, the overall time
complexity for lookup is O(1)
Algorithm 1 Lookup
Input: A W -bit key k
Input: {RAMi}, i = 1, 2, · · · , P
Output: A N -bit match vector m
1: Divide k into P segments: k → {k1, k2, · · · , kP} |ki| =
Wi, i = 1, 2, · · · , P
2: Initialize m to be all 1’s: m ← 11 · · · 1
3: for i ← 1 to P do {bitwise AND}
4: m ← m & RAMi[ki]
5: end for
2.3.3 Update
Updating a TCAM can be either adding or deleting a
spe-cific TCAM word Algorithm 2 shows the algorithm to add
or delete the n-th word of the TCAM in the RAM-based
im-plementation, where n = 1, 2, · · · , N It takes O(2Wi) time
to update the RAMi, i = 1, 2, · · · , P As the P RAMs are
updated in parallel, the overall time complexity for update
is determined by the RAM that takes the longest time for
update, which is O(maxP
i=12Wi) = O(2maxPi=1Wi)
Algorithm 2 Updating a TCAM word
Input: A W -bit ternary word t
Input: The index of t: n
Input: The update operation: op ∈ {add, delete}
Output: Updated {RAMi}, i = 1, 2, · · · , P
1: Divide t into P segments: t → {t1, t2, · · · , tP} |ti| =
Wi, i = 1, 2, · · · , P
2: for i ← 1 to P do {Update each RAM}
3: for k ← 0 to 2Wi− 1 do
4: if k matches tiand op == add then
5: RAMi[k][n] = 1
6: else
7: RAMi[k][n] = 0
8: end if
9: end for
10: end for
2.3.4 Space Analysis
The size of RAMiis 2Wi× N , i = 1, 2, · · · , P Hence the
overall memory requirement isPP
i=1(2Wi× N ) = NPP
i=12Wi
To minimize the overall memory requirement, we formulate
the problem as:
min
P
X
i=1
2Wi =
W min
P =1( min {W1,W 2 ,··· ,WP}
P X i=1
2Wi) (1)
subject to
P X i=1
For a given P , min{W1,W 2 ,··· ,WP}PPi=12Wi= P ·2WP when
Wi = WP, i = 1, 2, · · · , P Hence the overall memory re-quirement is minimum when all the P RAMs have the same address width, denoted as w = W
P The depth of each RAM
is 2w Then the overall memory requirement is P
X i=1 (2Wi· N ) =
W w
X i=1
(2w· N ) = W
w2 w
N = N W2
w
w (3)
We define the RAM/TCAM ratio as the number of RAM bits needed to implement a TCAM bit According to Equa-tion (3), the RAM/TCAM ratio is 2w
w when all the RAMs employ the same address width of w Basically a larger w results in a larger RAM/TCAM ratio, which indicates lower memory efficiency The minimum RAM/TCAM ratio is 2 when w = 1 (P = W ) or w = 2 (P = W/2) In other words, when the depth of each RAM is 2 (w = 1) or 4 (w = 2), the overall memory requirement achieves the minimum, which is 2N W , i.e twice the size of the corresponding native TCAM
2.3.5 Comparison with Native TCAM
Table 3 summarizes the difference between the native TCAM and its corresponding implementation using P RAMs, with respect to time and space complexities Here we consider all the RAMs employ the same address width (w), so that both the update time and the space complexities achieve the op-timum for the RAM-based TCAM (as discussed in Sections 2.3.3 and 2.3.4)
Table 3: Native TCAM vs RAM-based TCAM
Native TCAM RAM-based TCAM Lookup time O(1) O(1) Update time O(1) O(2w) Space N W 2wwN W
We are interested in implementing the RAM-based TCAM
on FPGA While the theoretical discussion in Section 2 ex-cludes the priority encoder, the hardware architecture of the RAM-based TCAM must consider the priority encoder
3.1 Basic Architecture
The theorectical model of the RAM-based TCAM im-plementation (discussed in Section 2.3.1) can be directly mapped to the hardware architecture shown in Figure 3
A N × W TCAM is implemented using P RAMs where the size of the i-th RAM is 2Wi× N andPP
i=1Wi= W
As illustrated in Algorithm 1, a lookup is performed by dividing the input W -bit key into P segments where the length of the i-th segment is Wi, i = 1, 2, · · · , P Then each segment of the key is used as the address to access the corresponding RAM Each RAM outputs a N -bit vector The P N -bit vectors are then bitwise ANDed to generate the final match vector The match vector is finally fed into
Trang 5Data_out Addr_in
RAM1
Data_out Addr_in
RAM2
Data_out Addr_in
RAMP
Key
N N N
N W
W2
W P
ID
Match
Figure 3: Basic architecture (without update logic)
the priority encoder to obtain the index of the matching
word with the highest priority A 1-bit Match signal is also
generated to indicate if there is any match
We add update logic to the RAM-based TCAM so that it
can complete any update by itself at run time In accordance
with Algorithm 2, Figure 4 shows the logic for updating the
RAM-based TCAM, where Wmax = maxP
i=1Wi We use two W -bit binary numbers, denoted as Data and M ask, to
represent a W -bit ternary word t that is to be updated The
i-th bit of t is “don’t care” bit if and only if the i-th bit of
M ask is set to be 0, i = 1, 2, · · · , W For example, the 2-bit
ternary word 0∗ can be represented by: Data=00 or 01, and
M ask=10 Id specifies the index of the ternary word to be
updated Op indicates if the ternary word is to be added
(Op = 0) or deleted (Op = 1)
Data W
W1
W2
W P
W max-bit
counter
Data_out Addr_R
RAM1
Addr_W We Data_in
+1 CMP
CHG
Data_out Addr_R
RAM2
Addr_W We Data_in
+1 CMP
CHG
Data_out Addr_R
RAMP
Addr_W We Data_in
+1 CMP
CHG
Id
Mask
W
1
W
2
W P
Op
W
Figure 4: Update logic “CMP”: Compare “CHG”:
Change
Adding or deleting the Id-th ternary word t is
accom-plished by setting or clearing the Id-th bit from all the RAM
words whose addresses match t Meanwhile we must keep
the rest of the bits of these RAM words unchanged Hence
we need to read the original content of the RAM words, change only the Id-th bit and then write the updated RAM words back to the RAM This requires 2 · 2w clock cycles to update a single-port RAM whose address width is w To re-duce the update latency, we utilize a simple dual-port RAM and perform the read and write at the same clock cycle A simple dual-port RAM has two input ports and one output port One input port is for Read only and the other is for Write only At each clock cycle during update, the update logic writes the updated RAM word to the address k while reading the content of the RAM word at the address k + 1 Hence the update latency becomes 2w+ 1 clock cycles where the 1 clock cycle is consumed to fetch the content of the first RAM word Another part of the update logic is a state matchine (not shown in Figure 4) that switches the state of the TCAM between lookup and update During update, no lookup is permitted and any match result is invalid
In implementing a large-scale RAM-based TCAM on FPGA, there are two main challenges:
• Throughput: When the TCAM is deeper or wider, the logic and the routing complexities become larger, espe-cially for bitwise-ANDing a lot of wide bit vectors and for priority encoding a deep match vector This re-sults in significant degradation in the achievable clock rate, which determines the maximum throughput of the RAM-based TCAM
• Resource usage: The on-chip resource of a FPGA de-vice is limited Hence we must optimize the architec-ture to save the resource or use the resource efficiently
We need to find out the best memory configuration based on the physical capability It is also desirable to enable resource sharing between subsystems
We propose a scalable and modular architecture that em-ploys configurable small-size RAM-based TCAM units as building blocks Both bit vector bitwise-ANDing and prior-ity encoding are performed in a localized and pipelined fash-ion so that high throughput is sustained for large TCAMs
We decouple the update logic from each unit so that a sin-gle update engine can be shared flexibly by multiple TCAM units On-chip logic resources are thus saved Note that such resource sharing is only possible in a modular architecture
3.2.1 Overview
The top-level design consists of a grid of units, which are organized in multiple rows Figure 5 shows the top-level ar-chitecture with R rows each of which contains L units The TCAM words with higher priority are stored in the units with lower index The units within a row are searched se-quentially in a pipelined fashion Priority is resolved locally within each unit After each row outputs a matching result,
a global priority encoder is needed to select the one with the globally highest priority
3.2.2 Unit Design
A TCAM unit is basically a U × W TCAM implemented
in RAMs, where U is the number of TCAM words per unit Figure 6 depicts the architecture of a TCAM unit Each unit performs the local TCAM lookup and combines the lo-cal match result with the result from the preceding unit As
Trang 6Key_out Match_out ID_out
Match_in ID_in
Key_out Match_out ID_out
Match_in ID_in
Key_out Match_out ID_out
Key_in Match_in ID_in
Key_out Match_out ID_out
Key_in Match_in ID_in
Matching ID Key
Key_out Match_out ID_out
Match_in ID_in
Key_out Match_out ID_out
Key_in Match_in ID_in Unit 1
Unit L+1
Key_out Match_out ID_out
Key_in Match_in ID_in
Key_out Match_out ID_out
Key_in Match_in ID_in
Key_out Match_out ID_out
Key_in Match_in ID_in Unit (R-1)L+1
Figure 5: Top-level architecture
the unit index determines the priority, a matching TCAM
word stored in the preceding units always has a higher
pri-ority than the local matching one The U × W TCAM is
constructed using P RAMs based on the basic architecture
shown in Section 3.1 We use the same address width w for
all the P RAMs to achieve the maximum memory efficiency
as discussed in Section 2.3.4
Key_in W
Match_out Match_in
ID_in
ID_out
Key_out Key
ID Match
U×W TCAM (in RAM)
Figure 6: A Unit
When W is large, there are many RAMs each of which
outputs a U -bit vector The throughput may degrade in
bitwise-ANDing a large number of bit vectors We divide
a unit into multiple pipelined stages Let H denote the
number of stages in a unit Then each stage contains HP
RAMs Within each stage, the bit vectors generated by the
P
H RAMs are bitwise-ANDed The resulting U -bit vector is
combined with the bit vector passed from the previous stage
and then passed to the next stage The last stage of the unit
performs the local priority encoding
3.2.3 Update Engine
We make the following observations:
• Updating a TCAM word involves updating only one
unit
• Update logic is identical for the units with the same memory organization
To save logic resource, it is desirable to share the update logic between units We decouple the update logic from units and build multiple update engines Each update en-gine contains the update logic and serves multiple units An update engine maintains a single state machine and decides which unit to be updated based on the index (Id) of the TCAM word to be updated A unit receives from its up-date engine the addresses and the write enable signals for its RAMs The unit also interacts with its update engine to exchange the bit vectors to update each RAM word Due to the decoupling of the update logic from the units, the association between the lookup units (LUs) and the up-date engines (UEs) is flexible The only constraint is that the units served by the same update engine must have the same memory organization (i.e P and w) Figure 7 shows three different example layouts of the update engines in a 4-row, 4-unit-per-row architecture
In some network search applications such as access con-trol list (ACL), a packet is matched against a set of rules
An ACL-like rule specifies the match condition on each of the multiple packet header fields Some fields such as TCP ports are specified using ranges rather than ternary strings Taking 5-field ACL as an example, the two 16-bit port fields are normally specified in ranges The ranges must be con-verted into ternary strings so that such rules can be stored in
a TCAM However, a range may be converted into multiple ternary strings A r-bit range can be expanded to 2(r − 1) prefixes or 2(r − 2) ternary strings If there are D of such fields in a rule, then this rule can be expanded to (2r − 4)D ternary words in the worst case Such a problem is called
“rule expansion” [14] Various range encoding methods have been proposed to minimize rule expansion Even with the optimal range encoding [14], it needs r ternary words to
Trang 7LU LU LU LU
UE
UE
UE
UE
UE
(a) Row
(b) Column
(c) Square Figure 7: Example layouts of update engines (UEs)
LU: lookup unit
represent a r-bit range In such a case, a rule with D range
fields will occupy O(rD) TCAM words
An attractive advantage of FPGA compared with ASIC
is that we can reprogram the hardware on-the-fly to add
customorized logic So for the ACL-like search problems,
we adopt the similar idea as [15] to augment the TCAM
de-sign with explicit range match support instead of converting
ranges into ternary strings This is achieved by storing the
lower and upper bounds of each range explicitly in registers
Hence, if there are N rules each containing D r-bit port
fields, then we require totally N ∗ D ∗ r ∗ 2 bits of registers
to store the lower and the upper bounds of all the ranges
On the other hand, the size of the TCAM that needs to be
stored in RAMs is reduced to N × (W − D · r)
According to the theoretical analysis in Section 2.3.4, the
RAM-based implementation of a N ×W TCAM requires the
minimum memory when employing shallow RAMs with the
same depth of 2 or 4 However, real hardware has limitations
on the minimum depth of physical RAMs For example,
each block RAM (BRAM) available on a Xilinx Virtex 7
FPGA can be configured as 512 × 72, 1K × 36, 2K × 18,
4K × 9, 8K × 4, 16K × 2, or 32K × 1, in simple dual-port
mode In other words, the minimum depth for a BRAM is
512=29 Let dmindenote the minimum address width of the
physical RAM A N × W (logical) RAM where N ≤ 2dmin
will be mapped to a 2 × W physical RAM Thus the RAM/TCAM ratio becomes 2 max(w,dmin)
w instead of 2w
w
A trick that can be played is to map multiple shallow (logi-cal) RAMs to a deep physical RAM For example, two 2d×W (logical) RAMs can be mapped to a single 2d+1×W physical RAM But the throughput will be halved unless the physical RAM has two sets of input/output ports used independently for the two (logical) RAMs While some multi-port RAM designs [16] are available, they bring extra complications and are beyond the scope of this paper
Hence when implementing the RAM-based TCAM in real hardware, the address width of each RAM, i.e w, should
be carefully chosen based on the available physical configu-ration
We implement our modular RAM-based TCAM architec-ture on a Xilinx Virtex 7 XC7V2000T device with -2 speed grade We evaluate the performance based on the post place and route results from the Xilinx Vivado 2013.1 development toolset To recap, we list the key parameters of the archi-tecture in Table 4 Note that N = R · L · U
Table 4: Architectural parameters Parameter Description
R The number of rows
L The number of units per row
U The number of TCAM words per unit
H The number of stages per unit
w The address width of the RAM
4.1 Analysis and Estimation
Due to its pipelined architecture, our RAM-based TCAM implementation processes one packet every clock cycle Thus the throughput is F million packets per second (Mpps) when the clock rate of the implementation achieves F MHz During lookup, each packet traverses the R rows in par-allel It takes L · H clock cycles to go through each row One clock cycle is needed for final priority encoding when the architecture consists of more than one rows Thus the lookup latency in terms of the number of clock cycles is
=
L · H if R = 1
L · H + 1 if R > 1 . The address width of RAMs, i.e w, is a critical pa-rameter in our RAM-based TCAM The update latency is
2w+ 1 while the memory requirement for implementing a
N × W TCAM is 2wwN W To determine the optimal w,
we examine the physical memory resource available on the FPGA device There are two types of memory resources
in Xilinx Virtex FPGAs: distributed RAM and block RAM (BRAM) While BRAMs are provided as standalone RAMs, distributed RAM is coupled with logic resources The basic logic resource unit of FPGA is usually called a Slice Only
a certain type of Slice, named SliceM, can be used to build the distributed RAM As required by our architecture, we consider the RAMs only in simple dual-port (SDP) mode Table 5 summarizes the total amount and the minimum
Trang 820%
40%
60%
80%
100%
512 1024 2048 4096 8192 16384
1024 2048 4096 8192 16384
# words (N)
Memory (Kbits)
L=4 L=8 L=16 L=32 Utilization
199
178 156 105 94
134 117
154 138
50
100
150
200
250
1024 2048 4096 8192 16384
# words (N)
Throughput (Mpps)
L=4 L=8 L=16 L=32
0%
20%
40%
60%
80%
100%
16384 32768 65536 131072 262144 524288
1024 2048 4096 8192 16384
# words (N)
# Slices
L=4 L=8 L=16 L=32 Utilization
6 11
19 20
0
5
10
15
20
25
1024 2048 4096 8192 16384
# words (N)
Power (Watts)
L=4 L=8 L=16 L=32
Figure 8: Increasing the TCAM depth (N )
dress width (dmin) of the memory resource available on our
target FPGA device
Table 5: Memory resource on a XC7V2000T
RAM type (in SDP mode) Total size (bits) dmin
Distributed RAM 16550400 5
Either distributed RAM or BRAM can be employed to
implement the RAM-based TCAM architecture In either
case, we set w=dminof the employed RAM type to achieve
the highest memory efficiency Based on the information
from Table 5 we can estimate the maximum size of the
TCAM that can be implemented on the target device When
the architecture is implemented using distributed RAM, the
RAM/TCAM ratio is 255 and the maximum TCAM size is
16550400
32/5 =2586000 bits When using BRAM, the RAM/TCAM
ratio is29
9 and the maximum TCAM size is 47628288
512/9 =837216 bits We can see that, though the total amount of BRAM
bits is nearly the triple of that of distributed RAM bits,
BRAM-based implementation supports a much smaller TCAM
due to the higher RAM/TCAM ratio Moreover, the
up-date latency of distributed RAM-based implementation is
33 clock cycles, while the update latency for BRAM-based
implementation is 513 clock cycles Hence in most of our
experiments, distributed RAMs (w = 5) are employed Also
note that our architecture is modular where each unit may
independently select the RAM type for TCAM
implemen-tation Thus the maximum TCAM size would be 3423216
bits when both distributed RAMs and BRAMs are utilized
4.2 Scalability
We are interested in how the performance scales when the
TCAM depth (i.e N ) or the TCAM width (W ) is increased
The key performance metrics include the throughput, the memory requirement, the power consumption estimates, and the resource usage In these experiments, the default param-eter settings are L = 4, U = 64, H = 1, and w = 5 Each unit contains its own update logic
First, we fix W = 150 and increase N by doubling R Figure 8 shows the results where the memory and the Slices results are drawn using a logarithmic scale The throughput
is measured as As expected, the throughput degrades for a deeper TCAM This is because a larger R results in a deeper final priority encoder which becomes the critical path Also with a larger TCAM, the resource utilization approaches 100% This makes it difficult to route signals, which fur-ther lowers the achievable clock rate Fortunately because
of the configurable architecture, we can trade the latency for throughput Since N = R · L · U , we can increase L to reduce
R for a given N while keeping other parameters fixed As shown in Figure 8, a larger L results in a higher throughput, though it is at the expense of a larger latency By tuning the latency-throughput trade-off, our design can sustain a
150 MHz clock rate for large TCAMs up to 16K × 150 bits
= 2.4 Mbits Such a clock rate allows the design to process
150 million packets per second (Mpps) which translates to
100 Gbps throughput for minimum-size Ethernet packets Second, we fix the TCAM depth N = 4096 and increase the TCAM width W Figure 9 shows that a larger TCAM width results in a lower throughput This is because there are W
w RAMs per unit where w = 5 in the implementation With a large W , it becomes time-critical to bitwise-AND
a large number of bit vectors within each unit Again this can be amended by trading the latency for throughput We increase the number of stages per unit so that each stage handles a smaller number of RAMs As shown in Figure 9, the throughput is improved by increasing H by 1 This on the other hand increases the latency by L = 4 clock cycles
In both the above experiments, the resource usage is linear with the TCAM size (N × W ) The estimated power
Trang 910%
20%
30%
40%
50%
60%
2048 4096 8192
Word width (W)
Memory (Kbits)
H=1 H=2 Utilization
156
143
134
128
169
153
50
70
90
110
130
150
170
190
Word width (W)
Throughput (Mpps)
H=1 H=2
0%
10%
20%
30%
40%
50%
60%
65536 131072 262144
Word width (W)
# Slices
H=1 H=2 Utilization
6
8
10
12 9
12
0
2
4
6
8
10
12
14
Word width (W)
Power (Watts)
H=1 H=2
Figure 9: Increasing the TCAM width (W )
sumption is sublinear with the TCAM depth while is linear
with the TCAM width
4.3 Impact of Unit Size
Each TCAM unit in our architecture stores U TCAM
words It is desirable to have a small U so that the
lo-cal bit vector bitwise-ANDing and priority encoding within
each unit do not become the critical path On the other
hand a smaller U leads to a larger L when R is fixed for a
given N Thus we can tune the latency-throughput trade-off
by changing U In this experiment, we fix R = 4, H = 1
and vary U in implementing a 1024 × 150 TCAM As
ex-pected, Figure 10 shows that a larger U results in a lower
throughput as well as a lower latency Such a trade-off can
be exploited for some latency-sensitive applications where
the latency is measured in terms of nanoseconds instead of
the number of clock cycles Based on the results shown in
Figure 10, when U is doubled from 64 to 128, the
through-put is slightly degraded while the latency is reduced from
6 ∗ 5 = 30 ns to 4 ∗ 5.27 = 21 ns The change of U has
little impact on other performance metrics, which thus are
not shown here
199 190
162
1 2 3 4 5 6 7
50
100
150
200
250
64 128 256
Unit size (U)
Throughput (Mpps)
Latency (# of clock cycles)
Figure 10: Increasing the unit size (U )
4.4 Distributed vs Block RAMs
As discussed in Section 4.1, distributed RAMs are more ef-ficient than BRAMs in implementing the RAM-based TCAM
on the target FPGA But usually it is desirable to inte-grate the RAM-based TCAM with other engines (such as
a packet parser) in a single FPGA device to comprise a complete packet processing system Then the choice of the RAM type may depend on not only the efficiency but also the resource budget BRAMs will be preferred to imple-ment the RAM-based TCAM in case the other engines re-quire a lot of Slices but few BRAMs Hence we conduct experiments to characterize the performance of the RAM-based TCAMs implemented using the two different RAM types In these experiments, W = 150, L = 4, U = 64, and H = 1 Each TCAM unit contains its own update logic As shown in Table 6, distributed RAM-based im-plementations achieve higher clock rates and lower power consumption than BRAM-based implementations This is due to the fact that a BRAM is deeper and larger, and thus requires longer access time and dissipates more power than
a distributed RAM Because distributed RAMs are based
on Slices (SliceM), the distributed RAM-based implementa-tions require much more logic resource (in terms of Slices) than BRAM-based implementations
4.5 Impact of Update Engine Layout
As discussed in Section 3.2.3, we can have flexible asso-ciations between lookup units and update engines by de-coupling the update logic from each unit We conduct ex-periments to evaluate the impact of different update engine (UE) layouts on the performance of the architecture The evaluated update engine layouts include:
• All : Each unit contains its own update logic
• Square: The four neighboring units forming a square share the same update engine (Figure 7(c))
Trang 10Table 6: Implementation results based on different RAM types TCAM size: N × W 1024×150 bits 2048×150 bits 4096×150 bits
RAM type Distributed Block Distributed Block Distributed Block
# of Slices 20526 12138 40239 23560 80622 45632 (Utilization) (6.72%) (3.97%) (13.18%) (7.71%) (26.40%) (14.94%)
(Utilization) (0.00%) (21.05%) (0.00%) (42.11%) (0.00%) (84.21%) Estimated Power (Watts) 1.933 3.211 3.448 5.73 6.135 10.757
0%
1%
2%
3%
4%
5%
6%
7%
0 200 400 600 800 1000 1200
All Square Row Column None
UE layout
Memory (Kbits)
199 194
50
100
150
200
250
All Square Row Column None
UE layout Throughput (Mpps)
20526
15253 15343 14961
8659
0%
2%
4%
6%
8%
4096 8192 16384 32768
All Square Row Column None
UE layout
# Slices
1.9 2.1
2.2 2.2
1.5
0
0.5
1
1.5
2
2.5
All Square Row Column None
UE layout Power (Watts)
Figure 11: Impact of the update engine (UE) layout
• Row : The units in a same row share the same update
engine (Figure 7(a))
• Column: The units in a same column share the same
update engine (Figure 7(b))
• None: No update logic for any unit The TCAM is
not updatable
In these experiments, N = 1024, W = 150, R = 4, L = 4,
U = 64, H = 1, and w = 5 So the architecture consists of 4
by 4 units, basically the same as illustrated in Figure 7 The
implementation results are shown in Figure 11 Comparing
the Slice results of the All and the None layouts, we can
infer that the update logic accounts for more than half of
the total logic usage of the architecture in the All layout In
the Square, Row, and Column layouts, by sharing the
up-date engine, the logic resource is reduced by roughly 25%,
compared with the All layout These three layouts achieve
the similar logic resource saving, because all of them have
each update engine shared by four lookup units The costs
of sharing the update engine include the slightly degraded
throughput and the slightly increased power consumption
Such costs are basically due to the wide mux/demux and
the stretched signal routing between lookup units and
up-date engines Higher throughput could be obtained by
care-ful chip floor planning Also note that the update engine
layout has no effect on the memory requirement which is determined only by the lookup units
4.6 Cost of Explicit Range Matching
As discussed in Section 3.3, we provide the capability to add explicit range matching logic to the TCAM architecture
so that range-to-ternary conversion can be avoided for some search applications such as ACL Such explicit range match-ing logic is based on a heavy use of registers We conduct experiments to understand the performance cost of the ex-plicit range matching logic We fix W = 150 and increase the number of 16-bit fields that are specified in ranges The other parameters are by default: N = 1024, R = 4, L = 4,
U = 64, H = 1, and w = 5 (distributed RAM) Each TCAM unit has its own update logic Table 7 shows that adding the explicit range matching logic for every 16-bit range-based field requires 5K more Slices and 30K more registers The increased usage of logic also results in higher power con-sumption Whether to enable the explicit range matching should be based on the characteristics of the ruleset used in the search application Consider a ruleset whose expansion ratio (due to range-to-ternary conversion) is a while it re-quires b times more logic resource to add the explicit range matching logic Then it is better not to enable the explicit range matching if a < b