Nguyen, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam * Corresponding author’s email: tutx@vnu.edu.vn Abstract— In
Trang 1A 45nm High-Throughput and Low Latency AES
Encryption for Real-Time Applications
Pham-Khoi Dong, Hung K Nguyen, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam
*) Corresponding author’s email: tutx@vnu.edu.vn
Abstract— In this paper, we propose a high-throughput and
low-latency AES architecture for wideband and real-time
applications such as surveillance cameras, video conference,
motion detection, IoT gateways, data store encryption… Our
design uses an outer round pipeline technique to achieve high
throughput The design has been modelled in RTL VHDL and
then synthesized with a 45nm CMOS technology using Synopsys
Design Compiler The implementation results show that the
proposed architecture achieves a throughput of 111.3Gbps and
a latency of 12.6ns at the maximum operating frequency of
870MHz With the same 45nm CMOS technology, our design has
area efficiency (2.4 times) and energy efficiency (4.7 times)
higher than other related works
Keywords— AES, cryptography, high throughput, low latency,
real-time applications
The Advanced Encryption Standard (AES) was published
by the National Institute of Standards and Technology (NIST)
in 2001 It is a symmetric block cipher that is intended to
replace DES as the approved standard for a wide range of
applications [1] In AES, the number of cipher rounds
depends on the size of the key It is equal to 10, 12, or 14 for
128-, 192-, or 256-bit keys, respectively AES encryption
round employs consecutively four primary operations:
SubBytes, ShiftRows, MixColumns, and AddRoundKey
AES implementations can be broadly classified into
software and hardware implementations Compared to the
software implementation, the hardware implementation of
AES, by nature, provides more physical security and higher
speed In general, the hardware implementation can be
performed in either FPGA or ASIC technologies [2] In
hardware implementations design, there are two main trends:
low power consumption with limited performance and high
performance designs of AES cores
D.-H Bui et al [3] present hardware optimization
strategies for high-speed ultralow-power AES architectures
First, the authors used AES 32-datapath to optimize area cost
Next, they utilized eight S-Box to improve throughput
Finally, they applied a clock gating strategy into data storage
registers to reduce power consumption The test-chip was
verified on ST FDSOI 28nm technology It achieved a power
consumption of less than 20µW for all key configurations
with the energy consumption of less than 1pJ/b and the
throughput of 28Mb/s at 10MHz operating frequency
In [4], the design of ultra-low power AES encryption core
by combining optimized architectures, using clock gating
technique is presented This AES encryption core has been
implemented on silicon on thin buried oxide (SOTB) 65nm technology The implementation results show that by using two S-Boxes the AES encryption core requires the smallest number of clock cycles and achieves the lowest power consumption of 0.4µW/ MHz Moreover, the proposed one S-Box AES encryption core occupies very low hardware resources of 2.4 kilo gate equivalent (kGEs)
Zhao et al [5] present the architectural exploration of lightweight AES accelerators with the goal of minimizing energy consumption The number of cycles per encryption in lightweight AES designs is estimated as a function of the number of available S-Boxes This AES architecture was implemented in a 65nm test-chip and achieved 0.83pJ/bit energy at 0.32 V with a throughput of 376kbps
Works [3], [4], [5] propose AES designs with an extremely low area and low power consumption, but due to the use of loop architecture and low frequency, throughput is limited and latency is too long Therefore, these architectures are not suitable for high throughput applications and low latency requirements
In the second trend, pipelining and sub-pipelining techniques can be applied to increase the operating frequency and throughput of the AES Hodjat et al [6] propose
AES-128 core architectures with throughputs of 30 to 70Gbps corresponding to area cost between 180 and 275kGEs, implemented on CMOS 180nm technology With 30 Gbps throughput, the architecture used is outer round pipelining (one stage pipeline per round), takes 11 cycles to encrypt a 128-bit block Therefore, the corresponding latency is 47 ns With a throughput of 70 Gbps, the authors used a 4-stage pipeline architecture in each round, which took 41 cycles for each 128-bit block corresponding to a delay of 74.9 ns The design in [7] proposed a reconfigurable AES-128/192/
256 encryption engine targeted for the on-die acceleration of real-time encryption/decryption of media content on high-performance microprocessor platforms It was fabricated by CMOS 45nm technology The design achieves a high throughput of 53Gbps and a maximum operating frequency
of 2100MHz It spends 55 clock cycles per encryption so its latency is 26.2 ns
AES core in [8] running at 1000MHz achieves the highest throughput of 128Gbps This architecture has 20 pipeline stages so need 20 clock cycles to encrypt one block of data Therefore, the latency of this AES is 20ns
Despite achieving high throughput, the designs in [8] [9] and [10] have the large latency and they are inefficient in terms of hardware resources and power consumption due to
Trang 2excessive use of the pipeline
In real-time applications, latency is an important factor
Delay in the encryption, decryption plus other types of delays
can affect the quality of service Latency in AES encryption is
defined by the number of cycles that each data sample has to
take to go through the encryption data-path before the
encrypted output is generated The inner round pipelining of
the AES algorithm reduces the area while the same throughput
is maintained, but the cost is an increase in latency When
there is only outer round pipelining (one stage pipeline per
round), the latency is 11 cycles In designs with two pipeline
stages per round, the latency is 21 cycles For the fully inner
and outer round pipelined designs with three or four pipeline
stages per round, the latencies are 31 and 41 cycles,
respectively [6] Our target is to design a high-throughput and
low latency AES architecture; therefore, we focus on outer
round pipelining architectures
A Proposed AES architecture
The proposed architecture for the AES core architecture is
presented in Fig 3 In this architecture, Cipher Round 1
through Cipher Round 9 and Final Round are combinational
logic blocks in the AES-128 encryption To achieve high
throughput, we insert registers between the cipher rounds
The pipeline architecture ensures that when data is fully filled
in the pipeline states, after clock cycle a block of 128-bit is
encrypted In our design, each encryption round is a
combination logic To minimize the latency, in the proposed
architecture we do not use the pipeline for the inner round
We apply parallel techniques to reduce the number of clocks
per encryption
B Proposed CipherRound architecture
The micro architecture of each cipher round is presented
in Fig 2 There are four main operations in each cipher round:
SubMatrix, ShiftMatrix, MixMatrix, and AddRoundKey To
provide sub-keys for ten Cipher Round transformations, we
design KeyExpansion architecture including ten RoundKey
transformations Between RoundKey transformations, we
insert registers to store sub-keys for each cipher round The
details of these operations are proposed in the following
sub-sections
S
AddRoundKey
SubMatrix ShiftMatrix
MixMatrix 8
⎣
⎢
S 0,c ′
S 1,c ′
S 2,c ′
S 3,c ′ ⎦
S 0,c
S 1,c
S 2,c
S 3,c ⎦
⎤
⎣
⎢
S 0,c ′
S 1,c ′
S 2,c ′
S 3,c ′ ⎦
S 0,c
S 1,c
S 2,c
S 3,c ⎦
⎤
⎣
⎢
S 0,c ′
S 1,c ′
S 2,c ′
S 3,c ′ ⎦
S 0,c
S 1,c
S 2,c
S 3,c ⎦
⎤
Cipher Round
⎣
⎢
S 0,c ′
S 1,c ′
S 2,c ′
S 3,c ′ ⎦
S 0,c
S 1,c
S 2,c
S 3,c ⎦
⎤
Fig 1 The micro architecture of CipherRound
C Proposed SubMatrix transformation
To speed up the encryption process we apply the parallelization technique for each SubMatrix transformation The input per round is 128 bits (16 bytes) assembled into a
4 × 4 byte matrix So that, in the SubMatrix transform we use
16 S-Boxes; each S-Box is a 16 × 16-byte Look Up Table The micro architecture of SubMatrix is shown in Fig 2 Therefore, we use 16 S-Boxes for each transformation round
128
128
S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box
Fig 2 The micro architecture of Parallel S-Boxes in SubMatrix
transformation
D Proposed ShiftMatrix transformation The ShiftMatrix transformation is implemented through simple signal wiring
E Proposed MixMatrix transformation The MixColumns transformation multiplies each column
of the input matrix by matrix M Multiplication operations in
𝑥 + 𝑥 + 𝑥 + 𝑥 + 1
⎣
⎢
⎢
⎢
⎡𝑆,
𝑆,
𝑆,
𝑆, ⎦⎥
⎥
⎥
⎤
=
02 03
01 02
01 01
03 01
⎢
⎡𝑆 ,
𝑆,
𝑆 ,
𝑆 , ⎦⎥
⎥
⎤ (1)
D Clk Q
D Clk Q
Cipher Round
D Clk Q
D Clk Q Su
Clk Q
D Clk Q
D Clk Q
D Clk Q
RoundKey (0)
Cipher Round
CipherKey_IN
PlainText_IN
128
128
Clk
CipherText_OUT 128
START
NewCipherKey_SI
RESET
KeyExpansion
AESTOP
Control
D Clk Q
Trang 3Therefore:
𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 where 𝑏 are bits (get a
value of '0' or '1'), and {02} = 𝑥
If 𝑏 = 0 then:
𝑥𝑓(𝑥) = 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 +
𝑏 𝑥 (2)
If 𝑏 = 1 then:
𝑥𝑓(𝑥) = (𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥
+ 𝑏 𝑥) 𝑚𝑜𝑑𝑒 𝑚(𝑥)
= [(𝑥 + 𝑥 + 𝑥 + 𝑥 + 1) + (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 +
𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) + (𝑥 + 𝑥 + 𝑥 +
1)] 𝑚𝑜𝑑𝑒 𝑚(𝑥)
= [𝑚(𝑥) + (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 +
𝑏 𝑥) + (𝑥 + 𝑥 + 𝑥 + 1)] 𝑚𝑜𝑑𝑒 𝑚(𝑥)
= (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) +
(𝑥 + 𝑥 + 𝑥 + 1) (3)
From (2) and (3), we have:
𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 × 00000010 =
𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 0 𝑖𝑓 𝑏 = 0
𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 0⨁00011011 𝑖𝑓 𝑏 = 1 (4)
So, we can write:
{02} ∙ 𝐵𝑦𝑡𝑒 = (b b b b b b b & ′0′) xor "1B")
when b = ′1′ else (b b b b b b b & ′0′);
Secondly, we calculate: {03} ⋅ 𝑆,
Because of 0𝑥03 = 0𝑥02⨁0𝑥01 so we have:
𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 × 00000011 = 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 ×
00000010 ⨁ 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 (5)
In other words, {03} × {𝐵𝑦𝑡𝑒} = {02} × {𝐵𝑦𝑡𝑒}⨁{𝐵𝑦𝑡𝑒}
So, we proposed the micro architecture of the MixColumn
process as in Fig 4
In_0(7:0) In_0(6:0)&'0'
0b00011011
1 x In_0
2 x In_0 3 x In_0
In_0(7)
In_1(7:0) In_1(6:0)&'0'
0b00011011
1 x In_1
2 x In_1
3 x In_1 In_1(7)
In_2(7:0) In_2(6:0)&'0'
0b00011011
1 x In_2
2 x In_2
3 x In_2 In_2(7)
In_3(7:0) In_3(6:0)&'0'
0b00011011
1 x In_3
2 x In_3
3 x In_3 In_3(7)
Out_0(7:0)
Out_1(7:0)
Out_2(7:0)
Out_3(7:0)
0 1
0 1
0 1
0 1
0 1
0 1
0 1
0 1
Fig 4 The architecture of MixColumn transformation
In this figure we used XOR gates with 2 inputs and MUX
to get {02} × {𝐵𝑦𝑡𝑒_𝐼𝑛} and {03} × {𝐵𝑦𝑡𝑒_𝐼𝑛} and then calculate Byte_Out using XOR gates with 4 inputs
Because of using a parallel decoding architecture, we used four MixColumn blocks together to create the MixMatrix
F Proposed AddRoundKey transformation The AddRoundKey transformation performs an operation
on the State with one of the sub-keys The operation is a simple XOR function between each byte of the State and each byte of the sub-key
G Proposed Key Expansion transformation
We proposed a micro architecture to create sub-keys for each AES round as described in Fig 5 In this architecture, the 128-bit key of the previous round is divided into 4 words, each word is 32-bit and brought to the input of the Key Expansion round then performs transforms of RootWord, SubWord and Xored with the RCon register to generate a 128-bit key for the next round
RootWord SubWord
S S S S
RCon
W_in0 W_in1 W_in2
W_in3
W_out0 W_out1 W_out2 W_out3
32 32 32 32
KeyExp Round
128
128
Fig 5 The architecture of Key Expansion round
The proposed design of a high throughput outer round pipelined AES core has been modelled in RTL VHDL, then has been implemented using Synopsys tools with a CMOS 45nm technology from NANGATE The experimental results are shown in TABLE I From the obtained results, the proposed design achieves an ultra-high throughput of 111.3Gbps at the operating frequency of 870 MHz The power consumption is only 56.3mW The area cost is 0.13mm2 (equivalent to 164.5kGEs) Therefore, the design
From the experimental result, our design can run with the maximum operating frequency of 870MHz When the pipeline stages of our design are fully filled with data, each one cycle clock has 128 output bits Thus, the maximum throughput that our design can achieve is:
0.87 × 128
In this design, we used 11 pipeline stages, so latency is 11 clock cycles (equivalent to 12.6ns):
11
Trang 4TABLE I AES ENCRYPTION IMPLEMENTATION
TABLE I summarizes our design implementation results
on NANGATE CMOS 45nm technology The comparison of
our architecture with the related works is shown in TABLE
II Compared to the works in [7] and [8] which were
implemented at the same technology node, our design
achieves a similar throughput as [8], twice higher throughput
than the one in [7] but has a lower latency and occupies less
area Compared to the other works, our design has the lowest
latency and highest efficiency in using hardware resources
(area efficiency in terms of Gbps/mm2) It has 2.4x better area
efficiency and 2x lower latency than the one in [7], while on
the other hand, the power consumption is also 2.2x lower In [8], although the throughput was higher, the efficient use of hardware resources is 400 times lower and the power consumption is 109 times higher than our design From the area point of view, our AES architecture is 48 times smaller than the design in [8] Therefore, our proposed design is suitable for real-time applications at a low-cost hardware implementation In Fig 6 energy efficiency and area efficiency in our design are much higher than other related works
Fig 6 Comparisons with the related works
TABLE II COMPARISON OF THE PROPOSED DESIGN AND DIFFERENT AES ARCHITECTURES Design (MHz) CLK encryption Cycles per Tech (nm) (mmArea 2 ) (kGate) Area Power (mW) Throughput (Gbps) Latency (ns) Energy Efficiency (Gbps/W) Area Efficiency (Gbps/mm 2 )
Broadband communications (5G networks, IoT gateways,
optical transmission systems, surveillance cameras, video
conferencing) require increasing quality of service and data
security Therefore, designing security cores with high
throughput and low latency is always a challenge, especially
with hardware cost and power consumption constraints We
proposed in this paper an AES core architecture for high
throughput and real-time applications The outer pipeline and
fully parallel architecture allow us to increase the operating
frequency and reduce the latency Our design operates at
870MHz on NANGATE CMOS 45nm technology, achieves a
high throughput of 111.3Gbps and low latency of 12.6 ns
energy efficiency (1977Gbps/W)
V ACKNOWLEDGMENT This research is partly funded by the Ministry of Science and Technology (MOST) of Vietnam under grant number KC.01.21/16-20
https://www.pearson.com/us/higher- education/program/Stallings-Cryptography-and- Network-Security-Principles-and-Practice-7th-Edition/PGM334401.html [Accessed: 21-Sep-2018]
and fully pipelined implementation of AES algorithm
on FPGA,” Microprocessor and Microsystems, vol 39,
no 7, pp 480–493, Oct 2015
Trang 5Low-Power Low-Energy Multisecurity-Level
Internet-of-Things Applications,” IEEE Transactions on Very
Large Scale Integration (VLSI) systems, vol 25, no 12,
pp 3281–3290, Dec 2017
ultra-low power AES encryption cores with silicon
demonstration in SOTB CMOS process,” Electronics
Letters, vol 53, no 23, pp 1512–1514, 2017
minimum-energy operation and silicon demonstration
in 65nm with lowest energy per encryption,” in 2015
IEEE International Symposium on Circuits and
Systems (ISCAS), 2015, pp 2349–2352
trade-offs for fully pipelined 30 to 70 Gbits/s AES
processors,” IEEE Transactions on Computers, vol
55, no 4, pp 366–372, Apr 2006
Composite-Field AES-Encrypt/Decrypt Accelerator
for Content-Protection in 45 nm High-Performance
Microprocessors,” IEEE Journal of Solid-State
Circuits, vol 46, no 4, pp 767–776, Apr 2011
throughput reconfigurable cryptographic processor,”
in 2014 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), 2014, pp 155–161
an ultra high speed AES processor for next generation
IT security,” Comput Electr Eng., vol 37, no 6, pp
1160–1170, Nov 2011
[10] B Erbagci, N E C Akkaya, C Teegarden, and K
Mai, “A 275 Gbps AES encryption accelerator using
ROM-based S-boxes in 65nm,” in 2015 IEEE Custom
Integrated Circuits Conference (CICC), 2015, pp 1–4
[11] P Liu, J Hsiao, H Chang, and C Lee, “A 2.97 Gb/s
DPA-resistant AES engine with self-generated random
sequence,” in 2011 Proceedings of the ESSCIRC
(ESSCIRC), 2011, pp 71–74
[12] K Rahimunnisa, P Karthigaikumar, N Christy, S
Kumar, and J Jayakumar, “PSP: Parallel sub-pipelined
architecture for high throughput AES on FPGA and
ASIC,” Open Comput Sci., vol 3, no 4, pp 173–186,
2013