A 45nm HighThroughput and Low Latency AES Encryption for RealTime Applications44912

Nguyen, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam * Corresponding author’s email: tutx@vnu.edu.vn Abstract— In

Trang 1

A 45nm High-Throughput and Low Latency AES

Encryption for Real-Time Applications

Pham-Khoi Dong, Hung K Nguyen, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam

*) Corresponding author’s email: tutx@vnu.edu.vn

Abstract— In this paper, we propose a high-throughput and

low-latency AES architecture for wideband and real-time

applications such as surveillance cameras, video conference,

motion detection, IoT gateways, data store encryption… Our

design uses an outer round pipeline technique to achieve high

throughput The design has been modelled in RTL VHDL and

then synthesized with a 45nm CMOS technology using Synopsys

Design Compiler The implementation results show that the

proposed architecture achieves a throughput of 111.3Gbps and

a latency of 12.6ns at the maximum operating frequency of

870MHz With the same 45nm CMOS technology, our design has

area efficiency (2.4 times) and energy efficiency (4.7 times)

higher than other related works

Keywords— AES, cryptography, high throughput, low latency,

real-time applications

The Advanced Encryption Standard (AES) was published

by the National Institute of Standards and Technology (NIST)

in 2001 It is a symmetric block cipher that is intended to

replace DES as the approved standard for a wide range of

applications [1] In AES, the number of cipher rounds

depends on the size of the key It is equal to 10, 12, or 14 for

128-, 192-, or 256-bit keys, respectively AES encryption

round employs consecutively four primary operations:

SubBytes, ShiftRows, MixColumns, and AddRoundKey

AES implementations can be broadly classified into

software and hardware implementations Compared to the

software implementation, the hardware implementation of

AES, by nature, provides more physical security and higher

speed In general, the hardware implementation can be

performed in either FPGA or ASIC technologies [2] In

hardware implementations design, there are two main trends:

low power consumption with limited performance and high

performance designs of AES cores

D.-H Bui et al [3] present hardware optimization

strategies for high-speed ultralow-power AES architectures

First, the authors used AES 32-datapath to optimize area cost

Next, they utilized eight S-Box to improve throughput

Finally, they applied a clock gating strategy into data storage

registers to reduce power consumption The test-chip was

verified on ST FDSOI 28nm technology It achieved a power

consumption of less than 20µW for all key configurations

with the energy consumption of less than 1pJ/b and the

throughput of 28Mb/s at 10MHz operating frequency

In [4], the design of ultra-low power AES encryption core

by combining optimized architectures, using clock gating

technique is presented This AES encryption core has been

implemented on silicon on thin buried oxide (SOTB) 65nm technology The implementation results show that by using two S-Boxes the AES encryption core requires the smallest number of clock cycles and achieves the lowest power consumption of 0.4µW/ MHz Moreover, the proposed one S-Box AES encryption core occupies very low hardware resources of 2.4 kilo gate equivalent (kGEs)

Zhao et al [5] present the architectural exploration of lightweight AES accelerators with the goal of minimizing energy consumption The number of cycles per encryption in lightweight AES designs is estimated as a function of the number of available S-Boxes This AES architecture was implemented in a 65nm test-chip and achieved 0.83pJ/bit energy at 0.32 V with a throughput of 376kbps

Works [3], [4], [5] propose AES designs with an extremely low area and low power consumption, but due to the use of loop architecture and low frequency, throughput is limited and latency is too long Therefore, these architectures are not suitable for high throughput applications and low latency requirements

In the second trend, pipelining and sub-pipelining techniques can be applied to increase the operating frequency and throughput of the AES Hodjat et al [6] propose

AES-128 core architectures with throughputs of 30 to 70Gbps corresponding to area cost between 180 and 275kGEs, implemented on CMOS 180nm technology With 30 Gbps throughput, the architecture used is outer round pipelining (one stage pipeline per round), takes 11 cycles to encrypt a 128-bit block Therefore, the corresponding latency is 47 ns With a throughput of 70 Gbps, the authors used a 4-stage pipeline architecture in each round, which took 41 cycles for each 128-bit block corresponding to a delay of 74.9 ns The design in [7] proposed a reconfigurable AES-128/192/

256 encryption engine targeted for the on-die acceleration of real-time encryption/decryption of media content on high-performance microprocessor platforms It was fabricated by CMOS 45nm technology The design achieves a high throughput of 53Gbps and a maximum operating frequency

of 2100MHz It spends 55 clock cycles per encryption so its latency is 26.2 ns

AES core in [8] running at 1000MHz achieves the highest throughput of 128Gbps This architecture has 20 pipeline stages so need 20 clock cycles to encrypt one block of data Therefore, the latency of this AES is 20ns

Despite achieving high throughput, the designs in [8] [9] and [10] have the large latency and they are inefficient in terms of hardware resources and power consumption due to

Trang 2

excessive use of the pipeline

In real-time applications, latency is an important factor

Delay in the encryption, decryption plus other types of delays

can affect the quality of service Latency in AES encryption is

defined by the number of cycles that each data sample has to

take to go through the encryption data-path before the

encrypted output is generated The inner round pipelining of

the AES algorithm reduces the area while the same throughput

is maintained, but the cost is an increase in latency When

there is only outer round pipelining (one stage pipeline per

round), the latency is 11 cycles In designs with two pipeline

stages per round, the latency is 21 cycles For the fully inner

and outer round pipelined designs with three or four pipeline

stages per round, the latencies are 31 and 41 cycles,

respectively [6] Our target is to design a high-throughput and

low latency AES architecture; therefore, we focus on outer

round pipelining architectures

A Proposed AES architecture

The proposed architecture for the AES core architecture is

presented in Fig 3 In this architecture, Cipher Round 1

through Cipher Round 9 and Final Round are combinational

logic blocks in the AES-128 encryption To achieve high

throughput, we insert registers between the cipher rounds

The pipeline architecture ensures that when data is fully filled

in the pipeline states, after clock cycle a block of 128-bit is

encrypted In our design, each encryption round is a

combination logic To minimize the latency, in the proposed

architecture we do not use the pipeline for the inner round

We apply parallel techniques to reduce the number of clocks

per encryption

B Proposed CipherRound architecture

The micro architecture of each cipher round is presented

in Fig 2 There are four main operations in each cipher round:

SubMatrix, ShiftMatrix, MixMatrix, and AddRoundKey To

provide sub-keys for ten Cipher Round transformations, we

design KeyExpansion architecture including ten RoundKey

transformations Between RoundKey transformations, we

insert registers to store sub-keys for each cipher round The

details of these operations are proposed in the following

sub-sections

S

AddRoundKey

SubMatrix ShiftMatrix

MixMatrix 8

⎣

⎢

S 0,c ′

S 1,c ′

S 2,c ′

S 3,c ′ ⎦

S 0,c

S 1,c

S 2,c

S 3,c ⎦

⎤

⎣

⎢

S 0,c ′

S 1,c ′

S 2,c ′

S 3,c ′ ⎦

S 0,c

S 1,c

S 2,c

S 3,c ⎦

⎤

⎣

⎢

S 0,c ′

S 1,c ′

S 2,c ′

S 3,c ′ ⎦

S 0,c

S 1,c

S 2,c

S 3,c ⎦

⎤

Cipher Round

⎣

⎢

S 0,c ′

S 1,c ′

S 2,c ′

S 3,c ′ ⎦

S 0,c

S 1,c

S 2,c

S 3,c ⎦

⎤

Fig 1 The micro architecture of CipherRound

C Proposed SubMatrix transformation

To speed up the encryption process we apply the parallelization technique for each SubMatrix transformation The input per round is 128 bits (16 bytes) assembled into a

4 × 4 byte matrix So that, in the SubMatrix transform we use

16 S-Boxes; each S-Box is a 16 × 16-byte Look Up Table The micro architecture of SubMatrix is shown in Fig 2 Therefore, we use 16 S-Boxes for each transformation round

128

S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box S-Box

Fig 2 The micro architecture of Parallel S-Boxes in SubMatrix

transformation

D Proposed ShiftMatrix transformation The ShiftMatrix transformation is implemented through simple signal wiring

E Proposed MixMatrix transformation The MixColumns transformation multiplies each column

of the input matrix by matrix M Multiplication operations in

𝑥 + 𝑥 + 𝑥 + 𝑥 + 1

⎣

⎢

⎡𝑆,

𝑆,

𝑆, ⎦⎥

⎥

⎤

=

02 03

01 02

01 01

03 01

⎢

⎡𝑆 ,

𝑆,

𝑆 ,

𝑆 , ⎦⎥

⎥

⎤ (1)

D Clk Q

Cipher Round

D Clk Q

D Clk Q Su

Clk Q

D Clk Q

RoundKey (0)

Cipher Round

CipherKey_IN

PlainText_IN

128

Clk

CipherText_OUT 128

START

NewCipherKey_SI

RESET

KeyExpansion

AESTOP

Control

D Clk Q

Trang 3

Therefore:

𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 where 𝑏 are bits (get a

value of '0' or '1'), and {02} = 𝑥

If 𝑏 = 0 then:

𝑥𝑓(𝑥) = 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 +

𝑏 𝑥 (2)

If 𝑏 = 1 then:

𝑥𝑓(𝑥) = (𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥

+ 𝑏 𝑥) 𝑚𝑜𝑑𝑒 𝑚(𝑥)

= [(𝑥 + 𝑥 + 𝑥 + 𝑥 + 1) + (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 +

𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) + (𝑥 + 𝑥 + 𝑥 +

1)] 𝑚𝑜𝑑𝑒 𝑚(𝑥)

= [𝑚(𝑥) + (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 +

𝑏 𝑥) + (𝑥 + 𝑥 + 𝑥 + 1)] 𝑚𝑜𝑑𝑒 𝑚(𝑥)

= (𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥 + 𝑏 𝑥) +

(𝑥 + 𝑥 + 𝑥 + 1) (3)

From (2) and (3), we have:

𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 × 00000010 =

𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 0 𝑖𝑓 𝑏 = 0

𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 0⨁00011011 𝑖𝑓 𝑏 = 1 (4)

So, we can write:

{02} ∙ 𝐵𝑦𝑡𝑒 = (b b b b b b b & ′0′) xor "1B")

when b = ′1′ else (b b b b b b b & ′0′);

Secondly, we calculate: {03} ⋅ 𝑆,

Because of 0𝑥03 = 0𝑥02⨁0𝑥01 so we have:

𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 × 00000011 = 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 ×

00000010 ⨁ 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 𝑏 (5)

In other words, {03} × {𝐵𝑦𝑡𝑒} = {02} × {𝐵𝑦𝑡𝑒}⨁{𝐵𝑦𝑡𝑒}

So, we proposed the micro architecture of the MixColumn

process as in Fig 4

In_0(7:0) In_0(6:0)&'0'

0b00011011

1 x In_0

2 x In_0 3 x In_0

In_0(7)

In_1(7:0) In_1(6:0)&'0'

0b00011011

1 x In_1

2 x In_1

3 x In_1 In_1(7)

In_2(7:0) In_2(6:0)&'0'

0b00011011

1 x In_2

2 x In_2

3 x In_2 In_2(7)

In_3(7:0) In_3(6:0)&'0'

0b00011011

1 x In_3

2 x In_3

3 x In_3 In_3(7)

Out_0(7:0)

Out_1(7:0)

Out_2(7:0)

Out_3(7:0)

0 1

Fig 4 The architecture of MixColumn transformation

In this figure we used XOR gates with 2 inputs and MUX

to get {02} × {𝐵𝑦𝑡𝑒_𝐼𝑛} and {03} × {𝐵𝑦𝑡𝑒_𝐼𝑛} and then calculate Byte_Out using XOR gates with 4 inputs

Because of using a parallel decoding architecture, we used four MixColumn blocks together to create the MixMatrix

F Proposed AddRoundKey transformation The AddRoundKey transformation performs an operation

on the State with one of the sub-keys The operation is a simple XOR function between each byte of the State and each byte of the sub-key

G Proposed Key Expansion transformation

We proposed a micro architecture to create sub-keys for each AES round as described in Fig 5 In this architecture, the 128-bit key of the previous round is divided into 4 words, each word is 32-bit and brought to the input of the Key Expansion round then performs transforms of RootWord, SubWord and Xored with the RCon register to generate a 128-bit key for the next round

RootWord SubWord

S S S S

RCon

W_in0 W_in1 W_in2

W_in3

W_out0 W_out1 W_out2 W_out3

32 32 32 32

KeyExp Round

128

Fig 5 The architecture of Key Expansion round

The proposed design of a high throughput outer round pipelined AES core has been modelled in RTL VHDL, then has been implemented using Synopsys tools with a CMOS 45nm technology from NANGATE The experimental results are shown in TABLE I From the obtained results, the proposed design achieves an ultra-high throughput of 111.3Gbps at the operating frequency of 870 MHz The power consumption is only 56.3mW The area cost is 0.13mm2 (equivalent to 164.5kGEs) Therefore, the design

From the experimental result, our design can run with the maximum operating frequency of 870MHz When the pipeline stages of our design are fully filled with data, each one cycle clock has 128 output bits Thus, the maximum throughput that our design can achieve is:

0.87 × 128

In this design, we used 11 pipeline stages, so latency is 11 clock cycles (equivalent to 12.6ns):

11

Trang 4

TABLE I AES ENCRYPTION IMPLEMENTATION

TABLE I summarizes our design implementation results

on NANGATE CMOS 45nm technology The comparison of

our architecture with the related works is shown in TABLE

II Compared to the works in [7] and [8] which were

implemented at the same technology node, our design

achieves a similar throughput as [8], twice higher throughput

than the one in [7] but has a lower latency and occupies less

area Compared to the other works, our design has the lowest

latency and highest efficiency in using hardware resources

(area efficiency in terms of Gbps/mm2) It has 2.4x better area

efficiency and 2x lower latency than the one in [7], while on

the other hand, the power consumption is also 2.2x lower In [8], although the throughput was higher, the efficient use of hardware resources is 400 times lower and the power consumption is 109 times higher than our design From the area point of view, our AES architecture is 48 times smaller than the design in [8] Therefore, our proposed design is suitable for real-time applications at a low-cost hardware implementation In Fig 6 energy efficiency and area efficiency in our design are much higher than other related works

Fig 6 Comparisons with the related works

TABLE II COMPARISON OF THE PROPOSED DESIGN AND DIFFERENT AES ARCHITECTURES Design (MHz) CLK encryption Cycles per Tech (nm) (mmArea 2 ) (kGate) Area Power (mW) Throughput (Gbps) Latency (ns) Energy Efficiency (Gbps/W) Area Efficiency (Gbps/mm 2 )

Broadband communications (5G networks, IoT gateways,

optical transmission systems, surveillance cameras, video

conferencing) require increasing quality of service and data

security Therefore, designing security cores with high

throughput and low latency is always a challenge, especially

with hardware cost and power consumption constraints We

proposed in this paper an AES core architecture for high

throughput and real-time applications The outer pipeline and

fully parallel architecture allow us to increase the operating

frequency and reduce the latency Our design operates at

870MHz on NANGATE CMOS 45nm technology, achieves a

high throughput of 111.3Gbps and low latency of 12.6 ns

energy efficiency (1977Gbps/W)

V ACKNOWLEDGMENT This research is partly funded by the Ministry of Science and Technology (MOST) of Vietnam under grant number KC.01.21/16-20

https://www.pearson.com/us/higher- education/program/Stallings-Cryptography-and- Network-Security-Principles-and-Practice-7th-Edition/PGM334401.html [Accessed: 21-Sep-2018]

and fully pipelined implementation of AES algorithm

on FPGA,” Microprocessor and Microsystems, vol 39,

no 7, pp 480–493, Oct 2015

Trang 5

Low-Power Low-Energy Multisecurity-Level

Internet-of-Things Applications,” IEEE Transactions on Very

Large Scale Integration (VLSI) systems, vol 25, no 12,

pp 3281–3290, Dec 2017

ultra-low power AES encryption cores with silicon

demonstration in SOTB CMOS process,” Electronics

Letters, vol 53, no 23, pp 1512–1514, 2017

minimum-energy operation and silicon demonstration

in 65nm with lowest energy per encryption,” in 2015

IEEE International Symposium on Circuits and

Systems (ISCAS), 2015, pp 2349–2352

trade-offs for fully pipelined 30 to 70 Gbits/s AES

processors,” IEEE Transactions on Computers, vol

55, no 4, pp 366–372, Apr 2006

Composite-Field AES-Encrypt/Decrypt Accelerator

for Content-Protection in 45 nm High-Performance

Microprocessors,” IEEE Journal of Solid-State

Circuits, vol 46, no 4, pp 767–776, Apr 2011

throughput reconfigurable cryptographic processor,”

in 2014 IEEE/ACM International Conference on

Computer-Aided Design (ICCAD), 2014, pp 155–161

an ultra high speed AES processor for next generation

IT security,” Comput Electr Eng., vol 37, no 6, pp

1160–1170, Nov 2011

[10] B Erbagci, N E C Akkaya, C Teegarden, and K

Mai, “A 275 Gbps AES encryption accelerator using

ROM-based S-boxes in 65nm,” in 2015 IEEE Custom

Integrated Circuits Conference (CICC), 2015, pp 1–4

[11] P Liu, J Hsiao, H Chang, and C Lee, “A 2.97 Gb/s

DPA-resistant AES engine with self-generated random

sequence,” in 2011 Proceedings of the ESSCIRC

(ESSCIRC), 2011, pp 71–74

[12] K Rahimunnisa, P Karthigaikumar, N Christy, S

Kumar, and J Jayakumar, “PSP: Parallel sub-pipelined

architecture for high throughput AES on FPGA and

ASIC,” Open Comput Sci., vol 3, no 4, pp 173–186,

2013

Định dạng
Số trang	5
Dung lượng	1,44 MB