A Reconfigurable Multifunction DMA Controller for HighPerformance Computing Systems44986

This paper presents the design of a reconfigurable multi-function memory direct memory controller ReDMAC for high-performance MPSoCs.. The ReDMAC can support four operating modes, inc

Trang 1

A Reconfigurable Multi-function DMA Controller

for High-Performance Computing Systems

Hung K Nguyen, Khoi P Dong, Xuan-Tu Tran SISLAB, VNU University of Engineering and Technology -144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

Email: kiemhung@vnu.edu,vn

Abstract—Huge bandwidth demand along with the

requirement to synchronize data structures between different

processing structures in multiprocessor system-on-chip

(MPSoC) lead to the need to design dedicated memory access

controllers This paper presents the design of a reconfigurable

multi-function memory direct memory controller (ReDMAC)

for high-performance MPSoCs The ReDMAC supports the

capability of dynamic reconfiguration by enabling the

hardware fabrics to be synthesized into various functions even

if the system is working The ReDMAC can support four

operating modes, including direct memory access, matrix

transposing, data sorting, and matrix merging The ReDMAC

has been modeled at the Register Transfer Level (RTL) using

VHDL language The controller has been simulated and

evaluated on reconfigurability to work with individual

functions The controller is also synthesized with the Synopsys

Design Compiler tool to compare hardware costs with the

independent implementation of each individual function

Simulation and synthesis results indicate that the proposed

design meets the required functionality, while the area of the

controller decreases about three times compared to total area

of independent function cores

Keywords—ReDMAC, reconfigurable memory direct memory

controller, multiprocessor system-on-chip, high-performance

computing, reconfigurable fabrics

I INTRODUCTION Recently, the research trend in the design of

high-performance computing systems has shifted toward the

hybrid reconfigurable Multiprocessor System-on-Chips

(MPSoC) (e.g MUSRA [1], Zynq Ultrascale[2], ADRES[3],

REMUS[4], CPSoC [5] etc.) These systems are normally

integrated many heterogeneous processing resources such as

software programmable microprocessors (PP), hardwired IP

(Intellectual Property) cores, reconfigurable hardware

architectures, etc To program such a system, a target

application is first partitioned into a set of tasks and then

mapped onto the heterogeneous computational and routing

resources of the system Mapping and partitioning the

application so that it can be executed on several smaller

processors in a parallel or pipelining fashion is more efficient

than execution on a single processor Especially,

computation-intensive kernel functions of the application are

mapped onto the reconfigurable hardware so that they can

achieve high performance approximately equivalent to that

of ASIC while maintaining a degree of flexibility close to

that of DSP processors [6] Moreover, by dynamically

reconfiguring hardware, reconfigurable computing systems

allow many hardware tasks to be mapped onto the same

hardware platform, thus reducing the area and power

consumption of the design [7]

However, designing such high-performance computing

systems also has some challenges One of them is the

communication and synchronization of data between different processing structures Parallel processing architectures usually require a huge data bandwidth Therefore, the system bandwidth is necessary to ensure that data is always available for all resources to run concurrently without idle states Moreover, because the processing structures have different execution models, the data structure exchanged between them needs to be transformed to ensure compatibility

A common method used for data communication between processing units is through a shared memory with assistance of a direct memory access controller (DMAC) Here, DMAC is used for transferring data between shared-memory and parallel processing arrays without the participation of the central processing unit (CPU) Hence, DMAC is a very important component that helps to increase data transfer rate and reduce load for CPU in computing systems Unfortunately, a conventional DMAC [8] in general-purpose computer usually supports only simple operations that copy continuous data blocks from source storage area to destination one This architecture is not efficient to access to complex data structure supported by parallel processing architectures Because of these limitations the traditional DMACs architectures cannot provide enough throughput to keep up with new technology trends The role

of DMACs becomes more complicated in parallel computation architectures Improving and optimizing the functionality of DMAC become a key issue in designing high-performance computing systems [9] Many DMACs ([10]-[14]) have been proposed with the unique features that are dedicated to a specific domain of applications

In this paper, we propose and implement a reconfigurable multi-function DMA controller (ReDMAC) for the coarse-grained reconfigurable architecture, named MUSRA [1] Because MUSRA is designed to aim at accelerating computation of loops in the multimedia processing applications, some loop-transformation techniques have to be applied while mapping a specific loop onto the MUSRA As

a result, the data that is transferred between software modules running on microprocessors and loops executing on the MUSRA also need to be applied some proper transformations such as tiling, fusion, splitting, skewing, sectioning, etc [15] Therefore, the proposed DMAC does not only take charge of moving data from system’s memory

to parallel processing array, but also has to convert data structures to the suitable formats that are compatible to the execution model of parallel processing array of MUSRA The DMAC supports four modes:

x Basic DMA mode allows a data block to be moved from one place to another one;

Trang 2

x Fusing DMA mode merges an M×N-matrix with an

M×L- matrix into a M×(N+L)-matrix then move it to

another position;

x Transposing DMA mode copies a M×N-matrix

from one specified place, and then transposes before

moves it to another place;

x Sorting DMA mode copies a data block from one

place, and then sorts before moves it to another

place

The rest of this paper is organized as follows The

operation principle and architecture of the proposed DMAC

are presented in Section II In Section III, experimental

results and the evaluation on flexibility, performance and

implementation cost are reported and discussed Finally,

some conclusions are given in Section IV

II PROPOSED ARCHITECTURE

A. Principle Overview

The ReDMAC is designed to keep the role as an adapter

between ARM AMBA-based processing systems with the

hardware accelerators Fig 1 shows ReDMAC’s interface

and connectivity in a system-on-chip The interface between

the ReDMAC and the processing system complies with the

AMBA AHB protocol specification [16] It includes an AHB

Master interface for accessing to system’s memory and an

AHB slave interface for receiving DMA command from

CPU In addition, ReDMAC also has another interface for

handshaking with CPU or peripherals that request a DMA

session From the structure perspective, the ReDMAC

includes two parts: DMAC wrapper and DMAC core The

wrapper is to make the interface of DMAC core compatible

with the AHB bus and accelerator interface, therefore, allow

DMAC core to transfer data between memory and

accelerator

Accelerator

AHB Master Interface

DMAC Wrapper

DMAC core AMBA AHB

CPU

FLASH/SDRAM Controller Memory

AHB Slave Interface

Accelerator Interface

HldA

Dreq Hreq

Dack

Fig 1 ReDMAC interface and interconection in a SoC

B. DMAC core

The proposed architecture of the DMAC core is shown in

Fig 2 The DMAC core consists of the three main blocks

which are Control Register File, Configuration Context

Generator (CCG), and Control Unit (CU) Especially, to offer the reconfigurability in real-time, the CU is in turn composed of a parameterized FSM (Finite State Machine), Reconfigurable Fabrics, and Context Register File (CRF)

Configuration Context Generator (CCG)

Control Register File CMR

DADR_REG

Start

Handshaking Interface Stage 1

Stage 2 Reconfigurable fabrics

Routing Blocks CRF1 CRF

Parameterized FSM Processing Blocks CRF CRF

Control signals Status signals Control Unit

Done

AHB Slave Bus

AHB Master Bus

CGRA Bus

Fig 2 Functional block diagram of DMAC core

Clear all registers Reset

Dreq = ‘1’?

F

MODE DECODING

“0001” “0010” “0100” “1000”

T

Setting Context4

Done <= ‘1’

Start = ‘0’?

Done <= ‘0’

Setting Context3 Setting

Context2 Setting

Context1

Executing

F

T

CCG

Parameterized FSM

Hreq <= ‘1’

Hlda = ‘1’?

F

T Dack <= ‘1’

Start = ‘1’

Fig 3 FSMD flowchart of DMAC core

The operation of DMAC core is described by FSMD (Finite State Machine with Data-path) flowchart in Fig 3

Trang 3

The separation of the control unit from the configuration

context generator aims at isolating the functional operation

of the DMAC core from the configuration process This

structure avoids interferences between two sections, thus

ensuring design stability In addition, it creates a two-stage

pipelined mechanism (as shown in Fig 2) between these

sections, which reduces the time overhead caused by

configuration After right the CCG finishes the configuration

process, it is possible to immediately write a new DMA

command to the control register file

1) Control Register File

Control register file contains the some registers, which

determine the function and control parameters of the DMAC

core These registers are written by an external CPU via

AHB slave interface, and are read by the CCG to generate

configuration information for the DMAC core There are six

registers as follows:

control commands (e.g function, single/burst

transfer mode, data width, etc.) sent by the CPU;

address of the source data block in the memory that

DMAC core needs to read data from;

two rows of the source data block in the memory that

DMAC core needs to read data from

starting address of the destination data block in the

memory that DMAC core needs to write data to

between two rows of the destination data block in

the memory that DMAC core needs to write data to

data to be processed This register includes two

separated registers: RIR (Row Index Register)

indicates the row numbers of the data block; CIR

(Column Index Register) indicates the column

numbers of the data block

2) Configuration Context Generator (CCG)

CCG takes charge of two tasks in the DMA core Firstly,

it gets a DMA request and performs the handshaking

protocol to get access to AHB master bus Secondly, CCG

has to decode the information contained in the register CMR

and then generating configuration information and control

parameters for the parameterized FSM and reconfigurable

fabrics A set of such information is called as the

configuration context for the ReDMAC and is stored in

configuration register files CRF

Dreq

(in)

Hreq

(out)

Hlda

(in)

Dack

(out)

WrCRF

Can write a new command to control register file from here

Fig 4 Timing diagram of handshaking signals

CCG is designed to allow handshaking process and configuration context generation to happen in parallel Fig 4 shows the timing diagram of handshaking signals generated

by CCG After detecting the transition from 0 to 1 on the signal Dreg, CCG will start handshaking and context generating concurrently As a result, it takes only one clock cycle to latch a configuration context to CRF

3) Control unit (CU)

Reset =’1';

Done = ‘0’;

nReset=’0'

Start = ‘1’

F T

Done = ‘1’

Start = ‘0’

Done = ‘0’

B_sel = ‘00’; EN_B=’1';

i_lt_R = ‘1’

MemR_n = ‘0’;

F T

T

A_Sel = “01”; EN_A =‘1’

B_sel=’01'; EN_B=’1' J_lt_C =’1'

EN_J = ‘1’

EN_I=’1';

F T LD_i = ‘1’;

LD_j = ‘1’; Clr_t = ‘1’;

addrW_sel = ‘1’;

MemW_n = ‘0’

RDY_Read = ‘1’ F T

RDY_Wrt =’1'

F T

A_sel=’11'; EN_A=’1' En_T = ‘1’;

A_sel = “00”; EN_A=’1';

Mode1 = ‘1’

F

B > A? F T

Swap <= ~swap;

T

Mode1 = ‘1’ F T

Swap <= ~swap;

addrW_sel = ‘0’;

MemW_n = ‘0’

RDY_Wrt =’1'

F T

Fig 5 Flowchart of the FSM

Control unit performs the functions of generating addresses to read data from the source memory area, converting data structure, moving and writing data to the target memory area CU includes two parts:

Trang 4

x Parameterized FSM is responsible for generating

the signals that control the operation of the

reconfigurable fabrics The operation of the

parameterized FSM is described by the flowchart in

Fig 5

x Reconfigurable fabrics consists of the routing

blocks and basic building blocks that enables it to

alert physically into a control circuit that handles the

required DMA transfer and transformation The

routing blocks consist of the wires and

programmable switches for establishing connection

between basic building blocks to build up address

generator as well as and data convertor according to

a specific requirement

In addition, CU also includes a Context Register File

(CRF) that contains the configuration information for

reconfigurable fabrics as well as parameters for the FSM

The CRF is established by the CCG based on the content of

the register CMR The values of these registers will be kept

during the operation of the DMAC core in a particular mode

and only changed when the DMAC core changes its

operating mode

Fig 6 shows one of reconfigurable fabrics that can be

configured to build various write address generators

depending on the required DMA function The basic building

blocks are distinguished by grey while the routing blocks are

identified by the orange The registers in the CRF are

denoted by green CRF can be used to contain parameters

that specify the address range or contain information bits that

set the state of switches

Addr_W

Register

clk reset En_B

0 2×1 MUX

ADD

A Out

1 b_sel0

B

1 0

1 2×1 MUX

0 B_sel1

1 0

i

CRF(2)

0 2×1 MUX

1

1 0

j

CRF(1)

1 2×1 MUX 0

1 0

1 2×1

MUX

0

0 2×1 MUX

CRF(3)

1

CRF(0)(0)

B A

1

2×1 MUX

0 AddrW_sel

CRF(3) CRF(0)(0) RF( R 0)( 0

Fig 6 A reconfigurable fabric

III RESULTS AND EVALUATION

A. Synthesis Results

The proposed reconfigurable DMAC was modeled at

Register-Transfer-Level (RTL) in VHDL language and

successfully synthesized into the gate-level circuits by Synopsys Design Compiler with the NANGATE 45nm open cell library [17]

Besides, in order to evaluate the effectiveness as well as the area cost of the proposed ReDMAC, we also implemented the five different DMAC versions (as shown in TABLE I) Here, basic DMAC, sorting DMAC, transposing DMAC, and fusing DMAC adopt only one of functions supported by ReDMAC as follows:

- Basic DMAC only supports transferring data between memory areas;

- Sorting DMAC supports the basic DMA function with the capability of data sorting;

- Transposing DMAC supports the basic DMA function with the capability of matrix transposing;

- Fusing DMAC supports the basic DMA function with the capability of matrix fusing

Function-select DMAC also supports four functions by integrated all above function cores into the same design, but each function is selected by switching between cores

The synthesis results of DMAC versions are shown in TABLE I The maximum frequency of the ReDMAC is about to 625 MHz that is the lowest compared with the other DMACs This decrease in frequency is due to the delays introduced by the routing blocks However, ReDMAC can support all four functions with an implementation cost of just 1407μm2 that is three times lower than the Function-Select DMAC Also note that ReDMAC's implementation cost is only slightly higher than Sorting DMAC that is the most complex single-function DMAC

TABLE I S YNTHESIS R ESULTS OF D IFFERENT DMAC D ESIGNS

B. Simulation Results The proposed reconfigurable DMAC is evaluated in terms of performance, flexibility and configuration overhead using the HDL-based simulator To do that, an evaluation testbench platform as shown in Fig 7 has been built from the RTL model of ReDMAC

Fig 8 shows the simulation result of ReDMAC by ModelSim simulator Each DMA session includes three phases: (1) Initializing: CPU writes a DMA command to ReDMAC and starts a DMA session by assert the signal dreq

= ‘1’; (2) handshaking and configuring: ReDMAC handshakes with CPU to become the bus master and configures DMAC core at the same time; (3) DMA processing: DMAC transfers data between system memory and accelerator memory Let’s look inside the waveform in Fig 8 to analyze the operation of ReDMAC At the time of 2845ns, CPU writes the first DMA command into the

Trang 5

DMAC After detecting that the signal dreq transit from ‘0’

to ‘1’, ReDMAC performs handshaking protocol to get

access to the AHB master bus ReDMAC confirms that the

DMA session is started by asserting the signals Dack = ‘1’

At the time of 3315ns, after right Dack = ‘1’, CPU can start

an initializing phase for a next DMA session by writing new

DMA command to the control register file of ReDMAC The

simulation results prove that our ReDMAC design allows the

initialization of next DMA session to be hidden under the

DMA process of current DMA session In addition, it takes

only one clock cycle to switch to next configuration context

Data Memory

CPU:

- Generate test vectors

- Initialize a DMA session

- Validate result after DMA session

memr_n

data_i

addr_R

memr_n

data_o addr_R

Bank j

Bank i

ai dbi

da ad

8 4

Parameterized FSM

Reconfigurable Fabrics

data_i

addr_R

clk

memr_n

DMA

addr_W

data_o

memw_n

rst_n

D Hre Da

CU

CRF

Load_n

Control Registers

CCG

Start Done

Clock

Generator

Reset Generator

Fig 7 Simulation testbench

TABLE 2 summarizes execution time (in cycles) of the

DMAC designs depending on the size of input data block

Where, the execution time is defined as latency of DMA

process The input data block is a 2D-array of R×C bytes (R

= 1 in the case of verifying basic DMA function and sorting DMA function) The results in the table have been inferred from the FSM flowchart of DMA core (in Fig 5) and verified by simulation with blocks of random data with many different sizes Note that beside depending on the size of the input data block, the execution time of the sorting function also depends on the content of the data Therefore, the execution time that is shown in the table for sorting function

is latency for the worst case As shown in TABLE 2, ReDMAC can be reconfigured flexibly to support all functions with a slight increase in the execution time This increase is the result of each function being built-up from the reconfigurable fabrics instead of a dedicated architecture designed for that function

This paper presents the design of a reconfigurable multi-function DMAC for high-perform computing systems In addition to basic DMA function, the proposed ReDMAC also supports three data transformation functions that are popularly used in digital signal processing and multimedia processing The ReDMAC also supports the capability of dynamic reconfiguration by enabling the hardware fabrics to

be reconfigured into different functions even if the system is working To reduce time overhead caused by reconfiguration, a DMA session is partitioned into phases and implemented by an architecture of two-stage pipeline

The proposed architecture has been modeled at RTL using VHDL language, and then simulated and synthesized in order to validate the flexibility, cost and performance of the architecture The experimental results have proven that the proposed design meets the required functionality, while the area of the controller decreases about three times compared

to total area of independent function cores The proposed ReDMAC can be applied to reconfigurable high-performance SoCs

Fig 8 Simulation Result

TABLE 2 E XECUTION TIME ( CYCLES ) OF KERNEL LOOPS ON VARIOUS COMPUTATION PLATFORMS

+25×C-26)/2

Trang 6

ACKNOWLEDGMENT This work has been supported by Vietnam National

University, Hanoi under Project No QG.16.33

[1] Kiem Hung Nguyen and Thi Minh Phan (2017) RTL Design of a

Dynamically Reconfigurable Cell Array for Multimedia

Processing In Proceeding of the 4 th NAFOSTED Conference on

Information and Computer Science (NICS), 24-25 November 2017,

Hanoi, Vietnam

[2] Santarini, M "Xilinx 16nm ultrascale+ devices yield 2-5X

performance/watt advantage." XCell Journal 90 (2015): 8-15

[3] B Mei, M Berekovic and J.Y Mignolet: “ADRES & DRESC:

Architecture and Compiler for Coarse-Grain Reconfigurable

Processors”, Fine- and Coarse-Grain Reconfigurable Computing,

chapter 6, pp.255-297, 2007

[4] KiemHung Nguyen and Peng Cao and Xuexiang Wang and Jun

Yang and Longxing Shi (2013) Hardware Software Co-design of

H.264 Baseline Encoder on Coarse-Grained Dynamically

Reconfigurable Computing System-on-Chip IEICE Transactions on

Information and Systems, E96-D (3) pp 601-615 ISSN 0916-8532

[5] N Dutt, A Jantsch, S Sarma, "Toward Smart Embedded Systems: A

Self-aware System-on-Chip (SoC) Perspective" ACM TECS, Vol 15,

No 2, Article 22, February 2016

[6] João M P Cardoso, Pedro C Diniz: “Compilation Techniques for Reconfigurable Architectures”, Springer, 2009

[7] A Shoa and S Shirani, “Run-Time Reconfigurable Systems for Digital Signal Processing Applications: A Survey”, Journal of VLSI Signal Processing, Vol 39, pp.213–235, 2005, Springer Science [8] Datasheet of Intel 8257 Programmable DMA Controller

[9] Tehre, Vaishali, and Ravindra Kshirsagar "Survey on coarse grained reconfigurable architectures." International Journal of Computer Applications 48.16 (2012): 1-7

[10] Lattice Semiconductor Corporation Scatter-Gather Direct Memory Access Controller IP Core Users Guide October 2010

[11] Altera Corporation Scatter-Gather DMA Controller Core, Quartus II 9.1 November 2009

[12] Xilinx Channelized Direct Memory Access and Scatter Gather February 2010

[13] Hussain, Tassadaq, et al "PPMC: a programmable pattern based

Reconfigurable Computing Springer, Berlin, Heidelberg, 2012 [14] Nilsson, Emelie "DMA Controller for LEON3 SoC: s Using AMBA." (2013)

[15] João M P Cardoso Pedro C Diniz: Compilation Techniques for Reconfigurable Architectures, 2009, Springer

[16] AMBA Specification (Rev 2.0) http://www.arm.com [17] http://www.nangate.com/

Định dạng
Số trang	6
Dung lượng	465,66 KB