This paper presents the design of a reconfigurable multi-function memory direct memory controller ReDMAC for high-performance MPSoCs.. The ReDMAC can support four operating modes, inc
Trang 1A Reconfigurable Multi-function DMA Controller
for High-Performance Computing Systems
Hung K Nguyen, Khoi P Dong, Xuan-Tu Tran SISLAB, VNU University of Engineering and Technology -144 Xuan Thuy, Cau Giay, Hanoi, Vietnam
Email: kiemhung@vnu.edu,vn
Abstract—Huge bandwidth demand along with the
requirement to synchronize data structures between different
processing structures in multiprocessor system-on-chip
(MPSoC) lead to the need to design dedicated memory access
controllers This paper presents the design of a reconfigurable
multi-function memory direct memory controller (ReDMAC)
for high-performance MPSoCs The ReDMAC supports the
capability of dynamic reconfiguration by enabling the
hardware fabrics to be synthesized into various functions even
if the system is working The ReDMAC can support four
operating modes, including direct memory access, matrix
transposing, data sorting, and matrix merging The ReDMAC
has been modeled at the Register Transfer Level (RTL) using
VHDL language The controller has been simulated and
evaluated on reconfigurability to work with individual
functions The controller is also synthesized with the Synopsys
Design Compiler tool to compare hardware costs with the
independent implementation of each individual function
Simulation and synthesis results indicate that the proposed
design meets the required functionality, while the area of the
controller decreases about three times compared to total area
of independent function cores
Keywords—ReDMAC, reconfigurable memory direct memory
controller, multiprocessor system-on-chip, high-performance
computing, reconfigurable fabrics
I INTRODUCTION Recently, the research trend in the design of
high-performance computing systems has shifted toward the
hybrid reconfigurable Multiprocessor System-on-Chips
(MPSoC) (e.g MUSRA [1], Zynq Ultrascale[2], ADRES[3],
REMUS[4], CPSoC [5] etc.) These systems are normally
integrated many heterogeneous processing resources such as
software programmable microprocessors (PP), hardwired IP
(Intellectual Property) cores, reconfigurable hardware
architectures, etc To program such a system, a target
application is first partitioned into a set of tasks and then
mapped onto the heterogeneous computational and routing
resources of the system Mapping and partitioning the
application so that it can be executed on several smaller
processors in a parallel or pipelining fashion is more efficient
than execution on a single processor Especially,
computation-intensive kernel functions of the application are
mapped onto the reconfigurable hardware so that they can
achieve high performance approximately equivalent to that
of ASIC while maintaining a degree of flexibility close to
that of DSP processors [6] Moreover, by dynamically
reconfiguring hardware, reconfigurable computing systems
allow many hardware tasks to be mapped onto the same
hardware platform, thus reducing the area and power
consumption of the design [7]
However, designing such high-performance computing
systems also has some challenges One of them is the
communication and synchronization of data between different processing structures Parallel processing architectures usually require a huge data bandwidth Therefore, the system bandwidth is necessary to ensure that data is always available for all resources to run concurrently without idle states Moreover, because the processing structures have different execution models, the data structure exchanged between them needs to be transformed to ensure compatibility
A common method used for data communication between processing units is through a shared memory with assistance of a direct memory access controller (DMAC) Here, DMAC is used for transferring data between shared-memory and parallel processing arrays without the participation of the central processing unit (CPU) Hence, DMAC is a very important component that helps to increase data transfer rate and reduce load for CPU in computing systems Unfortunately, a conventional DMAC [8] in general-purpose computer usually supports only simple operations that copy continuous data blocks from source storage area to destination one This architecture is not efficient to access to complex data structure supported by parallel processing architectures Because of these limitations the traditional DMACs architectures cannot provide enough throughput to keep up with new technology trends The role
of DMACs becomes more complicated in parallel computation architectures Improving and optimizing the functionality of DMAC become a key issue in designing high-performance computing systems [9] Many DMACs ([10]-[14]) have been proposed with the unique features that are dedicated to a specific domain of applications
In this paper, we propose and implement a reconfigurable multi-function DMA controller (ReDMAC) for the coarse-grained reconfigurable architecture, named MUSRA [1] Because MUSRA is designed to aim at accelerating computation of loops in the multimedia processing applications, some loop-transformation techniques have to be applied while mapping a specific loop onto the MUSRA As
a result, the data that is transferred between software modules running on microprocessors and loops executing on the MUSRA also need to be applied some proper transformations such as tiling, fusion, splitting, skewing, sectioning, etc [15] Therefore, the proposed DMAC does not only take charge of moving data from system’s memory
to parallel processing array, but also has to convert data structures to the suitable formats that are compatible to the execution model of parallel processing array of MUSRA The DMAC supports four modes:
x Basic DMA mode allows a data block to be moved from one place to another one;
Trang 2x Fusing DMA mode merges an M×N-matrix with an
M×L- matrix into a M×(N+L)-matrix then move it to
another position;
x Transposing DMA mode copies a M×N-matrix
from one specified place, and then transposes before
moves it to another place;
x Sorting DMA mode copies a data block from one
place, and then sorts before moves it to another
place
The rest of this paper is organized as follows The
operation principle and architecture of the proposed DMAC
are presented in Section II In Section III, experimental
results and the evaluation on flexibility, performance and
implementation cost are reported and discussed Finally,
some conclusions are given in Section IV
II PROPOSED ARCHITECTURE
A. Principle Overview
The ReDMAC is designed to keep the role as an adapter
between ARM AMBA-based processing systems with the
hardware accelerators Fig 1 shows ReDMAC’s interface
and connectivity in a system-on-chip The interface between
the ReDMAC and the processing system complies with the
AMBA AHB protocol specification [16] It includes an AHB
Master interface for accessing to system’s memory and an
AHB slave interface for receiving DMA command from
CPU In addition, ReDMAC also has another interface for
handshaking with CPU or peripherals that request a DMA
session From the structure perspective, the ReDMAC
includes two parts: DMAC wrapper and DMAC core The
wrapper is to make the interface of DMAC core compatible
with the AHB bus and accelerator interface, therefore, allow
DMAC core to transfer data between memory and
accelerator
Accelerator
AHB Master Interface
DMAC Wrapper
DMAC core AMBA AHB
CPU
FLASH/SDRAM Controller Memory
AHB Slave Interface
Accelerator Interface
HldA
Dreq Hreq
Dack
Fig 1 ReDMAC interface and interconection in a SoC
B. DMAC core
The proposed architecture of the DMAC core is shown in
Fig 2 The DMAC core consists of the three main blocks
which are Control Register File, Configuration Context
Generator (CCG), and Control Unit (CU) Especially, to offer the reconfigurability in real-time, the CU is in turn composed of a parameterized FSM (Finite State Machine), Reconfigurable Fabrics, and Context Register File (CRF)
Configuration Context Generator (CCG)
Control Register File CMR
DADR_REG
Start
Handshaking Interface Stage 1
Stage 2 Reconfigurable fabrics
Routing Blocks CRF1 CRF
Parameterized FSM Processing Blocks CRF CRF
Control signals Status signals Control Unit
Done
AHB Slave Bus
AHB Master Bus
CGRA Bus
Fig 2 Functional block diagram of DMAC core
Clear all registers Reset
Dreq = ‘1’?
F
MODE DECODING
“0001” “0010” “0100” “1000”
T
Setting Context4
Done <= ‘1’
Start = ‘0’?
Done <= ‘0’
Setting Context3 Setting
Context2 Setting
Context1
Executing
F
T
CCG
Parameterized FSM
Hreq <= ‘1’
Hlda = ‘1’?
F
T Dack <= ‘1’
Start = ‘1’
Fig 3 FSMD flowchart of DMAC core
The operation of DMAC core is described by FSMD (Finite State Machine with Data-path) flowchart in Fig 3
Trang 3The separation of the control unit from the configuration
context generator aims at isolating the functional operation
of the DMAC core from the configuration process This
structure avoids interferences between two sections, thus
ensuring design stability In addition, it creates a two-stage
pipelined mechanism (as shown in Fig 2) between these
sections, which reduces the time overhead caused by
configuration After right the CCG finishes the configuration
process, it is possible to immediately write a new DMA
command to the control register file
1) Control Register File
Control register file contains the some registers, which
determine the function and control parameters of the DMAC
core These registers are written by an external CPU via
AHB slave interface, and are read by the CCG to generate
configuration information for the DMAC core There are six
registers as follows:
control commands (e.g function, single/burst
transfer mode, data width, etc.) sent by the CPU;
address of the source data block in the memory that
DMAC core needs to read data from;
two rows of the source data block in the memory that
DMAC core needs to read data from
starting address of the destination data block in the
memory that DMAC core needs to write data to
between two rows of the destination data block in
the memory that DMAC core needs to write data to
data to be processed This register includes two
separated registers: RIR (Row Index Register)
indicates the row numbers of the data block; CIR
(Column Index Register) indicates the column
numbers of the data block
2) Configuration Context Generator (CCG)
CCG takes charge of two tasks in the DMA core Firstly,
it gets a DMA request and performs the handshaking
protocol to get access to AHB master bus Secondly, CCG
has to decode the information contained in the register CMR
and then generating configuration information and control
parameters for the parameterized FSM and reconfigurable
fabrics A set of such information is called as the
configuration context for the ReDMAC and is stored in
configuration register files CRF
Dreq
(in)
Hreq
(out)
Hlda
(in)
Dack
(out)
WrCRF
Can write a new command to control register file from here
Fig 4 Timing diagram of handshaking signals
CCG is designed to allow handshaking process and configuration context generation to happen in parallel Fig 4 shows the timing diagram of handshaking signals generated
by CCG After detecting the transition from 0 to 1 on the signal Dreg, CCG will start handshaking and context generating concurrently As a result, it takes only one clock cycle to latch a configuration context to CRF
3) Control unit (CU)
Reset =’1';
Done = ‘0’;
nReset=’0'
Start = ‘1’
F T
Done = ‘1’
Start = ‘0’
Done = ‘0’
B_sel = ‘00’; EN_B=’1';
i_lt_R = ‘1’
MemR_n = ‘0’;
F T
F T
T
A_Sel = “01”; EN_A =‘1’
B_sel=’01'; EN_B=’1' J_lt_C =’1'
EN_J = ‘1’
EN_I=’1';
F T LD_i = ‘1’;
LD_j = ‘1’; Clr_t = ‘1’;
addrW_sel = ‘1’;
MemW_n = ‘0’
RDY_Read = ‘1’ F T
RDY_Wrt =’1'
F T
A_sel=’11'; EN_A=’1' En_T = ‘1’;
A_sel = “00”; EN_A=’1';
Mode1 = ‘1’
F
B > A? F T
Swap <= ~swap;
T
Mode1 = ‘1’ F T
Swap <= ~swap;
addrW_sel = ‘0’;
MemW_n = ‘0’
RDY_Wrt =’1'
F T
Fig 5 Flowchart of the FSM
Control unit performs the functions of generating addresses to read data from the source memory area, converting data structure, moving and writing data to the target memory area CU includes two parts:
Trang 4x Parameterized FSM is responsible for generating
the signals that control the operation of the
reconfigurable fabrics The operation of the
parameterized FSM is described by the flowchart in
Fig 5
x Reconfigurable fabrics consists of the routing
blocks and basic building blocks that enables it to
alert physically into a control circuit that handles the
required DMA transfer and transformation The
routing blocks consist of the wires and
programmable switches for establishing connection
between basic building blocks to build up address
generator as well as and data convertor according to
a specific requirement
In addition, CU also includes a Context Register File
(CRF) that contains the configuration information for
reconfigurable fabrics as well as parameters for the FSM
The CRF is established by the CCG based on the content of
the register CMR The values of these registers will be kept
during the operation of the DMAC core in a particular mode
and only changed when the DMAC core changes its
operating mode
Fig 6 shows one of reconfigurable fabrics that can be
configured to build various write address generators
depending on the required DMA function The basic building
blocks are distinguished by grey while the routing blocks are
identified by the orange The registers in the CRF are
denoted by green CRF can be used to contain parameters
that specify the address range or contain information bits that
set the state of switches
Addr_W
Register
clk reset En_B
0 2×1 MUX
ADD
A Out
1 b_sel0
B
1 0
1 2×1 MUX
0 B_sel1
1 0
i
CRF(2)
0 2×1 MUX
1
1 0
j
CRF(1)
1 2×1 MUX 0
1 0
1 2×1
MUX
0
0 2×1 MUX
CRF(3)
1
CRF(0)(0)
B A
1
2×1 MUX
0 AddrW_sel
CRF(3) CRF(0)(0) RF( R 0)( 0
Fig 6 A reconfigurable fabric
III RESULTS AND EVALUATION
A. Synthesis Results
The proposed reconfigurable DMAC was modeled at
Register-Transfer-Level (RTL) in VHDL language and
successfully synthesized into the gate-level circuits by Synopsys Design Compiler with the NANGATE 45nm open cell library [17]
Besides, in order to evaluate the effectiveness as well as the area cost of the proposed ReDMAC, we also implemented the five different DMAC versions (as shown in TABLE I) Here, basic DMAC, sorting DMAC, transposing DMAC, and fusing DMAC adopt only one of functions supported by ReDMAC as follows:
- Basic DMAC only supports transferring data between memory areas;
- Sorting DMAC supports the basic DMA function with the capability of data sorting;
- Transposing DMAC supports the basic DMA function with the capability of matrix transposing;
- Fusing DMAC supports the basic DMA function with the capability of matrix fusing
Function-select DMAC also supports four functions by integrated all above function cores into the same design, but each function is selected by switching between cores
The synthesis results of DMAC versions are shown in TABLE I The maximum frequency of the ReDMAC is about to 625 MHz that is the lowest compared with the other DMACs This decrease in frequency is due to the delays introduced by the routing blocks However, ReDMAC can support all four functions with an implementation cost of just 1407μm2 that is three times lower than the Function-Select DMAC Also note that ReDMAC's implementation cost is only slightly higher than Sorting DMAC that is the most complex single-function DMAC
TABLE I S YNTHESIS R ESULTS OF D IFFERENT DMAC D ESIGNS
B. Simulation Results The proposed reconfigurable DMAC is evaluated in terms of performance, flexibility and configuration overhead using the HDL-based simulator To do that, an evaluation testbench platform as shown in Fig 7 has been built from the RTL model of ReDMAC
Fig 8 shows the simulation result of ReDMAC by ModelSim simulator Each DMA session includes three phases: (1) Initializing: CPU writes a DMA command to ReDMAC and starts a DMA session by assert the signal dreq
= ‘1’; (2) handshaking and configuring: ReDMAC handshakes with CPU to become the bus master and configures DMAC core at the same time; (3) DMA processing: DMAC transfers data between system memory and accelerator memory Let’s look inside the waveform in Fig 8 to analyze the operation of ReDMAC At the time of 2845ns, CPU writes the first DMA command into the
Trang 5DMAC After detecting that the signal dreq transit from ‘0’
to ‘1’, ReDMAC performs handshaking protocol to get
access to the AHB master bus ReDMAC confirms that the
DMA session is started by asserting the signals Dack = ‘1’
At the time of 3315ns, after right Dack = ‘1’, CPU can start
an initializing phase for a next DMA session by writing new
DMA command to the control register file of ReDMAC The
simulation results prove that our ReDMAC design allows the
initialization of next DMA session to be hidden under the
DMA process of current DMA session In addition, it takes
only one clock cycle to switch to next configuration context
Data Memory
CPU:
- Generate test vectors
- Initialize a DMA session
- Validate result after DMA session
memr_n
data_i
addr_R
memr_n
data_o addr_R
Bank j
Bank i
ai dbi
da ad
8 4
Parameterized FSM
Reconfigurable Fabrics
data_i
addr_R
clk
memr_n
DMA
addr_W
data_o
memw_n
rst_n
D Hre Da
CU
CRF
Load_n
Control Registers
CCG
Start Done
Clock
Generator
Reset Generator
Fig 7 Simulation testbench
TABLE 2 summarizes execution time (in cycles) of the
DMAC designs depending on the size of input data block
Where, the execution time is defined as latency of DMA
process The input data block is a 2D-array of R×C bytes (R
= 1 in the case of verifying basic DMA function and sorting DMA function) The results in the table have been inferred from the FSM flowchart of DMA core (in Fig 5) and verified by simulation with blocks of random data with many different sizes Note that beside depending on the size of the input data block, the execution time of the sorting function also depends on the content of the data Therefore, the execution time that is shown in the table for sorting function
is latency for the worst case As shown in TABLE 2, ReDMAC can be reconfigured flexibly to support all functions with a slight increase in the execution time This increase is the result of each function being built-up from the reconfigurable fabrics instead of a dedicated architecture designed for that function
This paper presents the design of a reconfigurable multi-function DMAC for high-perform computing systems In addition to basic DMA function, the proposed ReDMAC also supports three data transformation functions that are popularly used in digital signal processing and multimedia processing The ReDMAC also supports the capability of dynamic reconfiguration by enabling the hardware fabrics to
be reconfigured into different functions even if the system is working To reduce time overhead caused by reconfiguration, a DMA session is partitioned into phases and implemented by an architecture of two-stage pipeline
The proposed architecture has been modeled at RTL using VHDL language, and then simulated and synthesized in order to validate the flexibility, cost and performance of the architecture The experimental results have proven that the proposed design meets the required functionality, while the area of the controller decreases about three times compared
to total area of independent function cores The proposed ReDMAC can be applied to reconfigurable high-performance SoCs
Fig 8 Simulation Result
TABLE 2 E XECUTION TIME ( CYCLES ) OF KERNEL LOOPS ON VARIOUS COMPUTATION PLATFORMS
+25×C-26)/2
Trang 6ACKNOWLEDGMENT This work has been supported by Vietnam National
University, Hanoi under Project No QG.16.33
[1] Kiem Hung Nguyen and Thi Minh Phan (2017) RTL Design of a
Dynamically Reconfigurable Cell Array for Multimedia
Processing In Proceeding of the 4 th NAFOSTED Conference on
Information and Computer Science (NICS), 24-25 November 2017,
Hanoi, Vietnam
[2] Santarini, M "Xilinx 16nm ultrascale+ devices yield 2-5X
performance/watt advantage." XCell Journal 90 (2015): 8-15
[3] B Mei, M Berekovic and J.Y Mignolet: “ADRES & DRESC:
Architecture and Compiler for Coarse-Grain Reconfigurable
Processors”, Fine- and Coarse-Grain Reconfigurable Computing,
chapter 6, pp.255-297, 2007
[4] KiemHung Nguyen and Peng Cao and Xuexiang Wang and Jun
Yang and Longxing Shi (2013) Hardware Software Co-design of
H.264 Baseline Encoder on Coarse-Grained Dynamically
Reconfigurable Computing System-on-Chip IEICE Transactions on
Information and Systems, E96-D (3) pp 601-615 ISSN 0916-8532
[5] N Dutt, A Jantsch, S Sarma, "Toward Smart Embedded Systems: A
Self-aware System-on-Chip (SoC) Perspective" ACM TECS, Vol 15,
No 2, Article 22, February 2016
[6] João M P Cardoso, Pedro C Diniz: “Compilation Techniques for Reconfigurable Architectures”, Springer, 2009
[7] A Shoa and S Shirani, “Run-Time Reconfigurable Systems for Digital Signal Processing Applications: A Survey”, Journal of VLSI Signal Processing, Vol 39, pp.213–235, 2005, Springer Science [8] Datasheet of Intel 8257 Programmable DMA Controller
[9] Tehre, Vaishali, and Ravindra Kshirsagar "Survey on coarse grained reconfigurable architectures." International Journal of Computer Applications 48.16 (2012): 1-7
[10] Lattice Semiconductor Corporation Scatter-Gather Direct Memory Access Controller IP Core Users Guide October 2010
[11] Altera Corporation Scatter-Gather DMA Controller Core, Quartus II 9.1 November 2009
[12] Xilinx Channelized Direct Memory Access and Scatter Gather February 2010
[13] Hussain, Tassadaq, et al "PPMC: a programmable pattern based
Reconfigurable Computing Springer, Berlin, Heidelberg, 2012 [14] Nilsson, Emelie "DMA Controller for LEON3 SoC: s Using AMBA." (2013)
[15] João M P Cardoso Pedro C Diniz: Compilation Techniques for Reconfigurable Architectures, 2009, Springer
[16] AMBA Specification (Rev 2.0) http://www.arm.com [17] http://www.nangate.com/