A Novel Hardware Architecture for Human Detection using HOGSVM CoOptimization44883

Human detection using HOG-SVM in hardware shows high classification rate at higher throughput when compared with deep learning methods.. In this paper, we propose a novel high-throughput

Trang 1

A Novel Hardware Architecture for Human

Detection using HOG-SVM Co-Optimization

Ngo-Doanh Nguyen, Duy-Hieu Bui, Xuan-Tu Tran SISLAB, VNU University of Engineering and Technology,

144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam

Corresponding author’s email: tutx@vnu.edu.vn

Abstract—Histogram of Oriented Gradient (HOG) in

combi-nation with Supported Vector Machine (SVM) has been used as

an efficient method for object detection in general and human

detection in particular Human detection using HOG-SVM in

hardware shows high classification rate at higher throughput

when compared with deep learning methods However, data

dependencies and complicated arithmetic in HOG feature

gen-eration and SVM classification limit the maximum throughput

of these applications In this paper, we propose a novel

high-throughput hardware architecture for human detection by

co-optimizing HOG feature generation and SVM classification The

throughput is improved by using a fast, highly-parallel and

low-cost HOG feature generation in combination with a modified

datapath for parallel computation of SVM and HOG feature

normalization The proposed architecture has been implemented

in TSMC 65nm technology with a maximum operating frequency

of 500MHz and throughput of 139fps for Full-HD resolution The

hardware area cost is about 145kGEs along with 242kb SRAMs

Index Terms—Artificial Intelligence, Histogram of Oriented

Gradient, Support Vector Machine, HOG, SVM

Human detection has been a crucial part in surveillance and

automobile systems such as smart cars and security To detect

a human, each frame from the camera will be analyzed to

decide if it contains a person or not These embedded-vision

applications are often implemented in hardware to maximize

the throughput to meet the real-time requirement However,

the current hardware implementations of human detection

such as HOG-SVM and Deep Convolution Neural Network

(CNN) have limitations in terms of throughput because of the

data dependencies and a large number of operations required

Between the two methods for human detection, HOG-SVM

shows its advantages over CNN because HOG-SVM requires

fewer operations and data dependencies [1] In contrast, CNN

provides more robust solutions with increasing accuracies [1]

Therefore, HOG-SVM is more suitable for embedded-vision

applications than CNN since it has high throughput and

low-power consumption for hardware implementation

Human detection using HOG-SVM first describes the

detec-tion windows using HOG descriptors [2], then applies SVM

classification to decide if it contains a person The original

HOG feature generation process contains many complicated

arithmetic functions such as inverse tangent, square, square

root, and division SVM classification can only start when

the HOG features are available This data dependency further

limits the throughput of the system and increase memory size for storage and reuse of HOG features [3]

There have been many research works focusing on opti-mizing HOG feature generation and SVM for hardware im-plementation of human detection To increase the throughput

of HOG feature generation, the approximation of trigono-metric functions can be done by using CORDIC [4], [5] or other approximate algorithms such as the works in [5]–[8] Approximate computations can increase the throughput and reduce hardware complexity, but it reduces the accuracy of HOG features Throughput can also be increased by reusing the generated HOG features [4], [5] of the overlapped cells

of the detection windows Finally, a multi-core setup along with parallel SVM computation can be used to increase the throughput further [4], [9] However, in these works, HOG feature generation and SVM are still calculated separately

In this paper, we propose a novel architecture for calculating HOG features and performing detection using SVM by co-optimizing the two processes to increase the throughput The throughput is improved by using a fast, highly-parallel and low-cost HOG feature generation in combination with a modified datapath for parallel computation of SVM and HOG feature normalization with data reuse The SVM calculation for a block is performed on the unnormalized HOG features

in parallel with the HOG normalization Seven SVM modules are utilized for fast classification on the reused data of the overlapped cells of two windows In addition, the area is opti-mized by using Sequential Multiply-and-Accumulate (SMAC) for SVM and HOG normalization The proposed architecture has been implemented using TSMC 65nm technology with a maximum operating frequency of 500MHz and a maximum throughput of139fps for Full-HD video with a hardware cost

of 145kGEs and about 242kb SRAMs

The rests of this paper are organized as follows Section II is the current state of the art of HOG-SVM for hardware imple-mentation Our proposal hardware architecture to improve the throughput of HOG-SVM is described in Section III Section

IV presents our hardware implementation results Finally, there are some conclusions and perspectives in Section V

II STATE OFTHEART

A Overview of HOG-SVM algorithm Human detection using HOG-SVM first proposed by Dalal and his colleague in [2] Their processing flow is presented

Trang 2

in Fig 1 The algorithm works on a detection window of

64 × 128 pixels After pre-processing, the detection window

is divided into multiple cells with a size of8 × 8 pixels Cell

histogram is generated based on the gradient of each pixel

Gradient includes its angle and magnitude The magnitudes

of the gradients in each cell are accumulated into nine bins

to construction the cell histogram based on their angles

Four adjacent cells (2 × 2) are grouped into one block Cell

histograms are normalized based on block data

Cell histogram Generation

Gradient Vector Generation

Pre-processing

Cell Histogram

Histogram Normalization

SVM Classification

Detection window

(64x128 pixels)

Gradient Vectors

in 1 Cell (8x8

pixels)

Collect 3780 HOG features in a single detection window Image

inputs

Image level Cell level Block level Window level

Angle Magnitude

1 2 3 4 5 6 7 8 9

Cell histogram (9 bins)

Block (4 cells) normalization

1 2 3 4 5 6 7 8 9

Cell 0 Cell 1

Cell 3

1 2 3 4 5 6 7 8 9

Cell 2

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

…

Fig 1 HOG-SVM algorithm for human detection [9].

One of the most complicated parts for implementing cell

histogram is gradient computation, which uses inverse tangent,

square and square root as in equation (1) and (2)

m(x, y) =

q

fx(x, y)2+ fy(x, y)2 (1)

θ(x, y) = arctanfy(x, y)

where fx(x, y) and fy(x, y) are horizontal and vertical vectors;

m(x, y) and θ(x, y) are magnitude and angle, respectively

Block normalization is done for four adjacent cells L2

normalization is used on 36 bins to create the block histogram

as described in (3)

vn

where vi is a bin in a block histogram and ǫ is used to evade

zero-division

Finally, an SVM classifier is used to decide the presence of

a person in a detection window using the block histograms

B Related works

Hardware implementation of complicated arithmetic

func-tions in calculating HOG features requires large area and

high power consumption Many researchers have been trying

to optimize HOG generation for this purpose For examples,

the inverse tangent can be calculated by using CORDIC

as proposed by [4], [5] Some other works [6]–[8] avoid

the inverse tangent by using an approximation function and

converting it into the comparison of the angles Ho et al in

[6] show a very fast and low-cost approximation method to

calculate the angles and the corresponding magnitudes with an

area of 3.5 kGEs For block normalization, L2 norm function

can be avoid by using L1-norm [9] which is much simpler

than L2-norm as described in (4)

vn

For L2-norm implementation, Chen et al [7] use the fast inverse square root using IEEE-754 number format Square root function can also be implemented using Newton-Raphson method [5] In general, these approximations help reduce hardware area and increase throughput, but lower the accuracy

of the generated HOG features

In addition, throughput has been increased by reusing cell histograms overlapped in multiple detection windows Many works such as Mizuno et al [5] and Takagi et al [4] use data-reuse methods to improve the throughput of the system Finally, parallelism can be used to push the throughput further For examples, Takagi et al used two HOG-SVM cores [4] Suleiman et al [9] utilized 3 HOG-SVM detectors for multi-scale support Their work also uses multiple MAC module in parallel for SVM to improve the throughput They also presented the preprocessing using gradient function to reduce the bit width before calculating the histogram

The aforementioned works have investigated various aspects

of hardware implementations of HOG-SVM However, HOG feature generation and SVM classification are optimized sep-arately In this work, we propose a novel architecture with the co-optimization of HOG feature generation and SVM classification After the block histogram is generated, SVM classification is executed directly on unnormalized data and

in parallel with L2-norm function Then, SVM results will

be divided by the normalized coefficients The hardware area

is minimized for L2-norm and SVM by using SMAC The proposed architecture will be presented in the next section III PROPOSEDHARDWAREARCHITECTURE

As stated in Section II, HOG-SVM contains many com-plicated arithmetics in cell histogram generation and block histogram normalization Conventionally, SVM classification

is performed on the normalized block data To increase the throughput of HOG-SVM for human detection, we propose

a novel hardware architecture which uses the co-optimization

of HOG feature normalization and SVM classification The proposed architecture is summarized in Fig 2 The pixels of each window are stored in an SRAM In each clock cycle, 8 pixels are read to generate 8 gradient vectors These vectors are accumulated to create one cell histogram every 8 clock cycles The cell histogram buffer is used to form the block histograms In our architecture, SVM and block normalization are executed in parallel Block data containing 36 histogram values of 4 cells are sent to both SVM and L2-norm The SVM results are normalized by the square root values The final results are accumulated into the window accumulation registers to decide if there is a person in the search windows

A Cell Histogram Generation The first step in HOG feature generation is to calculate the angles and the magnitudes of the gradient vectors In this work,

we use the method in [6] to generate the gradient vectors This low-cost and high-throughput method enables the parallel processing of 8 pixels in one clock cycle The gradient vectors then used by the bin accumulator to generate cell histograms

Trang 3

Window

SRAM

(64x128)

Dx,

Dy

Gradient gen 0 Gradient gen 1

Gradient gen 7

Sequential MAC

0 – 6

Divider

SVM SRAM

Contain person?

Stage 1

Square & Acc 1

Block SAC

Stage 5

Fig 2 Proposed block diagram of object detection.

Our architecture can generate a cell histogram every 8 clock

cycles The outputs are then stored into the cell histogram

buffer The cell histogram buffer is a circular buffer which

stores 128 cells of a window for reuse It provides a block

histogram for SVM calculation

B Parallel computation of SVM and block normalization

In conventional HOG-SVM classification, SVM is done

on the normalized HOG features of the search window as

described in equation (5)

D=

n

X

i =1

(ωi× vn

where ωi is weights obtained after training process and vn

i is the normalized HOG features

However, the normalization process especially using

L2-norm (equation (3)) is time-consuming because it needs to

collect all the information of a block and do the square and

square root operations In this work, we propose to parallelize

SVM and HOG block normalization SVM is performed on

unnormalized data and then divided by the normalized block

coefficient Equation (6) is the new SVM equation on

unnor-malized block data By using this equation, the normalization

process can be done in parallel with SVM classification

D=

Pw

i =1ωi× vi

To optimize the hardware area and speed further, we propose

to use SMAC to calculate SVM and the sum of squared values

(SAC) The SMAC with highly parallel inputs enables the

computation of SAC of a cell histogram in twelve cycles, while

SVM performed on a block takes 8 cycles The detail of this

MAC architecture is presented in the next subsection

C SMAC architecture for SAC and SVM

Instead of using normal MAC module, this work uses

SMACs to reduce the hardware cost and to increase the

op-erating frequency The architecture of the SMAC is described

in Fig 3 Our design is based on bit-serial multipliers and

a parallel accumulator which have been used in convolutional

neuron network [10] The number of multiplicand pairs can be

changed at design time This architecture loops through each

bit of wi using a shift register If the current bit is ‘1’, the second multiplicands (Ai) is shifted, then accumulated into the results An adder tree is used to add all Ai which has the current bit of wi equal to ‘1’ A barrel shifter shifts the final result instead of each Aiseparately to save hardware area The throughput of SMAC depends on the number of input pairs and the number of bits used to represent wi SVM modules use a SMAC with 36 input pairs which correspond to a block histogram To calculate SAC, 9 input pairs are put with the same values

-W 1 (n)

-W1(n)

/ n

-W2(n)

Shift Reg

-W2(n)

/ n

-W p (n)

/ n

Reg

A D D E R T R E E

Bit counter

SoP

m+n+

log(p) /

/ m

1 /

Done

/ m

m / / 1

m+n+

log(p) /

Shift Reg

Reg

/ m

m / / 1

Shift Reg

Reg

/ m

m / / 1

Fig 3 The architecture of the Sequential MAC (SMAC).

D Data reuse and pipeline For high-speed designs, data reuse is an important factor In this work, we reuse the generated cell histograms by storing

128 × 9 bin values in the cell histogram buffer After the generation of the first ten cell histograms of the first window, a block histogram is generated in 8 clock cycles The square-root values for block normalization are reused in SQRT buffer At the frame level, after the first window is processed the second window are processed by only calculated the cell historgram

of the non-overlapped cells When the window reaches the frame boundary, it is moved down and then move to the left This leads to the fact that SVM has to process the overlapped windows again To solve this problem, we use 6 additional SMACs to calculate the SVM of the overlapped area and

Trang 4

the normalization is done by reusing SQRT values in SQRT

buffers and the pipelined divider

In our proposed architecture, data processing is performed

at different levels For example, the cell histogram generation

processes 8 gradient vectors per clock cycle, and a cell

histogram is generated in 8 clock cycles In contrast, SAC

works on a cell histogram and needs 12 clock cycles to finish

Therefore, to increase the throughput and the data utilization

on the design, we double the units when it is necessary For

instance, SAC and SQRT modules are doubled because they

cannot process a cell histogram and a square root of a block

SAC in 8 clock cycles The two units work alternatively to

meet the timing of the system The timing and activation of

each unit are described in Fig 4

SAC Cell 8

SAC Cell 12

SQRT 0

SQRT 1

SQRT Block 0 SQRT Block 1

DIV

SQRT Block 2

Bin

12 cycles

2

Sum SAC block 1

SVM Blk 4

12 cycles

8 cycles

Fig 4 Data pipeline of the proposed architecture for the first window.

With the proposed method, our design needs1.12K cycles to

compute HOG-SVM for the first window For other windows,

the data reuse scheme is activated In the worst case, only

15 new blocks need doing HOG feature generation In this

case, the first SVM SMAC is assigned for these blocks The

other 6 SVM SMACs are utilized for recalculating of the SVM

classification for the overlapped blocks The total number of

cycles for a new window with data reuse is 128 cycles with

data pipeline For a Full-HD image, our design requires about

3.58M cycles to finish, which leads to a peak throughput of

139fps at the frequency of 500MHz

IV HARDWARE IMPLEMENTATION RESULTS

The proposed architecture has been implemented in Matlab

with the training and test dataset from [2] The proposed

hardware architecture has been modeled in VHDL, simulated

and synthesized using Synopsys VCS and Design Compiler

The hardware model has minor differences in accuracy when

compared with the software model in Matlab using the INRIA

person dataset The final RTL has been synthesized by Design

Compiler using TSMC65nm standard cell library and SRAM

model from ARM

The implementation results are summarized in Table I With

our optimizations, this design can run at 500MHz with a

hardware area of 145kGEs At this frequency, the proposed

design can process Full-HD images at the speed of139fps The

total size of SRAMs in this work is about242kb Our design

achieves the highest operating frequency and the highest

framerate when compared with the previous works in [4], [8] and [9] Our design also uses fewer SRAMs for data reuse

TABLE I

538 Mbit 0.242Mbit

* The HOG core area is calculated from the original paper based on the best

of our knowledge.

Human detection has a wide range of applications such

as robotics, automobile, and video surveillance One of the efficient methods to perform human detection is HOG-SVM However, HOG-SVM with its complexity and data dependency limits its throughput in hardware implementation In this paper, we proposed a novel hardware architecture with the co-optimization of HOG normalization and SVM classification Along with the data reuse strategy, the proposed hardware architecture can run at the maximum frequency of500MHz in TSMC65nm technology with a throughput of 139fps for

Full-HD resolution This hardware design can be used for very-high-speed human detection systems

This research is partly supported by Ministry of Science and Technology (MoST) of Vietnam under grant number 28/2018/TL.CN-CNC

[1] A Suleiman, Y.-H Chen, J Emer, and V Sze, “Towards closing the energy gap between hog and cnn features for embedded vision,” in ISCAS, May 2017, pp 1–4.

[2] N Dalal and B Triggs, “Histograms of oriented gradients for human detection,” in IEEE-CVPR, vol 1, June 2005, pp 886–893.

[3] M Hiromoto and R Miyamoto, “Hardware architecture for high-accuracy real-time pedestrian detection with cohog features,” in IEEE-ICCV Workshops, Sep 2009, pp 894–899.

[4] K Takagi, K Mizuno, S Izumi, H Kawaguchi, and M Yoshimoto, “A sub-100-milliwatt dual-core hog accelerator vlsi for real-time multiple object detection,” in IEEE-ICASSP, May 2013, pp 2533–2537.

[5] K Mizuno, Y Terachi, K Takagi, S Izumi, H Kawaguchi, and

M Yoshimoto, “Architectural study of hog feature extraction processor for real-time object detection,” in IEEE SIPS, 2012, pp 197–202 [6] H.-H Ho, N.-S Nguyen, D.-H Bui, and X.-T Tran, “Accurate and low complex cell histogram generation by bypass the gradient of pixel computation,” in IEEE-NICS, Nov 2017, pp 201–206.

[7] P.-Y Chen, C.-C Huang, C.-Y Lien, and Y.-H Tsai, “An efficient hardware implementation of hog feature extraction for human detection,” IEEE-TITS, vol 15, no 2, pp 656–662, April 2014.

[8] F An, X Zhang, A Luo, L Chen, and H J Mattausch, “A hardware architecture for cell-based feature-extraction and classification using dual-feature space,” IEEE Transactions on Circuits and Systems for Video Technology, vol 28, no 10, pp 3086–3098, Oct 2018.

[9] A Suleiman and V Sze, “Energy-efficient hog-based object detection at 1080hd 60 fps with multi-scale support,” in IEEE-SiPS, Oct 2014, pp 1–6.

[10] P Judd, J Albericio, T Hetherington, T M Aamodt, and A Moshovos,

“Stripes: Bit-serial deep neural network computing,” in IEEE/ACM MICRO, Oct 2016, pp 1–12.

Định dạng
Số trang	4
Dung lượng	234,91 KB