Human detection using HOG-SVM in hardware shows high classification rate at higher throughput when compared with deep learning methods.. In this paper, we propose a novel high-throughput
Trang 1A Novel Hardware Architecture for Human
Detection using HOG-SVM Co-Optimization
Ngo-Doanh Nguyen, Duy-Hieu Bui, Xuan-Tu Tran SISLAB, VNU University of Engineering and Technology,
144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam
Corresponding author’s email: tutx@vnu.edu.vn
Abstract—Histogram of Oriented Gradient (HOG) in
combi-nation with Supported Vector Machine (SVM) has been used as
an efficient method for object detection in general and human
detection in particular Human detection using HOG-SVM in
hardware shows high classification rate at higher throughput
when compared with deep learning methods However, data
dependencies and complicated arithmetic in HOG feature
gen-eration and SVM classification limit the maximum throughput
of these applications In this paper, we propose a novel
high-throughput hardware architecture for human detection by
co-optimizing HOG feature generation and SVM classification The
throughput is improved by using a fast, highly-parallel and
low-cost HOG feature generation in combination with a modified
datapath for parallel computation of SVM and HOG feature
normalization The proposed architecture has been implemented
in TSMC 65nm technology with a maximum operating frequency
of 500MHz and throughput of 139fps for Full-HD resolution The
hardware area cost is about 145kGEs along with 242kb SRAMs
Index Terms—Artificial Intelligence, Histogram of Oriented
Gradient, Support Vector Machine, HOG, SVM
Human detection has been a crucial part in surveillance and
automobile systems such as smart cars and security To detect
a human, each frame from the camera will be analyzed to
decide if it contains a person or not These embedded-vision
applications are often implemented in hardware to maximize
the throughput to meet the real-time requirement However,
the current hardware implementations of human detection
such as HOG-SVM and Deep Convolution Neural Network
(CNN) have limitations in terms of throughput because of the
data dependencies and a large number of operations required
Between the two methods for human detection, HOG-SVM
shows its advantages over CNN because HOG-SVM requires
fewer operations and data dependencies [1] In contrast, CNN
provides more robust solutions with increasing accuracies [1]
Therefore, HOG-SVM is more suitable for embedded-vision
applications than CNN since it has high throughput and
low-power consumption for hardware implementation
Human detection using HOG-SVM first describes the
detec-tion windows using HOG descriptors [2], then applies SVM
classification to decide if it contains a person The original
HOG feature generation process contains many complicated
arithmetic functions such as inverse tangent, square, square
root, and division SVM classification can only start when
the HOG features are available This data dependency further
limits the throughput of the system and increase memory size for storage and reuse of HOG features [3]
There have been many research works focusing on opti-mizing HOG feature generation and SVM for hardware im-plementation of human detection To increase the throughput
of HOG feature generation, the approximation of trigono-metric functions can be done by using CORDIC [4], [5] or other approximate algorithms such as the works in [5]–[8] Approximate computations can increase the throughput and reduce hardware complexity, but it reduces the accuracy of HOG features Throughput can also be increased by reusing the generated HOG features [4], [5] of the overlapped cells
of the detection windows Finally, a multi-core setup along with parallel SVM computation can be used to increase the throughput further [4], [9] However, in these works, HOG feature generation and SVM are still calculated separately
In this paper, we propose a novel architecture for calculating HOG features and performing detection using SVM by co-optimizing the two processes to increase the throughput The throughput is improved by using a fast, highly-parallel and low-cost HOG feature generation in combination with a modified datapath for parallel computation of SVM and HOG feature normalization with data reuse The SVM calculation for a block is performed on the unnormalized HOG features
in parallel with the HOG normalization Seven SVM modules are utilized for fast classification on the reused data of the overlapped cells of two windows In addition, the area is opti-mized by using Sequential Multiply-and-Accumulate (SMAC) for SVM and HOG normalization The proposed architecture has been implemented using TSMC 65nm technology with a maximum operating frequency of 500MHz and a maximum throughput of139fps for Full-HD video with a hardware cost
of 145kGEs and about 242kb SRAMs
The rests of this paper are organized as follows Section II is the current state of the art of HOG-SVM for hardware imple-mentation Our proposal hardware architecture to improve the throughput of HOG-SVM is described in Section III Section
IV presents our hardware implementation results Finally, there are some conclusions and perspectives in Section V
II STATE OFTHEART
A Overview of HOG-SVM algorithm Human detection using HOG-SVM first proposed by Dalal and his colleague in [2] Their processing flow is presented
Trang 2in Fig 1 The algorithm works on a detection window of
64 × 128 pixels After pre-processing, the detection window
is divided into multiple cells with a size of8 × 8 pixels Cell
histogram is generated based on the gradient of each pixel
Gradient includes its angle and magnitude The magnitudes
of the gradients in each cell are accumulated into nine bins
to construction the cell histogram based on their angles
Four adjacent cells (2 × 2) are grouped into one block Cell
histograms are normalized based on block data
Cell histogram Generation
Gradient Vector Generation
Pre-processing
Cell Histogram
Histogram Normalization
SVM Classification
Detection window
(64x128 pixels)
Gradient Vectors
in 1 Cell (8x8
pixels)
Collect 3780 HOG features in a single detection window Image
inputs
Image level Cell level Block level Window level
Angle Magnitude
1 2 3 4 5 6 7 8 9
Cell histogram (9 bins)
Block (4 cells) normalization
1 2 3 4 5 6 7 8 9
Cell 0 Cell 1
Cell 3
1 2 3 4 5 6 7 8 9
Cell 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
…
Fig 1 HOG-SVM algorithm for human detection [9].
One of the most complicated parts for implementing cell
histogram is gradient computation, which uses inverse tangent,
square and square root as in equation (1) and (2)
m(x, y) =
q
fx(x, y)2+ fy(x, y)2 (1)
θ(x, y) = arctanfy(x, y)
where fx(x, y) and fy(x, y) are horizontal and vertical vectors;
m(x, y) and θ(x, y) are magnitude and angle, respectively
Block normalization is done for four adjacent cells L2
normalization is used on 36 bins to create the block histogram
as described in (3)
vn
where vi is a bin in a block histogram and ǫ is used to evade
zero-division
Finally, an SVM classifier is used to decide the presence of
a person in a detection window using the block histograms
B Related works
Hardware implementation of complicated arithmetic
func-tions in calculating HOG features requires large area and
high power consumption Many researchers have been trying
to optimize HOG generation for this purpose For examples,
the inverse tangent can be calculated by using CORDIC
as proposed by [4], [5] Some other works [6]–[8] avoid
the inverse tangent by using an approximation function and
converting it into the comparison of the angles Ho et al in
[6] show a very fast and low-cost approximation method to
calculate the angles and the corresponding magnitudes with an
area of 3.5 kGEs For block normalization, L2 norm function
can be avoid by using L1-norm [9] which is much simpler
than L2-norm as described in (4)
vn
For L2-norm implementation, Chen et al [7] use the fast inverse square root using IEEE-754 number format Square root function can also be implemented using Newton-Raphson method [5] In general, these approximations help reduce hardware area and increase throughput, but lower the accuracy
of the generated HOG features
In addition, throughput has been increased by reusing cell histograms overlapped in multiple detection windows Many works such as Mizuno et al [5] and Takagi et al [4] use data-reuse methods to improve the throughput of the system Finally, parallelism can be used to push the throughput further For examples, Takagi et al used two HOG-SVM cores [4] Suleiman et al [9] utilized 3 HOG-SVM detectors for multi-scale support Their work also uses multiple MAC module in parallel for SVM to improve the throughput They also presented the preprocessing using gradient function to reduce the bit width before calculating the histogram
The aforementioned works have investigated various aspects
of hardware implementations of HOG-SVM However, HOG feature generation and SVM classification are optimized sep-arately In this work, we propose a novel architecture with the co-optimization of HOG feature generation and SVM classification After the block histogram is generated, SVM classification is executed directly on unnormalized data and
in parallel with L2-norm function Then, SVM results will
be divided by the normalized coefficients The hardware area
is minimized for L2-norm and SVM by using SMAC The proposed architecture will be presented in the next section III PROPOSEDHARDWAREARCHITECTURE
As stated in Section II, HOG-SVM contains many com-plicated arithmetics in cell histogram generation and block histogram normalization Conventionally, SVM classification
is performed on the normalized block data To increase the throughput of HOG-SVM for human detection, we propose
a novel hardware architecture which uses the co-optimization
of HOG feature normalization and SVM classification The proposed architecture is summarized in Fig 2 The pixels of each window are stored in an SRAM In each clock cycle, 8 pixels are read to generate 8 gradient vectors These vectors are accumulated to create one cell histogram every 8 clock cycles The cell histogram buffer is used to form the block histograms In our architecture, SVM and block normalization are executed in parallel Block data containing 36 histogram values of 4 cells are sent to both SVM and L2-norm The SVM results are normalized by the square root values The final results are accumulated into the window accumulation registers to decide if there is a person in the search windows
A Cell Histogram Generation The first step in HOG feature generation is to calculate the angles and the magnitudes of the gradient vectors In this work,
we use the method in [6] to generate the gradient vectors This low-cost and high-throughput method enables the parallel processing of 8 pixels in one clock cycle The gradient vectors then used by the bin accumulator to generate cell histograms
Trang 3Window
SRAM
(64x128)
Dx,
Dy
Gradient gen 0 Gradient gen 1
Gradient gen 7
Sequential MAC
0 – 6
Divider
SVM SRAM
Contain person?
Stage 1
Square & Acc 1
Block SAC
Stage 5
Fig 2 Proposed block diagram of object detection.
Our architecture can generate a cell histogram every 8 clock
cycles The outputs are then stored into the cell histogram
buffer The cell histogram buffer is a circular buffer which
stores 128 cells of a window for reuse It provides a block
histogram for SVM calculation
B Parallel computation of SVM and block normalization
In conventional HOG-SVM classification, SVM is done
on the normalized HOG features of the search window as
described in equation (5)
D=
n
X
i =1
(ωi× vn
where ωi is weights obtained after training process and vn
i is the normalized HOG features
However, the normalization process especially using
L2-norm (equation (3)) is time-consuming because it needs to
collect all the information of a block and do the square and
square root operations In this work, we propose to parallelize
SVM and HOG block normalization SVM is performed on
unnormalized data and then divided by the normalized block
coefficient Equation (6) is the new SVM equation on
unnor-malized block data By using this equation, the normalization
process can be done in parallel with SVM classification
D=
Pw
i =1ωi× vi
To optimize the hardware area and speed further, we propose
to use SMAC to calculate SVM and the sum of squared values
(SAC) The SMAC with highly parallel inputs enables the
computation of SAC of a cell histogram in twelve cycles, while
SVM performed on a block takes 8 cycles The detail of this
MAC architecture is presented in the next subsection
C SMAC architecture for SAC and SVM
Instead of using normal MAC module, this work uses
SMACs to reduce the hardware cost and to increase the
op-erating frequency The architecture of the SMAC is described
in Fig 3 Our design is based on bit-serial multipliers and
a parallel accumulator which have been used in convolutional
neuron network [10] The number of multiplicand pairs can be
changed at design time This architecture loops through each
bit of wi using a shift register If the current bit is ‘1’, the second multiplicands (Ai) is shifted, then accumulated into the results An adder tree is used to add all Ai which has the current bit of wi equal to ‘1’ A barrel shifter shifts the final result instead of each Aiseparately to save hardware area The throughput of SMAC depends on the number of input pairs and the number of bits used to represent wi SVM modules use a SMAC with 36 input pairs which correspond to a block histogram To calculate SAC, 9 input pairs are put with the same values
-W 1 (n)
-W1(n)
/ n
-W2(n)
Shift Reg
-W2(n)
/ n
-W p (n)
-W p (n)
/ n
Reg
A D D E R T R E E
Bit counter
SoP
m+n+
log(p) /
/ m
1 /
Done
/ m
/ m
/ m
m / / 1
m+n+
log(p) /
Shift Reg
Reg
/ m
m / / 1
Shift Reg
Reg
/ m
m / / 1
Fig 3 The architecture of the Sequential MAC (SMAC).
D Data reuse and pipeline For high-speed designs, data reuse is an important factor In this work, we reuse the generated cell histograms by storing
128 × 9 bin values in the cell histogram buffer After the generation of the first ten cell histograms of the first window, a block histogram is generated in 8 clock cycles The square-root values for block normalization are reused in SQRT buffer At the frame level, after the first window is processed the second window are processed by only calculated the cell historgram
of the non-overlapped cells When the window reaches the frame boundary, it is moved down and then move to the left This leads to the fact that SVM has to process the overlapped windows again To solve this problem, we use 6 additional SMACs to calculate the SVM of the overlapped area and
Trang 4the normalization is done by reusing SQRT values in SQRT
buffers and the pipelined divider
In our proposed architecture, data processing is performed
at different levels For example, the cell histogram generation
processes 8 gradient vectors per clock cycle, and a cell
histogram is generated in 8 clock cycles In contrast, SAC
works on a cell histogram and needs 12 clock cycles to finish
Therefore, to increase the throughput and the data utilization
on the design, we double the units when it is necessary For
instance, SAC and SQRT modules are doubled because they
cannot process a cell histogram and a square root of a block
SAC in 8 clock cycles The two units work alternatively to
meet the timing of the system The timing and activation of
each unit are described in Fig 4
SAC Cell 8
SAC Cell 12
SQRT 0
SQRT 1
SQRT Block 0 SQRT Block 1
DIV
SQRT Block 2
Bin
12 cycles
2
Sum SAC block 1
SVM Blk 4
12 cycles
8 cycles
Fig 4 Data pipeline of the proposed architecture for the first window.
With the proposed method, our design needs1.12K cycles to
compute HOG-SVM for the first window For other windows,
the data reuse scheme is activated In the worst case, only
15 new blocks need doing HOG feature generation In this
case, the first SVM SMAC is assigned for these blocks The
other 6 SVM SMACs are utilized for recalculating of the SVM
classification for the overlapped blocks The total number of
cycles for a new window with data reuse is 128 cycles with
data pipeline For a Full-HD image, our design requires about
3.58M cycles to finish, which leads to a peak throughput of
139fps at the frequency of 500MHz
IV HARDWARE IMPLEMENTATION RESULTS
The proposed architecture has been implemented in Matlab
with the training and test dataset from [2] The proposed
hardware architecture has been modeled in VHDL, simulated
and synthesized using Synopsys VCS and Design Compiler
The hardware model has minor differences in accuracy when
compared with the software model in Matlab using the INRIA
person dataset The final RTL has been synthesized by Design
Compiler using TSMC65nm standard cell library and SRAM
model from ARM
The implementation results are summarized in Table I With
our optimizations, this design can run at 500MHz with a
hardware area of 145kGEs At this frequency, the proposed
design can process Full-HD images at the speed of139fps The
total size of SRAMs in this work is about242kb Our design
achieves the highest operating frequency and the highest
framerate when compared with the previous works in [4], [8] and [9] Our design also uses fewer SRAMs for data reuse
TABLE I
538 Mbit 0.242Mbit
* The HOG core area is calculated from the original paper based on the best
of our knowledge.
Human detection has a wide range of applications such
as robotics, automobile, and video surveillance One of the efficient methods to perform human detection is HOG-SVM However, HOG-SVM with its complexity and data dependency limits its throughput in hardware implementation In this paper, we proposed a novel hardware architecture with the co-optimization of HOG normalization and SVM classification Along with the data reuse strategy, the proposed hardware architecture can run at the maximum frequency of500MHz in TSMC65nm technology with a throughput of 139fps for
Full-HD resolution This hardware design can be used for very-high-speed human detection systems
This research is partly supported by Ministry of Science and Technology (MoST) of Vietnam under grant number 28/2018/TL.CN-CNC
[1] A Suleiman, Y.-H Chen, J Emer, and V Sze, “Towards closing the energy gap between hog and cnn features for embedded vision,” in ISCAS, May 2017, pp 1–4.
[2] N Dalal and B Triggs, “Histograms of oriented gradients for human detection,” in IEEE-CVPR, vol 1, June 2005, pp 886–893.
[3] M Hiromoto and R Miyamoto, “Hardware architecture for high-accuracy real-time pedestrian detection with cohog features,” in IEEE-ICCV Workshops, Sep 2009, pp 894–899.
[4] K Takagi, K Mizuno, S Izumi, H Kawaguchi, and M Yoshimoto, “A sub-100-milliwatt dual-core hog accelerator vlsi for real-time multiple object detection,” in IEEE-ICASSP, May 2013, pp 2533–2537.
[5] K Mizuno, Y Terachi, K Takagi, S Izumi, H Kawaguchi, and
M Yoshimoto, “Architectural study of hog feature extraction processor for real-time object detection,” in IEEE SIPS, 2012, pp 197–202 [6] H.-H Ho, N.-S Nguyen, D.-H Bui, and X.-T Tran, “Accurate and low complex cell histogram generation by bypass the gradient of pixel computation,” in IEEE-NICS, Nov 2017, pp 201–206.
[7] P.-Y Chen, C.-C Huang, C.-Y Lien, and Y.-H Tsai, “An efficient hardware implementation of hog feature extraction for human detection,” IEEE-TITS, vol 15, no 2, pp 656–662, April 2014.
[8] F An, X Zhang, A Luo, L Chen, and H J Mattausch, “A hardware architecture for cell-based feature-extraction and classification using dual-feature space,” IEEE Transactions on Circuits and Systems for Video Technology, vol 28, no 10, pp 3086–3098, Oct 2018.
[9] A Suleiman and V Sze, “Energy-efficient hog-based object detection at 1080hd 60 fps with multi-scale support,” in IEEE-SiPS, Oct 2014, pp 1–6.
[10] P Judd, J Albericio, T Hetherington, T M Aamodt, and A Moshovos,
“Stripes: Bit-serial deep neural network computing,” in IEEE/ACM MICRO, Oct 2016, pp 1–12.