A Variable Precision Approach for Deep Neural Networks Xuan-Tuyen Tran, Duy-Anh Nguyen, Duy-Hieu Bui, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy
Trang 1A Variable Precision Approach for Deep
Neural Networks Xuan-Tuyen Tran, Duy-Anh Nguyen, Duy-Hieu Bui, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam
Corresponding author’s email: tutx@vnu.edu.vn
Abstract— Deep Neural Network (DNN) architectures have
been recently considered as the big breakthrough for a variety
of applications Because of the high computing capabilities
required, DNN has been unsuitable for various embedded
applications Many works have been trying to optimize the key
operations, which are multiply-and-add, in hardware for a
smaller area, higher throughput, and lower power consumption
One way to optimize these factors is to use the reduced bit
accuracy; for examples, Google's TPU used only 8-bit integer
operations for DNN inference Based on the characteristics of
different layers in DNN, further bit accuracy can be changed to
preserve the hardware area, power consumption, and
throughput In this work, the thesis investigates a hardware
implementation of multiply-and-add with variable bit precision
which can be adjusted at the computation time The proposed
design can calculate the sum of several products with the bit
precision ranging from 1 to 16 bits The hardware
implementation results on Xilinx FPGA Virtex 707 development
kit show that our design occupies smaller hardware and can run
at a higher frequency of 310 MHz, while the same functionality
implemented with and without DSP48 blocks can only run at a
frequency of 102 MHz In addition, to demonstrate that the
proposed design is applicable effectively for deep neural
network architecture, the paper also integrated the new design
in the MNIST network The simulation and verification results
show that the proposed system can achieve the accuracy up of to
88%
Keywords: Deep learning, neural network,
variable-weight-bit, throughput, power consumption
I INTRODUCTION Deep learning using neural network architectures has
exploded with fascinating and promising results With many
achievements in image classification [1], speech recognition
[2], or genomics [3], deep learning has been considerably
surpassed other traditional algorithms based on handcrafted
features The outstanding ability in feature extraction benefits
deep neural network to become a promising candidate for
many artificial intelligence applications Nevertheless, one of
the most serious concern is the high computation complexity,
which poses a real difficulty on the application of this
approach While the essential requirements of high speed and
energy efficiency for real-time applications have not been
met in general purpose platforms like GPU or CPU, the
specifically designed hardware make FPGA is the next
possible solution for implementing neural network
algorithms To realize the algorithm effectively on this
computational platform, various accelerator techniques have been introduced
Previous works tried to reduce computational complexity and data storage by using quantization [4] or approximation [5] These works show that the DNN can be accelerated by using various bit precisions for operations in different layers
of DNNs For example, Google’s TPU shows that for inference mode of DNNs, 8-bit precisions in fixed-point format to use integer arithmetic is good enough for a large number of applications However, reducing bit precisions unintendedly will affect the network accuracy with accumulated errors Therefore, this reduction should also be considered as an integrated part not only in inference mode but also in the training phase and the backpropagation steps Bit precisions in fixed-point format include a number of bits to represent the integer part and the fractional one of data
in DNN and their corresponding arithmetic operations In addition, reducing the bit width (i.e the total number of bits
to represent an integral part and the fractional part) will decrease the memory bandwidth and the data storage space when compared with the standard computation of 32-bit data width or 16-bit data width
In the conventional approaches, where the number of bit precision is fixed, the calculation is done with a specific accuracy even when the required accuracy is lower than the actual one More specifically, individual DNNs might have different optimal bit precisions in different layers, therefore, using the worst-case numerical precision for all values in the network would lead to the waste of energy, lower performance, and higher power consumption
This work implements the MAC unit (Multiplication and Accumulation) with variable-bit-precision The multiplier is eliminated by using serial processing of each bit of the weights, and the accuracy is adjustable by ending the calculation when the designed accuracy is reached This module will enable the run-time adjustment of the accuracy
of DNN It can be used in both inference data-path and the backpropagation data-path to test the DNNs with different bit accuracy
Subsequently, to prove that the proposed design could be applied effectively in the DNNs architecture, this work has successfully integrated it into a specific feed-forward neural network used for hand digital writing recognition The system has been evaluated using the MNIST database, and achieved the accuracy of 88%
Trang 2Three main contributions of the proposed computational
method include:
Firstly, providing the capability to enable a
changeable precision depending on different
requirements
Secondly, allowing the parallel computation, which
leads to speeding up computational time
Thirdly, the proposed MAC unit is integrated the
“zero-skipping” mechanism, which enables the design
to become adaptive, and more effective when
processing the data which have many zero values
The remaining parts of this paper are organized as
follows: Section II investigates the variation in precision that
motivate this work The algorithm and hardware architecture
implementation of the proposed computation unit is
illustrated in Section III In addition, to demonstrate the
application ability, the proposed module has been integrated
successfully in specific neural network architecture for
recognizing handwriting digitals The overall system
architecture design is in the scope of Section IV After that,
the simulation and implementation results are depicted in
Section V This work has conducted verification experiments
with numerous test cases In addition, the hardware
implementation results on Xilinx FPGA Virtex 707
development kit are also reported Finally, the paper comes
with the conclusion that provides some potential researches
could be conducted in the future (Section VI)
II RESRACHES ON NUMERICAL PRECISION IN DNNS
A The variability of precision requirement across and
within DNNs
Different DNNs can have varied fixed-point bit-width
requirements for representing their data In fact, each DNN
network has been trained to get its own weights to operate
effectively The same fixed bit precision utilized for various
DNN systems may reduce the flexibility and waste the
energy Using fixed bit precision cannot exploit the precision
variability among different architectures
Many experiments conducted in [6] show that different
networks have a different number of required weight bits The
comparison between accuracy results using fixed-point
representation with different bit-width shown in Figure 6 on
four neural networks: AlexNet [7] SqueezeNet [8],
GoogLeNet [9], and VGG-16 [10]
Figure 1: The accuracy of different DNNs with different bit
precision [11]
It is clear from Figure 1 that AlexNet and SqueezeNet achieve nearly full accuracy with 7-bit fixed-point number, while GoogLeNet and VGG-16 need 10 bits and 11 bits respectively to achieve a reliable result In other words, the minimal accuracy loss of a given DNNs architecture corresponding with particular minimum weight bit precision varies among various networks This inconsistency in bit-width requirements across different DNNs poses a challenge for hardware implementation Exploiting this result by calculating data with variable bit-width, will generally improve the performance, and make the system adaptable with some situations when accuracy can be tolerated
On the other hand, The works from [11] and [12] show that different layers can use different bit widths to represent the data with minimum accuracy loss For example, AlexNet needs 10 bits for the first layer but it only needs 4 bits for the eighth layer with 1% of accuracy loss The bit widths in different layers of AlexNet and GoogLeNet with its corresponding accuracy are illustrated in Figure 7 This means that the hardware accelerators should provide different bit precisions for different layers in these networks
Figure 2: Different layers in a DNN has different optimal data
width [12]
B The desired accuracy depends on the specific requirement
In some specific application, the required accuracy could
be tolerated in order to achieve a trade-off with another factor like speed or energy consumption In such a case, it is valuable that the accuracy of DNN can be adjustable and controlled to an optimal point
The desired accuracy of each layer in the DNN can be simulated and decided using software or accelerated hardware This accuracy might be adjusted not only after the training process but also during the training process which uses backpropagation With this information, the hardware accelerated DNN can be more efficient with lower hardware area, higher throughput and lower power consumption
III VARIABLE BIT PRECISION APPROACH
A Basic implementation The procedure for calculating the sum of the product with the variable bit precision method is illustrated in Figure 3
Trang 3Figure 3: Variable bit of precision implementation [12]
In this example, this approach considers A, B, C as
activation value and should be multiplied with Wa, Wb, Wc
respectively These three weights are represented with eight
bits in binary radix, and each bit of individual weight is
processed separately in one cycle clock, from the LSB to the
MSB The value ‘1’ or ‘0’ of these bits will decide
corresponding activation value or zero value will be added to
the partial sum The individual partial sum is also generated
in each cycle clock We have three activation inputs, so
totally possible partial sum can be formed equal to 23
different value Finally, to get the final result, all partial sums
are simply accumulated with a note that the partial sum
corresponding to the ith bit (counted from right to left) has to
be shifted to the left ith times In the case of a negative
number, firstly, we have to check the sign-bit of each weight
value The sign of corresponding activation value will be
changed if this sign-bit is 1, or unchanged in the other case
The detail for hardware implementation of this algorithm is
illustrated in the next section
B Processing in the negative number cases
To achieve correct results in situations that the data have
negative values, the aforementioned algorithm requires an
additional pre-processing stage While the above procedure
can work properly in the case where activation values are not
positive, any negative number in weight data will trigger off
an incorrect result This is because negative numbers in
hardware are presented by two’s complement of their
absolute value, which results in the additive inverse
operation Because this algorithm behavior in each cycle
clock is depended on whether '0' or '1' bit appears in weight
value, the inversion will cause the wrong result This problem
can be solved by initially checking the sign-bit of each weight
value If the weight is negative, it needs to be changed to a
positive number At the same time, the sign of corresponding
activation value will also be changed if the weight sign-bit is
1, or unchanged in the other case Changing the sign two
times will guarantee the correct result unchanged As
illustrated in Figure 4, a binary representation of weights
absolute value will be achieved so that the algorithm can
work properly
Figure 4: Proposed solution to process negative numbers
Keeping the weight positive is lucrative because it is possible to propose the “zero skippings” mechanism, which would effectively accelerate the computation, and make this algorithm adaptive with a range of data The detail implementation is clearly introduced in the next section
C “Zero skipping” implementation Keeping the weights positive in most situations results in
a number of '0' bits in the right and left sides of these binary representations At a clock cycle that MAC unit processes them, the partial sum equal to zero and nothing is accumulated into the final result Therefore, the implementation in this clock cycle is unnecessary and slow down the computation time To exploit this property, the proposed MAC unit firstly detects the number of '0' in both right and left side of weight binary representation After that, the counter variable, which is used to determine the number
of iterations, is adjusted based on the result acquired in the previous step As a result, instead of processing with the number of clock cycles equal to the weight bit-width, proposed architecture skips the zero bits in left and right of binary representation This proposal is especially advantageous when weight data have numerous '0' bit in binary representation In such a case, the computational time
is reduced considerably, and the implementation MAC operation is effectively accelerated
D Proposed MAC unit hardware architecture The high computing complexity of almost DNN architecture, which is attributed to a large number of multiply-and-accumulate operations, can be accomplished by the proposed hardware architecture The goal of the module
in Figure 5 is to calculate the value:
𝑆𝑢𝑚 = 𝑎 × 𝑤 + 𝑏 × 𝑤 + 𝑐 × 𝑤
In the above equation, 𝑤 , 𝑤2, and 𝑤3 have a variable binary presentation Firstly, each activation value goes through a MUX operator, to decide this value or its opposite number is accumulated Following that, eight partial sums are generated after each clock cycle and selected based on the bit representation of three weigh values: 𝑤1, 𝑤2, and 𝑤3 In addition, when finishing one clock cycle, three weigh representation will be shifted to the right so that the next bit can be used in the following clock cycle According to that, each value of the partial sum has been implemented a shift left operator before accumulated to get the final result
A
B
C
Wa
Wb
Wc
1
1
1
1
1 1 1
1 1
0
0 0
0 0 0 0
0 0
0
<<
<<
<<
<<
<<
<<
<<
<<
<<
<< <<
A B
A + B
B + C A+B+C
.
t0 t1 t2 t3 t6 t7
+ + + +
-Partial sum calculated after each cycle clock 8bit weights, so calculation
completed after 8 clock
cycle
Activations
Trang 4Figure 5: Proposed hardware architecture to implement MAC
operation according to the variable precision approach
IV DEMONSTRATION APPLICATION:FEED-FORWARD NEURAL
To demonstrate the application ability of the proposed
approach, the hardware architecture deployed in this work
implements the feed-forward stage of a fully-connected
artificial neural network The ANN uses offline learning and
this process is accomplished by Matlab 2015 software using
the open source Caffe framework The output acquired from
the training process is a set of values for weight and bias
These data are saved in bin files and become parameters for
neural network hardware implementation
A Overview of system architecture
The architecture overview is illustrated in Figure 6 Four
RAMs are used to store all data: image pixels of each test,
weights and bias parameters The result RAM is used to store
the intermediate output from the neuron module in the
calculation process The proposed neuron is the main
computational unit, almost feed-forward neural network
operations are calculated sequentially in this module Prior to
the multiplication and addition implementation, a multiplexer
is used to select inputs whether from image RAM or result
RAM This selection is depended on which layer are being
calculated The input from image pixel is used only at the first
hidden layer, subsequently, in following layers, the proposed
neuron uses data from result RAM as the input
Figure 6: General architecture of the data-path of the design
Having finished calculation for all neurons in the
network, the system uses a Max component to read the data
at the output layer This module will choose the output with the highest value, and considered it as a predicted class
B The proposed variable precision bit neuron block
A variable precision MAC unit that can calculate the sum of four products simultaneously has been deployed to integrate into the neural network system conveniently Following that, the computational neuron unit is constructed as depicted in Figure 7
Figure 7: Architecture of neuron applied to the neural network
The proposed neuron module used in this system is composed of four proposed MAC units As a result, the proposed module can calculate the sum of sixteen products simultaneously All calculations are accomplished sequentially in this neuron
V SIMULATION AND IMPLEMENTATION RESULTS
A Results of ANN hardware implementation
In this system design, the output max_index is the result with the highest probability of the predicted system In this work, the hardware architecture composes of three layers with 64 neurons in the hidden layers The testing images are taken from MNIST database and the accuracy of system using 6 bits to represent is 89%
In order to examine how the variation in the number of precision bit affected on accuracy, different experiments have been conducted, and reported Figure 8 It can be seen that the accuracy has been significantly improved when using five precision bits
Figure 8: Number of precision bit effects on accuracy
Trang 5However, increasing to more than five bits only slightly rise
the number of correct predictions, so five precision bits could
be considered as an optimal solution in this case
B Results of the proposed variable precision bit neuron
block
All the computations, from the multiply-and-add operations
to deriving results from the activate functions, are
implemented on this neuron
Table 1: Error of the proposed design with different numbers of
precision bits
Precision bit
used
4 bits 2.8805 x 10-3 2.7031 x 10-1
6 bits 1.47 x 10-4 5.782 x 10-2
To evaluate the reliability of the proposed neuron, the
Mean Square Error (MSE) is calculated from 1000 random
number with different numbers of precision bits used This
error is the difference between the outputs of the proposed
design and the actual real number represented using floating
point In addition, the maximum error during all verification
process is also reported in Table 1
𝑁
As in Table 1, both the MSE and max error recorded are
gradually degraded when more bits are used to representing
data While the accuracy is improved, the proposed module
also requires more clock cycles to complete operations,
which is the trade-off of the proposed approach
C Results of FPGA implementation
1) Implementation results of MAC block
This work has implemented the two designs in a Xilinx
Virtex-7 FPGA VC707 Evaluation Kit The results are
summarized in Table I DSP48 block is disabled during the
synthesis to have a fair comparison results because the
multiplication in the reference module can be mapped using
3 DSP slices
Figure 9: Conventional implementation for MAC operation
The simulation results are reported in Table 2, it can be
noticed that the fixed 16-bit precision module smaller
maximum operating frequency of 102 MHz and a fixed
throughput of 306M Multiply-and-Add operations per second
In contrast, the variable bit precision module has the maximum frequency of 310 MHz and a variable throughput
of 20M Multiply-and-Add operations when used with 15-bit precision, and up to 310M Multiply-and-Add operations when used with 1-bit precision The conventional module can provide the sum of three multiplications after each clock cycles while the proposed design uses variable clock cycles from 1 up to 16 clock cycles to generate the results The advantage of the proposed module is the ability to change the precision with small hardware cost
Table 2: Synthesis results of the proposed design and the conventional one
Criterions Conventional design Proposed design Bit width Fixed 16 bits 1-16 bits
Throughput
2) Implementation results of neuron block
We have also implemented the proposed variable precision bit neuron block in this FPGA platform When the proposed approach applied, the implementation results show that the design could have faster operation frequency compared to the conventional binary one and another approach using the SC technique introduced in [13], see Table 3 However, the proposed design needs more area cost
to store the reference values
Table 3: Implementation results for neuron blocks of different approachs
Binary neuron block [13]
SC neuron block [13] Proposed neuron
block Parallel
3) Implementation results of ANN architecture The proposed neuron block using a variable precision technique has been integrated into a specific ANN architecture to demonstrate its application ability The implementation results are reported in Table 4 The architecture design uses 64 neurons hidden layers and the ReLU activate function The area cost is 4117 LUTs and 1515 registers, which are relatively small compared to the available FPGA resources
a
w1
w2
w3
b
c
SUM
Trang 6Table 4: Hardware resources utilization report of FPGA
implementation
resources Utilization
VI CONCLUSION
The growing interest in many intelligent systems has
become the center of attention in recent years While the
conventional approaches using task-specific handcrafted
features algorithms have shown their limitations, DNNs
architectures, with the ability to self-extract the high-level
features from available data, have gained great advantages
However, the high computational complexity required in
most DNNs architectures is often a serious challenge
Therefore, a variety of state-of-art researches on neural
network accelerator have been introduced, some of them have
tried to take advantage of the difference in the precision
required in DNNs
In this work, variable-bit-precision module has been
proposed for exploiting the precision variability requirement
among different layers and networks architecture An optimal
length of bit per different networks and layers translates
directly to performance and energy benefits This report
demonstrates the basic concept as well as the hardware
architecture of the proposed module In addition, it provides
a comparison between the proposed design and the
conventional design Implementation results have shown that
the proposed MAC unit can work at a much higher frequency
The proposed design can be used in both forward datapath
and the backpropagation datapath to investigate different
accuracy on FPGA to accelerate the training phase and to find
the optimal data width for ASIC implementation
In addition, this work also integrated the proposed
design in an MNIST architecture to demonstrate the
application ability The results show that the variable bit
MAC unit could work effectively and result in the accuracy
of up to 88%
This work is supported by Vietnam National
University, Hanoi under the grant number TXTCN.19.07
[1] T.-H Chan, K Jia, S Gao, J Lu, Z Zeng, and Y Ma,
“PCANet: A Simple Deep Learning Baseline for Image
Classification?,” IEEE Transactions on Image Processing, vol 24, no 12, pp 5017–5032, Dec 2015 [2] L Deng, G Hinton, and B Kingsbury, “New types of deep neural network learning for speech recognition and related applications: an overview,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp 8599–8603
[3] Y Park and M Kellis, “Deep learning for regulatory genomics,” Nature Biotechnology, vol 33, no 8, pp 825–826, Aug 2015
[4] P Judd et al., “Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets,” arXiv:1511.05236 [cs], Nov 2015
[5] J Wu, C Leng, Y Wang, Q Hu, and J Cheng,
“Quantized Convolutional Neural Networks for Mobile Devices,” arXiv:1512.06473 [cs], Dec 2015
[6] L Lai, N Suda, and V Chandra, “Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations,” arXiv:1703.03073 [cs], Mar 2017
[7] A Krizhevsky, I Sutskever, and G E Hinton,
“ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol 60,
no 6, pp 84–90, May 2017
[8] F N Iandola, S Han, M W Moskewicz, K Ashraf, W
J Dally, and K Keutzer, “SQUEEZENET: ALEXNET Level Accurcy with 5 X Fever Parameters and < 0.5 MB Model Size,” p 13, 2017
[9] C Szegedy et al., “Going deeper with convolutions,” in
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp 1–9 [10] K Simonyan and A Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556 [cs], Sep 2014 [11] P Judd, J Albericio, T Hetherington, T M Aamodt, and A Moshovos, “Stripes: Bit-Serial Deep Neural Network Computing,” p 12
[12] J Lee, C Kim, S Kang, D Shin, S Kim, and H.-J Yoo,
“UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision,” in 2018 IEEE International Solid - State Circuits Conference - (ISSCC), San Francisco, CA,
2018, pp 218–220
[13] D.-A Nguyen, H.-H Ho, D.-H Bui, and X.-T Tran,
“An Efficient Hardware Implementation of Artificial Neural Network based on Stochastic Computing,” in
2018 5th NAFOSTED Conference on Information and Computer Science (NICS), Ho Chi Minh City, 2018, pp 237–242