A Variable Precision Approach for Deep Neural Networks44863

A Variable Precision Approach for Deep Neural Networks Xuan-Tuyen Tran, Duy-Anh Nguyen, Duy-Hieu Bui, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy

Trang 1

A Variable Precision Approach for Deep

Neural Networks Xuan-Tuyen Tran, Duy-Anh Nguyen, Duy-Hieu Bui, Xuan-Tu Tran * SISLAB, VNU University of Engineering and Technology – 144 Xuan Thuy road, Cau Giay, Hanoi, Vietnam

Corresponding author’s email: tutx@vnu.edu.vn

Abstract— Deep Neural Network (DNN) architectures have

been recently considered as the big breakthrough for a variety

of applications Because of the high computing capabilities

required, DNN has been unsuitable for various embedded

applications Many works have been trying to optimize the key

operations, which are multiply-and-add, in hardware for a

smaller area, higher throughput, and lower power consumption

One way to optimize these factors is to use the reduced bit

accuracy; for examples, Google's TPU used only 8-bit integer

operations for DNN inference Based on the characteristics of

different layers in DNN, further bit accuracy can be changed to

preserve the hardware area, power consumption, and

throughput In this work, the thesis investigates a hardware

implementation of multiply-and-add with variable bit precision

which can be adjusted at the computation time The proposed

design can calculate the sum of several products with the bit

precision ranging from 1 to 16 bits The hardware

implementation results on Xilinx FPGA Virtex 707 development

kit show that our design occupies smaller hardware and can run

at a higher frequency of 310 MHz, while the same functionality

implemented with and without DSP48 blocks can only run at a

frequency of 102 MHz In addition, to demonstrate that the

proposed design is applicable effectively for deep neural

network architecture, the paper also integrated the new design

in the MNIST network The simulation and verification results

show that the proposed system can achieve the accuracy up of to

88%

Keywords: Deep learning, neural network,

variable-weight-bit, throughput, power consumption

I INTRODUCTION Deep learning using neural network architectures has

exploded with fascinating and promising results With many

achievements in image classification [1], speech recognition

[2], or genomics [3], deep learning has been considerably

surpassed other traditional algorithms based on handcrafted

features The outstanding ability in feature extraction benefits

deep neural network to become a promising candidate for

many artificial intelligence applications Nevertheless, one of

the most serious concern is the high computation complexity,

which poses a real difficulty on the application of this

approach While the essential requirements of high speed and

energy efficiency for real-time applications have not been

met in general purpose platforms like GPU or CPU, the

specifically designed hardware make FPGA is the next

possible solution for implementing neural network

algorithms To realize the algorithm effectively on this

computational platform, various accelerator techniques have been introduced

Previous works tried to reduce computational complexity and data storage by using quantization [4] or approximation [5] These works show that the DNN can be accelerated by using various bit precisions for operations in different layers

of DNNs For example, Google’s TPU shows that for inference mode of DNNs, 8-bit precisions in fixed-point format to use integer arithmetic is good enough for a large number of applications However, reducing bit precisions unintendedly will affect the network accuracy with accumulated errors Therefore, this reduction should also be considered as an integrated part not only in inference mode but also in the training phase and the backpropagation steps Bit precisions in fixed-point format include a number of bits to represent the integer part and the fractional one of data

in DNN and their corresponding arithmetic operations In addition, reducing the bit width (i.e the total number of bits

to represent an integral part and the fractional part) will decrease the memory bandwidth and the data storage space when compared with the standard computation of 32-bit data width or 16-bit data width

In the conventional approaches, where the number of bit precision is fixed, the calculation is done with a specific accuracy even when the required accuracy is lower than the actual one More specifically, individual DNNs might have different optimal bit precisions in different layers, therefore, using the worst-case numerical precision for all values in the network would lead to the waste of energy, lower performance, and higher power consumption

This work implements the MAC unit (Multiplication and Accumulation) with variable-bit-precision The multiplier is eliminated by using serial processing of each bit of the weights, and the accuracy is adjustable by ending the calculation when the designed accuracy is reached This module will enable the run-time adjustment of the accuracy

of DNN It can be used in both inference data-path and the backpropagation data-path to test the DNNs with different bit accuracy

Subsequently, to prove that the proposed design could be applied effectively in the DNNs architecture, this work has successfully integrated it into a specific feed-forward neural network used for hand digital writing recognition The system has been evaluated using the MNIST database, and achieved the accuracy of 88%

Trang 2

Three main contributions of the proposed computational

method include:

 Firstly, providing the capability to enable a

changeable precision depending on different

requirements

 Secondly, allowing the parallel computation, which

leads to speeding up computational time

 Thirdly, the proposed MAC unit is integrated the

“zero-skipping” mechanism, which enables the design

to become adaptive, and more effective when

processing the data which have many zero values

The remaining parts of this paper are organized as

follows: Section II investigates the variation in precision that

motivate this work The algorithm and hardware architecture

implementation of the proposed computation unit is

illustrated in Section III In addition, to demonstrate the

application ability, the proposed module has been integrated

successfully in specific neural network architecture for

recognizing handwriting digitals The overall system

architecture design is in the scope of Section IV After that,

the simulation and implementation results are depicted in

Section V This work has conducted verification experiments

with numerous test cases In addition, the hardware

implementation results on Xilinx FPGA Virtex 707

development kit are also reported Finally, the paper comes

with the conclusion that provides some potential researches

could be conducted in the future (Section VI)

II RESRACHES ON NUMERICAL PRECISION IN DNNS

A The variability of precision requirement across and

within DNNs

Different DNNs can have varied fixed-point bit-width

requirements for representing their data In fact, each DNN

network has been trained to get its own weights to operate

effectively The same fixed bit precision utilized for various

DNN systems may reduce the flexibility and waste the

energy Using fixed bit precision cannot exploit the precision

variability among different architectures

Many experiments conducted in [6] show that different

networks have a different number of required weight bits The

comparison between accuracy results using fixed-point

representation with different bit-width shown in Figure 6 on

four neural networks: AlexNet [7] SqueezeNet [8],

GoogLeNet [9], and VGG-16 [10]

Figure 1: The accuracy of different DNNs with different bit

precision [11]

It is clear from Figure 1 that AlexNet and SqueezeNet achieve nearly full accuracy with 7-bit fixed-point number, while GoogLeNet and VGG-16 need 10 bits and 11 bits respectively to achieve a reliable result In other words, the minimal accuracy loss of a given DNNs architecture corresponding with particular minimum weight bit precision varies among various networks This inconsistency in bit-width requirements across different DNNs poses a challenge for hardware implementation Exploiting this result by calculating data with variable bit-width, will generally improve the performance, and make the system adaptable with some situations when accuracy can be tolerated

On the other hand, The works from [11] and [12] show that different layers can use different bit widths to represent the data with minimum accuracy loss For example, AlexNet needs 10 bits for the first layer but it only needs 4 bits for the eighth layer with 1% of accuracy loss The bit widths in different layers of AlexNet and GoogLeNet with its corresponding accuracy are illustrated in Figure 7 This means that the hardware accelerators should provide different bit precisions for different layers in these networks

Figure 2: Different layers in a DNN has different optimal data

width [12]

B The desired accuracy depends on the specific requirement

In some specific application, the required accuracy could

be tolerated in order to achieve a trade-off with another factor like speed or energy consumption In such a case, it is valuable that the accuracy of DNN can be adjustable and controlled to an optimal point

The desired accuracy of each layer in the DNN can be simulated and decided using software or accelerated hardware This accuracy might be adjusted not only after the training process but also during the training process which uses backpropagation With this information, the hardware accelerated DNN can be more efficient with lower hardware area, higher throughput and lower power consumption

III VARIABLE BIT PRECISION APPROACH

A Basic implementation The procedure for calculating the sum of the product with the variable bit precision method is illustrated in Figure 3

Trang 3

Figure 3: Variable bit of precision implementation [12]

In this example, this approach considers A, B, C as

activation value and should be multiplied with Wa, Wb, Wc

respectively These three weights are represented with eight

bits in binary radix, and each bit of individual weight is

processed separately in one cycle clock, from the LSB to the

MSB The value ‘1’ or ‘0’ of these bits will decide

corresponding activation value or zero value will be added to

the partial sum The individual partial sum is also generated

in each cycle clock We have three activation inputs, so

totally possible partial sum can be formed equal to 23

different value Finally, to get the final result, all partial sums

are simply accumulated with a note that the partial sum

corresponding to the ith bit (counted from right to left) has to

be shifted to the left ith times In the case of a negative

number, firstly, we have to check the sign-bit of each weight

value The sign of corresponding activation value will be

changed if this sign-bit is 1, or unchanged in the other case

The detail for hardware implementation of this algorithm is

illustrated in the next section

B Processing in the negative number cases

To achieve correct results in situations that the data have

negative values, the aforementioned algorithm requires an

additional pre-processing stage While the above procedure

can work properly in the case where activation values are not

positive, any negative number in weight data will trigger off

an incorrect result This is because negative numbers in

hardware are presented by two’s complement of their

absolute value, which results in the additive inverse

operation Because this algorithm behavior in each cycle

clock is depended on whether '0' or '1' bit appears in weight

value, the inversion will cause the wrong result This problem

can be solved by initially checking the sign-bit of each weight

value If the weight is negative, it needs to be changed to a

positive number At the same time, the sign of corresponding

activation value will also be changed if the weight sign-bit is

1, or unchanged in the other case Changing the sign two

times will guarantee the correct result unchanged As

illustrated in Figure 4, a binary representation of weights

absolute value will be achieved so that the algorithm can

work properly

Figure 4: Proposed solution to process negative numbers

Keeping the weight positive is lucrative because it is possible to propose the “zero skippings” mechanism, which would effectively accelerate the computation, and make this algorithm adaptive with a range of data The detail implementation is clearly introduced in the next section

C “Zero skipping” implementation Keeping the weights positive in most situations results in

a number of '0' bits in the right and left sides of these binary representations At a clock cycle that MAC unit processes them, the partial sum equal to zero and nothing is accumulated into the final result Therefore, the implementation in this clock cycle is unnecessary and slow down the computation time To exploit this property, the proposed MAC unit firstly detects the number of '0' in both right and left side of weight binary representation After that, the counter variable, which is used to determine the number

of iterations, is adjusted based on the result acquired in the previous step As a result, instead of processing with the number of clock cycles equal to the weight bit-width, proposed architecture skips the zero bits in left and right of binary representation This proposal is especially advantageous when weight data have numerous '0' bit in binary representation In such a case, the computational time

is reduced considerably, and the implementation MAC operation is effectively accelerated

D Proposed MAC unit hardware architecture The high computing complexity of almost DNN architecture, which is attributed to a large number of multiply-and-accumulate operations, can be accomplished by the proposed hardware architecture The goal of the module

in Figure 5 is to calculate the value:

𝑆𝑢𝑚 = 𝑎 × 𝑤 + 𝑏 × 𝑤 + 𝑐 × 𝑤

In the above equation, 𝑤 , 𝑤2, and 𝑤3 have a variable binary presentation Firstly, each activation value goes through a MUX operator, to decide this value or its opposite number is accumulated Following that, eight partial sums are generated after each clock cycle and selected based on the bit representation of three weigh values: 𝑤1, 𝑤2, and 𝑤3 In addition, when finishing one clock cycle, three weigh representation will be shifted to the right so that the next bit can be used in the following clock cycle According to that, each value of the partial sum has been implemented a shift left operator before accumulated to get the final result

A

B

C

Wa

Wb

Wc

1

1 1 1

1 1

0

0 0

0 0 0 0

0 0

0

<<

<< <<

A B

A + B

B + C A+B+C

.

t0 t1 t2 t3 t6 t7

+ + + +

-Partial sum calculated after each cycle clock 8bit weights, so calculation

completed after 8 clock

cycle

Activations

Trang 4

Figure 5: Proposed hardware architecture to implement MAC

operation according to the variable precision approach

IV DEMONSTRATION APPLICATION:FEED-FORWARD NEURAL

To demonstrate the application ability of the proposed

approach, the hardware architecture deployed in this work

implements the feed-forward stage of a fully-connected

artificial neural network The ANN uses offline learning and

this process is accomplished by Matlab 2015 software using

the open source Caffe framework The output acquired from

the training process is a set of values for weight and bias

These data are saved in bin files and become parameters for

neural network hardware implementation

A Overview of system architecture

The architecture overview is illustrated in Figure 6 Four

RAMs are used to store all data: image pixels of each test,

weights and bias parameters The result RAM is used to store

the intermediate output from the neuron module in the

calculation process The proposed neuron is the main

computational unit, almost feed-forward neural network

operations are calculated sequentially in this module Prior to

the multiplication and addition implementation, a multiplexer

is used to select inputs whether from image RAM or result

RAM This selection is depended on which layer are being

calculated The input from image pixel is used only at the first

hidden layer, subsequently, in following layers, the proposed

neuron uses data from result RAM as the input

Figure 6: General architecture of the data-path of the design

Having finished calculation for all neurons in the

network, the system uses a Max component to read the data

at the output layer This module will choose the output with the highest value, and considered it as a predicted class

B The proposed variable precision bit neuron block

A variable precision MAC unit that can calculate the sum of four products simultaneously has been deployed to integrate into the neural network system conveniently Following that, the computational neuron unit is constructed as depicted in Figure 7

Figure 7: Architecture of neuron applied to the neural network

The proposed neuron module used in this system is composed of four proposed MAC units As a result, the proposed module can calculate the sum of sixteen products simultaneously All calculations are accomplished sequentially in this neuron

V SIMULATION AND IMPLEMENTATION RESULTS

A Results of ANN hardware implementation

In this system design, the output max_index is the result with the highest probability of the predicted system In this work, the hardware architecture composes of three layers with 64 neurons in the hidden layers The testing images are taken from MNIST database and the accuracy of system using 6 bits to represent is 89%

In order to examine how the variation in the number of precision bit affected on accuracy, different experiments have been conducted, and reported Figure 8 It can be seen that the accuracy has been significantly improved when using five precision bits

Figure 8: Number of precision bit effects on accuracy

Trang 5

However, increasing to more than five bits only slightly rise

the number of correct predictions, so five precision bits could

be considered as an optimal solution in this case

B Results of the proposed variable precision bit neuron

block

All the computations, from the multiply-and-add operations

to deriving results from the activate functions, are

implemented on this neuron

Table 1: Error of the proposed design with different numbers of

precision bits

Precision bit

used

4 bits 2.8805 x 10-3 2.7031 x 10-1

6 bits 1.47 x 10-4 5.782 x 10-2

To evaluate the reliability of the proposed neuron, the

Mean Square Error (MSE) is calculated from 1000 random

number with different numbers of precision bits used This

error is the difference between the outputs of the proposed

design and the actual real number represented using floating

point In addition, the maximum error during all verification

process is also reported in Table 1

𝑁

As in Table 1, both the MSE and max error recorded are

gradually degraded when more bits are used to representing

data While the accuracy is improved, the proposed module

also requires more clock cycles to complete operations,

which is the trade-off of the proposed approach

C Results of FPGA implementation

1) Implementation results of MAC block

This work has implemented the two designs in a Xilinx

Virtex-7 FPGA VC707 Evaluation Kit The results are

summarized in Table I DSP48 block is disabled during the

synthesis to have a fair comparison results because the

multiplication in the reference module can be mapped using

3 DSP slices

Figure 9: Conventional implementation for MAC operation

The simulation results are reported in Table 2, it can be

noticed that the fixed 16-bit precision module smaller

maximum operating frequency of 102 MHz and a fixed

throughput of 306M Multiply-and-Add operations per second

In contrast, the variable bit precision module has the maximum frequency of 310 MHz and a variable throughput

of 20M Multiply-and-Add operations when used with 15-bit precision, and up to 310M Multiply-and-Add operations when used with 1-bit precision The conventional module can provide the sum of three multiplications after each clock cycles while the proposed design uses variable clock cycles from 1 up to 16 clock cycles to generate the results The advantage of the proposed module is the ability to change the precision with small hardware cost

Table 2: Synthesis results of the proposed design and the conventional one

Criterions Conventional design Proposed design Bit width Fixed 16 bits 1-16 bits

Throughput

2) Implementation results of neuron block

We have also implemented the proposed variable precision bit neuron block in this FPGA platform When the proposed approach applied, the implementation results show that the design could have faster operation frequency compared to the conventional binary one and another approach using the SC technique introduced in [13], see Table 3 However, the proposed design needs more area cost

to store the reference values

Table 3: Implementation results for neuron blocks of different approachs

Binary neuron block [13]

SC neuron block [13] Proposed neuron

block Parallel

3) Implementation results of ANN architecture The proposed neuron block using a variable precision technique has been integrated into a specific ANN architecture to demonstrate its application ability The implementation results are reported in Table 4 The architecture design uses 64 neurons hidden layers and the ReLU activate function The area cost is 4117 LUTs and 1515 registers, which are relatively small compared to the available FPGA resources

a

w1

w2

w3

b

c

SUM

Trang 6

Table 4: Hardware resources utilization report of FPGA

implementation

resources Utilization

VI CONCLUSION

The growing interest in many intelligent systems has

become the center of attention in recent years While the

conventional approaches using task-specific handcrafted

features algorithms have shown their limitations, DNNs

architectures, with the ability to self-extract the high-level

features from available data, have gained great advantages

However, the high computational complexity required in

most DNNs architectures is often a serious challenge

Therefore, a variety of state-of-art researches on neural

network accelerator have been introduced, some of them have

tried to take advantage of the difference in the precision

required in DNNs

In this work, variable-bit-precision module has been

proposed for exploiting the precision variability requirement

among different layers and networks architecture An optimal

length of bit per different networks and layers translates

directly to performance and energy benefits This report

demonstrates the basic concept as well as the hardware

architecture of the proposed module In addition, it provides

a comparison between the proposed design and the

conventional design Implementation results have shown that

the proposed MAC unit can work at a much higher frequency

The proposed design can be used in both forward datapath

and the backpropagation datapath to investigate different

accuracy on FPGA to accelerate the training phase and to find

the optimal data width for ASIC implementation

In addition, this work also integrated the proposed

design in an MNIST architecture to demonstrate the

application ability The results show that the variable bit

MAC unit could work effectively and result in the accuracy

of up to 88%

This work is supported by Vietnam National

University, Hanoi under the grant number TXTCN.19.07

[1] T.-H Chan, K Jia, S Gao, J Lu, Z Zeng, and Y Ma,

“PCANet: A Simple Deep Learning Baseline for Image

Classification?,” IEEE Transactions on Image Processing, vol 24, no 12, pp 5017–5032, Dec 2015 [2] L Deng, G Hinton, and B Kingsbury, “New types of deep neural network learning for speech recognition and related applications: an overview,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 2013, pp 8599–8603

[3] Y Park and M Kellis, “Deep learning for regulatory genomics,” Nature Biotechnology, vol 33, no 8, pp 825–826, Aug 2015

[4] P Judd et al., “Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets,” arXiv:1511.05236 [cs], Nov 2015

[5] J Wu, C Leng, Y Wang, Q Hu, and J Cheng,

“Quantized Convolutional Neural Networks for Mobile Devices,” arXiv:1512.06473 [cs], Dec 2015

[6] L Lai, N Suda, and V Chandra, “Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations,” arXiv:1703.03073 [cs], Mar 2017

[7] A Krizhevsky, I Sutskever, and G E Hinton,

“ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol 60,

no 6, pp 84–90, May 2017

[8] F N Iandola, S Han, M W Moskewicz, K Ashraf, W

J Dally, and K Keutzer, “SQUEEZENET: ALEXNET Level Accurcy with 5 X Fever Parameters and < 0.5 MB Model Size,” p 13, 2017

[9] C Szegedy et al., “Going deeper with convolutions,” in

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp 1–9 [10] K Simonyan and A Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556 [cs], Sep 2014 [11] P Judd, J Albericio, T Hetherington, T M Aamodt, and A Moshovos, “Stripes: Bit-Serial Deep Neural Network Computing,” p 12

[12] J Lee, C Kim, S Kang, D Shin, S Kim, and H.-J Yoo,

“UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision,” in 2018 IEEE International Solid - State Circuits Conference - (ISSCC), San Francisco, CA,

2018, pp 218–220

[13] D.-A Nguyen, H.-H Ho, D.-H Bui, and X.-T Tran,

“An Efficient Hardware Implementation of Artificial Neural Network based on Stochastic Computing,” in

2018 5th NAFOSTED Conference on Information and Computer Science (NICS), Ho Chi Minh City, 2018, pp 237–242

Định dạng
Số trang	6
Dung lượng	1,29 MB