In this research, we used Convolutional Neural Network [1][2] (CNN) to the task of Traffic Sign Recognition. This research is foundation for us to continue our research on self-driving. Convolutional Neural Network is a multistage architectures. It can be automatically learn features.
Trang 1A TRAFFIC SIGN RECOGNITION SYSTEM WITH
CONVOLUTIONAL NEURAL NETWORK
Luong Cong Duan1,*, Nguyen Hong Kiem2, Nguyen Ngoc Minh1
Abstract: In this research, we used Convolutional Neural Network [1][2] (CNN)
to the task of Traffic Sign Recognition This research is foundation for us to continue our research on self-driving Convolutional Neural Network is a multi-stage architectures It can be automatically learn features We have used Tensorflow library and Python as main tool for test our research After conducting research and testing, the results of the architectures reached 91.1% accuracy.
Keywords: Traffic Sign Recognition, Convolution Neural Network, CNN, Self-Driving
1 INTRODUCTION
Our long-term goal in this research is self-driving vehicles and research on traffic sign identification is is one of the first researches Traffic sign identification can apply many areas of traffic as: Notification signal information changes on the road, reminder about wrongful when joining traffic and automated driving Traffic signals often have clear differences but their quantity of type is quite large In addition, the quality of image signals is greatly affected by the angle of view, the light, the obscurity, colors fading and speed of movement In this paper, our aims are building a test identifier that ignores conditions that are too difficult, it will be conducted further research In this paper, we have used a basic dataset called: German Traffic Sign [3] This is a dataset be used in GTSRB (German Traffic Sign Recognition Benchmark) competition It provides more than 50,000 sample pictures including 43 different classes: speed limits, dangerous curves, slippery road… This dataset was used in a competition a few years ago The best result for the competition correctly guessed 99.46% of the signs that was designed by the IDSIA team using the Committee of the CNNs method [3]
Traditional methods for traffic sign recognition generally consists two task: detection and classification Detection is first handled with computationally inexpensive, hand-crafted algorithms Classification is subsequently performed on detected candidates with more expensive, but more accurate, algorithms Hand-crafted features are also called shallow features, are not discriminative enough as databases become larger and larger and generic deep features should push the recognition performance even further Classification has been approached with a number of popular classification methods such as Neural Networks [4], Support Vector Machines [5]… In global sign shapes are first detected with various heuristics and color thresholding, then the detected windows are classified using a different Multi-Layer Neural Net for each type of outer shape These neural nets take 32x32 inputs and have at most 30, 15 and 10 hidden units for each of their 3 layers While using a similar input size, the networks used in the present work have orders of magnitude more parameters
Current popular algorithms mainly use convolutional neural networks to execute both feature extraction and classification[6] Experiments have shown that CNN has many advantages in recognition problems There are a variety of CNN variants having been proposed in GTSRB Pierre Sermanet and Yann LeCun [7] fed both the high-level and low-level features extracted by different convolutional layers to the fully-connected layers This method combined global invariant features with the local detailed ones and the accuracy record was 99.17%
Trang 2From those information we decided to choose CNN as the basic method for traffic sign recognition task CNN is a biologically-inspired, multilayer feed-forward architecture that can learn multiple stages of invariant features using a combination of supervised and unsupervised learning Each stage is composed of a (convolutional) filter bank layer, a non-linear transform layer, and a spatial feature pooling layer The spatial pooling layers lower the spatial resolution of the representation, thereby making the representation robust
to small shifts and geometric distortions, similarly to “complex cells” in standard models
of the visual cortex [8] CNN are generally composed of one to three stages, capped by a classifiercomposed of one or two additional layers
Figure 1 Typical CNN architecture (Wikipedia)
After building architecture, we used a method to optimize the loss function One of the most popular methods is Gradient Descent [1][9] Gradient descent is a way to minimize
an objective J( ) function parameterized by a model’s parameters d
by updating the parameters in the opposite direction of the gradient of the objective function J( ) to the parameters The learning rate
determines the size of the steps we take
to reach a (local) minimum In other
words, we follow the direction of the
slope of the surface created by the
objective function downhill until we
reach a valley
Currently, there are many libraries
and programming languages that
support user programming and training
machine learning With its machine
learning background, Google has
created an open source library called
Tensorflow It has flexible architecture
that allows user to deploy computation
to one or more CPUs or GPUs in a
desktop, server, or mobile device with a
single API [10] We have decided to use
this libraries for our project
2 NETWORK ARCHITECTURE
The architecture used in the present work departs from traditional CNN[5] by the use of connections that skip layers, and by the use of pooling layers with different subsampling ratios for the connections that skip layers and for those that do not
Figure 2 Gradient descent on a series of
level sets
Trang 3We have run the test a number of times and by this time we have temporarily selected the architectures include 4 stage as follows:
1 Inputs data [batch, 32, 32, 3] YUV data
2 1 st stage
Input = inputs data Conv1 + ReLU : kernel size = 5, layer width = 108
channel Y connect 100 kernel
channel UV connect 8 kernel
Max pooling : kernel size = 2 Output = “conv1”
3 2st stage
Input = “conv1”
Conv2 + ReLU : kernel size = 3, layer width = 200 Max pooling : kernel size = 2
Output = “conv2”
4 3st stage
Combine “conv1(flatten)” with “conv2(flatten)”
Input = concat "conv1(flatten)" and “conv2(flatten)”
Fully network + ReLU : layer width = 300 Output = “fc1”
5 Output - 4st stage Input = “fc1”
Out : layer width = 43
Figure 3 Network architecture
Figure 4 Diagram of netwoek architecture
3 EXPERIMENT
A Data Preparation
Currently, GTSRB dataset has about 50.000 sample pictures of 43 class However, the number of images for each class is uneven Below is the detailed information on the distribution of the dataset:
Figure 5 Number of inputs per class before balancing data
Trang 4It can be sent that are differences between the classes We should create some data to balance the number of inputs We have used an easy method to increment number of images That is rotating images by a few degrees This is the distribution after this operation:
Figure 6 Number of inputs per class before balancing data
The data is more balanced, and each class has at least 500 images This new dataset will help to train our network better
Additionally, all images are down-sampled or upsampled to 32x32 (dataset samples sizes vary from 15x15 to 250x250) and converted to YUV space The Y channel is then preprocessed with global and local contrast normalization while U and V channels are left unchanged
B Network optimization
After preparing the input data, we conducted the training using the Gradient Descent optimization with simple dataseet with purpose of optimizing our network We use 200 training epochs to test and calibration them
During training, we have tried to change the order of “Batch Normalization” and “Max Pooling” to compare differences in training speed (BP means: “Conv Batch Normalization Max Pooling” and PB means: “Conv Max Pooling Batch Normalization”) Two ways to arrange the results are as follows:
Figure 7 Compare between BP and PB
The chart clearly shows that the PB architectures is better than the BP architectures So
in this paper we use PB to desgin our architecture After that, we tried the difference of the
Trang 5network when it has difference number of fully layer We have assumed that the network has one more fully layer will better But the reality is the opposite
Figure 8 Compare Fully Layer number
With our data, the network with one Fully Layer is better than no and two It suggests that in each case, complex architecture is not meant good results We need to test and find the suitable architecture After optimization network, we have selected the network architectures as mentioned in section II
C Trainning and Result
After choosing the architecture and parameters, we conducted training with the dataset that was developed above The program was trained with 39.209 samples with label and tested with 12.630 without label The final result is as follows:
>> Time to trainning: 4673.0710661411285s
>> Validation accuracy: 0.9854
>> Test accuracy: 0.9260
>> Time to process a picture: 0.253s
Figure 9 Loss and Accuracy of training process
The result shows that after training and testing, the match rate of the training data with our architecture is 98.54% and the match rate of testing data with our architecture is 92.6% The tranning process is conducted in nearly 40,000 steps but the graph shows that from about 10,000th steps, the loss rate and accuracy of the network changes very slowly,
Trang 6this is the phase of completion of the coefficients Sometimes, the loss rate increases and the accuracy decreases very fast then returns to the old value range This is an anomaly, so during training, the programmer should check the change of these parameters to ensure stability before the training stops for the best training results
In this paper, we conducted experiments with no GPU machine The results show that processing time of each image is about 0.253s (3.95 fps) That is a good parameter for our next research GPU supports parallel computing so the current processing speed can be upgraded to realtime processing
4 SUMMARY
In this paper, a simple architecture for traffic sign recognition is proposed We have conducted trials to change the order of processes and find out the best choice With the same number of elements, the arrangement of elements is very important for CNN In addition, complexity is not always good, with each type of data we need to change accordingly to have the most appropriate network architecture Although the design architecture is simple, it gives a good result This architecture has the following advantages: simple, easy to deploy in both high and low language; uses less system resources, high processing speed
The accuracy of our architecture is 92.6% This result is not really high but the architecture is much simpler than other architectures We can use it with low-profile computers such as embedded computers or FPGAs However, before doing it, we will be using some filter and image processing tools as a pre-processing for better input quality
In the next phase of research, we will rebuild our architectures with C/C++ language more optimized for speed and continue to further optimize the architectures and continue
to solve the next problem as: sensor problems, case handling, automatically control… to build a model of self-driving vehicles
Finally, after solving the component problems, we will try to employ it into some embeded computers and FPGA to run testing device and evaluate performance
REFERENCES
[1] Ian Goodfellow and Yoshua Bengio and Aaron Courville, “Deep Learning”, MIT
Press, 2016
[2] Jianxin Wu, LAMDA Group, National Key Lab for Novel Software Technology,
“Introduction to Convolutional Neural Networks”, on May 2017
http://benchmark.ini.rub.de/?section=gtsrb&subsection=news
[3] J Torresen, J W Bakke and L Sekanina, "Efficient recognition of speed limit signs,"
Proceedings The 7th International IEEE Conference on Intelligent Transportation Systems (IEEE Cat No.04TH8749), 2004, pp 652-656
[4] De la Escalera, A, Moreno, L, Salichs, M, and Armingol, J “Road traffic sign detection and classification” Industrial Electronics, IEEE Transactions, on 848 –859, 1997 [5] R Girshick, J Donahue, T Darrell and J Malik, "Region-Based Convolutional Networks for Accurate Object Detection and Segmentation," in IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol 38, no 1, pp 142-158, Jan 1 2016
[6] Sermanet, Pierre, and Yann LeCun, “Traffic sign recognition with multi-scale convolutional networks” Neural Networks (IJCNN), The 2011 International Joint
Conference on IEEE, 2011
[7] LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P “Gradient-based learning applied to document recognition” Proceedings of the IEEE, 86(11):2278–2324, November 1998
Trang 7[8] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton,
Greg Hullender, “Learning to Rank using Gradient Descent”, Proceeding ICML '05
Proceedings of the 22nd international conference on Machine learning Pages 89 – 96, August 2005
[9] https://www.tensorflow.org/
TÓM TẮT
NHẬN DIỆN BIỂN BÁO GIAO THÔNG VỚI MẠNG NORON TÍCH CHẬP
Trong nghiên cứu này, chúng tôi sử dụng mạng tích chập (CNN) thực hiện nhiệm
vụ xây dựng chương trình nhận diện biển báo giao thông Đây là nền tảng để thực hiện những nghiên cứu tiếp theo về xe tự lái Mạng tích chập là mạng noron có kiến trúc nhiều lớp và áp dụng thêm các thuật toán nhân chập giữa các lớp Mạng này
có khả năng tự động học các đặng tính của đối tượng Sau khi xây dựng kiến trúc của mạng chúng tôi sử dụng thư viện Tensorflow và ngôn ngữ lập trình Python là công cụ chính để thử nghiệm Và kết quả thử nghiệm cho thấy mặc dù kiến trúc mạng đơn giản chỉ gồm 4 lớp đã có thể đạt được độ chính xác là 92,6%
Từ khóa: CNN, Nhận diện biển báo giao thông, Mạng tích chập, Xe tự lái
Published, 26 th February, 2018
Author affiliations:
1 Post and Telecommunication Institute of Technology, Km10, Nguyen Trai, Ha Đong, Ha Noi;
2
Telecommunication University, No.11 Mai Xuan Thuong, Nha Trang, Khanh Hoa
* Corresponding author: duanlc@ptit.edu.vn.