A traffic sign recognition system with convolutional neural network

In this research, we used Convolutional Neural Network [1][2] (CNN) to the task of Traffic Sign Recognition. This research is foundation for us to continue our research on self-driving. Convolutional Neural Network is a multistage architectures. It can be automatically learn features.

Trang 1

A TRAFFIC SIGN RECOGNITION SYSTEM WITH

CONVOLUTIONAL NEURAL NETWORK

Luong Cong Duan1,*, Nguyen Hong Kiem2, Nguyen Ngoc Minh1

Abstract: In this research, we used Convolutional Neural Network [1][2] (CNN)

to the task of Traffic Sign Recognition This research is foundation for us to continue our research on self-driving Convolutional Neural Network is a multi-stage architectures It can be automatically learn features We have used Tensorflow library and Python as main tool for test our research After conducting research and testing, the results of the architectures reached 91.1% accuracy.

Keywords: Traffic Sign Recognition, Convolution Neural Network, CNN, Self-Driving

1 INTRODUCTION

Our long-term goal in this research is self-driving vehicles and research on traffic sign identification is is one of the first researches Traffic sign identification can apply many areas of traffic as: Notification signal information changes on the road, reminder about wrongful when joining traffic and automated driving Traffic signals often have clear differences but their quantity of type is quite large In addition, the quality of image signals is greatly affected by the angle of view, the light, the obscurity, colors fading and speed of movement In this paper, our aims are building a test identifier that ignores conditions that are too difficult, it will be conducted further research In this paper, we have used a basic dataset called: German Traffic Sign [3] This is a dataset be used in GTSRB (German Traffic Sign Recognition Benchmark) competition It provides more than 50,000 sample pictures including 43 different classes: speed limits, dangerous curves, slippery road… This dataset was used in a competition a few years ago The best result for the competition correctly guessed 99.46% of the signs that was designed by the IDSIA team using the Committee of the CNNs method [3]

Traditional methods for traffic sign recognition generally consists two task: detection and classification Detection is first handled with computationally inexpensive, hand-crafted algorithms Classification is subsequently performed on detected candidates with more expensive, but more accurate, algorithms Hand-crafted features are also called shallow features, are not discriminative enough as databases become larger and larger and generic deep features should push the recognition performance even further Classification has been approached with a number of popular classification methods such as Neural Networks [4], Support Vector Machines [5]… In global sign shapes are first detected with various heuristics and color thresholding, then the detected windows are classified using a different Multi-Layer Neural Net for each type of outer shape These neural nets take 32x32 inputs and have at most 30, 15 and 10 hidden units for each of their 3 layers While using a similar input size, the networks used in the present work have orders of magnitude more parameters

Current popular algorithms mainly use convolutional neural networks to execute both feature extraction and classification[6] Experiments have shown that CNN has many advantages in recognition problems There are a variety of CNN variants having been proposed in GTSRB Pierre Sermanet and Yann LeCun [7] fed both the high-level and low-level features extracted by different convolutional layers to the fully-connected layers This method combined global invariant features with the local detailed ones and the accuracy record was 99.17%

Trang 2

From those information we decided to choose CNN as the basic method for traffic sign recognition task CNN is a biologically-inspired, multilayer feed-forward architecture that can learn multiple stages of invariant features using a combination of supervised and unsupervised learning Each stage is composed of a (convolutional) filter bank layer, a non-linear transform layer, and a spatial feature pooling layer The spatial pooling layers lower the spatial resolution of the representation, thereby making the representation robust

to small shifts and geometric distortions, similarly to “complex cells” in standard models

of the visual cortex [8] CNN are generally composed of one to three stages, capped by a classifiercomposed of one or two additional layers

Figure 1 Typical CNN architecture (Wikipedia)

After building architecture, we used a method to optimize the loss function One of the most popular methods is Gradient Descent [1][9] Gradient descent is a way to minimize

an objective J( ) function parameterized by a model’s parameters d

  by updating the parameters in the opposite direction of the gradient of the objective function J( ) to the parameters The learning rate 

determines the size of the steps we take

to reach a (local) minimum In other

words, we follow the direction of the

slope of the surface created by the

objective function downhill until we

reach a valley

Currently, there are many libraries

and programming languages that

support user programming and training

machine learning With its machine

learning background, Google has

created an open source library called

Tensorflow It has flexible architecture

that allows user to deploy computation

to one or more CPUs or GPUs in a

desktop, server, or mobile device with a

single API [10] We have decided to use

this libraries for our project

2 NETWORK ARCHITECTURE

The architecture used in the present work departs from traditional CNN[5] by the use of connections that skip layers, and by the use of pooling layers with different subsampling ratios for the connections that skip layers and for those that do not

Figure 2 Gradient descent on a series of

level sets

Trang 3

We have run the test a number of times and by this time we have temporarily selected the architectures include 4 stage as follows:

1 Inputs data [batch, 32, 32, 3] YUV data

2 1 st stage

Input = inputs data Conv1 + ReLU : kernel size = 5, layer width = 108

channel Y connect 100 kernel

channel UV connect 8 kernel

Max pooling : kernel size = 2 Output = “conv1”

3 2st stage

Input = “conv1”

Conv2 + ReLU : kernel size = 3, layer width = 200 Max pooling : kernel size = 2

Output = “conv2”

4 3st stage

Combine “conv1(flatten)” with “conv2(flatten)”

Input = concat "conv1(flatten)" and “conv2(flatten)”

Fully network + ReLU : layer width = 300 Output = “fc1”

5 Output - 4st stage Input = “fc1”

Out : layer width = 43

Figure 3 Network architecture

Figure 4 Diagram of netwoek architecture

3 EXPERIMENT

A Data Preparation

Currently, GTSRB dataset has about 50.000 sample pictures of 43 class However, the number of images for each class is uneven Below is the detailed information on the distribution of the dataset:

Figure 5 Number of inputs per class before balancing data

Trang 4

It can be sent that are differences between the classes We should create some data to balance the number of inputs We have used an easy method to increment number of images That is rotating images by a few degrees This is the distribution after this operation:

Figure 6 Number of inputs per class before balancing data

The data is more balanced, and each class has at least 500 images This new dataset will help to train our network better

Additionally, all images are down-sampled or upsampled to 32x32 (dataset samples sizes vary from 15x15 to 250x250) and converted to YUV space The Y channel is then preprocessed with global and local contrast normalization while U and V channels are left unchanged

B Network optimization

After preparing the input data, we conducted the training using the Gradient Descent optimization with simple dataseet with purpose of optimizing our network We use 200 training epochs to test and calibration them

During training, we have tried to change the order of “Batch Normalization” and “Max Pooling” to compare differences in training speed (BP means: “Conv  Batch Normalization  Max Pooling” and PB means: “Conv  Max Pooling  Batch Normalization”) Two ways to arrange the results are as follows:

Figure 7 Compare between BP and PB

The chart clearly shows that the PB architectures is better than the BP architectures So

in this paper we use PB to desgin our architecture After that, we tried the difference of the

Trang 5

network when it has difference number of fully layer We have assumed that the network has one more fully layer will better But the reality is the opposite

Figure 8 Compare Fully Layer number

With our data, the network with one Fully Layer is better than no and two It suggests that in each case, complex architecture is not meant good results We need to test and find the suitable architecture After optimization network, we have selected the network architectures as mentioned in section II

C Trainning and Result

After choosing the architecture and parameters, we conducted training with the dataset that was developed above The program was trained with 39.209 samples with label and tested with 12.630 without label The final result is as follows:

>> Time to trainning: 4673.0710661411285s

>> Validation accuracy: 0.9854

>> Test accuracy: 0.9260

>> Time to process a picture: 0.253s

Figure 9 Loss and Accuracy of training process

The result shows that after training and testing, the match rate of the training data with our architecture is 98.54% and the match rate of testing data with our architecture is 92.6% The tranning process is conducted in nearly 40,000 steps but the graph shows that from about 10,000th steps, the loss rate and accuracy of the network changes very slowly,

Trang 6

this is the phase of completion of the coefficients Sometimes, the loss rate increases and the accuracy decreases very fast then returns to the old value range This is an anomaly, so during training, the programmer should check the change of these parameters to ensure stability before the training stops for the best training results

In this paper, we conducted experiments with no GPU machine The results show that processing time of each image is about 0.253s (3.95 fps) That is a good parameter for our next research GPU supports parallel computing so the current processing speed can be upgraded to realtime processing

4 SUMMARY

In this paper, a simple architecture for traffic sign recognition is proposed We have conducted trials to change the order of processes and find out the best choice With the same number of elements, the arrangement of elements is very important for CNN In addition, complexity is not always good, with each type of data we need to change accordingly to have the most appropriate network architecture Although the design architecture is simple, it gives a good result This architecture has the following advantages: simple, easy to deploy in both high and low language; uses less system resources, high processing speed

The accuracy of our architecture is 92.6% This result is not really high but the architecture is much simpler than other architectures We can use it with low-profile computers such as embedded computers or FPGAs However, before doing it, we will be using some filter and image processing tools as a pre-processing for better input quality

In the next phase of research, we will rebuild our architectures with C/C++ language more optimized for speed and continue to further optimize the architectures and continue

to solve the next problem as: sensor problems, case handling, automatically control… to build a model of self-driving vehicles

Finally, after solving the component problems, we will try to employ it into some embeded computers and FPGA to run testing device and evaluate performance

REFERENCES

[1] Ian Goodfellow and Yoshua Bengio and Aaron Courville, “Deep Learning”, MIT

Press, 2016

[2] Jianxin Wu, LAMDA Group, National Key Lab for Novel Software Technology,

“Introduction to Convolutional Neural Networks”, on May 2017

http://benchmark.ini.rub.de/?section=gtsrb&subsection=news

[3] J Torresen, J W Bakke and L Sekanina, "Efficient recognition of speed limit signs,"

Proceedings The 7th International IEEE Conference on Intelligent Transportation Systems (IEEE Cat No.04TH8749), 2004, pp 652-656

[4] De la Escalera, A, Moreno, L, Salichs, M, and Armingol, J “Road traffic sign detection and classification” Industrial Electronics, IEEE Transactions, on 848 –859, 1997 [5] R Girshick, J Donahue, T Darrell and J Malik, "Region-Based Convolutional Networks for Accurate Object Detection and Segmentation," in IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol 38, no 1, pp 142-158, Jan 1 2016

[6] Sermanet, Pierre, and Yann LeCun, “Traffic sign recognition with multi-scale convolutional networks” Neural Networks (IJCNN), The 2011 International Joint

Conference on IEEE, 2011

[7] LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P “Gradient-based learning applied to document recognition” Proceedings of the IEEE, 86(11):2278–2324, November 1998

Trang 7

[8] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton,

Greg Hullender, “Learning to Rank using Gradient Descent”, Proceeding ICML '05

Proceedings of the 22nd international conference on Machine learning Pages 89 – 96, August 2005

[9] https://www.tensorflow.org/

TÓM TẮT

NHẬN DIỆN BIỂN BÁO GIAO THÔNG VỚI MẠNG NORON TÍCH CHẬP

Trong nghiên cứu này, chúng tôi sử dụng mạng tích chập (CNN) thực hiện nhiệm

vụ xây dựng chương trình nhận diện biển báo giao thông Đây là nền tảng để thực hiện những nghiên cứu tiếp theo về xe tự lái Mạng tích chập là mạng noron có kiến trúc nhiều lớp và áp dụng thêm các thuật toán nhân chập giữa các lớp Mạng này

có khả năng tự động học các đặng tính của đối tượng Sau khi xây dựng kiến trúc của mạng chúng tôi sử dụng thư viện Tensorflow và ngôn ngữ lập trình Python là công cụ chính để thử nghiệm Và kết quả thử nghiệm cho thấy mặc dù kiến trúc mạng đơn giản chỉ gồm 4 lớp đã có thể đạt được độ chính xác là 92,6%

Từ khóa: CNN, Nhận diện biển báo giao thông, Mạng tích chập, Xe tự lái

Published, 26 th February, 2018

Author affiliations:

1 Post and Telecommunication Institute of Technology, Km10, Nguyen Trai, Ha Đong, Ha Noi;

2

Telecommunication University, No.11 Mai Xuan Thuong, Nha Trang, Khanh Hoa

* Corresponding author: duanlc@ptit.edu.vn.

Định dạng
Số trang	7
Dung lượng	619,24 KB