In the paper, two different convolutional network architectures for recognising Vietnamese text in natural scenes are presentd. Experiments are conducted to compare the performance of two networks in reading Vietnamese restaurant signs. Experimental results show that the deeper network outperforms the other in recognising accuracy and computational time.
Trang 1COMPARING CONVOLUTIONAL NEURAL NETWORKS IN
VIETNAMESE SCENE TEXT RECOGNITION
Le Ngoc Thuy*
Abstract: Scene text recognition is a challenging task for research community,
especially with the scripts with diacritical marks such as Vietnamese In the paper, two different convolutional network architectures for recognising Vietnamese text in natural scenes are presentd Experiments are conducted to compare the performance of two networks in reading Vietnamese restaurant signs Experimental results show that the deeper network outperforms the other in recognising accuracy and computational time
Keywords: Scene text recognition, Optical character recognition, Convolutional neural networks
1 INTRODUCTION
Reading text in natural scene images refers to the problem of converting image regions into strings Scene text recognition is a crucial issue in many useful applications including: automatic sign translation, text detection system for the blind, intelligent driving assistance, content-based image/video retrieval Hence, scene text recognition has received increasing interests from the research and industry community in recent years
Although scene text recognition seems similar to optical character recognition (OCR), reading text in natural scene images is much more challenging One of the leading commercial OCR engines, ABBYY FineReader, claims that it has capability of transforming scanned documents, such as graphics and images, into texts with the accuracy of 99.8% However, its accuracy in character recognition is
as low as 21% for scene text applications [1] The difficulty in scene text recognition results from three following facts Firstly, the appearances of characters often vary drastically in fonts, colors and sizes, even in the same image Secondly, the text in captured images is affected by various factors, such as blur, distortion, non-uniform illumination, occlusion and complex backgrounds Lastly, there are other objects in the captured image which make the problem more challenging
The numerous of studies have dealt with the scene text detection and recognition during the last two decades but most of the existing methods and benchmarks have focused on texts in English There were a few efforts addressing the scene text detection and recognition for language scripts with diacritics [2] The results of the ICDAR 2013 Robust Reading Competition shown that the participating methods were usually fail to detect the dot of the letters “i” and “j” [3] Therefore, there is a potential that most of the current methods in scene text detection and recognition do not recognise tiny atoms of the language scripts with diacritics such as Vietnamese, Thai, Arabic (Figure 1) if we applied the current methods to other languages directly For instances, commercial OCR softwares work well with scanned English documents but they still have significant errors in transforming the scanned Vietnamese documents to text The errors are mainly due
to letters with diacritics Moreover, some Vietnamese words may consist of one
Trang 2letter with two diacritics above or below it This distinctive characteristic makes the task of Vietnamese script recognition more challenging than most other scripts
As numerous researchers devoted to detecting and recognising the scene text, many papers have provided comprehensive surveys on these problems [4-11] The most broadly reviewed paper [4] addresses more than 200 papers which are classified in two groups The first group include stepwise methodologies which address the problem of reading scene text in four separate steps: localization, verification, segmentation, and recognition The advantages of stepwise methodologies are computational efficiency and the capability of processing the oriented text However, their disadvantages are the complexity in integrating different techniques from all four steps and the difficulty in optimizing parameters for all steps at the same time The other group include integrated methodologies which are to identify specific words in images with character and language models While the integrated methodologies have a clear advantage in optimize parameters for the whole solution, they are often computationally expensive and limited to a small lexicon of words
Figure 1 The same sentence in different languages: English, Arabic, Slovakian,
Vietnamese, Urdu, Japanese and Thai
Another valuable survey [5] gives the overview of recent advances in scene text detection and recognition for static images by referring to around 100 papers
Y Zhu et al [5] address the related works on scene text detection as three types of methods: texture based methods, component based methods and hybrid methods The paper not only analyses the strength and weakness of comparative methods but also gives the useful discussion about state-of-the-art algorithms and the future trend in scene text detection and recognition
The above papers emphasize the well performance of deep learning methods in scene text detection and recognition They also suggests that the further improvement in detection and recognition accuracy can be achieved, if the deep learning framework is employed and combined with the language knowledge Among the studies using deep learning and big data, Google PhotoOCR [12] is a remarkably successful work which won the ICDAR Robust Reading Competition in
2013 It takes advantage of substantial progress in deep learning and large scale language modeling Its deep neural network (DNN) character classifier is trained on two million examples while its language model is built by utilizing a corpus of more than a trillion tokens Many other methods using DNNs has achieved the top scores
Trang 3in ICDAR Robust Reading Competitions
To the best of our knowledge, there has not been any study of word-level scene text recognition for Vietnamese Hence, this paper will explore this area by comparing the performance of two neural networks in recognising Vietnamese words on the restaurant signs The concept of convolutional neural networks (CNNs) is introduced in the next section Then, two network architectures are represented with different complexity levels Section 3 discusses experimental results when using the presented networks for Vietnamese text recognition
2 SCENE TEXT RECOGNITION USING CNNs 2.1 Background theory
Convolutional neural networks are specific feed-forward multilayer neural networks which combines three following architectural ideas: (i) local receptive fields used to detect elementary visual features in images, such as oriented edges, end points or corners; (ii) shared weights to extract the same set of elementary features from the whole input image and to reduce the computational cost; (iii) sub-sampling operations to reduce the computational cost and the sensitivity to affine transformations such as shifts and rotations [3]
A convolutional neural network consists of many layers, including the input layer, the output layer, and hidden layers The hidden layers of convolutional networks include convolutional layers and pooling layers Each unit in a convolutional layer is locally connected to a set of units located in a small neighborhood of the previous layer The output of convolutional layers are called feature maps because they help to extract the visual features in images The output features at a layer may be used to build the higher-order features in the next layers Unfortunately, no algorithm is able to automatically determine the optimal architecture of a CNN for a given classification task The architecture of network such as the number of layers, the number of units in each layer, and the network parameters must be determined through experiments This section present two convolutional network architectures which are used for the experiments of Vietnamese scene text recognition in Section 3
2.2 Network architecture 1
The first proposed network consists of three convolution layers as shown in Figure 2 The input of network is the coloured image with the size of 32x32x3 The first convolutional layer has 32 feature maps corresponding to 32 convolutional filters The size of each convolutional filter in the first layer is 5x5x3 The second and third convolutional layers have 32 and 64 feature maps, respectively The outputs of convolutional layers are sub-sampled using the max pooling function and normalised by the rectifier linear unit ReLU The receptive field of pooling layers is a 3x3 matrix with the stride of 2 The last two layers are fully connected
to combine the features learned from the previous convolution and pooling layers The number of filters in the last layer is the number of classes to be recognised This architecture has totally 12,399,306 connections while having only 145,578 parameters thanks to the weight sharing characteristic
Trang 4Input Image
32x32x3
Conv layer 32x32x32
Pooling layer 16x16x32
Conv layer 16x16x32
Pooling layer 8x8x32 Conv layer 8x8x64
Pooling layer 4x4x64 1x1x64
Figure 2 The first convolution network architecture
2.3 Network architecture 2
The second network architecture is simpler than the first one It consists of only one convolution layer and one pooling layer (Figure 3) To get more information from the input data, the larger size of input images is used (64x64x3) The convolutional layer is created by utilizing 400 kernel filters whose size is 8x8x3 The outputs of convolutional layers are sub-sampled using the average pooling function and normalised by the sigmoid function The receptive field of pooling layers is a 3x3 matrix with the stride of 3 so that the sub-sampled areas are non-overlapping This architecture has totally 250,822,800 connections while having only 77,200 parameters
Input Image
64x64x3
Conv layer 57x57x400 Pooling layer19x19x400
Figure 3 The second convolution network architecture
3 EXPERIMENTS AND RESULTS 3.1 Training dataset
Since there is no labeled datasets of Vietnamese scene text found on the internet, a dataset of Vietnamese restaurant signs was built by collecting the images on the internet and by capturing the shop signs on the street (Figure 4) The collected dataset consisted of 1,301 images containing 464 words of “bún” (rice noodle), 409 words of “phở”, 428 words of “cơm” (rice) This dataset was split in two subsets Two thirds of images were used for training the network The rest were used for validation
Trang 5Figure 4 Images of dataset
The convolutional neural networks often require a larger number of data so that the networks can learn the features of objects by themselves Hence, the images of other objects were added to the training datasets The final training set consists of about 3000 resized images of 10 objects
3.2 Experimental results
Our experiments utilised the softmax classifier, which is a known multiclass classification method, for recognising text The output of the above neural networks was used as the input of softmax classifier It should be noted that the input of neural networks in our experiments was produced directly from origin captured image Hence, the networks did not need the pre-prosessing step to crop words from the origin images as some other networks do
The accuracy of recognising each word (noodle, phở, rice) and the average accuracy for Vietnamese words were shown in Table 1 Although the input image resolution of network 2 had the resolution with four times greater than that of network 1, the accuracy of network 1 in recognising words was higher than that of network 2 This was thanks to the deeper architecture of network 2
Table 1 The recognising accuracy
Network 1 Network 2
Vietnamese words 86,7% 69,7%
Figure 5 and 6 shown some randomly selected images which were recognised correctly and incorrectly The recognised results were promising because the networks can correctly recognise the blurred words in the images with non-uniform illumination and complex background
Trang 6Figure 5 Correctly recognised words
Another remarks in comparing these two networks is about the computational complexity Although the number of parameters in network 1 are about double those in network 2, the number of connections in network 2 are twenty times greater than those in network 1 Hence, the second network needed much more time for calculating the forward propagation in the network This fact makes the first network faster in the recognising task
Figure 6 Incorrectly recognised words
Trang 74 CONCLUSIONS
Two convolutional neural networks in Vietnamese scene text recognition have been compared The results pointed out that the deeper network shown better performance in recognising accuracy and computational time
The current results are obtained by using the image pixels as the input of CNNs
To achieve the higher accuracy, further investigation should be focused on using some specific image features as the input of CNNs The performance of above CNNs on Vietnamese scene text recognition should be slightly improved with a larger labeled dataset
REFERENCES
[1] Wang K., Babenko B., Belongie S., “End-to-End Scene Text Recognition”,
IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 2011
[2] Le N T., “Các giải thuật phát hiện chữ viết đối với các ngôn ngữ có dấu”,
Journal of Military Science and Technology, vol 46 (2016), pp 163-169
[3] Karatzas D., Shafait F., Uchida S., Iwamura M., Bigorda L., Mestre S., Mas J.,
Mota D., Almaz J., Heras L., “ICDAR 2013 robust reading competition”,
Proceedings of the ICDAR (2013)
[4] Q Ye and D Doermann, “Text detection and recognition in imagery: A
survey”, IEEE Trans Pattern Anal Mach Intell., vol 37, no 7 (2014), pp
1480-1500
[5] Y Zhu, C Yao and X Bai, “Scene text detection and recognition: Recent
advances and future trends”, Frontiers of Computer Science, Vol 10, Issue 1
(2015), pp 19-36
[6] Chongmu Chen, Da-Han Wang, Hanzi Wan, “Scene Character and Text Recognition: The State-of-the-Art”, Chapter Image and Graphics in Volume
9219 of the series Lecture Notes in Computer Science (2015), pp 310-320
[7] Karanje Uma B., and Rahul Dagade, “Survey on Text Detection, Segmentation and Recognition from a Natural Scene Images” International Journal of
Computer Applications 108.13 (2014)
[8] Patil Priyanka, and S I Nipanikar, “A Survey on Scene Text Detection and Text Recognition”, International Journal of Advanced Research in Computer
and Communication Engineering, Vol 5, Issue 3 (2016), pp 887-889
[9] Cun-Zhao Shi, Song Gao, Meng-Tao Liu, Cheng-Zuo QiA, “Stroke Detector and Structure Based Models for Character Recognition: A Comparative
Study”, IEEE Transactions on Image Processing, Vol 24, Issue: 12 (2015), pp
4952-4964
[10] Kaur Tajinder, and Nirvair Neeru, “Text Detection and Extraction from Natural Scene: A Survey”, International Journal of Advance Research in
Computer Science and Management Studies, Vol 3, Issue 3 (2015), pp
331-336
Trang 8[11] N Sharma , U Pal and M Blumenstein, “Recent advances in video based document processing: A review”, Proc DAS (2012), pp 63-68
[12] A Bissacco, M Cummins, Y Netzer, H Neven, “PhotoOCR: Reading Text
in Uncontrolled Conditions”, IEEE International Conference on Computer
Vision, 2013, pp, 785-792
TÓM TẮT
SO SÁNH CÁC MẠNG NƠ RON TÍCH CHẬP TRONG VIỆC
NHẬN DẠNG CHỮ TIẾNG VIỆT TRONG CẢNH
Vấn đề nhận dạng chữ viết trong cảnh là một nhiệm vụ thách thức đối với các nhà nghiên cứu, đặc biệt là nhận dạng chữ viết có dấu như tiếng Việt Bài báo này giới thiệu hai kiến trúc mạng nơ ron tích chập ứng dụng trong việc nhận dạng chữ viết tiếng Việt trong cảnh vật tự nhiên Tác giả đã tiến hành các thử nghiệm để so sánh hiệu quả của hai mạng nơ ron này trong việc đọc các bảng hiệu nhà hàng bằng tiếng Việt Kết quả thử nghiệm cho thấy mạng nơ ron có kiến trúc sâu hơn đạt hiệu quả tốt hơn về độ chính xác của quá trình nhận dạng và thời gian tính toán
Từ khóa: Nhận dạng chữ viết trong cảnh, Nhận dạng ký tự quang học, Mạng nơ ron tích chập.
Published, 1 st Nov., 2017
Author affiliation:
Posts and Telecommunications Institute of Technology;
* Email: thuyln@ptit.edu.vn.