a comparison of neural network architectures for

1 Introduction The focus of this work is the development, use, and analysis of an artificial neural network architecture for the recognition of handwritten digit data.. Specifically, fo

Trang 1

A Comparison of Neural Network Architectures for

Handwritten Digit Recognition

Eman M El-Sheikh1, Bradley A Swain1, and Mohamed A Khabou2

1Department of Computer Science, University of West Florida, Pensacola, FL, USA

2Department of Electrical & Computer Engineering, University of West Florida, Pensacola, FL, USA

Abstract - In this paper, we describe the development

and use of an artificial neural network architecture for

recognizing handwritten digit data The feed-forward

neural network, which was implemented in Java, was

designed to be modular and parameterized so that

various parameters and settings of the network could be

easily modified The network was used to recognize a

large set of data representing handwritten digits

Various configurations of the network were run on the

data set to determine the best settings for this type of

data set A correct classification rate of 93.77% was

achieved on a testing set containing 3000 images not

used in the training phase We present a summary and

analysis of the results

Keywords: Machine learning, neural network,

handwritten digit recognition, classification, Java

application

1 Introduction

The focus of this work is the development, use,

and analysis of an artificial neural network architecture

for the recognition of handwritten digit data Neural

networks have a long track record of successful use for

pattern recognition problems, including character

recognition The application of recognizing

handwritten digit data was selected for this study due

to the numerous potential applications, including mail

sorting, automated check reading, and data entry for

hand-held devices

In this paper, we describe the development of a

software neural network architecture The system,

which was implemented using the Java programming

language, was designed to be modular and

parameterized to allow easy modification of the

network’s size, structure, and parameters The overall

motivation for our study is the analysis of various

machine learning techniques for real-world

applications with the purpose of determining the

appropriateness and usefulness of each technique Specifically, for the study described in this paper, we focused on the use of neural network learning techniques for handwritten digit recognition Our objective was two-fold: to test the neural network implementation on a large-scale, real-world data set using a variety of network parameters and sizes, and to determine the best network structure and settings for the handwritten digit data set The results provide evidence for the use of neural network learning techniques for this type of application and data set, and insight about the appropriate design, use, and parameterization of the network to achieve acceptable results

In the next section, we provide a brief summary of the literature relevant to neural network techniques and applications Section 3 describes the design, implementation, and use of the neural network Section

4 describes the methods used for our tests, including a description of the data set, experiments we conducted, and results of our study The last section includes the conclusions derived and lessons learned from analyzing the results, as well as plans for future work

2 Literature Review

The use of neural networks in pattern recognition problems dates back to the early 1950s when the early perceptron models were introduced by Rosenblatt In the 1960s and 1970s the progress in neural networks slowed due to limited computing capabilities and the unfounded perceptions concerning the limitations of neural networks The enhancement of computing capabilities in the 1980s and the emphatic support by Hopfield, Rumelhart, and other researchers accelerated the development of new neural network models and theories Today numerous neural network models and algorithms are available including back propagation learning, competitive learning, Kohonen learning, Hopfield networks, Boltzman machines, shared-weight

Trang 2

neural networks and self-organizing neural networks

[4]

Neural networks have been used in a variety of

applications ranging from stock price prediction to

automatic target detection and recognition They are

especially useful as pattern classifiers and/or

recognizers because of their robustness, fault tolerance,

and universal function approximation capability In

theory, a multi-layer feed-forward neural network with

enough nodes and hidden layers can approximate any

bounded function and its derivative to any arbitrary

accuracy given a “large enough” training set [5]

Results published in [1] state that a network with N

inputs, one hidden layer of H units, and a total of W

weights will require in the order of W/ε training

patterns to yield an error less than ε on the test set

3 Neural Network Architecture

The neural network application used for this

paper was written using the Sun Java 2 Runtime

Environment (v 1.5.0), mainly on Mac OS X 10.5.1,

but has been successfully used on several Linux

distributions as well Theoretically, the program should

work on any platform for which a Java virtual machine

exists, as it uses no platform-specific features or calls

The design of the application attempted to make as few

assumptions as possible All attributes, input, and

output values are stored as real numbers, allowing for

both continuous and discrete attributes Missing

attribute values, however, are not handled by the

implementation The application uses a simple

feed-forward back-propagation-trained network, using the

sigmoid activation function, and is designed to allow

the use of a different activation function The system

was designed to be modular and easily re-configured

Configurable attributes of the network include the

number of input neurons, the number of hidden layers,

the number of neurons in each hidden layer, the

number of output neurons, the learning rate (α), the

threshold at which the value of the sigmoid function

will cause the neurons to fire, and the number of

epochs for which to train the network, all of which are

passed into the application by way of command-line

parameters There is a complete connection between

nodes in layers, with every node in the layer being

connected to every node in the following layer, and the

initial weights for these connections are initialized to

random values in the [-1, 1] range

4 Methods and Results

In this section, we describe the data set that we used for the study, the configurations of the neural network architecture that we used for our experiments, and the result of our test runs

4.1 Data

The problem of handwritten character recognition has been studied intensively for over thirty years [7] This research has been driven by the great number of potential applications including the recognition of handwritten addresses for mail sorting purposes Unlike the problem of machine-printed character recognition, the difficulties in the handwritten character recognition come from the fact that people write differently, as can be seen in Figure 1

Numerous methods have been proposed to capture the distinctive features of handwritten characters [2, 3, 6, 8] These approaches can be classified into two categories: global analysis and structural analysis As

an example of the first category, we find techniques such as template matching, moment invariant, mathematical transforms (Fourier, Walsh, Hadamard, etc) In the second category, the main goal is to capture the essential shape features of the characters mainly from their skeletons or contours Such features include loops, endpoints, junctions, arcs, concavities, convexities, and strokes

The main challenges to create a classifier based on features are: (1) what type of features to use, (2) how many features to use, (3) how to select the “best” features, and (4) how to define criteria for selecting the

“best” features

The use of feature vectors as input to handwritten digit classifiers has many advantages over the use of the two-dimensional digit images The most important advantage is the reduction of the dimension of the input space The smallest “reasonable” dimensions of a digit image are 16×16, which corresponds to an input vector of dimension 256 Such input dimensions would produce a huge neural network if multiple hidden layers are used The use of features in this specific case would reduce the dimension of the input space significantly

Trang 3

Figure 1 Representative samples from the data set

In our experiments we used a data set that consisted of

6000 unconstrained binary images of handwritten

digits (600 images from each digit class) from actual

USPS mail pieces The data was collected at the

Environmental Research Institute of Michigan (ERIM)

and was used by many researchers to test their

handwriting recognition systems [3] All images were

moment-normalized to a size of 24×18 as described in

[2] A subset of this data set (2000 images) was used

for training our system and a different subset (3500

images) was used for testing The results we report in

this paper were achieved with the testing data set

Three sets of features were used to train and test the

neural networks Each set contains 60 features, which

have values in the [0, 1] range and represent the degree

of correlation between a feature’s template and the

input image A value of 0 indicates that a feature

template did not match the digit image; a value of 1

indicates a complete match; and values in between

these two extremes indicate a partial match

These 60 “best” features were selected using three

measures: an information measure, an orthogonality

measure, and a combination of the two The

information measure is based on Shannon’s entropy It

is a statistical measurement of the capability of a

feature to separate digit classes If the information

measure of a feature is high, it indicates that the feature

is able to separate at least some of the classes well

since the conditional probabilities of the classes given

that the feature is either present or not present are

significantly different from each other The

orthogonality measure is based on the known

mathematical property that the best basis to represent

an n-dimensional space is using n orthogonal vectors

(i.e their dot product equals 0) The orthogonality measure yields features that do not respond similarly to

an input image, i.e the response of one feature would

be high when the other is low on some classes and vice-versa on all the other classes Thus the features selected based on the orthogonality measure provide discrimination capability since they behave differently

on each class The third measure we used is a combination of the information and orthogonality measures More details about the measures and selection process can be found in [3]

4.2 Experiments

We ran the networks on the handwritten digit data, training on the first 2000 digits in each file and testing on the final 3500 digits For a comparison between architectures, we ran tests using two hidden layers with 25 and 20 nodes, two hidden layers with 25 and 15 nodes, one hidden layer with 25 nodes, and one hidden layer with 15 nodes, respectively The other parameters to the network were fixed for all runs of all architectures After trial runs with different values, a learning rate of 0.05 was used, as it was small enough

to avoid saturating the network, but large enough to provide ample learning The number of training epochs was set at 500, as experimentation proved this to provide the best testing results, compared to 200 (under-trained), and 750 and 1000 epochs (over-trained) Finally, for all tests the firing threshold for the activation function of the hidden neurons was set at 0.9, which experimentally gave the best results

Trang 4

Table 1 Summary of performance results for each architecture configuration and each feature data set

Hidden

Layer(s)

Info Features

Time (mm:ss)

Orthogonal Features

Time (mm:ss)

Combination Features

Time (mm:ss)

25/20 90.63% 04:18 93.77% 04:20 93.74% 04:15 25/15 90.83% 03:56 93.34% 03:59 93.34% 03:59

25 90.54% 03:18 93.57% 03:19 93.43% 03:20

15 89.91% 02:08 92.86% 02:08 92.83% 02:08

4.3 Results

The results for the experiments are summarized in

Table 1 The table shows the percentage of handwritten

digit samples correctly classified for different

configurations of the architecture For each row, the

first column shows the architecture of the network

used The first two runs both used two hidden layers;

the last two used only one hidden layer The remaining

columns consist of data set and time pairs, with the

percentage of the test data correctly classified and the

amount of time that the network took to train and test,

respectively Each architecture and data set

combination was run 5 times, and the best result

presented for each configuration of the architecture that

was tested

Based on previous work done on the data sets, the two

hidden layer architecture with 25 nodes in the first

layer and 15 nodes in the second layer was expected to

perform best For the information features data set, this

held true, but for the other data sets, it actually

performed worse than both the 25/20 two hidden layer

architecture and the single hidden layer with 25 nodes

architecture For the other data sets, the 25/20

architecture outperformed the other network

architectures, with only a minor increase in time

required over the 25/15 architecture With a slightly

reduced performance, however, the single hidden layer

with 25 nodes performs nearly as well as the 25/20

architecture, but reduces the running time by as much

as a minute

5 Conclusions

Overall, the implemented system was able to

successfully classify the test data using each of three

feature sets The single hidden layer network with 15

nodes performed better on the orthogonal and

combination features data set, achieving a prediction accuracy of 92.86% and 92.83%, respectively The information set did not perform as well due to the fact that some features were almost identical and hence did not provide more information to the network Adding more nodes to the single hidden layer improved the accuracy slightly With 25 nodes in the hidden layer, the accuracy improved to 90.54% for the information features, 93.57% for the orthogonal features, and 93.43% for the combination features data set This slight improvement in accuracy comes at a price of an increase in time requirements, since the running time increased by over a minute It is interesting to note that expanding the architecture to include two hidden layers with 25 nodes in the first layer and 15 nodes in the second layer increases the running time but does not improve the classification accuracy for the orthogonal and combination features data set, and only has a slight benefit for the information features data set, improving the accuracy from 90.54% to 90.83% Increasing the number of nodes in the second hidden layer from 15 to

20 provides the best classification accuracy for the orthogonal features data set and the combined features data set, namely 93.77% and 93.74% respectively, but does not improve the prediction rate for the information features data

The orthogonal features data set and the combined features data set provided better results for all configurations of the network than the information features data set This was somewhat expected since the selection process can theoretically yield very similar information features which do not provide extra class discrimination to the neural network This possibility is not present in the orthogonality and combined features since the feature selection process using these two measures inherently discourages similar features Both the orthogonal and combination features data sets provided fairly comparable results, in terms of both accuracy and running time The single

Trang 5

hidden layer with 25 nodes performs nearly as well as

the two hidden layer architecture with 25 and 20 nodes

on all three data sets, but reduces the running time by

as much as a minute

We would like to continue testing the developed

system with other large data sets for handwritten digit

and character recognition We are also interested in

exploring the combination of different machine

learning techniques for handwritten digit recognition

and similar character recognition problems More

specifically, we are interested in developing a system

that integrates a decision tree learning technique with a

neural network architecture and testing it with the three

data sets used in this study to compare the performance

of such a system with the results of using a decision

tree or neural network separately We anticipate that

such a technique that fuses both machine learning

techniques will yield better results than using each

technique individually, and will allow the recognition

of larger, more complex data sets

6 References

[1] E Baum and D Haussler, “What Size Net Gives

Valid Generalization?” Neural Computation, vol 1,

no 1, pp 151-160, 1989

[2] P D Gader, B Forester, M Ganzberger, A

Gillies, B Mitchell, M Whalem, and T Yocum,

“Recognition of Handwritten Digits Using Template

and Model Matching,” Pattern Recognition, vol 24,

pp 421-432, 1991

[3] P D Gader and M A Khabou, “Automatic

Feature Generation for Handwritten Digit

Recognition,” IEEE Trans Pattern Analysis Machine

Intelligence, vol 18, no 12, pp 1256-1262, 1996

[4] S Haykin, Neural Networks, a Comprehensive

Foundation, MacHillan Publishing Co., 1994

[5] K Hornik, M Stinchrombe and H White,

“Universal Approximation of an Unknown mapping

and its Derivatives Using Multilayer Feed-forward

Networks,” Neural Networks, vol 3, pp 551-560,

1990

[6] C.Y Suen, “Distinctive Features in the Automatic

Recognition of Handprinted Characters,” Signal

Processing, vol 4, pp 193-207, 1982

[7] C.Y Suen , “Character Recognition by Computer

and Application,” Handbook of Pattern Recognition

and Image Processing, Academic Press, pp 569-586,

1986

[8] C.Y Suen, C Nadal, R Legault, T.A Mai, and

L Lam, “Computer Recognition of Unconstrained

Handwritten Numerals.” Proc of IEEE, vol 80, no 7,

pp 1162-1180, 1992

Định dạng
Số trang	5
Dung lượng	138,32 KB