1 Introduction The focus of this work is the development, use, and analysis of an artificial neural network architecture for the recognition of handwritten digit data.. Specifically, fo
Trang 1A Comparison of Neural Network Architectures for
Handwritten Digit Recognition
Eman M El-Sheikh1, Bradley A Swain1, and Mohamed A Khabou2
1Department of Computer Science, University of West Florida, Pensacola, FL, USA
2Department of Electrical & Computer Engineering, University of West Florida, Pensacola, FL, USA
Abstract - In this paper, we describe the development
and use of an artificial neural network architecture for
recognizing handwritten digit data The feed-forward
neural network, which was implemented in Java, was
designed to be modular and parameterized so that
various parameters and settings of the network could be
easily modified The network was used to recognize a
large set of data representing handwritten digits
Various configurations of the network were run on the
data set to determine the best settings for this type of
data set A correct classification rate of 93.77% was
achieved on a testing set containing 3000 images not
used in the training phase We present a summary and
analysis of the results
Keywords: Machine learning, neural network,
handwritten digit recognition, classification, Java
application
1 Introduction
The focus of this work is the development, use,
and analysis of an artificial neural network architecture
for the recognition of handwritten digit data Neural
networks have a long track record of successful use for
pattern recognition problems, including character
recognition The application of recognizing
handwritten digit data was selected for this study due
to the numerous potential applications, including mail
sorting, automated check reading, and data entry for
hand-held devices
In this paper, we describe the development of a
software neural network architecture The system,
which was implemented using the Java programming
language, was designed to be modular and
parameterized to allow easy modification of the
network’s size, structure, and parameters The overall
motivation for our study is the analysis of various
machine learning techniques for real-world
applications with the purpose of determining the
appropriateness and usefulness of each technique Specifically, for the study described in this paper, we focused on the use of neural network learning techniques for handwritten digit recognition Our objective was two-fold: to test the neural network implementation on a large-scale, real-world data set using a variety of network parameters and sizes, and to determine the best network structure and settings for the handwritten digit data set The results provide evidence for the use of neural network learning techniques for this type of application and data set, and insight about the appropriate design, use, and parameterization of the network to achieve acceptable results
In the next section, we provide a brief summary of the literature relevant to neural network techniques and applications Section 3 describes the design, implementation, and use of the neural network Section
4 describes the methods used for our tests, including a description of the data set, experiments we conducted, and results of our study The last section includes the conclusions derived and lessons learned from analyzing the results, as well as plans for future work
2 Literature Review
The use of neural networks in pattern recognition problems dates back to the early 1950s when the early perceptron models were introduced by Rosenblatt In the 1960s and 1970s the progress in neural networks slowed due to limited computing capabilities and the unfounded perceptions concerning the limitations of neural networks The enhancement of computing capabilities in the 1980s and the emphatic support by Hopfield, Rumelhart, and other researchers accelerated the development of new neural network models and theories Today numerous neural network models and algorithms are available including back propagation learning, competitive learning, Kohonen learning, Hopfield networks, Boltzman machines, shared-weight
Trang 2neural networks and self-organizing neural networks
[4]
Neural networks have been used in a variety of
applications ranging from stock price prediction to
automatic target detection and recognition They are
especially useful as pattern classifiers and/or
recognizers because of their robustness, fault tolerance,
and universal function approximation capability In
theory, a multi-layer feed-forward neural network with
enough nodes and hidden layers can approximate any
bounded function and its derivative to any arbitrary
accuracy given a “large enough” training set [5]
Results published in [1] state that a network with N
inputs, one hidden layer of H units, and a total of W
weights will require in the order of W/ε training
patterns to yield an error less than ε on the test set
3 Neural Network Architecture
The neural network application used for this
paper was written using the Sun Java 2 Runtime
Environment (v 1.5.0), mainly on Mac OS X 10.5.1,
but has been successfully used on several Linux
distributions as well Theoretically, the program should
work on any platform for which a Java virtual machine
exists, as it uses no platform-specific features or calls
The design of the application attempted to make as few
assumptions as possible All attributes, input, and
output values are stored as real numbers, allowing for
both continuous and discrete attributes Missing
attribute values, however, are not handled by the
implementation The application uses a simple
feed-forward back-propagation-trained network, using the
sigmoid activation function, and is designed to allow
the use of a different activation function The system
was designed to be modular and easily re-configured
Configurable attributes of the network include the
number of input neurons, the number of hidden layers,
the number of neurons in each hidden layer, the
number of output neurons, the learning rate (α), the
threshold at which the value of the sigmoid function
will cause the neurons to fire, and the number of
epochs for which to train the network, all of which are
passed into the application by way of command-line
parameters There is a complete connection between
nodes in layers, with every node in the layer being
connected to every node in the following layer, and the
initial weights for these connections are initialized to
random values in the [-1, 1] range
4 Methods and Results
In this section, we describe the data set that we used for the study, the configurations of the neural network architecture that we used for our experiments, and the result of our test runs
4.1 Data
The problem of handwritten character recognition has been studied intensively for over thirty years [7] This research has been driven by the great number of potential applications including the recognition of handwritten addresses for mail sorting purposes Unlike the problem of machine-printed character recognition, the difficulties in the handwritten character recognition come from the fact that people write differently, as can be seen in Figure 1
Numerous methods have been proposed to capture the distinctive features of handwritten characters [2, 3, 6, 8] These approaches can be classified into two categories: global analysis and structural analysis As
an example of the first category, we find techniques such as template matching, moment invariant, mathematical transforms (Fourier, Walsh, Hadamard, etc) In the second category, the main goal is to capture the essential shape features of the characters mainly from their skeletons or contours Such features include loops, endpoints, junctions, arcs, concavities, convexities, and strokes
The main challenges to create a classifier based on features are: (1) what type of features to use, (2) how many features to use, (3) how to select the “best” features, and (4) how to define criteria for selecting the
“best” features
The use of feature vectors as input to handwritten digit classifiers has many advantages over the use of the two-dimensional digit images The most important advantage is the reduction of the dimension of the input space The smallest “reasonable” dimensions of a digit image are 16×16, which corresponds to an input vector of dimension 256 Such input dimensions would produce a huge neural network if multiple hidden layers are used The use of features in this specific case would reduce the dimension of the input space significantly
Trang 3Figure 1 Representative samples from the data set
In our experiments we used a data set that consisted of
6000 unconstrained binary images of handwritten
digits (600 images from each digit class) from actual
USPS mail pieces The data was collected at the
Environmental Research Institute of Michigan (ERIM)
and was used by many researchers to test their
handwriting recognition systems [3] All images were
moment-normalized to a size of 24×18 as described in
[2] A subset of this data set (2000 images) was used
for training our system and a different subset (3500
images) was used for testing The results we report in
this paper were achieved with the testing data set
Three sets of features were used to train and test the
neural networks Each set contains 60 features, which
have values in the [0, 1] range and represent the degree
of correlation between a feature’s template and the
input image A value of 0 indicates that a feature
template did not match the digit image; a value of 1
indicates a complete match; and values in between
these two extremes indicate a partial match
These 60 “best” features were selected using three
measures: an information measure, an orthogonality
measure, and a combination of the two The
information measure is based on Shannon’s entropy It
is a statistical measurement of the capability of a
feature to separate digit classes If the information
measure of a feature is high, it indicates that the feature
is able to separate at least some of the classes well
since the conditional probabilities of the classes given
that the feature is either present or not present are
significantly different from each other The
orthogonality measure is based on the known
mathematical property that the best basis to represent
an n-dimensional space is using n orthogonal vectors
(i.e their dot product equals 0) The orthogonality measure yields features that do not respond similarly to
an input image, i.e the response of one feature would
be high when the other is low on some classes and vice-versa on all the other classes Thus the features selected based on the orthogonality measure provide discrimination capability since they behave differently
on each class The third measure we used is a combination of the information and orthogonality measures More details about the measures and selection process can be found in [3]
4.2 Experiments
We ran the networks on the handwritten digit data, training on the first 2000 digits in each file and testing on the final 3500 digits For a comparison between architectures, we ran tests using two hidden layers with 25 and 20 nodes, two hidden layers with 25 and 15 nodes, one hidden layer with 25 nodes, and one hidden layer with 15 nodes, respectively The other parameters to the network were fixed for all runs of all architectures After trial runs with different values, a learning rate of 0.05 was used, as it was small enough
to avoid saturating the network, but large enough to provide ample learning The number of training epochs was set at 500, as experimentation proved this to provide the best testing results, compared to 200 (under-trained), and 750 and 1000 epochs (over-trained) Finally, for all tests the firing threshold for the activation function of the hidden neurons was set at 0.9, which experimentally gave the best results
Trang 4Table 1 Summary of performance results for each architecture configuration and each feature data set
Hidden
Layer(s)
Info Features
Time (mm:ss)
Orthogonal Features
Time (mm:ss)
Combination Features
Time (mm:ss)
25/20 90.63% 04:18 93.77% 04:20 93.74% 04:15 25/15 90.83% 03:56 93.34% 03:59 93.34% 03:59
25 90.54% 03:18 93.57% 03:19 93.43% 03:20
15 89.91% 02:08 92.86% 02:08 92.83% 02:08
4.3 Results
The results for the experiments are summarized in
Table 1 The table shows the percentage of handwritten
digit samples correctly classified for different
configurations of the architecture For each row, the
first column shows the architecture of the network
used The first two runs both used two hidden layers;
the last two used only one hidden layer The remaining
columns consist of data set and time pairs, with the
percentage of the test data correctly classified and the
amount of time that the network took to train and test,
respectively Each architecture and data set
combination was run 5 times, and the best result
presented for each configuration of the architecture that
was tested
Based on previous work done on the data sets, the two
hidden layer architecture with 25 nodes in the first
layer and 15 nodes in the second layer was expected to
perform best For the information features data set, this
held true, but for the other data sets, it actually
performed worse than both the 25/20 two hidden layer
architecture and the single hidden layer with 25 nodes
architecture For the other data sets, the 25/20
architecture outperformed the other network
architectures, with only a minor increase in time
required over the 25/15 architecture With a slightly
reduced performance, however, the single hidden layer
with 25 nodes performs nearly as well as the 25/20
architecture, but reduces the running time by as much
as a minute
5 Conclusions
Overall, the implemented system was able to
successfully classify the test data using each of three
feature sets The single hidden layer network with 15
nodes performed better on the orthogonal and
combination features data set, achieving a prediction accuracy of 92.86% and 92.83%, respectively The information set did not perform as well due to the fact that some features were almost identical and hence did not provide more information to the network Adding more nodes to the single hidden layer improved the accuracy slightly With 25 nodes in the hidden layer, the accuracy improved to 90.54% for the information features, 93.57% for the orthogonal features, and 93.43% for the combination features data set This slight improvement in accuracy comes at a price of an increase in time requirements, since the running time increased by over a minute It is interesting to note that expanding the architecture to include two hidden layers with 25 nodes in the first layer and 15 nodes in the second layer increases the running time but does not improve the classification accuracy for the orthogonal and combination features data set, and only has a slight benefit for the information features data set, improving the accuracy from 90.54% to 90.83% Increasing the number of nodes in the second hidden layer from 15 to
20 provides the best classification accuracy for the orthogonal features data set and the combined features data set, namely 93.77% and 93.74% respectively, but does not improve the prediction rate for the information features data
The orthogonal features data set and the combined features data set provided better results for all configurations of the network than the information features data set This was somewhat expected since the selection process can theoretically yield very similar information features which do not provide extra class discrimination to the neural network This possibility is not present in the orthogonality and combined features since the feature selection process using these two measures inherently discourages similar features Both the orthogonal and combination features data sets provided fairly comparable results, in terms of both accuracy and running time The single
Trang 5hidden layer with 25 nodes performs nearly as well as
the two hidden layer architecture with 25 and 20 nodes
on all three data sets, but reduces the running time by
as much as a minute
We would like to continue testing the developed
system with other large data sets for handwritten digit
and character recognition We are also interested in
exploring the combination of different machine
learning techniques for handwritten digit recognition
and similar character recognition problems More
specifically, we are interested in developing a system
that integrates a decision tree learning technique with a
neural network architecture and testing it with the three
data sets used in this study to compare the performance
of such a system with the results of using a decision
tree or neural network separately We anticipate that
such a technique that fuses both machine learning
techniques will yield better results than using each
technique individually, and will allow the recognition
of larger, more complex data sets
6 References
[1] E Baum and D Haussler, “What Size Net Gives
Valid Generalization?” Neural Computation, vol 1,
no 1, pp 151-160, 1989
[2] P D Gader, B Forester, M Ganzberger, A
Gillies, B Mitchell, M Whalem, and T Yocum,
“Recognition of Handwritten Digits Using Template
and Model Matching,” Pattern Recognition, vol 24,
pp 421-432, 1991
[3] P D Gader and M A Khabou, “Automatic
Feature Generation for Handwritten Digit
Recognition,” IEEE Trans Pattern Analysis Machine
Intelligence, vol 18, no 12, pp 1256-1262, 1996
[4] S Haykin, Neural Networks, a Comprehensive
Foundation, MacHillan Publishing Co., 1994
[5] K Hornik, M Stinchrombe and H White,
“Universal Approximation of an Unknown mapping
and its Derivatives Using Multilayer Feed-forward
Networks,” Neural Networks, vol 3, pp 551-560,
1990
[6] C.Y Suen, “Distinctive Features in the Automatic
Recognition of Handprinted Characters,” Signal
Processing, vol 4, pp 193-207, 1982
[7] C.Y Suen , “Character Recognition by Computer
and Application,” Handbook of Pattern Recognition
and Image Processing, Academic Press, pp 569-586,
1986
[8] C.Y Suen, C Nadal, R Legault, T.A Mai, and
L Lam, “Computer Recognition of Unconstrained
Handwritten Numerals.” Proc of IEEE, vol 80, no 7,
pp 1162-1180, 1992