mode detection in online handwritten documents using blstm neural

Mode Detection in Online Handwritten Documents Using BLSTM NeuralNetworks Emanuel Inderm¨uhle∗, Volkmar Frinken†and Horst Bunke∗ ∗Institute of Computer Science and Applied Mathematics Un

Trang 1

Mode Detection in Online Handwritten Documents Using BLSTM Neural

Networks

Emanuel Inderm¨uhle∗, Volkmar Frinken†and Horst Bunke∗

∗Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland Email: {eindermu, bunke}@iam.unibe.ch

†Computer Vision Center Autonomous University of Barcelona

Edifici O, 08193 Bellaterra, Barcelona, Spain

Email: vfrinken@cvc.uab.es

Abstract Mode detection in online handwritten documents

refers to the process of distinguishing different types of

contents, such as text, formulas, diagrams, or tables,

one from another In this paper a new approach to mode

detection is proposed that uses bidirectional long-short

term memory (BLSTM) neural networks The BLSTM

neural network is a novel type of recursive neural

net-work that has been successfully applied in speech and

handwriting recognition In this paper we show that it

has the potential to significantly outperform traditional

methods for mode detection, which are usually based

on stroke classification As a further advantage over

previous approaches, the proposed system is trainable

and does not rely on user-defined heuristics Moreover,

it can be easily adapted to new or additional types of

modes by just providing the system with new training

data

1 Introduction

Mode detection in online handwritten documents

refers to the identification of the content type the writer

is drawing at every point in the process of document

creation Although drawing modes can be defined

ar-bitrarily, the most common ones are text and non-text

The detection of different writing modes in online

hand-writing enables the selection of an appropriate system to

further process the information For example,

handwrit-ten text can be passed on to a handwriting recognizer,

while non-text, e.g graphical symbols or mathemati-cal formulas, can be further processed by specialized recognition engines

Mode detection in online handwritten documents is

of growing significance due to the use of tablet comput-ers, tablet based input devices, and digital pens The understanding and interpretation of such documents is

a highly valuable goal, e.g for the scenario of a smart meeting room [15] where it is desired to search, browse, and organize handwritten notes taken with digital pens during a meeting One important difference to the of-fline modality is the linear character of the data which binds elements related in a temporal context more than the spacial arrangement does

A common approach to mode detection found in the literature is the analysis of individual strokes [12, 16, 18] Jain et al [12] proposed a linear classifier to dis-tinguish between text an non-text strokes represented

by only two features, viz length and curvature An accuracy of 98% is reported on their data set The same method applied to the IAMonDo-DB [11], which

is used in this paper, resulted in an accuracy of 91% [9] Rossignol et al [16] presented a system distinguish-ing text and three classes of non-textual elements on a database containing floorplans of bathrooms Also in this work, the two features proposed by Jain et al were used However, classification is done with a partially linear decision function Willems et al [18] introduced

a set of 12 features and they showed that, depending

on the classes selected for classification, another subset

of features works best In a text vs non-text distinc-tion task, an accuracy of 99.2% was reached The au-thors conducted their experiments on the data set used

2012 International Conference on Frontiers in Handwriting Recognition

Trang 2

in [16] combined with text strokes from the

UNIPEN-database [8] and non-text strokes form Fonseca et al

[2] In [9], only offline information was used to classify

textual and non-textual connected components,

achiev-ing 94.4% on the IAMonDo-DB In [1] not only

fea-tures from the individual strokes are considered, but

also the class of the previous stroke In addition,

infor-mation about the gaps between strokes was used The

accuracy of 95% on a private database shows the

poten-tial that lies in considering context information In [14],

a system based on the features proposed in [19] has been

used with a kNN classifier to distinguish between text

and non-text strokes The system has been incorporated

into a software development kit to build pen based

ap-plications The authors extended their system in [17]

to use multiple classifier system with different types of

classifiers New features and a fully worked out feature

selection strategy are applied Interestingly the system

is tested on the same database on which the experiments

in this chapter are run This allows direct comparison

The best result achieved with the multiple classifier

sys-tem is 97.0% The best individual classifier is kNN with

k= 5, using the Mahalanobis distance, also achieves a

classification accuracy of 97.0%

One of the main problems that becomes evident

when reviewing the literature is the use of different data

sets, which prevent a fair comparison In the current

paper, we use the IAMonDo-DB, which has been made

publicly available recently and might become a

com-mon ground for the analysis of online handwritten

doc-uments in Latin script1

In this paper we also propose the use of a BLSTM

neural network for mode detection Originally, this kind

of neural network was used for speech recognition [7]

Recently it has been applied with remarkable success

in the field of handwriting recognition [6] Automatic

transcription of online and offline handwriting could be

improved without the need of word segmentation,

nei-ther in the test nor in the training phase The flexibility

of this system is also demonstrated by its application to

keyword spotting [3, 10]

To apply the BLSTM neural network the online

handwriting data is not presented as a set of individual

strokes, but as a stream of feature vectors The neural

network is then trained on this data to recognize

pat-terns in the document and translate them into sequences

of labels representing characters and non-text data This

makes it possible to predict the class of a stroke

consid-ering these labels and the positions of their activation in

the output stream

1 The IAMonDo-Database is online available at http:

//www.iapr-tc11.org/mediawiki/index.php/IAM_

Online_Document_Database_(IAMonDo-database)

Figure 1: Sample documents from the data set Text ink

is black, non-text ink is gray

The rest of this paper is structured as follows In Sec-tion 2 the database, its format, and the ground truth are introduced In Section 3 we present the novel applica-tion of BLSTM to mode detecapplica-tion in online handwritten documents Next, the experiments and their results are presented in Section 4 Finally, we draw conclusions in Section 5

2 Data

The proposed procedure for mode detection was ex-perimentally evaluated on the IAMonDo-database [11] The data set consists of roughly 1,000 documents pro-duced by 200 writers The documents contain text in textblocks, lists, tables, formulas, and diagrams, as well

as non-text in drawings and diagrams About 72% of all strokes belong to text Examples of these docu-ments can be seen in Figure 1 Some of the docudocu-ments are quite challenging regarding the proper extraction of text

The digital ink is stored in terms of groups of succes-sive sample points described by X-, and Y coordinates, time, and pressure Every time the pen is lifted from the paper, the points recorded so far are grouped together to build a stroke On average a document of the database contains 370 strokes and a stroke consist of 14 sample points

In this paper the intention is to measure the ability to distinguish between text and non-text strokes Hence a corresponding ground truth must be provided The de-tailed annotation of the IAMonDo-database allows us

to derive ground truth in a straight forward manner: To strokes that are part of text blocks, lists, labels in dia-grams, table content, and formulas the text class is as-signed The remaining strokes are considered non-text

Trang 3

3 BLSTM Based Mode Detection System

3.1 Preprocessing and Feature Extraction

The digital ink we are dealing with was generated

by Anoto pens2 In order to save disc space, the

digi-tal pen compresses the ink by removing sample points

holding redundant information To get a data stream

with uniform sample rate, these points must be

recov-ered again from the compressed representation

Be-tween two strokes, the pen does not touch the paper,

and no data is recorded Such gaps must be filled to

have a consecutive sequence of sampling points This

step is done by interpolating a straight line

Commonly used features for handwriting

recogni-tion of online documents, as described, for example, in

[13], depend on text line segmentation This type of

fea-tures do not fit our requirements since no segmentation

can be performed beforehand What we actually need

is features extracted in the original writing order We

use seven features which satisfy this need They are

ex-tracted from each sampling point i using the following

four properties: the force fi, the coordinates xiand yi,

and the time stamp ti The list of the features is given

below:

1 The pen force fi, where 0 indicates no contact

be-tween pen and paper and 1 is the maximal force

recorded This feature is directly delivered by the

Anoto pen, distinguishing 256 different values

2 ∆x of the segment between point i − 1 and i + 1:

∆x = xi+1− xi−1 d(i − 1, i + 1) (1) where d(i, j) is the Euclidean distance between

sample point i and j

3 ∆y of the segment between point i − 1 and i + 1

∆y = yi+1− yi−1 d(i − 1, i + 1) (2)

4 Change of angle at point i:∆φ = φi−φi−1, where

φi= arccos(∆xi) + πI∆y i >0 (3)

and I∆y i >0 ∈ {0, 1} is the indicator function

which specifies whether∆yi>0

5 The Speed is given by the Euclidean distance

be-tween points i− 1 and i divided by time in terms

of sampling intervals

d(i − 1, i)

2 http://www.anoto.com

where r is the sampling rate This value is normal-ized to[−1, 1] using the hyperbolic tangent

6 Distance from the current sample point i to the nearest point nxiwhere the digital ink crosses it-self As feature value, d(i,nx1 i) is chosen and nor-malized to[−1, 1] by the hyperbolic tangent

7 Number of such crossing points on the segment be-tween points i− 1 and i

These features have proven to work well for the hand-writing recognition task as we could demonstrate in [4]

As this method is based on a handwriting recognizer, it

is an appropriate feature set The seven features provide little more, than what is needed to reconstruct the pen trajectory

3.2 BLSTM Neural Networks

The considered system is based on a recently de-veloped recurrent neural network, termed bidirectional long-short term memory(BLSTM) neural network [6] Instead of simple nodes, the hidden layers are made up

of so-called long short-term memory (LSTM) blocks These memory blocks are specifically designed to ad-dress the vanishing gradient problem, which refers to the exponential increase or decay of values as they cy-cle through recurrent network layers This is done by nodes that control the information flow into and out of each memory block The input layer contains one node for each of the seven features, while the hidden layer consists of the LSTM cells and the output layer contains one node for each possible output label

The network is bidirectional, which means that the input data sequence is fed into the network both ways, forward and backward This is a great advantage be-cause the mode of a stroke not only depends on the previous, but also on the following data The bidirec-tionalarchitecture is realized by two input and two hid-den layers One input and one hidhid-den layer deal with the forward sequence, and the other input and hidden layer with the backward sequence The output layer sums up the activation levels from both hidden layers at each po-sition in the text The output activations of the nodes

in the output layer are then normalized to sum up to

1 Hence they can be treated as a vector indicating the probability for each label to occur at a particular posi-tion A path through this probability vector sequence therefore corresponds to a sequence of labels For more details about BLSTM networks we refer to [5, 6]

Trang 4

3.3 Training of BLSTM Neural Networks

For mode detection, the training of the BLSTMs is

similar to the training for text or speech recognition

There exists one difference, however, which consist in

the generation of the training sequences For text

recog-nition the document is segmented into text lines and

their feature vector sequences are used for training Text

lines, however, are not suited for mode detection since

non-text elements must be part of the training data as

well Therefore, the documents are split into so called

slices, each containing 40 consecutive strokes In order

to have a valid label sequence as ground truth for the

slices in the training set, strokes at the beginning and

and of each slice are removed or added, respectively,

until a slice contains only complete words To further

improve the training, the slices overlap each other by

half of their strokes

The labels used for training are the same as those

used for handwriting recognition i.e every text

char-acter is represented by a label and an ε label which is

introduced during the training process Additionally, a

non-text label is introduced for each non-text stroke By

using the ε label, the network tends to activate the other

labels only for one or two sample points and in between

those sample points the ε label is activated This results

in a simpler label string where there is only one peak

per recognized label The setup described here is

actu-ally the same as the one used for keyword spotting in

online handwritten documents [10]

3.4 Mode detection using BLSTM Neural

Net-work

In handwriting or speech recognition the possible

la-bel sequence created by the system is restricted by the

vocabulary and influenced by the language model In

mode detection no dynamic programming based

decod-ing is needed Instead, the label sequence can directly

be retrieved Also the time of a labels’ activation is

stored as part of the labels’ instances

The algorithm for mode detection, which is

de-scribed in the following, is also illustrated in Fig 2 In

the first step the sequence of labels is extracted by

tak-ing the label with the highest activation value at each

time step Then runs of the same label are replaced by

just a single instance Its position value is set to the

position of the last label in the run In the next step,

the ε label is discarded as its only purpose is to

sepa-rate multiple instance of the same label Then, runs of

the white-space label are, again, replaced by their last

instance The labels of the remaining sequence are

di-vided into three groups, viz the text labels (every

la-Figure 2: The different steps of the mode detection pro-cedure using BLSTM Legend for the label sequences:

’#’ is a non-text label, ’.’ is a ε-label, ’ ’ is a run of

ε labels of undefined length, ’ ’ is a whitespace label,

’|’ is a mode switch, ’ttt’ denotes text mode, and ’###’ denotes non-text mode

bel which stands for a character), non-text labels, and white-space labels

The points in time where the writing mode changes between text to non-text (referred to as mode-switch) can now be placed at the position of white-space labels which are between two labels of different modes If two adjacent labels of different modes have no white-space label in between, then a mode-switch is placed at the sample point in the middle of the two activations So, the mode-switches divide the digital-ink into segments which are written in one single writing mode, text or non-text The mode who’s segments are covering the majority of an individual stroke is chosen to be the pre-dicted mode of that stroke

Trang 5

4 Experiments and Results

4.1 Setup

From the database 403 documents are used for

train-ing, 200 for validation, and 203 for testing The

divi-sion of the data into these three subsets was introduced

in [11]

In the training phase, the following configurations

were applied Ten neural nets were trained, each with

100 hidden nodes The training documents are divided

into slices as mentioned in Section 3.3 The slices of

the documents in the validation set are used to stop the

training iterations before over-fitting effects appear The

stopping criterion for the training is met at the epoch in

which the label error rate on the validation set has not

decreased for five epochs This takes 27 epochs on the

average More details on the training of BLSTM neural

network can be found in [6]

4.2 Results

With an accuracy of 97.01% the BLSTM based

rec-ognizer can significantly improve the recognition rate

of the stroke based method presented in [11] and it is

slightly better than the results described in [17]

Fig 3 shows part of documents where the system

successfully solved difficult examples of the mode

de-tection problem

4.3 Common errors

Common errors of the BLSTM based mode

detec-tion system are shown in Fig 4 Errors mostly arise

from non-text strokes that look like individual

charac-ters and are interpreted as text On the other hand,

in-dividual characters in diagram labels may be classified

as text if their shape is similar to common

non-text elements like arrows or other primitive geometrical

shapes This problem is hard to overcome, as often the

right decision can only be made with contextual and

se-mantical knowledge, which of course is out of the scope

of the system

Another problem is text that has been rotated The

system is not rotation invariant, but this can potentially

be tackled in future by using artificially rotated slices

for training

The third problem concerns formulas In the

exper-imental setup, formulas are considered to be text The

system, however, recognizes root symbols, fraction bars

and other large symbols (correctly) as non-text As the

database does not offer a more detailed annotation for

formulas, this problem can not be overcome

(a)

(b)

Figure 3: Examples with successfully detected writing mode Grey color refers to content be written in non-text mode, while black denotes non-text mode

(a) in diagrams (b) rotated text

(c) formulas

Figure 4: Examples of errors in mode detection In 4a small strokes in diagrams get confused, in 4b rotated text poses a problem, and in 4c symbols in formulas get mixed up Grey color refers to correctly classified content, while black denotes errors

Trang 6

5 Conclusion

In this paper we present a system for mode

detec-tion in online handwritten documents based on BLSTM

neural networks We compared the accuracy of mode

detection to an approach proposed previously in the

lit-erature The error rate could be reduced by 34% which

seems an impressive improvement The advantage of

the proposed approach is that no heuristically defined

values are needed Instead the system is completely

based on training data

The system presented in this paper is one specific

application of BLSTM neural networks As it requires

only little effort to change the label sequence used for

training, the system can be easily adapted to detecting

other content types like gestures, arrows, or boxes As

the BLSTM technology allows one to train the amount

of context that is to be taken into account, a future

ver-sion might even be able to distinguish between tables,

lists, labels in diagrams, and text blocks for which more

context has to be considered Also text line extraction

by a specific line-break label seems feasible

Acknowledgement

We thank Alex Graves for kindly providing us

with the BLSTM Neural Network source code This

work has been supported by the European project

FP7-PEOPLE-2008-IAPP: 230653, the Spanish project

TIN2009-14633-C03-03, and the Spanish MICINN

un-der the MIPRCV ”Consoliun-der Ingenio 2010”

CSD2007-00018 project

References

[1] C M Bishop, M Svensen, and G E Hinton

Distin-guishing text from graphics in on-line handwritten ink

In Proc 9th Int Workshop on Frontiers in Handwriting

Recognition, pages 142–147, Washington, DC, USA,

2004 IEEE Computer Society

[2] M J Fonseca and J A Jorge Experimental evaluation

of an on-line scribble recognizer Pattern Recognition

Letters, 22:1311–1319, 2001

[3] V Frinken, A Fischer, R Manmatha, and H Bunke A

novel word spotting method based on recurrent neural

networks IEEE Trans on Pattern Analysis and

Ma-chine Intelligence, 34(2):211–224, 2012

Hand-schrifterkennung Bachelor’s thesis, University of Bern,

2010 (in German)

[5] A Graves, S Fern´andez, F Gomez, and J

Schmidhu-ber Connectionist temporal classification: Labelling

unsegmented sequential data with recurrent neural

net-works In Proc 23rd Int Conf on Machine Learning,

pages 369–376, 2006

[6] A Graves, M Liwicki, S Fernandez, R Bertolami,

H Bunke, and J Schmidhuber A novel connection-ist system for unconstrained handwriting recognition IEEE Trans on Pattern Analysis and Machine Intelli-gence, 31(5):855–869, 2009

[7] A Graves and J Schmidhuber Framewise phoneme classification with bidirectional LSTM and other neu-ral network architectures Neuneu-ral Networks, 18(6):602–

610, 2005

[8] I Guyon, L Schomaker, R Plamondon, M Liberman, and S Janet UNIPEN project of on-line data exchange and recognizer benchmarks In Proc 12th Int Conf on Pattern Recognition, volume 2, pages 29–33, 1994 [9] E Inderm¨uhle, H Bunke, F Shafait, and T Breuel Text vs non-text distinction in online handwritten doc-uments In Proc of the 25th Annual ACM Symposium

on Applied Computing, volume 1, pages 3–7, 2010 [10] E Inderm¨uhle, V Frinken, A Fischer, and H Bunk Keyword spotting in online handwritten documents containing text and non-text using blstm neural net-works In Proc 11th Int Conf on Document Analysis and Recognition, 2011

[11] E Inderm¨uhle, M Liwicki, and H Bunke IAMonDo-database: an online handwritten document database with non-uniform contents In Proc 9th Int Workshop

on Document Analysis Systems, pages 97–104, 2010 [12] A K Jain, A M Namboodiri, and J Subrahmonia Structure in on-line documents In Proc 6th Int Conf

on Document Analysis and Recognition, pages 844–

848, 2001

[13] M Liwicki and H Bunke HMM-based on-line recog-nition of handwritten whiteboard notes In Proc 10th Int Workshop on Frontiers in Handwriting Recognition, pages 595–599, 2006

[14] M Liwicki, M Weber, and A Dengel Online mode de-tection for pen-enabled multi-touch interfaces In Proc 15th Conf of the International Graphonomics Society, 2011

[15] D Moore The IDIAP smart meeting room Technical report, IDIAP-Com, 2002

[16] S Rossignol, D Willems, A Neumann, and L Vuurpijl Mode detection and incremental recognition In Proc 9th Int Workshop on Frontiers in Handwriting Recog-nition, pages 597–602, 2004

[17] M Weber, M Liwicki, Y Schelske, C Schoelzel,

F Strauß and, and A Dengel MCS for online mode detection: Evaluation on pen-enabled multi-touch inter-faces In Proc 11th Int Conf on Document Analysis and Recognition, pages 957 –961, 2011

[18] D Willems, S Rossignol, and L Vuurpijl Features for mode detection in natural online pen input In Proc of 12th Biennial Conf of the Int Graphonomics Society, pages 113–117, 2005

[19] D Willems and L Vuurpijl A bayesian network ap-proach to mode detection for interactive maps In Proc 9th Int Conf on Document Analysis and Recognition, volume 2, pages 869 –873, 2007

Tiêu đề	Mode detection in online handwritten documents using BLSTM neural networks
Tác giả	Emanuel Indermühle, Volkmar Frinken, Horst Bunke
Trường học	University of Bern
Chuyên ngành	Computer Science
Thể loại	Conference paper
Năm xuất bản	2012
Thành phố	Bern

Định dạng
Số trang	6
Dung lượng	215,39 KB