ACKNOWLEDGEMENT First and foremost, I would like to thank my supervisors, Dr. Pham Bao Son and Dr. Nguyen Phuong Thai, for their patient guidance, continuous support and encouragement through the years. They were always available and answered my question carefully whenever I need help. I am very grateful to their advice and teaching. I would like to thank Nguyen Ba Dat for his constant help also his contributions to this work. I would like to thank the following people for their reviewing parts of this thesis: Nguyen Quoc Dai, Hoang Duc Tam. Their tireless help, suggestions and comments are invaluable. I greatly appreciate the Human Machine Interaction Laboratory, UETColtech for their support in the time I am here. I would also like to thank my friend, Le Hong Thuan, for her kindly help and encouragement. Finally, I would like to thank my family for their love and understanding that helps me to finish this thesis successfully. Thank you ABSTRACT The knowledge of human being is huge and expanding everyday. However, because almost all of it is only available in unstructured forms of natural language documents, there is a great need of computing systems for extracting information automatically. In such problem domains, Named Entities Recognition (NER) holds a central role in a successful information extractor. Approaches for NER can be divided into three groups: statistical approaches, grammarbased approaches and hybrid approaches. In statistical and hybrid approaches, a large annotated training corpus is required to achieve an acceptable result. However, it takes a lot of time and effort to obtain such annotated corpora. Grammarbased approaches take advantages of using experts’ knowledge to overcome the shortage of annotated corpora. Nonetheless, there are some problems that occur in grammarbased approaches. It is, firstly, the difficulty of maintaining the system when a large number of rules are added. Secondly, because of the fact that our language is changing day by day, grammarbased approaches become expensive when adapting into new domains or acquiring new knowledge. In this thesis, we firstly introduce an incremental knowledge acquisition method for Named Entities Recognition (NER). Although NER is different from a traditional classification problem, with this method, we have successfully applied Ripple Down Rule (RDR) which is known as the favourable solution for handling classification problems. As the result, the method takes the advantages of RDR by incrementally acquiring knowledge without breaking the consistency of the existing system. With RDR structure, our system is able to be adapted to other domains in the easier and more effective way. It is also compatible with the changing of our language. Moreover, this thesis also introduces an implementation on GATE framework by using JAPE grammars to reduce the effort of creating a new knowledge base. Experiments show that knowledge is acquired continuously without breaking the consistency of the existing knowledge base. Meanwhile, the current knowledge base is evaluated with an Fmeasure of 82% on the set of an existing Vietnamese corpus. Keywords: incremental knowledge acquisition, Named Entities Recognition, Ripple Down Rule TABLE OF CONTENTS LIST OF FIGURES Figure 2.1: An SCRDR example 17 Figure 2.2: GATE’s architecture 19 Figure 2.3: An annotation graph 20 Figure 3.1: Our system’s overview 24 Figure 3.2: An example of Tokenizer 26 Figure 3.3: An example of Gazetteer 26 Figure 3.4: A example of NE annotations 27 Figure 3.5: The NER Module 28 Figure 3.6: An example of all received NE annotations 30 Figure 3.7: The structure of RDR Module 31 Figure 4.1: The changing of Fmeasure between layers 38 Figure 4.2: The performance of our system after every 20 rules 38 LIST OF TABLES
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Van Bong INCREMENTAL KNOWLEDGE ACQUISITION FOR NAMED ENTITIES RECOGNITION
Major: Computer Science
Major: Computer Science
Supervisor: Dr Pham Bao Son
Co-Supervisor: Dr Nguyen Phuong Thai
HA NOI - 2012
Trang 2“I hereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution To the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due reference or acknowledgement is made.”
Signature:………
Trang 3SUPERVISOR’S APPROVAL
“I hereby approve that the thesis in its current form is ready for committee examination as a requirement for the Bachelor of Computer Science degree at the University of Engineering and Technology.”
Signature:………
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]
[19] [20] [21] [22] [23]
Trang 4First and foremost, I would like to thank my supervisors, Dr Pham Bao Sonand Dr Nguyen Phuong Thai, for their patient guidance, continuous support andencouragement through the years They were always available and answered myquestion carefully whenever I need help I am very grateful to their advice andteaching
I would like to thank Nguyen Ba Dat for his constant help also his contributions
to this work I would like to thank the following people for their reviewing parts of thisthesis: Nguyen Quoc Dai, Hoang Duc Tam Their tireless help, suggestions andcomments are invaluable I greatly appreciate the Human Machine InteractionLaboratory, UET/Coltech for their support in the time I am here
I would also like to thank my friend, Le Hong Thuan, for her kindly help andencouragement
Finally, I would like to thank my family for their love and understanding thathelps me to finish this thesis successfully Thank you!
Trang 5In this thesis, we firstly introduce an incremental knowledge acquisition method forNamed Entities Recognition (NER) Although NER is different from a traditionalclassification problem, with this method, we have successfully applied Ripple Down Rule(RDR) which is known as the favourable solution for handling classification problems As theresult, the method takes the advantages of RDR by incrementally acquiring knowledgewithout breaking the consistency of the existing system With RDR structure, our system isable to be adapted to other domains in the easier and more effective way It is also compatiblewith the changing of our language.
Moreover, this thesis also introduces an implementation on GATE framework byusing JAPE grammars to reduce the effort of creating a new knowledge base Experimentsshow that knowledge is acquired continuously without breaking the consistency of theexisting knowledge base Meanwhile, the current knowledge base is evaluated with an F-measure of 82% on the set of an existing Vietnamese corpus
Keywords: incremental knowledge acquisition, Named Entities Recognition, Ripple
Down Rule
Trang 6TABLE OF CONTENTS
Trang 7LIST OF FIGURES
Figure 2.1: An SCRDR example
Figure 2.2: GATE’s architecture
Figure 2.3: An annotation graph
Figure 3.1: Our system’s overview
Figure 3.2: An example of Tokenizer
Figure 3.3: An example of Gazetteer
Figure 3.4: A example of NE annotations
Figure 3.5: The NER Module
Figure 3.6: An example of all received NE annotations
Figure 3.7: The structure of RDR Module
Figure 4.1: The changing of F-measure between layers
Figure 4.2: The performance of our system after every 20 rules
Trang 8LIST OF TABLES
Trang 9LIST OF ABBREVIATIONS
CRF Conditional Random Fields
GATE General Architecture for Text Engineering
NER Named Entities Recognition
NLP Natural Language Processing
POS Part of Speech
Trang 10Chapter 1 INTRODUCTION 1.1 Named Entities Recognition
The knowledge of human being is enormous and expanding every day With theexplosion of Internet, this kind of resources is now shared and becomes easier to fetch.However, almost all of it is unstructured and stored in a natural language form ofdocuments Therefore, it is difficult to collect and use this kind of resource effectively
As the result, the problem of how to automatically extract information from thatresource and how to store it into a structured form is considered as the favouriteproblem in natural language processing domains by many scholars
Named Entities Recognition (NER) holds the central role of a successfulinformation extractor It includes two smaller problems which are locating andcategorizing all mentions of named entities in textual document There are somepopular kinds of named entities [4,14]:
- Names of people, locations and organizations
- Dates and times
- Currency amount or percentages
Furthermore, depending on problems and domains, there some additionalspecific named entities
Similar to many other natural language processing (NLP) problems, there arethree common kinds of approaches:
- Grammar-based approaches [4,15]
- Statistical approaches [14]
- Hybrid approaches [8]
Trang 11However, in statistical approaches and hybrid approaches, a large annotatedcorpus is required to achieve a good result Although grammar-based approaches canovercome this difficulty and take advantages of using the experts’ observation, thesystem requires a great deal of time and effort for maintaining as well as expandingwhen the number of rules is relatively large In this thesis, we introduce a method toaddress those difficulties by a well structured system of rules.
1.2 Ripple down rules
In term of maintaining a grammar-based system, Ripple Down Rules (RDR) isknown as one of the best solutions to reduce unwanted effects when adding new rules[22] Originally, RDR is invented to solve the problem of maintaining a large rule-based knowledge system for classification problem domains Before the invention ofRDR, scientists give a little concern about the structure and as how to maintain a rule-based system Therefore, with the development of human being in general, peoplewasted a lot of time and effort to add new knowledge or to adapt the existing system tonew domains RDR is different, which maintains the system by incrementallyacquiring new knowledge The difference is that RDR strictly structures rules in aform of a tree As the result, RDR has been successful in handling single classificationproblems across a wide range of domains such as help-desk support, emailclassification, RoboCup as well as natural language processing However, there hasnot been any research on applying RDR to the NER problem because it includes notonly classifying a named entity but also locating the boundary of the named entity
1.3 The contributions of the thesis
In this thesis, we firstly introduce a new method to recognize named entities bybuilding an incremental knowledge acquisition system The method addresses theabove difficulties including:
- By turning the NER problem into a classification problem, the RDRstructure is successfully applied
- As the result, it takes the advantages of RDR The system is able to keep theconsistency and avoid unwanted interactions between rules while acquiringnew knowledge
Trang 12- By the ability of acquiring knowledge incrementally, the system is easilyadapted to new domains and also compatible with the changing of ournatural language.
In addition, we also present an implementation via JAPE grammars on GATE(General Architecture for Text Engineering) framework [7] This implementation is alanguage processing resource on GATE and attempts to recognize some commonkinds of named entities on a Vietnamese sentence
1.4 Thesis organization
The rest of thesis includes 4 chapters
In chapter 2, we provide some literature reviews including the NER problemand related works This chapter also introduces the overview of RDR and focuses onthe advantages of using this structure The last part of this chapter is the review ofGATE and JAPE grammars which is the framework of our implementation
In chapter 3, we propose our incremental knowledge acquisition method fornamed entities recognition This chapter also focuses on the rule language used in oursystem
Chapter 4 presents our experiments on Vietnamese The experimental resultsand the errors are analyzed in this chapter
Finally, chapter 5 summarizes the contributions of this thesis It also outlinesfuture researches on our method
Trang 13Chapter 2 LITERATURE REVIEW
In this chapter, we provide a literature review of our work Section 2.1describes the named entities recognition problem and related works and section 2.2provides an overview of ripple down rules In the rest of this chapter, section 2.3presents the GATE framework and JAPE grammars that we have been working on
2.1 Named Entities Recognition
2.1.1 Introduction
Named Entities Recognition is a phase in IE (Information Extraction) system Ittries to locate and identify all mentions of names and quantities in natural languagedocuments Depending on problem domains, there are several types of named entities.However, the following are some common categories of named entities [4,14]:
- Names of people, locations and organizations: Person, Location, Organization
- Dates and times: Date, Time
- Currency amount or percentages: Money, Percent
Because, the two later kinds of NE (including: Date, Time, Money, Percent) areclear and do not have many ambiguities, they are often easier to recognize However,the definition of NE is domain dependent For example, in the chemical named entitiesrecognizer systems, there are some additional named entities for drugs, diseases, etc[21] In advertisement magazines, they can be product names
Similar to many other natural language processing problems, there are threetraditional groups of approaches:
- Statistical approaches (Machine learning) [4,15]
- Grammar-based approaches [14]
Trang 14- Hybrid approach which is the hybrid of the two above approaches [8]
2.1.2 Evaluation Metrics
Before continuing with the current approaches, we, firstly, give the definition of
the measurement used in this thesis Including precision and recall:
“I”,”O” or “B”
- “I” if the present word is inside the considering named entities
- “O” if the present word is outside the considering named entities
- “B” if the present word is the start of the considering named entities
After that, the IOB model, POS Tagging and the category of named entities, etc
is considered as the input of the machine learning system
There are three kinds of learning in statistical approaches including supervised,unsupervised, semi-supervised However, the two later kinds (unsupervised and semi-supervised) are rarely used in named entities recognition problem domains There are
only few researches applying these kinds of learning For example: Unsupervised
Trang 15Models for Named Entities Classification [5] and Unsupervised Named Entities
Classification Models and their Ensembles [11]
Supervised machine learning is the most popular approach For example:Bikel’s system using Hidden Markov Model [1], Borthwick’s system using MaximumEntropy [2] and Wu with the system using SVM [23] In 2008, Mansouri introduced asystem using a new fuzzy support machine [14] In those systems, named entities aredivided into three groups:
- ENAMEX (Organization, Person, Location)
- TIMEX (Date, Time)
- NUMEX (Money, Percent)
The result received is considerable with an F-measure of about 93% [14].However, the result when applying those approaches to such under-resourced language
as Vietnamese is much different For example:
- Nguyễn’s system used Conditional Random Fields (CRF) model to recognize 8types of entities: Person, Location, Organization, Time, Number, Money Theresults is about in the range 80% to 81% [18]
- Phạm’s system used Support Vector Machine However, because of the lack ofannotated corpus the result is about 83.56% in F-measure [21]
To conclude, statistical approaches take the advantages of building up a systemautomatically with a relative high accuracy However, they require a large amount ofannotated data for training This requirement takes a great deal of time and efforts insome under-resourced languages
2.1.3.2 Grammar-based Approaches
This is a traditional approach in which rules is designed by human being after along time of observation A rule often includes POS (verb, noun or adjective, etc.),contexts and some other attributes of word (all capitalized or first-capitalized).Sometimes, a dictionary is also included to build a rule [3] For example:
Trang 16Mr Bean is a British comedy television programme series of 14 half-hour
episodes written by and starring Rowan Atkinson as the title character
In this example Bean, which is placed after “Mr.”, should be recognized as anamed entity with the label of Person To write a rule for this kind of named entities,the “Mr.” firstly is labeled as a “titleperson” by the gazetteer using a dictionary [10].After that, a rule is simply designed like this:
If: Named Noun Phrase is placed after “titleperson”
Then: Named Noun Phrase is labeled as a Person
For more effectively, Morgan used an advanced language analysis module with
an F-measure of over 90% [16] As can be seen from this example, we can create arule that is able to recognize a number of named entities from a simple case This isone of the advantages of grammar-based approaches by using experts’ observation
Although there have been some successful projects with high results, becausethey however are not structured carefully, it is really difficult to adapt them into newproblem domains Moreover, building such rule systems often requires a lot of timeand effort
In 2001, Maynard introduced MUSE (Multi source entity finder) which wasdeveloped on GATE It is highly flexible because the rule set as well as the gazetteerare available to be used in other domains This system has been evaluated with an F-measure of around 93% with the following categories: Entity (Organization, Person,Location), Time (Date, Time), Number (Money, Percent), Address (Email, URL,Telephone, IP) and Identifier [15]
2.1.3.3 Hybrid Approaches
The hybrid approach is a mixture of the statistical approach and the based approach for using advantages of both above approaches Sirhari and Fangintroduce a hybrid system on Chinese documents which was received a relatively highresult [8]
Trang 17grammar-2.2 Ripple Down Rules
2.2.1 Introduction
Traditionally, scientists give a little thought about the structure and how tomaintain a rule-based system However, with the development of human being, peoplewaste a lot of time and effort to add new knowledge into existing systems RippleDown Rules (RDR) is different, which maintains the system by incrementallyacquiring new knowledge Originally, RDR is invented to solve the problem ofmaintaining a large rule-based knowledge system [22] Rules are added by expertswithout programming skill or knowledge engineering support while the system isbeing used
Note that, in RDR approaches, cases played an integral part of knowledgeacquisition process by motivating the capture of new knowledge Cases are also thecontexts for deciding whether the new knowledge would apply or not for ensuring theconsistency of the system when adding new knowledge
RDR has been successful in handling single classification problems across awide range of domains such as help-desk support, email classification, RoboCup aswell as natural language processing
There are three main forms of RDR [22]:
- Single classification Ripple Down Rules (SCRDR)
- Multiple Classification Ripple Down Rules
- Nested Classification Ripple Down Rules
Our system uses SCRDR as the structure of the knowledge base Therefore, inthis thesis we discuss only the SCRDR structure
2.2.2 Single Classification Ripple Down Rules
A Single Classification Ripple Down Rules (SCRDR) [22,6] tree is a binary
tree with two branches that typically called except and if-not Each node contains a rule which is in the form of “If condition then conclusion” and a cornerstone case The
Trang 18cornerstone case is the condition for keeping the system consistent whenever adding a
new rule In SCRDR, there is only one conclusion is allowed for each input case
A SCRDR is executed by passing a case from the root At each node, the rulecorresponding to that node is checked If the condition is true then the case is passed
on to the except branch Otherwise, the case is passed on to the if-not branch if existed The conclusion of the last rule which has the condition satisfied is considered as the result of the case However, to ensure that there is always a conclusion fired, the first node in the tree contains a rule that is always true, which is called the default node.
Figure 2.1 is an example of SCRDR:
Figure 2.1: An SCRDR example
In the given example, the default rule is Rule 0 with the condition “1=1” which
is always true In Rule 3, since Class 3 is not considered by the expert to be true, an exception rule (Rule 4) is added to Rule 3 and override Rule 3 In SCRDR, the
conditions of all rules on the pathway are included as the explanation of the
conclusion For example, Rule 6 is explained as IF “attribute3=c” AND NOT
“attribute4=d” AND “attribute6=f” THEN Class 2 To ensure that the previous
knowledge still remains valid when adding Rule 4 and Rule 5, cornerstone case in
Trang 19Rule 3 is checked by Rule 4 and Rule 5 If Rule 4, and Rule 5 are fired or give a
conclusion in the cornerstone case of Rule 3, the system will not be consistent.
2.3 GATE Overview
2.3.1 Introduction
GATE [7] is the General Architecture for Text Engineering – a engineering framework developed by the University of Sheffield It was developedfrom 1996 and used in a number of IE problems as well as many other applications.[15]
language-GATE provides a friendly framework for natural language processing, whichcontains three main components:
- An architecture for describing the components constructing a language
processing system
- A framework (Java class library and API) that could be used as a basis for
building a language-processing systems
- A graphical development interface (on top of the framework) including of a set
of tools and components for language-processing visually
Trang 20Figure 2.2: GATE’s architecture
2.3.2 Resources in GATE
GATE distinguishes the following types of resources:
- Language Resource (LR) representing documents, lexicons, corpora,ontologies, etc
- Processing Resource (PR) representing natural language processes such astokenizers, POS taggers, etc
- Visual Resource (VR) representing visual tools for processing natural languagedocuments
2.3.3 GATE’s annotations
When processing resources (such as parsers, tokenizers, taggers, etc.) areexecuted on the texts, they produce some information about this text For example, theinformation is the list of tokens or the type of token (word, number, punctuation, etc.)
Trang 21that is generated from the tokeniser The POS tagger produces the part of speech forthe word For describing this information, GATE uses annotations.
A GATE annotation consists of:
- ID, which is unique in the document containing this annotation
- type, which represents the type of the annotation
- start and end node, which denote the span of the annotation
- a set of features (attribute/value pairs) that provide additional informationabout the annotation
For example, with the following sentence:
Mr Bush is the President of the United States
An annotation graph like the one on Figure 2.3 will be created
Trang 22Figure 2.3: An annotation graph
2.3.4 JAPE Overview
JAPE is the Java Annotation Patterns Engine in GATE [7] It provides a finitestate transduction based on regular expressions over annotations Each JAPEtransducer is included with a JAPE grammar If the rules in the grammar are satisfied
by the annotations generated in the system, it will perform the action which is defined
in the grammar for the rule
A JAPE grammar consists of a set of phases, which are executed sequentially
A phase consists of:
- unique (for the grammar) name
- input annotations, which defines the annotation types that are considered as theinput for the phase
Trang 23- options, which defines the behaviour of the JAPE engine when executing thisphase
- one or more macros (optional)
- one or more rules
A rule consists of:
- unique name
- optional priority
- left-hand-side (LHS) defining a regular expression to be matched againstannotations
- right-hand-side (RHS) defining the action to be performed if the rule is satisfied
Because the LHS of a rule is a regular expression, it contains regular expressionoperators such as "*", "?", "|" or "+" The RHS can contain a valid block of Javastatements, which makes it quite powerful and flexible
The following is a sample JAPE grammar:
:orgName.Organization = {rule = "OrgUni"}
The sample grammar begins with definition of a single phase including one ruleand no macro In this sample, only Token annotations will be used for the regularexpression pattern Therefore, all other annotations which are already generated in thesystem will be ignored
Trang 24Additionally, the options of the phase define the JAPE engine to run in "appelt"mode, which means that if several rules are satisfied by the input at some moment,then the longest matching rule will be applied If there are several rules that match aregion of the input with the same length, then the rule with the highest priority will beapplied The "Rule01" rule in the sample has a priority set to 25 (priorities are usefulonly for "appelt" style of the phase) The default priority is -1.
The LHS of the rule (the block in brackets preceding the " >" symbol) contains
a regular expression that will be matched over the input annotations The rule says thatthe sequence of tokens "University", "of" and one or more tokens with the category of
“NNP” should be matched The matched sequence of input annotations can later be
referred to as "orgName"
Finally the statements in the RHS (the block in brackets following the " >"symbol) will be executed whenever the LHS is satisfied by the input The RHS willcreate a new annotation of type Organization for the matched annotations
Trang 253.1 Introduction
In our approach, we turn NER into a kind of classification problem Firstly, wedivide NER into two smaller problems including locating the position of the namedentity and categorizing that named entity In the locating position problem, instead offinding the starting and ending positions, we choose a starting position and find theending position or the length of that named entity However, the ending position of anamed entity is able to be identified by the conclusion of the RDR rule As the result,the NER problem is now considered as a classification problem which is easily solved
by the RDR approach
Our system receives a sentence in the form of a string and returns a set ofannotations that locates all identified named entities in the given sentence For moredetails, our system includes 3 modules:
- The pre-processing module which receives a sentence and returns a set
of basic annotations which is used to identify named entities in the givensentence
- The NER module which receives a set of basic annotations and operate with the RDR module to identify all named entities in the givensentence
Trang 26co The RDR module which receives a starting position locating a candidatenamed entity From the given position and the set of basic annotationsfrom pre-processing module, this module returns an NE annotationrepresenting a classified candidate named entity.
Those modules are organized as Figure 3.4:
Figure 3.4: Our system’s overview
As can be seen from Figure 3.4, our system receives a sentence and returns a set
of NE annotations (Named Entity Annotations which are described in section 3.3) Theprocess is divided into following steps:
- Firstly, the sentence is processed through the pre-processing module includingTokenizer and Gazetteer to return a set of basic annotations; this module isdescribed in section 3.2
- After that, Phase 1 of the NER module receives this set of basic annotationsfrom the pre-processing module and creates a set of starting positions of allpotential named entities in the given sentence This set of starting positions issent to Phase 2 of the NER module
- Next, for each received starting position, Phase 2 of the NER module sends theposition to the RDR module to receive an NE annotation
Trang 27- From each given position and the basic annotations which are created from thepre-processing module, the RDR module returns an NE annotation This NEannotation represents a classified candidate named entity which starts from thegiven position.
- Finally, Phase 2 of the NER module checks all received NE annotations andreturns a set of NE annotations which locates all named entities in the givensentence
3.2 Pre-processing
In the preprocessing stage, we use WS4VN [20] for word segmentation and part
of speech tagging (Tokenizer) We also manually create a gazetteer which is a set ofdictionaries [15] These two modules are implemented as two processing resources ofGATE for more convenient
3.2.1 Tokenizer
This process is designed to segment a given sentence into words BecauseVietnamese is a monosyllabic language, a Token might contains two or more singlewords The following are some useful attributes added for writing rules moreeffectively
- The attribute ‘kind’ for which five values are defined:
• “word”: a word is defined as a set of contiguous of upper, lower casesincluding a hyphen, space
• “number”: A number is defined as any combination of consecutivedigits
• “punctuation”: including starting and ending punctuations
• “symbol”: a symbol and not included in the above categories
• “other”: word is not included in the above categories
- The attribute ‘orth’ for which five values are defined: