1. Trang chủ
  2. » Luận Văn - Báo Cáo

Towards a framework for building an annotated named entities corpus

78 640 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 78
Dung lượng 0,91 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

1.3 Thesis contribution The thesis contribution includes: • We release a building corpus process base on • We apply the process to build NER corpus by offline tools method.. We see thatb

Trang 1

1 Introduction 1

1.1 Overview Name Entity recognition(NER) 1

1.2 NER Approach 3

1.2.1 Rule based approach 3

1.2.2 Machine learning Approach 4

1.2.3 Comparing 5

1.3 Thesis contribution 6

1.4 Thesis structure 7

2 Related Work 8 2.1 Overview our problem 8

2.2 Building NER corpus research 9

2.3 Researches about building corpus Process 10

2.4 Overview annotate tools 11

2.5 Summary 12

3 Corpus building process 13 3.1 Corpus building process 13

3.1.1 Objective 13

3.1.2 Built annotation guide line 14

3.1.3 Annotate documents 16

3.1.4 Quality control 17

3.2 Building Vietnamese NER corpus by off-line tools 20

3.2.1 Built annotation guide line 20

3.2.2 Annotate documents 22

3.2.3 Quality control 24

3.3 Discus about Vietnamese NER corpus building process 26

3.4 Conclusion 27

ii

Trang 2

4 Online Annotation Framework 28

4.1 Introduction 28

4.2 Training section 29

4.3 Annotation documents 30

4.3.1 Online annotation interface 31

4.3.2 Automate file distribution for annotator 32

4.3.3 Automate save and manage files 33

4.4 Quality control 34

4.4.1 Document level 34

4.4.2 Corpus level 35

4.4.3 Explain unusual entity 37

4.5 Conclusion 38

5 Evaluation 39 5.1 Introduction 39

5.2 Corpus evaluation 40

5.2.1 Inter annotatetor agreements 41

5.2.2 Offline corpus evaluation 42

5.2.3 Online corpus 45

5.3 Time costing 47

5.3.1 Overview 47

5.3.2 Offline process 48

5.3.3 Online framework 49

5.4 Named entity recognition system 51

5.4.1 Preprocessing 52

5.4.2 Gazetteer 54

5.4.3 Transducer 54

5.4.4 Experiment 56

5.5 Summary 58

6 Conclusion And Future work 60 6.1 Conclusion 60

6.2 Future work 62

6.2.1 Create corpus bigger and more quality 62

6.2.2 Improve online annotation framework 63

6.2.3 Building NER system base statistical 63

Trang 3

A Name Entity guideline 64

A.1 Basic concepts 64

A.1.1 Entity and Entity Name 64

A.1.2 Instance of entity 64

A.1.3 List of Entities 64

A.1.4 Entities recognize rules 65

A.2 Entity classification 65

A.2.1 Person 65

A.2.2 Organization 67

A.2.3 Location 68

A.2.4 Facility 69

A.2.5 Religion 69

Trang 4

3.1 Process building Annotation guide line 21

3.2 Callisto formatting 22

3.3 Callisto interface 23

3.4 Comparing two user corpus 25

4.1 Online Annotation Process 29

4.2 Annotation online tools Interface 31

4.3 Annotation gudeline form Interface 32

4.4 Review Tool Interface 35

4.5 Compare two documents interface 36

5.1 Inter Annotation Agreements result of two User 43

5.2 Evaluate accuracy rate for each Entity kind 44

5.3 Evaluate online corpus accuracy rate for each entity kind 47

5.4 Name entity recognition system architecture 52

5.5 Jape rule to recognize Person entity 55

5.6 Performance on the training data using strict criteria 57

5.7 Performance on test data using strict criteria 57

5.8 Performance on the test data using lenient criteria 58

v

Trang 5

5.1 An example of par corpus which annotate bu two user (User A and

user B) 42

5.2 frequency annotated documents 45

5.3 Inter annotation agreements in online annotation 46

5.4 User corpus accurate rate in online method 46

5.5 Time spent to quality control corpus 49

5.6 Time spent During annotation process 50

5.7 Quality control time in online framework 51

vi

Trang 6

1.1 Overview Name Entity recognition(NER)

The ability to determine the named entities in a text has been established as animportant task for several natural language processing areas, including informationretrieval, machine translation, information extraction and language understanding.The term ”Named Entity” widely use in Nature Language Processing(NLP), wascoined for the Sixth Message Understanding Conference(MUC-6) At the time,MUC was focusing in Information Extraction(IE) tasks where structured informa-tion of computer activities and defense related activities is extracted from unstruc-tured text,such as newspaper articles In defining tasks,people noticed that it isessential to recognize information units like names including: Person, organizationand location names and numerics expression including: time, date, money, percentexpression Identifying references to these entities in text was recognition as one

of the importance sub- task of IE and was called ”Name Entity Recognition andClassification”

The computational research aiming at automatically identifying named entities

1

Trang 7

in texts forms a vast and heterogeneous pool of strategies, methods and tations One of the first research papers in the field was presented by Lisa F Rau(1991) at the Seventh IEEE Conference on Artificial Intelligence Applications Ingenreral, each NER researches which have been devoted have to solve four problems:Language, Input,Kind of entity, and learning method.

represen-Languages:

NER have been applied to several languages There are many good researches forEnglish NER, they have solved language independence and multilingualism prob-lems German is well studied in CONLL-2003 and in earlier works Similarly,Spanish and Dutch are strongly represented, boosted by a major devoted confer-ence: CONLL -2002 (Collins, 2002) Chinese is studied in some researches (Wang

et al., 1992),(Computer et al., 1996), (Yu et al., 1998) and so are French (Petasis

et al., 2001), (And, 2003), Greek (Karkaletsis et al., 1999) and Italian (Black et al.,1998), (Cucchiarelli & Velardi, ) Many other languages received some attention aswell: Basque (Whitelaw & Patrick, 2003), Bulgarian (Silva et al., 2004), Catalan(Carreras et al., 2003),Hindi (Cucerzan & Yarowsky, 1999), Romania (Cucerzan &Yarowsky, 1999), Swedish (Kokkinakis, 1998) and Turkish (Cucerzan & Yarowsky,1999) Portuguese was examined by(Palmer et al., 1997)

In Vietnamese, there are some NER research is apply, for example VN- KIM(Nguyen & Cao, 2007)IE system have just

Format input

NER research have been applied to many format of documents: General text,email, scientific text, journalistic,ect and mamy domain: sport, business,literature,etc Each system usually direct specific format and domain (Maynard et al., 2001).Designed a system for email, scientific texts and religious texts (Minkov & Wang,2005) created a system specifically designed for email documents Now day, studies

Trang 8

want to apply to newer kind of format and domain For example, MUC-6 collectioncomposed of newswire texts, and on a proprietary corpus made of manual transla-tions of phone conversations and technical email

Kind of Entity

Although list entities depend kind and domain specific problems, NER systemsusually record some entities: Person, Location,Organization, Date, Time, Money,Percent Ambiguous have been appeared by Person, Location,Organization andother is fewer In Each domain, NERs target some specific one For instance, inmedicine domain, entity can be mane of disease or name of medicine

1.2 NER Approach

Similar to other NLP problems NER research have been developed into two mainapproaches:

• Rule based approach

• Machine learning approach

Using expert system to built rule system is traditional approach and they havebeen applied in NLP in general and NER in particular Rule system is set of rulewhich have been built by people (in ordinary expert) to particular target Ruleswill create by some features: Part of speech, context( words and phrases are in front

of words and behind one etc ) and some properties(Uppercase, lowercase ) andsome special dictionaries For example:

Trang 9

President Busto leave Iraq said Monday’s talks will include discussion on security,

a timetable for U.S forces

In this example, ”Busto” appear after the ”President”,for this reason ”Busto” issnnotated as Person entity Similar, ”Iraq” appear before ”leave” verb so that it

is seemed ”Location’ entity In this approach, we don’t need to annotate corpus.System can be identified and classified immediately by set of rules Advantage ofapproach is: easy to built rule base system So that many NER systems is rule basesystem since first period However, it is difficult to enhance accuracy rate Becauseorganize set of rules is difficult If we do not organize appropriately their, the rules

is overlap each other, and system can not identify and classify correctly

Now day, machine learning is common approach to solve NLP problems In NER,

it is used to enhance accuracy These are some model have been applied: supportvector machine, Hidden Markov model, decision tree, etc There are three kinds oflearning method have been applied in Machine learning: Un-supervised, supervised,and semi-supervised However, Un-supervised systems and semi-supervised don’tnot for NER problems There are a few researchs apply these methods: for example:Collins with system used un-annotate corpus (Collins & Singer, 1999) And Kimwith system using proper name and un- annotate corpus Systems which are appliedsupervised used more popularly in NER problems For example:Bikel with hiddenmarkow model(Black et al., 1998) ,Borthwick with Maximum Entropy (Borthwick

et al., 1998), etc In Machine Learning systems, we must built three sets: trainingset, test set and practice set

• A training set consists of an input vector and an answer vector, and is used

Trang 10

together with a supervised learning method to train a knowledge for the tem In NER, a training set is a corpus which have been annotated standardlabels.

sys-• A Test set is similar to training set But target of test set is check and evaluatesystem accuracy In NER problem, test set is a corpus which similar to trainset

• Practice set: is set which is applied machine leaning system to automaticallyidentify and classify Execute practice set is goal to built system

Annotation based learning have some advantages from manual hand writing rule:

• Annotation based learning can continue indefinitely, over weeks and months,with relatively self contained annotation decision at each point.In contrastsrule writing must remain cognizant of potential previous rules interdependen-cies when adding and revising rules,ultimately bounding continued rule systemgrowth by cognitive load factor

• Annotation by learning can more effective combine the effort of multiple ple The tagged sentences from deference data sets can be simple concatenated

peo-to form larger data sets with broader coverage

• User who write rule require large skill, including not only linguistic knowledgefor annotation, but also competence regular expression and ability to grap thecomplex interactive within rule list However, in machine learning approaches,annotators only require can used fluently language

Trang 11

• Performance of system which built by rule writer tend to exhibit considerablymore variance While machine system tend to much more consistent result.Although machine learning approach have a lot of advantages However we meet amain barrier: machine learning need a high quality corpus So that the problem ishow to build a high quality copus.

For Vietnamese, There is not any NER corpus is published Although somesystems have been built based on machine learning approach, they don’t share theirscorpus So that it is difficult to other research improve accurate for NER system.For this reason, my thesis focus:

• Solutions to build Vietnamese NER corpus

• Quality control and evaluate the corpus

• Apply the corpus into NER problem

1.3 Thesis contribution

The thesis contribution includes:

• We release a building corpus process base on

• We apply the process to build NER corpus by offline tools method Offlinetools method is a manual way use desktop programs, for example: Callistotool Offline tools method is called as offline tools

• To overcome offline tools disadvantage, We build a online annotation work The online frame work have some features

frame-– Annotation will be executed though Internet environments (Annotateanytime, anywhere)

Trang 12

– Automate all steps in process: Manage files, distribute to annotator, etc – Enable lager number annotator.

– Quality control corpus in many level

• Apply corpus to evaluate our NER system

1.4 Thesis structure

So that, my thesis including five chapters

• Chapter one: Introduction: Overview NER research and some approach

to built NER system.And We expose problem

• Chapter two: Related word: Overview some research in the world to builtNLP corpus in general and NER corpus in particular So that we localize mydirectly study

• Chapter three: Building corpus process: Describe a process build ageneral corpus Then, we apply to build Vietnamese corpus by off line tools

• Chapter four: Online corpus Framework: We base on building corpusprocess to build a online framework for annotating It will overcome off-linetools disadvantage

• Chapter five: Evaluation: Present about my experiments and evaluateresult And describe our NER system using corpus we built

Trang 13

For learning exist research we build own strategy to solve our problem.

2.1 Overview our problem

As we in last chapter, building a high quality NER corpus is very important Thecorpus will be used many ways in NER system:

• Testing system: Corpus will be used to evaluate system accuracy rate

• Training system: Corpus will be used to build system knowledge (Machinelearning approach)

8

Trang 14

However building high quality corpus is not easy If you don’t have suitable method,you only have a low accuracy corpus So that our problem is:

How to build a high quality NER corpus, and how to quality control

corpus

We need do three works to solve problem: Release a building corpus process,supply tools to support, and quality control

2.2 Building NER corpus research

When survey the problem we see that Building NER corpus problem is not new inthe world For example:

Kravalov´a, Jana and ˇZabokrtsk´y, Zdenˇek have built Czech Named Entity corpus(Kravalov´a & ˇZabokrtsk´y, 2009) In this research, 6000 sentences are manuallyannotated named entities They receive about 33000 entities They use the corpus

to train and evaluate a named entity recognizer based on Support Vector Machineclassification technique The presented recognizer outperforms the results previouslyreported for NE recognition in Czech

Furthermore, Asif Ekbal and Sivaji Bandyopadhyay have built Bengali NamedEntity Tagged Corpus (Asif Ekbal, 2008) A Bengali news corpus has been de-veloped from the web archive of a widely read Bengali newspaper They used tool

”Sanchay Editor1” to manual annotate, Sanchay Editor1 is a text editor for Indianlanguage Their corpus has been used to develop NER system in Bengali use pat-tern directed shallow parsing approaches, includes: Hidden Markov Model (HMM),Maximum Entropy (ME) Model, Conditional Random Field (CRF) and SupportVector Machine (SVM)

Trang 15

There is no NER copus is publish for Vietnamese language So that, some NERsystem have been based creating rule approach, for example: VN-KIM (Nguyen &Cao, 2007)(using Jape grammar) To release Vietnamese NER corpus will useful fordeveloping automatically NER researches.

2.3 Researches about building corpus Process

Many building corpus research are published, and many corpus is created: POScorpus, TreeBank corpus, event newer corpus: Parallel language corpus, Opinioncorpus, etc For example:

• Towards the national corpus of Polish research (Adam Przepiorkowski &Lazinski, 2008) study about building National Corpus of Polish and used tobuild Polish dictionary The corpus is very big, about a billion words Thecorpus have been built by four parters, and they annotated various features,entire corpus will be annotated linguistically, structurally and with the metadata During building time, they plan to carefully consider the recommenda-tions of the ISO/TC 37/SC 4 subcommittee, the TEI guidelines, any futurerecommendations of the CLARIN project 1

• ”Building a Greek corpus for Textual Entailment” research (Evi Marzelou

& Piperidis, 2008)study about building Greek corpus Annotation process inthe research includes some steps: Create guidelines, annotate (by expert andnon-expert human annotator) They compare and release the gold entailmentannotation

• The research ”Opinion annotation in On-line Chinese Product Reviews” (Ruifeng Xu

1 more information in web http://www.clarin.eu/

Trang 16

& Li, 2008) focus about opinion annotation The research will explain aboutannotation schema It includes seven steps.

Summary, after we review some create annotation corpus research We see thatbuilding corpus schemma include three main steps: Build annotation guide line,Annotate, and quality control corpus So that our corpus will be applied thesesteps

2.4 Overview annotate tools

These are many annotate tools exist: we can reference them:

• GATE2: The framework written by Java languages It includes many tions, Annotation is Gate ’s function

func-• Callisto3: The Callisto annotation tool was developed to support linguisticannotation of textual sources for any Unicode-supported language

• EasyRef4: It is a web service to handle (view, edit, import, export, bugreports) syntactic annotations

• SACODEYL Annotator5 : It is a open source application to annotatedocuments in desktop, furthermore it can be a web application

• WordFreak6 WordFreak is a java-based linguistic annotation tool designed tosupport human, and automatic annotation of linguistic data as well as employactive-learning for human correction of automatically annotated data

We will reference all tools to build our tools for annotate process

Trang 17

2.5 Summary

In this chapter, we focus about some related works around the thesis includes:building corpus process, annotation tools.It is useful to direct our word: forward aframework for building an Annotated Named entities corpus In next chapters, wehave explain our work to build Vietnamese NER Corpus

Trang 18

Corpus building process

In this chapter, we present about corpus building process Similar other annotatedprocess, corpus building process includes three steps: Built annotation guide line,annotate documents, and quality control Then, we apply the process to buildVietnamese NER corpus We use some off-line tools and discuss about advantageand disadvantage

3.1 Corpus building process

In this subsection, we explain the importance of building process If you want tobuild a small corpus (Corpus contains a few documents) you do not need a corpusbuilding process Simple, you annotate each documents by annotate tools If youwant corpus more accurate, the documents are annotated some times However,when you want to build a large corpus The work becomes complex Many peopleneed join in the job So that building process corpus is defined Basing on corpusbuilding process, people will know what work they have to do Manager manage

13

Trang 19

more easy all works and corpus quality Requirements of the building corpus processare list

• Every people takes part in the corpus building

• Each documents have to be annotated many times

• Administrator can control and evaluate quality of corpus as quality of tator ’s work

anno-As research we have studied in section two chapter two such as: National corpus

of Polish (Adam Przepiorkowski & Lazinski, 2008), building a Greek corpus forTextual Entailment (Evi Marzelou & Piperidis, 2008), opinion annotation in On-line Chinese Product Reviews (Ruifeng Xu & Li, 2008) Corpus building processinclude three steps:

• Building annotation guide line

• Annotation documents

• Quality control corpus

In next section we will present each steps

Annotation guide line is nearly a user manual for annotator They base on tions which is contained in guide line to find and classify entity In building NERcorpus, guide line include some contents: define a name entity, classify entities andsign of entity in documents Annotation guide line is very important because:

instruc-• Annotators seem guide line as theirs user manual to annotate correctly Beforeannotation process, they have to read and study carefully guide line They

Trang 20

have to knows: which word or phrase can be seem entity, Identification of eachentity kinds If they do not understand clearly, they face many problems whenthey annotate, and many error will be made.

• When face ambiguous case (The case can be understood many ways) Base onthe rules in guide line Annotator will decide which way is the most correct.For example when annotation sentences:

Trưởng công an huyện Kỳ Sơn dẫn tôi đi tới kho chứa hàng trămkhẩu súng tự chế được gom lại trong chiến dịch thu hồi vũ khí và

vật liệu nổ trái phép

Ky Son police chief take me to hundreds of manual weapons whichhave been gathered in inlegal Weapons and detonation materials

gathered campaign

There are two ways to annotation in the sentence : First "Kỳ Sơn"

is "Location" entity because it is a district name In other way,

"Công an xã Kỳ Sơn"is "Organization" entity because it is a officename Which way do we choose? In annotation guide line, we showthat: "Entity is not annotated overlap, and the most correct entity

is longest entity" So that the way two is applied

• Because there is only correct entity in one context, when we compare twodocuments which have been annotated by two difference one, if we found thedifferences It demonstrates that one is correct and other is wrong, even both

of them is wrong To repair them we have to base guideline For examplewhen annotation the sentence:

Trang 21

Bốn mươi năm trước chợ Mường Xén chưa xuất hiện thép Thái

Lan

there was not Thailand Steel in Muong Xen marketSome people annotate Mường Xén is "Location" other is "Facility" Because before the word is chợ(market) word base on annotatedguide line Mường Xén have to be "Facility" entity

Summary, annotation guideline is very important to quality of corpus So that

we have to built guide line the most correct and corresponding with our language

In general annotation guide line is built in some ways: built a new one or repairexisting

• Annotator receive own group They used tool to annotate documents Foraccuracy and unprejudiced, annotator work independently During annota-tion period, they reference annotation guideline to decide which tag we willannotate In this period, annotators read annotation guide line more carefully,they annotate more exactly The annotate tool more friendly and convenient,annotation performance is higher

• After annotator finish their work, they give back annotated documents tomanager The manager must organize and save them

Trang 22

This is a important step in corpus building process If people annotate carefully,volume of work in next steps is reduced considerable Otherwise, next step will workvery hard, sometime annotated documents are even re-annotated It spent a lot oftime process.

For example: this sentence is not added entity which is bold

Đi từ Hà Nội theo đường một cũ khoảng 30 km là tới xã Thống

Nhất

Thong nhat village is about 30 Km from Ha Noi by One road

• Annotate the word or phrase which is not a entity For example: this sentence

is added non-entity which is bold

chiếc <Facility>Inova</Facility> đang đi trên đường

A Inova car is running in the road

• Annotate incorrect tag

For example: In this sentence Long Biên is Facility in stead of cation

Lo-nó đi đến chợ <Location>Long Biên</Location>

He is go to Long Bien market

Trang 23

• Annotate incorrect word or phrase, for example:

In this sentence Long Biên is entity - facility but chợ Long Biên isnot entity

nó đi đến <Facility>chợ Long Biên</Facility>

He is go to Long Bien market

We check errors according two levels: Documents level, corpus level After correctall error, we gain a set of annotated documents without error It is a standardcorpus we want

Documents level:

In this levels supervisor find errors by comparing double annotated documents(They are annotated independently from a root documents by two annotators) Ifthere are difference between two documents, supervisor will check the difference.They base on annotation guide line and context around the difference to decidewhich documents is correct which documents is wrong, even both two documents iswrong (Notice at same context, only one case is corrects) In general, tool is applied

to automatically find all difference between double annotated documents

Trang 24

The word Hồ Chí Minh is annotated person entity many times in thecorpus However in one documents, Hồ Chí Minhis annotated locationentity.

To solve problem, we show entity and its context Base on annotation guide lineand its context, we explain it is correct or wrong For example, this sentence isannotated correct:

<Location>Hồ Chí Minh</Location> làthành phố lớn nhất cả nước

Ho Chi Minh is the biggest city in the countryFor finding unusual entities, we list all entities in the corpus and their frequency Theentities have low frequency is unusual entity We explain them base their context

To correct all error in this case, precision of corpus will be increase

Second case, the word or phrase A is recognized B entity kind many times inthe corpus However the word in the document c is not annotated We need findthem and explain this case Although, it is difficult to explore them, but it will beincrease corpus recall We list all the word or prase in the all documents similar

to word or phrase in the corpus We will explain them base on these context andannotation guide line For example this sentence TOYOTA is not a entity although

it is a organization in the corpus

chiếc xe TOYOTA của tôi lại bị hỏng

the TOYOTA is broken down

By all task in quality control step, we hope find and correct all errors in thecorpus We will build a high quality corpus Summary, In this section we present

a corpus building process It includes three steps: built annotation guide line,annotate documents and quality control It can be apply for NER corpus building

Trang 25

and other corpus In next section, we will apply the process to build a VietnameseNER corpus by using off-line tools: Callisto tool and quality tool.

3.2 Building Vietnamese NER corpus by off-line

tools

In this section, we apply the corpus building process to build Vietnamese NERcorpus We have to act manually three steps: build annotation guide line, annotatedocuments, and quality control

Vietnamese NER annotation guide line will build base on exist one for English NER

We reference Simple Named Entity Guidelines V6.4 (Strassel, 2006) to build we rawannotation guide line

• What is Entity? What word or phase do we identify entity?

• Number kind of entity?

• Which is case not entity?

• Identification for each kind of entity

Fistly, there are seven entity kinds in row annotation guide line: Person (PER),Organization (ORG), Location (LOC), Geo-Political Entity (GPE), Facility (FAC),Religion (REL) and Nationality (NAT)

We have to repair the guide line for approximate Vietnamese NER corpus ing We annotated set of documents base on guide line and finding error We explain

Trang 26

build-Figure 3.1: Process building Annotation guide line

each errors and repair guide line And then we annotate the documents and pair guide line The building guide line process finish when it is suitable to annotatedocuments These tasks will show in figure 3.1

re-After repair, annotation guide line only includes five entity kind: Person (PER),Organization (ORG), Location (LOC), Facility (FAC), and Religion (REL) Anno-tation guide line detail will show in appendix 01

Trang 27

Figure 3.2: Callisto formatting

After building completely annotation guide line, we will the annotate documents tobuild corpus In this subsection, we present overview input files and output files andannotation tool

Overview input and output files

Input data are general text documents which encoding by UTF-8 code Theyare the Vietnamese articles whose contents are about cultural and society

Output data formatting is Callisto formatting which based on XML formatting

We add xml tag before and after entity the format is show in figure:3.2 All inputfile and output file have ”.seg” extends

Overview annotation tools We use Callisto tools to annotate documents.Callisto was developed to support linguistic annotation of textual sources for anyUnicode-supported language Callisto has been built with a modular design, andutilizes standoff-annotation, allowing for unique tag-set definitions and domain de-pendent interfaces Standoff-annotation support, provided by jATLAS, allows for

Trang 28

Figure 3.3: Callisto interface

nearly any annotation task to be represented

Callisto interface is showed in this figure 3.3 When documents are imported,each entity has been hight line annotated tag by difference colors Bottom of inter-face, list of entity is classify by tag, each tag lie in one tab, so that it is easy to findentity which annotators have been added tag Because we classify entity into fivecategoris: Person, Location, Facility, Organization, Religion, there are five tags inthe bottom and we have five colors: red, yellow, green, blue, ping

Using Callisto we have define Entity in DTD file the configuration file of Callisto.their format base on XML To define the Entity tag, we use grammar:

<!ELEM EN T tag − name(#P CDAT A) > (3.1)

Trang 29

After defining task we can import them into tool In our research, there are twopeople take part in process Each people annotate all file of corpus So that, eachdocuments will be annotated two times We divide the corpus in to groups, eachgroup contain ten files We have sixty files then we have six groups After annotatorfinish their work, they give back files to supervisor The annotate documents stepswill finish.

After annotation documents process, supervisor will quality controls annotated uments For quality controlling, we use quality tool As we focus in last section, wewill check error in two levels: Document level and corpus level

doc-In documents level:

We compare two same documents which are annotated a documents by twouser(We call as ”pair documents”), we use quality tools find all errors which occur-rence into pair documents The interface of tools in showed in figure 3.4 Each pair

of difference entities present in one line We compare entity into some parameters:word, entity kind, start word position, end word position Each parameters areshowed in to a column For example:

First circumstance show that, user1 identify the word Đông Nam á aslocation entity while user2 do not annotate Second circumstance showthat: The word Trung is annotated location in user1’s documents whileuser2 annotated the phrase Miền Trung is location

Supervisors explain difference entities by context and annotation guide line Thecontext is showed in botton textbox of tools They use Callisto tools to re-annotateerror entities When we finish the quality control process, we get a high quality

Trang 30

Figure 3.4: Comparing two user corpus

corpus

In corpus level

We use tool to list all entity and their probability We focus low probabilityentity Then we see their context If the entities we use callisto tool to correct them.After step corpus precision will increase

In summary, we have present the Vietnamese NER corpus building by off-linetools The process include three steps: built annotation guide line, annotate docu-ments and quality control Exception building annotation guide line, we use tools

to support execute each steps

Trang 31

3.3 Discus about Vietnamese NER corpus

build-ing process.

After applying the process to build Vietnamese NER corpus, we have some ments:

com-We have a quality corpus because:

• Annotation guide line is built much carefully

• Each documents will annotated many times and quality control many levelAll steps in process are execute manually:

As we presented in last section, there are three steps in to process We have to

do many manual tasks in to steps special annotate step such as:

• Supervisor separate set of files into groups

• Supervisor assign groups for two annotators Annotators have to managetheirs files

• Annotators manage all annotation version during annotate

• Supervisor receive all annotated documents from annotators, and then savefiles into folder Each user is save into a folder

Only two user annotate

Only two user annotate documents So that steps spent a lot time If morepeople annotate, supervisor will manage much complex If we use off-line tools,system performance is low

In summary, to increase performance building corpus process, we will build onlineVietnamese NER corpus The process will have some features:

Trang 32

• Many people annotate documents.

• Automate all manual tasks

• Increase quality of corpus

3.4 Conclusion

In this chapter we present about corpus building process it includes three steps:built annotate guide line, annotate documents and quality control corpus We applythe process to built Vietnamese NER corpus Calisto tools is used to annotatedocuments and quality tools used to find errors During annotation, the process hassome disadvantage:

• Limited number users take part to build corpus

• Supervisor and annotateor manage manually corpus

• The tools is not convenient for quality controlling corpus

To solve all disadvantage we built a online annotation framework base on buildingcorpus process The framework allow many people join to annotated, it automati-cally manage corpus and all activities to build corpus The framework will present

in next chapter: ”Online annotation framework”

Trang 33

Online Annotation Framework

4.1 Introduction

These are many disadvantages to apply off-line tools for building corpus process,which have been mention in last chapter:Limited number annotator,process executemanually,etc

We will build a framework base on the process to overcome all disadvantages.The online framework will achieve targets:

• Facilitate the annotation process (Annotate anytime and anywhere)

• Enable larger number annotators

• Manage automatically files( both annotated and un-annotate files)

• Quality control corpus easier

The process into framework base on the building corpus process(mention in lastchapter) so that it includes some main steps: Build annotation guide line, annotatedocuments and quality control Because we re-use annotation guide line, the steps we

do not tell again For enhancing corpus quality, each annotators are trained about

28

Trang 34

Figure 4.1: Online Annotation Processannotation guide line before annotate in training section step And then, onlineannotation process includes: Training section, annotation document and qualitycontrol Each steps will be present in a section Figure 4.1 will illustrate the process.

4.2 Training section

This steps ’s target is training annotators about annotation guide line After steps,annotators have knowledge about entities and identification between entities.Training section is a very important step Annotated documents quality muchdepend annotatetors knowledge If they do not understand clearly their annotateddocuments have many errors, it it difficult for supervisor to quality control

We have not this steps when we use off-line tools This module do not integrateinto Callisto tools Furthermore, annotate is independence with quality control sothat it not easy to training This it a advantage online framework

Trang 35

The section includes some steps:

• Reading annotation guide line: System show annotation guide line forannotator They have to read carefully

• Annotate sample documents: After annotator read and understand guideline, they practice annotate a sample document The documents include somesymbolic sentences, which represent guide line contents The documents hasbeen annotate exactly

• Examine and explain errors: When annotator finish annotate, system willcompare and release result If there is not any error annotator will finish train-ning secsion, and change annotate documents step If annotated documentshave errors, system will list them to annotator System will explain abouterror when annotator kick them

When annotator understand annotation guide line, training step will stop Theycan annotate documents

4.3 Annotation documents

In online annotation framework, all tasks automate Annotator only focus tation contents Every other task will automate So that framework have somefeatures:

anno-• Framework interface is friendly

• Automate file distribution for annotator

• Automate files management both root files and annotated file

This section will explain about these features

Trang 36

Figure 4.2: Annotation online tools Interface

Online annotation interface show in figure: 4.2

Online annotation interface includes some parts:

• Login form (At top of interface): User have to input his username and password

to login system Beside login button is a signin button to create new user

• Annotation guide line form: Annotatation guilde line form open when thebutton is kicked User open it if he wants to reference guide line.(Figure 4.3

• Annotation form: includes a text box (which contain file name), set of buttons(each button represent a kind of entity), a textareabox( contatin document

Trang 37

Figure 4.3: Annotation gudeline form Interface

content)

• Button ”Save” to save annotated documents and open a new document, button

”Cancel” use to exits

In offline tools, file distribution task is executed manually by administrator But

in online framework , it automate We have some feature to distribute files toannotator:

• Random distribution: System choices randomly a file from set of root files touser

• Sequence distribution: System choices sequent from set root files to user

Trang 38

• Base quality user distribution: System evaluate quality of user ’s annotation.

If user annotate many mistakes, their files will distribute other user If userannotate well he will be distribute many error files

In our framework, feature first and second have been applied Feature third will beapplied in future work

In our framework, there are three kinds of file: Root file, annotated file, and reviewedfile

• Root file: It is the initial file and have not been annotated They contain inroot folder System load them into frame to annotate

• Annotated file: it have been annotated by one user, each annotated save inuser corpus folder, the folder is named as format:”usrername+ corpus”(User Ahave folder ”Acorpus”) User A have annotated documents B The documents

B will save in folder ”Acorpus”

• Reviewed files: repaired file will be receive after supervisor review and correctall errors It seem as documents in corpus and it lie in corpus folder

System automate loading root file to user annotate When user kick save buttonsystem automate save file in to corresponding folder(for example system save A’annotated files of user A into folder ”Acorpus”), and load a new file When userfinish his annotated section, system save state of section If user login next time,the state will load, and he continue his work

In summary, in annotation documents, all task are executed automatically User

do not care: file name, and where annotated files save They only focus annotatedocuments So that system reduce most user’s tasks

Trang 39

4.4 Quality control

After user annotate documents Supervisor quality control corpus In online work, many people can supervise corpus we have many level to quality controlcorpus in online framework:

frame-• Documents level: We have two mode in this level:

– Review document: For documents is annotated one times

– Compare pair document (Pair documents is two documents which notated frome a root documents by two separate annotators): for rootdocument is annotated many times

an-• : Corpus level: Finding error entities in corpus(increase precision rate),and annotated entities (Entities in documents have not been annotated - increaserecall rate)

un-• Explain error entities: Explain why the entity is error We base them to contract annotation guide line

re-We will present about these level

Both two modes( Review and compare) are supported by online quality tool (Webuild the tool in framework for quality control corpus in document level) Thetool interface show in figure: 4.4 When we choice a file to review, tool show alist of users who annotated this file If only a user annotated, the tool display filecontent Otherwise, if we choice two users, tools will display all difference entity inpair documents( figure 4.5) To show context, we kick the row represent the entity

Ngày đăng: 25/03/2015, 10:23

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN