Towards a framework for building an Annotated Named Entities Corpus Hoang Huu Son Faculty of Information Technology University of technology and engineering Vietnam National University,
Trang 1Towards a framework for building an Annotated Named Entities Corpus
Hoang Huu Son
Faculty of Information Technology University of technology and engineering Vietnam National University, Hanoi
Supervised by Doctor Pham Bao Son
A thesis submitted in fulfillment of the requirements for the degree of
Master of Information Technology
June, 2010
Trang 2Table of Contents
1.1 Overview Name Entity recognition(NER) 1
1.2 NER Approach 3
1.2.1 Rule based approach 3
1.2.2 Machine learning Approach 4
1.2.3 Comparing 5
1.3 Thesis contribution 6
1.4 Thesis structure 7
2 Related Work 8 2.1 Overview our problem 8
2.2 Building NER corpus research 9
2.3 Researches about building corpus Process 10
2.4 Overview annotate tools 11
2.5 Summary 12
3 Corpus building process 13 3.1 Corpus building process 13
3.1.1 Objective 13
3.1.2 Built annotation guide line 14
3.1.3 Annotate documents 16
3.1.4 Quality control 17
3.2 Building Vietnamese NER corpus by off-line tools 20
3.2.1 Built annotation guide line 20
3.2.2 Annotate documents 22
3.2.3 Quality control 24
3.3 Discus about Vietnamese NER corpus building process 26
3.4 Conclusion 27
ii
Trang 3TABLE OF CONTENTS iii
4.1 Introduction 28
4.2 Training section 29
4.3 Annotation documents 30
4.3.1 Online annotation interface 31
4.3.2 Automate file distribution for annotator 32
4.3.3 Automate save and manage files 33
4.4 Quality control 34
4.4.1 Document level 34
4.4.2 Corpus level 35
4.4.3 Explain unusual entity 37
4.5 Conclusion 38
5 Evaluation 39 5.1 Introduction 39
5.2 Corpus evaluation 40
5.2.1 Inter annotatetor agreements 41
5.2.2 Offline corpus evaluation 42
5.2.3 Online corpus 45
5.3 Time costing 47
5.3.1 Overview 47
5.3.2 Offline process 48
5.3.3 Online framework 49
5.4 Named entity recognition system 51
5.4.1 Preprocessing 52
5.4.2 Gazetteer 54
5.4.3 Transducer 54
5.4.4 Experiment 56
5.5 Summary 58
6 Conclusion And Future work 60 6.1 Conclusion 60
6.2 Future work 62
6.2.1 Create corpus bigger and more quality 62
6.2.2 Improve online annotation framework 63
6.2.3 Building NER system base statistical 63
Trang 4iv TABLE OF CONTENTS
A.1 Basic concepts 64
A.1.1 Entity and Entity Name 64
A.1.2 Instance of entity 64
A.1.3 List of Entities 64
A.1.4 Entities recognize rules 65
A.2 Entity classification 65
A.2.1 Person 65
A.2.2 Organization 67
A.2.3 Location 68
A.2.4 Facility 69
A.2.5 Religion 69
Trang 5Toward a Framework for building Named Entity Corpus
Hoang Huu Son University of Engineering and Technology Vietnam National University, Hanoi
144, Xuan Thuy, Cau Giay, Hanoi, Vietnam
Abstract
Named entities recognition (NER) problem is one of the
most interesting in nature language processing domain.
However a main NER research barrier is difficult to build
a NER corpus and there is any NER corpus have been
pub-lished So that in the thesis, we release a corpus building
process and frameworks to build NER corpus - special
Viet-namese named entity corpus.
1 Introduction
Please be noted some points as follows.
- The context of the research and its role/importance
- Related studies and their methods/solutions/approaches
- The remain problems and objective of this study/thesis
- Your proposal What will be carried out?
2
- You can arrange one or more sections after the
Intro-duction.
- You can use subsections.
- Show how the problem are formulated You may give
some foundations if necessary.
- Show different aspects of the problems, for examples:
the feature selections, learning algorithms, etc.
- Show your proposal, it is good if you can present the
differences between your proposal and previous studies It
is also important to show/analyze the solution in a
reason-able way.
- Show how features are selected/built; the
algo-rithms/methods you will use.
3 Experiments
You should give the information as follows: Kravalov´a,
Jana and ˇ Zabokrtsk´y, Zdenˇek have built Czech Named
En-tity Corpus which present in paper [?] In this recently
released corpus of Czech sentences with manually anno-tated named entities, in which a rich two-level classification scheme was used - How are the models designed? You can design different models/parameters, so please describe them in detail.
- How are the data prepared?
- The results should be presented in Tables and Graphs
- It is important of giving the discussion after obtaining experimental results.
4 Conclusions
- With regard to the objective of this study as you showed
in the introduction, which have been done?
- The contribution of your work, the meaning of obtained results.
- Present future work if needed.
Publications
- Give here your publications during this master course
- You can also give here your submission and its status (i.e submitted, revising, in press, )
References
[1] I M Author Some related article I wrote Some Fine Journal, 99(7):1–100, January 1999.
[2] A N Expert A Book He Wrote His Publisher, Erewhon, NC, 1999.