Towards a framework for building an annotated named entities corpus

Towards a framework for building an Annotated Named Entities Corpus Hoang Huu Son Faculty of Information Technology University of technology and engineering Vietnam National University,

Trang 1

Towards a framework for building an Annotated Named Entities Corpus

Hoang Huu Son

Faculty of Information Technology University of technology and engineering Vietnam National University, Hanoi

Supervised by Doctor Pham Bao Son

A thesis submitted in fulfillment of the requirements for the degree of

Master of Information Technology

June, 2010

Trang 2

Table of Contents

1.1 Overview Name Entity recognition(NER) 1

1.2 NER Approach 3

1.2.1 Rule based approach 3

1.2.2 Machine learning Approach 4

1.2.3 Comparing 5

1.3 Thesis contribution 6

1.4 Thesis structure 7

2 Related Work 8 2.1 Overview our problem 8

2.2 Building NER corpus research 9

2.3 Researches about building corpus Process 10

2.4 Overview annotate tools 11

2.5 Summary 12

3 Corpus building process 13 3.1 Corpus building process 13

3.1.1 Objective 13

3.1.2 Built annotation guide line 14

3.1.3 Annotate documents 16

3.1.4 Quality control 17

3.2 Building Vietnamese NER corpus by off-line tools 20

3.2.1 Built annotation guide line 20

3.2.2 Annotate documents 22

3.2.3 Quality control 24

3.3 Discus about Vietnamese NER corpus building process 26

3.4 Conclusion 27

ii

Trang 3

TABLE OF CONTENTS iii

4.1 Introduction 28

4.2 Training section 29

4.3 Annotation documents 30

4.3.1 Online annotation interface 31

4.3.2 Automate file distribution for annotator 32

4.3.3 Automate save and manage files 33

4.4 Quality control 34

4.4.1 Document level 34

4.4.2 Corpus level 35

4.4.3 Explain unusual entity 37

4.5 Conclusion 38

5 Evaluation 39 5.1 Introduction 39

5.2 Corpus evaluation 40

5.2.1 Inter annotatetor agreements 41

5.2.2 Offline corpus evaluation 42

5.2.3 Online corpus 45

5.3 Time costing 47

5.3.1 Overview 47

5.3.2 Offline process 48

5.3.3 Online framework 49

5.4 Named entity recognition system 51

5.4.1 Preprocessing 52

5.4.2 Gazetteer 54

5.4.3 Transducer 54

5.4.4 Experiment 56

5.5 Summary 58

6 Conclusion And Future work 60 6.1 Conclusion 60

6.2 Future work 62

6.2.1 Create corpus bigger and more quality 62

6.2.2 Improve online annotation framework 63

6.2.3 Building NER system base statistical 63

Trang 4

iv TABLE OF CONTENTS

A.1 Basic concepts 64

A.1.1 Entity and Entity Name 64

A.1.2 Instance of entity 64

A.1.3 List of Entities 64

A.1.4 Entities recognize rules 65

A.2 Entity classification 65

A.2.1 Person 65

A.2.2 Organization 67

A.2.3 Location 68

A.2.4 Facility 69

A.2.5 Religion 69

Trang 5

Toward a Framework for building Named Entity Corpus

Hoang Huu Son University of Engineering and Technology Vietnam National University, Hanoi

144, Xuan Thuy, Cau Giay, Hanoi, Vietnam

Abstract

Named entities recognition (NER) problem is one of the

most interesting in nature language processing domain.

However a main NER research barrier is difficult to build

a NER corpus and there is any NER corpus have been

pub-lished So that in the thesis, we release a corpus building

process and frameworks to build NER corpus - special

Viet-namese named entity corpus.

1 Introduction

Please be noted some points as follows.

- The context of the research and its role/importance

- Related studies and their methods/solutions/approaches

- The remain problems and objective of this study/thesis

- Your proposal What will be carried out?

2

- You can arrange one or more sections after the

Intro-duction.

- You can use subsections.

- Show how the problem are formulated You may give

some foundations if necessary.

- Show different aspects of the problems, for examples:

the feature selections, learning algorithms, etc.

- Show your proposal, it is good if you can present the

differences between your proposal and previous studies It

is also important to show/analyze the solution in a

reason-able way.

- Show how features are selected/built; the

algo-rithms/methods you will use.

3 Experiments

You should give the information as follows: Kravalov´a,

Jana and ˇ Zabokrtsk´y, Zdenˇek have built Czech Named

En-tity Corpus which present in paper [?] In this recently

released corpus of Czech sentences with manually anno-tated named entities, in which a rich two-level classification scheme was used - How are the models designed? You can design different models/parameters, so please describe them in detail.

- How are the data prepared?

- The results should be presented in Tables and Graphs

- It is important of giving the discussion after obtaining experimental results.

4 Conclusions

- With regard to the objective of this study as you showed

in the introduction, which have been done?

- The contribution of your work, the meaning of obtained results.

- Present future work if needed.

Publications

- Give here your publications during this master course

- You can also give here your submission and its status (i.e submitted, revising, in press, )

References

[1] I M Author Some related article I wrote Some Fine Journal, 99(7):1–100, January 1999.

[2] A N Expert A Book He Wrote His Publisher, Erewhon, NC, 1999.

Định dạng
Số trang	5
Dung lượng	217,68 KB