1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: Towards a framework for building an annotated named entities corpus

4 158 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 4
Dung lượng 166 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Towards a framework for building an annotated named entities corpus Hoàng Hữu Sơn Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn:

Trang 1

Towards a framework for building an

annotated named entities corpus

Hoàng Hữu Sơn

Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01

Người hướng dẫn: PGS.TS Phạm Bảo Sơn

Năm bảo vệ: 2010

Keywords Mạng thông tin; Công nghệ thông tin; Ngôn ngữ tự nhiên; Trí tuệ nhân

tạo

Content

Table of Contents

1.1 Overview Name Entity recognition(NER) 1

1.2 NER Approach 3

1.2.1 Rule based approach 3

1.2.2 Machine learning Approach 4

1.2.3 Comparing 5

1.3 Thesis contribution 6

1.4 Thesis structure 7

2 Related Work 8 2.1 Overview our problem 8

2.2 Building NER corpus research 9

2.3 Researches about building corpus Process 10

2.4 Overview annotate tools 11

2.5 Summary 12

3 Corpus building process 13 3.1 Corpus building process 13

3.1.1 Objective 13

3.1.2 Built annotation guide line 14

3.1.3 Annotate documents 16

3.1.4 Quality control 17

3.2 Building Vietnamese NER corpus by off-line tools 20

3.2.1 Built annotation guide line 20

3.2.2 Annotate documents 22

3.2.3 Quality control 24

3.3 Discus about Vietnamese NER corpus building process 26

Trang 2

3.4 Conclusion 27

4 Online Annotation Framework 28 4.1 Introduction 28

4.2 Training section 29

4.3 Annotation documents 30

4.3.1 Online annotation interface 31

4.3.2 Automate file distribution for annotator 32

4.3.3 Automate save and manage files 33

4.4 Quality control 34

4.4.1 Document level 34

4.4.2 Corpus level 35

4.4.3 Explain unusual entity 37

4.5 Conclusion 38

5 Evaluation 39 5.1 Introduction 39

5.2 Corpus evaluation 40

5.2.1 Inter annotatetor agreements 41

5.2.2 Offline corpus evaluation 42 5.2.3 Online corpus 45 5.3 Time costing 47

5.3.1 Overview 47

5.3.2 Offline process 48 5.3.3 Online framework 49

5.4 Named entity recognition system 51

5.4.1 Preprocessing 52

5.4.2 Gazetteer 54

5.4.3 Transducer 54

5.4.4 Experiment 56

5.5 Summary 58

6 Conclusion And Future work 60 6.1 Conclusion 60

6.2 Future work 62

6.2.1 Create corpus bigger and more quality 62

6.2.2 Improve online annotation framework 63

6.2.3 Building NER system base statistical 63

A Name Entity guideline 64 A.1 Basic concepts 64

A.1.1 Entity and Entity Name 64

A.1.2 Instance of entity 64

A.1.3 List of Entities 64

A.1.4Entities recognize rules 65

A.2 Entity classification 65

A.2.1 Person 65

A.2.2 Organization 67

A.2.3 Location 68

A.2.4 Facility 69

A.2.5 Religion 69

Trang 3

References

Adam Przepiorkowski, Rafal L Gorski, B L.-T., & Lazinski, M (2008) Towards the

national corpus of polish Proceedings of the Sixth International Language Resources and

Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association

(ELRA) http://www.lrec-conf.org/proceedings/lrec2008/

And, T P (2003) The multilingual named entity recognition framework

Asif Ekbal, S B (2008) Development of bengali named entity tagged corpus and its use in

ner systems The 6th Workshop on Asian Languae Resources, 2008

Bermingham, A., & Smeaton, A F (2007) A study of inter-annotator agreement for opinion retrieval

Black, W., Rinaldi, F., & Mowatt, D (1998) Facile: Description of the ne system used for

muc-7 In Proceedings of the 7th Message Understanding Conference

Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R (1998) Nyu: Description of the

mene named entity system as used in muc-7 In Proceedings of the Seventh Message

Understanding Conference (MUC-7

Carreras, X., Marquez, L., & Padro, L (2003) Named entity recognition for catalan using

spanish resources In Proceedings of EACL’03

Collins, M (2002) Coll02: Ranking algorithms for named entity extraction: Boosting and

the voted perceptron Association for Computational Linguistics

Collins, M., & Singer, Y (1999) Unsupervised models for named entity classification In

Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp 100-110)

Computer, D O., hsi Chen, H., & chang Lee, J (1996) Identification and classification of

proper nouns in chinese texts hsin-hsi chen and jen-chang lee Proceedings of 16th

International Conference on Computational Linguistics (pp 222-229)

Cucchiarelli, A., & Velardi, P Unsupervised named entity recognition using syntactic and semantic contextual evidence

Cucerzan, S., & Yarowsky, D (1999) Language independent named entity recognition combining morphological and contextual evidence (pp 90-99 )

Disambiguation, W S (2008) A case study on inter-annotator agreement for word sense disambiguation

Evi Marzelou, Maria Zourari, V G., & Piperidis, S (2008) Building a greek corpus for

textual entailment Proceedings of the Sixth International Language Resources and

Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association

(ELRA) http://www.lrec-conf.org/proceedings/lrec2008/

Karkaletsis, V., Paliouras, G., Petasis, G., Manousopoulou, N., & Spyropoulos, C D

(1999) Named-entity recognition from greek and english texts Journal of Intelligent and

Robotic Systems, 26, 123-135

Kokkinakis, D (1998) AVENTINUS, GATE and Swedish Lingware Proceedings of the

11th NODALIDA Conference (pp 22-33) Copenhagen

Kravalova, J., & Zabokrtsky, Z (2009) Czech named entity corpus and svm-based

Trang 4

recognizer NEWS ’09: Proceedings of the 2009 Named Entities Workshop: Shared Task on

Transliteration (pp 194-201) Morristown, NJ, USA: Association for Computational

Linguistics

Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y (2001) Named entity

recognition from diverse text types In Recent Advances in Natural Language Processing

2001 Conference, Tzigov Chark

Minkov, E., & Wang, R C (2005) Extracting personal names from emails: Applying

named entity recognition to informal text In HLT-EMNLP

Nelson, K P., & Edwards, D (2007) Population-based measures of agreement

Nguyen, T.-V T., & Cao, T H (2007) Vn-kim ie: automatic extraction of vietnamese

named-entities on the web New Gen Comput., 25, 277-292

Palmer, D., , Palmer, D D., & Day, D S (1997) A statistical profile of the named entity

task Proc ACL Conference for Applied Natural Language Processing (pp 190-193)

Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropou- los, C D (2001) Using machine learning to maintain rule-based named-entity recognition and

classification systems Proc Conference of Association for Computational Linguistics (pp

426-433)

Pham, D D., Tran, G B., & Pham, S B (2009) A hybrid approach to vietnamese word

segmentation using part of speech tags Knowledge and Systems Engineering, International

Conference on, 0, 154-161

Ruifeng Xu, Yunqing Xia, K.-F W., & Li, W (2008) Opinion annotation in online chinese

product reviews Proceedings of the Sixth International Language Resources and

Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association

(ELRA) http://www.lrec-conf.org/proceedings/lrec2008/

Silva, J F F D., Kozareva, Z., Gabriel, J., & Lopes, P (2004) Cluster analysis and

classification of named entities Proc Conference on Language Resources and Evaluation

Strassel, S (2006) Simple named entity guidelines v6.4

Wang, L.-J., Chang, H., Chao, & huang Chang, C (1992) Recognizing unregistered names

for mandarin word identification Proc of COLING92 (pp 1239-1243) COLING

Whitelaw, C., & Patrick, J (2003) Evaluating corpora for named entity recognition using character-level features In (Whitelaw & Patrick, 2003), 910-921

Yu, S., Bai, S., & Wu, P (1998) Description of the kent ridge digital labs system used for

muc-7 In Proceedings of the MUC-7

Ngày đăng: 18/12/2017, 02:46

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN