Towards a framework for building an annotated named entities corpus Hoàng Hữu Sơn Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn:
Trang 1Towards a framework for building an
annotated named entities corpus
Hoàng Hữu Sơn
Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01
Người hướng dẫn: PGS.TS Phạm Bảo Sơn
Năm bảo vệ: 2010
Keywords Mạng thông tin; Công nghệ thông tin; Ngôn ngữ tự nhiên; Trí tuệ nhân
tạo
Content
Table of Contents
1.1 Overview Name Entity recognition(NER) 1
1.2 NER Approach 3
1.2.1 Rule based approach 3
1.2.2 Machine learning Approach 4
1.2.3 Comparing 5
1.3 Thesis contribution 6
1.4 Thesis structure 7
2 Related Work 8 2.1 Overview our problem 8
2.2 Building NER corpus research 9
2.3 Researches about building corpus Process 10
2.4 Overview annotate tools 11
2.5 Summary 12
3 Corpus building process 13 3.1 Corpus building process 13
3.1.1 Objective 13
3.1.2 Built annotation guide line 14
3.1.3 Annotate documents 16
3.1.4 Quality control 17
3.2 Building Vietnamese NER corpus by off-line tools 20
3.2.1 Built annotation guide line 20
3.2.2 Annotate documents 22
3.2.3 Quality control 24
3.3 Discus about Vietnamese NER corpus building process 26
Trang 23.4 Conclusion 27
4 Online Annotation Framework 28 4.1 Introduction 28
4.2 Training section 29
4.3 Annotation documents 30
4.3.1 Online annotation interface 31
4.3.2 Automate file distribution for annotator 32
4.3.3 Automate save and manage files 33
4.4 Quality control 34
4.4.1 Document level 34
4.4.2 Corpus level 35
4.4.3 Explain unusual entity 37
4.5 Conclusion 38
5 Evaluation 39 5.1 Introduction 39
5.2 Corpus evaluation 40
5.2.1 Inter annotatetor agreements 41
5.2.2 Offline corpus evaluation 42 5.2.3 Online corpus 45 5.3 Time costing 47
5.3.1 Overview 47
5.3.2 Offline process 48 5.3.3 Online framework 49
5.4 Named entity recognition system 51
5.4.1 Preprocessing 52
5.4.2 Gazetteer 54
5.4.3 Transducer 54
5.4.4 Experiment 56
5.5 Summary 58
6 Conclusion And Future work 60 6.1 Conclusion 60
6.2 Future work 62
6.2.1 Create corpus bigger and more quality 62
6.2.2 Improve online annotation framework 63
6.2.3 Building NER system base statistical 63
A Name Entity guideline 64 A.1 Basic concepts 64
A.1.1 Entity and Entity Name 64
A.1.2 Instance of entity 64
A.1.3 List of Entities 64
A.1.4Entities recognize rules 65
A.2 Entity classification 65
A.2.1 Person 65
A.2.2 Organization 67
A.2.3 Location 68
A.2.4 Facility 69
A.2.5 Religion 69
Trang 3References
Adam Przepiorkowski, Rafal L Gorski, B L.-T., & Lazinski, M (2008) Towards the
national corpus of polish Proceedings of the Sixth International Language Resources and
Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association
(ELRA) http://www.lrec-conf.org/proceedings/lrec2008/
And, T P (2003) The multilingual named entity recognition framework
Asif Ekbal, S B (2008) Development of bengali named entity tagged corpus and its use in
ner systems The 6th Workshop on Asian Languae Resources, 2008
Bermingham, A., & Smeaton, A F (2007) A study of inter-annotator agreement for opinion retrieval
Black, W., Rinaldi, F., & Mowatt, D (1998) Facile: Description of the ne system used for
muc-7 In Proceedings of the 7th Message Understanding Conference
Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R (1998) Nyu: Description of the
mene named entity system as used in muc-7 In Proceedings of the Seventh Message
Understanding Conference (MUC-7
Carreras, X., Marquez, L., & Padro, L (2003) Named entity recognition for catalan using
spanish resources In Proceedings of EACL’03
Collins, M (2002) Coll02: Ranking algorithms for named entity extraction: Boosting and
the voted perceptron Association for Computational Linguistics
Collins, M., & Singer, Y (1999) Unsupervised models for named entity classification In
Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp 100-110)
Computer, D O., hsi Chen, H., & chang Lee, J (1996) Identification and classification of
proper nouns in chinese texts hsin-hsi chen and jen-chang lee Proceedings of 16th
International Conference on Computational Linguistics (pp 222-229)
Cucchiarelli, A., & Velardi, P Unsupervised named entity recognition using syntactic and semantic contextual evidence
Cucerzan, S., & Yarowsky, D (1999) Language independent named entity recognition combining morphological and contextual evidence (pp 90-99 )
Disambiguation, W S (2008) A case study on inter-annotator agreement for word sense disambiguation
Evi Marzelou, Maria Zourari, V G., & Piperidis, S (2008) Building a greek corpus for
textual entailment Proceedings of the Sixth International Language Resources and
Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association
(ELRA) http://www.lrec-conf.org/proceedings/lrec2008/
Karkaletsis, V., Paliouras, G., Petasis, G., Manousopoulou, N., & Spyropoulos, C D
(1999) Named-entity recognition from greek and english texts Journal of Intelligent and
Robotic Systems, 26, 123-135
Kokkinakis, D (1998) AVENTINUS, GATE and Swedish Lingware Proceedings of the
11th NODALIDA Conference (pp 22-33) Copenhagen
Kravalova, J., & Zabokrtsky, Z (2009) Czech named entity corpus and svm-based
Trang 4recognizer NEWS ’09: Proceedings of the 2009 Named Entities Workshop: Shared Task on
Transliteration (pp 194-201) Morristown, NJ, USA: Association for Computational
Linguistics
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., & Wilks, Y (2001) Named entity
recognition from diverse text types In Recent Advances in Natural Language Processing
2001 Conference, Tzigov Chark
Minkov, E., & Wang, R C (2005) Extracting personal names from emails: Applying
named entity recognition to informal text In HLT-EMNLP
Nelson, K P., & Edwards, D (2007) Population-based measures of agreement
Nguyen, T.-V T., & Cao, T H (2007) Vn-kim ie: automatic extraction of vietnamese
named-entities on the web New Gen Comput., 25, 277-292
Palmer, D., , Palmer, D D., & Day, D S (1997) A statistical profile of the named entity
task Proc ACL Conference for Applied Natural Language Processing (pp 190-193)
Petasis, G., Vichot, F., Wolinski, F., Paliouras, G., Karkaletsis, V., & Spyropou- los, C D (2001) Using machine learning to maintain rule-based named-entity recognition and
classification systems Proc Conference of Association for Computational Linguistics (pp
426-433)
Pham, D D., Tran, G B., & Pham, S B (2009) A hybrid approach to vietnamese word
segmentation using part of speech tags Knowledge and Systems Engineering, International
Conference on, 0, 154-161
Ruifeng Xu, Yunqing Xia, K.-F W., & Li, W (2008) Opinion annotation in online chinese
product reviews Proceedings of the Sixth International Language Resources and
Evaluation (LREC’08) Marrakech, Morocco: European Language Resources Association
(ELRA) http://www.lrec-conf.org/proceedings/lrec2008/
Silva, J F F D., Kozareva, Z., Gabriel, J., & Lopes, P (2004) Cluster analysis and
classification of named entities Proc Conference on Language Resources and Evaluation
Strassel, S (2006) Simple named entity guidelines v6.4
Wang, L.-J., Chang, H., Chao, & huang Chang, C (1992) Recognizing unregistered names
for mandarin word identification Proc of COLING92 (pp 1239-1243) COLING
Whitelaw, C., & Patrick, J (2003) Evaluating corpora for named entity recognition using character-level features In (Whitelaw & Patrick, 2003), 910-921
Yu, S., Bai, S., & Wu, P (1998) Description of the kent ridge digital labs system used for
muc-7 In Proceedings of the MUC-7