Nguyễn Phương Thái Năm bảo vệ: 2013 Abstract: Automatically constructing and clustering of words similarity have many important applications in Natural Language Processing NLP tasks,
Trang 1Phân cụm từ Tiếng Việt và nhận diện từ trái nghĩa
Nguyễn Kim Anh
Trường Đại học Công nghệ Ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS Nguyễn Phương Thái
Năm bảo vệ: 2013
Abstract: Automatically constructing and clustering of words similarity have
many important applications in Natural Language Processing (NLP) tasks, such as dictionary construction, statistical machine translation, named-entity recognition, functional labeling, word segmentation… In recent years, it is a common trend that word clustering is researched in some languages as English, Germany, Chinese… However, the task of word clustering in Vietnamese is a more recent one In this thesis, I use a large unlabeled data of Vietnamese of about 15 millions words which is equivalent to approximately 700 thousands of sentences This unlabeled data is extracted from newspapers: Lao dong, PC World, Tuoi tre and then part-of-speech tagged I investigated some approaches for constructing word clusters in Vietnamese, in which I mainly focus on two main methods by Brown and Dekang Lin I use the same Vietnamese corpus and the same evaluating tool for these two methods so that I can compare and evaluate the effects of those methods in certain NLP tasks Besides, I use the statistics method to suggest 20 frames of antonym which can be used to identify antonym classes in clusters
Keywords: Khoa học máy tính; Xử lý ngôn ngữ tự nhiên; Cụm từ; Từ trái nghĩa
Trang 2Table of Contents
Acknowledgements 4
Abstract 5
Chapter I - Introduction 10
1.1 Word Similarity 11
1.2 Hierarchical Clustering of Word 11
1.3 Function tags 12
1.4 Objectives of the Thesis 13
1.5 Our Contributions 13
1.6 Thesis structure 14
Chapter II - Related Works 15
2.1 Word Clustering 15
2.1.1 The Brown algorithm 15
2.1.2 Sticky Pairs and Sematic Classes 17
2.2 Word Similarity 18
2.2.1 Approach 18
2.2.2 Grammar Relationships 19
2.2.3 Results 20
2.3 Clustering By Committee 20
2.3.1 Motivation 21
2.3.2 Algorithm 21
2.3.3 Results 23
Chapter III - Our approach 25
3.1 Word clustering in Vietnamese 25
3.1.1 Brown's algorithm 25
3.1.2 Word similarity 26
3.2 Evaluating Methodology 28
3.3 Antonym classes 31
3.3.1 Ancillary antonym 31
3.3.2 Coordinated antonym 32
3.3.3 Minor classes 33
Trang 3Chapter IV - Experiment 37
4.1 Results and Comparison 37
4.2 Antonym frames 40
4.3 Effectiveness of Word Cluster feature in Vietnamese Functional labeling 42
4.4 Error analyses 43
4.5 Summarization 44
Chapter V - Conclusion and Future works 45
5.1 Conclusion 45
5.2 Future works 45
Bibliography 46
Trang 4[1] A L Berger, S A D Pietra, V J D Pietra, A Maximum Entropy Approach to
Natural Language Processing Computational Linguistics 1996
[2] Abney, S (2004), Understanding the Yarowsky Algorithm Computational
Linguistics, 30(3)
[3] Anh-Cuong Le, Phuong-Thai Nguyen, Hoai-Thu Vuong, Minh-Thu Pham, Tu-Bao
Ho 2009, An Experimental on Lexicalized Statistical Parsing for Vietnamese
Proceedings of KSE 2009, pp 162-167
[4] Blum, A and Chawla, S (2001), Learning from Labeled and Unlabeled Data Using
Graph Mincuts In Proceedings of ICML 2001
[5] Blum, A and Mitchell, T (1998) Combining Labeled and Unlabeled Data with
Co-training In Proceedings of the Workshop on Computational Learning Theory
[6] Caixia Yuan, Fuji Ren, and Xiaojie Wang, Accurate Learning for Chinese Function
Tags from Minimal Features, 2009
[7] Collins, M and Singer, Y (1999) Unsupervised models for named entity
classification In Proceedings of the Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very Large Corpora
[8] Dekang Lin, Xiaoyun Wu, Phrase Clustering for Discriminative Learning
ACL/AFNLP 2009: 1030-1038
[9] Dekang Lin, Patrick Pantel, Induction of semantic classes from natural language
text KDD 2001: 317-322
[10] D Lin Automatic Retrieval and Clustering of Similar Words COLING-ACL98,
Montreal, Canada, August, 1998
[11] D Lin Using Syntactic Dependency as Local Context to Resolve Word Sense
Ambiguity In Proceedings of ACL-97, Madrid, Spain July, 1997
[12] Dekang Lin 1997 Using syntactic dependency as local context to resolve word
sense ambiguity In Proceedings of ACL/EACL-97, pages 64–71, Madrid, Spain,
July
[13] Don Blaheta, Function tagging PhD thesis, 2003
Trang 5[14] Don Blaheta, Eugene Charniak, Assigning Function Tags to Parsed Text
Proceedings of the 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics, 2000
[15] Ido Dagan, Shaul Marcus, and Shaul Markovitch 1993 Contextual word similarity
and estimation from sparse data In Proceedings of ACL-93, pages 164–171,
Columbus, Ohio, June
[16] Jones, S, Antonymy: a corpus-based perspective Routledge, 2002 eScholarID:
4b966
[17] Katz, Fodor, The Structure of a Semantic Theory, 1963
[18] Lafferty, J., McCallum, A., Pereira, F 2001 Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data In: Proceedings
of ICML 2001, pages 282-289, Williamstown, USA
[19] Miller, S., Guinness, J., and Zamanian, A (2004) Name Tagging with Word
Clusters and Discriminative Training In Proceedings of HLT-NAACL 2004, pages
337– 342
[20] NianwenXue, Martha Paler, CIS Department University of Penn Treebanksylvania,
Automatic Semantic Role Labeling for Chinese Verbs, 2004
[21] Nguyen Thanh Huy, Nguyen Kim Anh, Nguyen Phuong Thai, Building an Efficient
Functional-Tag Labeling System for Vietnamese KSE 2011: 92-97
[22] P.F Brown, V.J Della Pietra, P.V deSouza, J.C Lai, and R.L Mercer 1992
Class-based n-gram models of natural language Computational Linguistics,
18(4):467-479
[23] Patrick Pantel 2003 Clustering by Committee Ph.D Dissertation Department of
Computing Science, University of Alberta
[24] Percy Liang, Semi-supervised learning for natural language Massachusetts Institute
of Technology, 2005
[25] Phuong-Thai Nguyen, Xuan-Luong Vu, Minh-Huyen Nguyen, Van-Hiep Nguyen,
Hong-Phuong Le Building a Large Syntactically-Annotated Corpus of
Vietnamese.The 3rd Linguistic Annotation Workshop (LAW), ACL-IJCNLP 2009
[26] T Koo, X Carreras, and M Collins Simple Semi-supervised Dependency Parsing
In Proc ACL, 2008, pp.595-603