1. Trang chủ
  2. » Luận Văn - Báo Cáo

Phân cụm từ tiếng việt và nhận diện từ trái nghĩa

5 307 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 5
Dung lượng 262,91 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Nguyễn Phương Thái Năm bảo vệ: 2013 Abstract: Automatically constructing and clustering of words similarity have many important applications in Natural Language Processing NLP tasks,

Trang 1

Phân cụm từ Tiếng Việt và nhận diện từ trái nghĩa

Nguyễn Kim Anh

Trường Đại học Công nghệ Ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS Nguyễn Phương Thái

Năm bảo vệ: 2013

Abstract: Automatically constructing and clustering of words similarity have

many important applications in Natural Language Processing (NLP) tasks, such as dictionary construction, statistical machine translation, named-entity recognition, functional labeling, word segmentation… In recent years, it is a common trend that word clustering is researched in some languages as English, Germany, Chinese… However, the task of word clustering in Vietnamese is a more recent one In this thesis, I use a large unlabeled data of Vietnamese of about 15 millions words which is equivalent to approximately 700 thousands of sentences This unlabeled data is extracted from newspapers: Lao dong, PC World, Tuoi tre and then part-of-speech tagged I investigated some approaches for constructing word clusters in Vietnamese, in which I mainly focus on two main methods by Brown and Dekang Lin I use the same Vietnamese corpus and the same evaluating tool for these two methods so that I can compare and evaluate the effects of those methods in certain NLP tasks Besides, I use the statistics method to suggest 20 frames of antonym which can be used to identify antonym classes in clusters

Keywords: Khoa học máy tính; Xử lý ngôn ngữ tự nhiên; Cụm từ; Từ trái nghĩa

Trang 2

Table of Contents

Acknowledgements 4

Abstract 5

Chapter I - Introduction 10

1.1 Word Similarity 11

1.2 Hierarchical Clustering of Word 11

1.3 Function tags 12

1.4 Objectives of the Thesis 13

1.5 Our Contributions 13

1.6 Thesis structure 14

Chapter II - Related Works 15

2.1 Word Clustering 15

2.1.1 The Brown algorithm 15

2.1.2 Sticky Pairs and Sematic Classes 17

2.2 Word Similarity 18

2.2.1 Approach 18

2.2.2 Grammar Relationships 19

2.2.3 Results 20

2.3 Clustering By Committee 20

2.3.1 Motivation 21

2.3.2 Algorithm 21

2.3.3 Results 23

Chapter III - Our approach 25

3.1 Word clustering in Vietnamese 25

3.1.1 Brown's algorithm 25

3.1.2 Word similarity 26

3.2 Evaluating Methodology 28

3.3 Antonym classes 31

3.3.1 Ancillary antonym 31

3.3.2 Coordinated antonym 32

3.3.3 Minor classes 33

Trang 3

Chapter IV - Experiment 37

4.1 Results and Comparison 37

4.2 Antonym frames 40

4.3 Effectiveness of Word Cluster feature in Vietnamese Functional labeling 42

4.4 Error analyses 43

4.5 Summarization 44

Chapter V - Conclusion and Future works 45

5.1 Conclusion 45

5.2 Future works 45

Bibliography 46

Trang 4

[1] A L Berger, S A D Pietra, V J D Pietra, A Maximum Entropy Approach to

Natural Language Processing Computational Linguistics 1996

[2] Abney, S (2004), Understanding the Yarowsky Algorithm Computational

Linguistics, 30(3)

[3] Anh-Cuong Le, Phuong-Thai Nguyen, Hoai-Thu Vuong, Minh-Thu Pham, Tu-Bao

Ho 2009, An Experimental on Lexicalized Statistical Parsing for Vietnamese

Proceedings of KSE 2009, pp 162-167

[4] Blum, A and Chawla, S (2001), Learning from Labeled and Unlabeled Data Using

Graph Mincuts In Proceedings of ICML 2001

[5] Blum, A and Mitchell, T (1998) Combining Labeled and Unlabeled Data with

Co-training In Proceedings of the Workshop on Computational Learning Theory

[6] Caixia Yuan, Fuji Ren, and Xiaojie Wang, Accurate Learning for Chinese Function

Tags from Minimal Features, 2009

[7] Collins, M and Singer, Y (1999) Unsupervised models for named entity

classification In Proceedings of the Joint SIGDAT Conference on Empirical

Methods in Natural Language Processing and Very Large Corpora

[8] Dekang Lin, Xiaoyun Wu, Phrase Clustering for Discriminative Learning

ACL/AFNLP 2009: 1030-1038

[9] Dekang Lin, Patrick Pantel, Induction of semantic classes from natural language

text KDD 2001: 317-322

[10] D Lin Automatic Retrieval and Clustering of Similar Words COLING-ACL98,

Montreal, Canada, August, 1998

[11] D Lin Using Syntactic Dependency as Local Context to Resolve Word Sense

Ambiguity In Proceedings of ACL-97, Madrid, Spain July, 1997

[12] Dekang Lin 1997 Using syntactic dependency as local context to resolve word

sense ambiguity In Proceedings of ACL/EACL-97, pages 64–71, Madrid, Spain,

July

[13] Don Blaheta, Function tagging PhD thesis, 2003

Trang 5

[14] Don Blaheta, Eugene Charniak, Assigning Function Tags to Parsed Text

Proceedings of the 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics, 2000

[15] Ido Dagan, Shaul Marcus, and Shaul Markovitch 1993 Contextual word similarity

and estimation from sparse data In Proceedings of ACL-93, pages 164–171,

Columbus, Ohio, June

[16] Jones, S, Antonymy: a corpus-based perspective Routledge, 2002 eScholarID:

4b966

[17] Katz, Fodor, The Structure of a Semantic Theory, 1963

[18] Lafferty, J., McCallum, A., Pereira, F 2001 Conditional Random Fields:

Probabilistic Models for Segmenting and Labeling Sequence Data In: Proceedings

of ICML 2001, pages 282-289, Williamstown, USA

[19] Miller, S., Guinness, J., and Zamanian, A (2004) Name Tagging with Word

Clusters and Discriminative Training In Proceedings of HLT-NAACL 2004, pages

337– 342

[20] NianwenXue, Martha Paler, CIS Department University of Penn Treebanksylvania,

Automatic Semantic Role Labeling for Chinese Verbs, 2004

[21] Nguyen Thanh Huy, Nguyen Kim Anh, Nguyen Phuong Thai, Building an Efficient

Functional-Tag Labeling System for Vietnamese KSE 2011: 92-97

[22] P.F Brown, V.J Della Pietra, P.V deSouza, J.C Lai, and R.L Mercer 1992

Class-based n-gram models of natural language Computational Linguistics,

18(4):467-479

[23] Patrick Pantel 2003 Clustering by Committee Ph.D Dissertation Department of

Computing Science, University of Alberta

[24] Percy Liang, Semi-supervised learning for natural language Massachusetts Institute

of Technology, 2005

[25] Phuong-Thai Nguyen, Xuan-Luong Vu, Minh-Huyen Nguyen, Van-Hiep Nguyen,

Hong-Phuong Le Building a Large Syntactically-Annotated Corpus of

Vietnamese.The 3rd Linguistic Annotation Workshop (LAW), ACL-IJCNLP 2009

[26] T Koo, X Carreras, and M Collins Simple Semi-supervised Dependency Parsing

In Proc ACL, 2008, pp.595-603

Ngày đăng: 25/08/2015, 16:21

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w