Extraction of Vietnamese collocation from text corpora Đỗ Thị Ngọc Quỳnh Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS.. Ther
Trang 1Extraction of Vietnamese collocation from
text corpora
Đỗ Thị Ngọc Quỳnh
Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01
Người hướng dẫn: TS Lê Anh Cường
Năm bảo vệ: 2011
Abstract Collocations have wide application in the fields of languages, compiled a
dictionary as well as the problem of natural language processing Therefore, the extraction of collocations in each language is really necessary, to improve the accuracy and the nature of the application of natural language processing, as well as help to learn a new language easier However, in Vietnam, the study of collocation is quite a new field This paper focused on researching some method of extracting collocations methods to find efficient model for the Vietnamese collocations extraction The mentioned methods were based on some classic statistical methods commonly used such as frequency, t-test, chi-square, mutual information We also suggested some general method using linguistic measure to increase the accuracy of the process of extraction Input data included the data has been through a POS-tagging and data has been parsed By running the program with different methods and combination of multiple methods together, comparing the accuracy of the method, we draw out the efficient method of extracting of Vietnamese Collocation
from Text Corpora
Keywords Xử lý ngôn ngữ; Xử lý dữ liệu; Ngôn ngữ tự nhiên; Trí tuệ nhân tạo
Content
Table of Contents
1.1 Definitions 2 1.2 Related works and motivation 3 1.3 Contribution of the thesis 6
2 Collocation: concept, roles and applications 7
Trang 236
2.1 Collocations’ characteristics 7
2.1.1 Recurrent 8
2.1.2 Arbitrary 8
2.1.3 Domain-dependent 8
2.1.4 Non-substitutability (theclosely linked in terms of vocabulary) 9
2.2 Classification of collocations 9
2.2.1 Idiomatic Phrases 10
2.2.2 Support Verb Construction 10
2.2.3 Fixed Phrases 10
2.3 Applications 11
2.4 Vietnamese collocations 12
3 Basic methods in Collocation extraction 14 3.1 Frequency 15
3.2 Hypothesis testing 16
3.2.1 T-Test 17
3.2.2 Chi-Square 18
3.3 Point-wise Mutual Information (PMI) 20
4 Our proposal for extracting Vietnamese collocation 23 4.1 Patterns for Vietnamese collocation 23
4.2 The Linguistic Measure 24
4.3 Designed model 25
5 Experiments 27 5.1 Data preparation 27
5.1.1 Collecting corpora 27
5.1.2 Extracting bi-grams 28
5.1.3 Adding syntactic information to bi-grams .28
5.2 The test models 29
5.3 Experimental results with statistical methods 30
5.3.1 Bi-grams with syntactic information 31
5.4 The experiments of our proposal 32
Bibliography
References
[1] M Benson The Structure of the Collocational Dictionary International Journal of
Lexicography, 2(1):1-14, 1989
[2] Raj Kishor Bisht and H S Dhami The application of fuzzy logic to collocation
extraction CoRR, abs/0811.1260, 2008
Trang 3[3] Elisabeth Breidt Extraction of v-n-collocations from text corpora: A feasibility study
for german In In CoRR-1996, pages 74-83, 1993
[4] Mai Ngọc Chừ; Vũ Đức Nghiêu và Hoàng Trọng Phiến Cơ sở ngôn ngữ học và tiếng
Việt Nxb Giáo dục, 1997
[5] John Carroll, Guido Minnen, Darren Pearce, Yvonne Canning, Siobhan Devlin, and John Tait Simplifying english text for language impaired readers 1999
[6] G Castiglione, A Restivo, and S Salemi Patterns in words and languages Discrete
Appl Math., 144:237-246, December 2004
[7] Y Choueka, A S Fraenkel, and S T Klein Compression of concordances in full-text
retrieval systems In Proceedings of the 11th annual international ACM SIGIR
conference on Research and development in information retrieval, SIGIR ’88, pages
597-612, New York, NY, USA, 1988 ACM
[8] Hoàng Thị Châu Vài nhận xét về quá trình tiêu chuẩn hoá tiếng việt thể hiên qua cách dùng từ địa phương trong sách vở, báo chí truớc và sau cách mạng tháng tám (4), 1970 [9] Kenneth Ward Church and Patrick Hanks Word association norms, mutual information,
and lexicography Comput Linguist., 16:22-29, March 1990
[10] A P Cowie The treatment of collocations and idioms in learners’ dictionaries Applied
Linguistics, II:223-235, March 1981
[11] D.A Cruse Lexical semantics Cambridge University Press, 1991
[12] John Rupert Firth A synopsis of linguistic theory 1930-1955 In Studies in Linguistic
Analysis, pages 1-32 Blackwell, Oxford, 1957.Eric Gaussier, David Hull, and Salah
Ait-mokhtar Term Alignment in Use: Machine-Aided Human Translation 2000
[13] John S Justeson and Slava M Katz Technical terminology some linguistic properties
and an algorithm for identification in text Natural Language Engineering, 1(1):9-27,
1995
[14] John S Justeson and Slava M Katz Technical terminology: some linguistic properties
and an algorithm for identification in text In Natural Language Engineering, pages
9-27 Cambridge University Press., 1995
[15] Adam Kilgarriff and David Tugwell Word sketch: Extraction and display of significant
collocations for lexicography Proc ACL workshop on COLLOCATION Computational
Extraction Analysis and Exploitation Toulouse July 3238, 2001
[16] Vuong Hoai Vu Pham Minh Thu Ho Tu Bao Le Anh Cuong, Nguyen Phuong Thai An
experimental statiscal on lexicalized parsing for vietnamese KSE, 2009
[17] Dekang Lin Extracting Collocations from Text Corpora 1998
[18] Dekang Lin Extracting Collocations from Text Corpora 1998
[19] Dekang Lin Using collocation statistics in information extraction In In Proceedings of
the Seventh Message Understanding Conference (MUC-7, 1998
[20] Christopher D Manning and Hinrich Schütze Foundations of statistical natural
language processing MIT Press, Cambridge, MA, USA, 1999
[21] Christopher D Manning and Hinrich Schuütze Foundations of statistical natural
Trang 4language processing MIT Press, Cambridge, MA, USA, 1999
[22] Johannes Matiasek Exploiting long distance collocational relations in predictive typing
In Proceedings of the EACL-03 Workshop on Language Modeling for Text Entry
Methods, pages 1-8, 2003
[23] Gitsaky C.Daigaku N and Tailor R Iranian Journal of Applied Linguistics, pages
137-169
[24] Darren Pearce and Bn Qh Using conceptual similarity for collocation extraction In
Proceedings of the Fourth annual CLUK colloquium, 2001
[25] Pavel Pecina and Pavel Schlesinger Combining association measures for collocation
extraction In In Proceedings of the COLING/ACL 2006 Main Conference Poster
Sessions, pages 651-658.Sasa Petrovic Collocation extraction measures for text mining
applications In Diploma Thesis num 1693, 2007
[26] Yin Li Qin Lu and Ruifeng Xu Improving xtract for chinese collocation extraction In
Proceedings of IEEE Int Conf Natural Language Processing and Knowledge Engineering, pages 333-338, 2003
[27] Violeta Seretan and Eric Wehrli Accurate collocation extraction using a multilingual
parser In Proceedings of the 21st International Conference on Computational
Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 953-960, Stroudsburg, PA, USA, 2006 Association for
Computational Linguistics
[28] Violeta Seretan and Eric Wehrli Multilingual collocation extraction: issues and
solutions In Proceedings of the Workshop on Multilingual Language Resources and
Interoperability, MLRI ’06, pages 40-49, Stroudsburg, PA, USA, 2006 Association for
Computational Linguistics
[29] Frank Smadja Retrieving collocations from text: Xtract Comput Linguist., 19:143-177,
March 1993
[30] Frank Smadja and Kathleen McKeown Translating collocations for use in bilingual
lexicons In Proceedings of the workshop on Human Language Technology, HLT ’94,
pages 152-156, Stroudsburg, PA, USA, 1994 Association for Computational Linguistics
[31] David A Smith Detecting events with date and place information in unstructured text
In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL
’02, pages 191-196, New York, NY, USA, 2002 ACM
[32] The teaching of collocations in EAP Technical report University of Leeds
[33] Nguyen Cam Tu Hidden topic discovery toward classification and clustering in
vietnamese web documents In Master Thesis in College of Technology, Viet Nam
National University, 2008
[34] James Liu Wan Yin Li, Qin Lu Tctract-a collocation extraction approach for noun
phrases using shallow parsing rules and statistic models In 20th Pacific Asia
Trang 5Conference on Language, Information and Computation (PACLICi06), pages 109-116,
2006
[35] Joachim Wermter and Udo Hahn Collocation extraction based on
modifiability statistics In Proceedings of the 20th international conference on
ComputationalLinguistics, COLING ’04, Stroudsburg, PA, USA, 2004 Association for
Computational Linguistics
[36] Janyce Wiebe, Theresa Wilson, and Matthew Bell Identifying collocations for
recognizing opinions In In Proc ACL-01 Workshop on Collocation: Computational
Extraction, Analysis, and Exploitation, pages 24-31, 2001
Thesis-related publication:
J Le Anh Cuong, Do Thi Ngoc Quynh and Cao Van Viet Building and Evaluating
Vietnamese Language Models VNU JOURNAL OF SCIENCE (revising)
J Le Anh Cuong, Do Thi Ngoc Quynh Vietnamese collocation extraction (to be
submitted)