1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: Extraction of Vietnamese collocation from text corpora

5 171 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 5
Dung lượng 142,94 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Extraction of Vietnamese collocation from text corpora Đỗ Thị Ngọc Quỳnh Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS.. Ther

Trang 1

Extraction of Vietnamese collocation from

text corpora

Đỗ Thị Ngọc Quỳnh

Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01

Người hướng dẫn: TS Lê Anh Cường

Năm bảo vệ: 2011

Abstract Collocations have wide application in the fields of languages, compiled a

dictionary as well as the problem of natural language processing Therefore, the extraction of collocations in each language is really necessary, to improve the accuracy and the nature of the application of natural language processing, as well as help to learn a new language easier However, in Vietnam, the study of collocation is quite a new field This paper focused on researching some method of extracting collocations methods to find efficient model for the Vietnamese collocations extraction The mentioned methods were based on some classic statistical methods commonly used such as frequency, t-test, chi-square, mutual information We also suggested some general method using linguistic measure to increase the accuracy of the process of extraction Input data included the data has been through a POS-tagging and data has been parsed By running the program with different methods and combination of multiple methods together, comparing the accuracy of the method, we draw out the efficient method of extracting of Vietnamese Collocation

from Text Corpora

Keywords Xử lý ngôn ngữ; Xử lý dữ liệu; Ngôn ngữ tự nhiên; Trí tuệ nhân tạo

Content

Table of Contents

1.1 Definitions 2 1.2 Related works and motivation 3 1.3 Contribution of the thesis 6

2 Collocation: concept, roles and applications 7

Trang 2

36

2.1 Collocations’ characteristics 7

2.1.1 Recurrent 8

2.1.2 Arbitrary 8

2.1.3 Domain-dependent 8

2.1.4 Non-substitutability (theclosely linked in terms of vocabulary) 9

2.2 Classification of collocations 9

2.2.1 Idiomatic Phrases 10

2.2.2 Support Verb Construction 10

2.2.3 Fixed Phrases 10

2.3 Applications 11

2.4 Vietnamese collocations 12

3 Basic methods in Collocation extraction 14 3.1 Frequency 15

3.2 Hypothesis testing 16

3.2.1 T-Test 17

3.2.2 Chi-Square 18

3.3 Point-wise Mutual Information (PMI) 20

4 Our proposal for extracting Vietnamese collocation 23 4.1 Patterns for Vietnamese collocation 23

4.2 The Linguistic Measure 24

4.3 Designed model 25

5 Experiments 27 5.1 Data preparation 27

5.1.1 Collecting corpora 27

5.1.2 Extracting bi-grams 28

5.1.3 Adding syntactic information to bi-grams .28

5.2 The test models 29

5.3 Experimental results with statistical methods 30

5.3.1 Bi-grams with syntactic information 31

5.4 The experiments of our proposal 32

Bibliography

References

[1] M Benson The Structure of the Collocational Dictionary International Journal of

Lexicography, 2(1):1-14, 1989

[2] Raj Kishor Bisht and H S Dhami The application of fuzzy logic to collocation

extraction CoRR, abs/0811.1260, 2008

Trang 3

[3] Elisabeth Breidt Extraction of v-n-collocations from text corpora: A feasibility study

for german In In CoRR-1996, pages 74-83, 1993

[4] Mai Ngọc Chừ; Vũ Đức Nghiêu và Hoàng Trọng Phiến Cơ sở ngôn ngữ học và tiếng

Việt Nxb Giáo dục, 1997

[5] John Carroll, Guido Minnen, Darren Pearce, Yvonne Canning, Siobhan Devlin, and John Tait Simplifying english text for language impaired readers 1999

[6] G Castiglione, A Restivo, and S Salemi Patterns in words and languages Discrete

Appl Math., 144:237-246, December 2004

[7] Y Choueka, A S Fraenkel, and S T Klein Compression of concordances in full-text

retrieval systems In Proceedings of the 11th annual international ACM SIGIR

conference on Research and development in information retrieval, SIGIR ’88, pages

597-612, New York, NY, USA, 1988 ACM

[8] Hoàng Thị Châu Vài nhận xét về quá trình tiêu chuẩn hoá tiếng việt thể hiên qua cách dùng từ địa phương trong sách vở, báo chí truớc và sau cách mạng tháng tám (4), 1970 [9] Kenneth Ward Church and Patrick Hanks Word association norms, mutual information,

and lexicography Comput Linguist., 16:22-29, March 1990

[10] A P Cowie The treatment of collocations and idioms in learners’ dictionaries Applied

Linguistics, II:223-235, March 1981

[11] D.A Cruse Lexical semantics Cambridge University Press, 1991

[12] John Rupert Firth A synopsis of linguistic theory 1930-1955 In Studies in Linguistic

Analysis, pages 1-32 Blackwell, Oxford, 1957.Eric Gaussier, David Hull, and Salah

Ait-mokhtar Term Alignment in Use: Machine-Aided Human Translation 2000

[13] John S Justeson and Slava M Katz Technical terminology some linguistic properties

and an algorithm for identification in text Natural Language Engineering, 1(1):9-27,

1995

[14] John S Justeson and Slava M Katz Technical terminology: some linguistic properties

and an algorithm for identification in text In Natural Language Engineering, pages

9-27 Cambridge University Press., 1995

[15] Adam Kilgarriff and David Tugwell Word sketch: Extraction and display of significant

collocations for lexicography Proc ACL workshop on COLLOCATION Computational

Extraction Analysis and Exploitation Toulouse July 3238, 2001

[16] Vuong Hoai Vu Pham Minh Thu Ho Tu Bao Le Anh Cuong, Nguyen Phuong Thai An

experimental statiscal on lexicalized parsing for vietnamese KSE, 2009

[17] Dekang Lin Extracting Collocations from Text Corpora 1998

[18] Dekang Lin Extracting Collocations from Text Corpora 1998

[19] Dekang Lin Using collocation statistics in information extraction In In Proceedings of

the Seventh Message Understanding Conference (MUC-7, 1998

[20] Christopher D Manning and Hinrich Schütze Foundations of statistical natural

language processing MIT Press, Cambridge, MA, USA, 1999

[21] Christopher D Manning and Hinrich Schuütze Foundations of statistical natural

Trang 4

language processing MIT Press, Cambridge, MA, USA, 1999

[22] Johannes Matiasek Exploiting long distance collocational relations in predictive typing

In Proceedings of the EACL-03 Workshop on Language Modeling for Text Entry

Methods, pages 1-8, 2003

[23] Gitsaky C.Daigaku N and Tailor R Iranian Journal of Applied Linguistics, pages

137-169

[24] Darren Pearce and Bn Qh Using conceptual similarity for collocation extraction In

Proceedings of the Fourth annual CLUK colloquium, 2001

[25] Pavel Pecina and Pavel Schlesinger Combining association measures for collocation

extraction In In Proceedings of the COLING/ACL 2006 Main Conference Poster

Sessions, pages 651-658.Sasa Petrovic Collocation extraction measures for text mining

applications In Diploma Thesis num 1693, 2007

[26] Yin Li Qin Lu and Ruifeng Xu Improving xtract for chinese collocation extraction In

Proceedings of IEEE Int Conf Natural Language Processing and Knowledge Engineering, pages 333-338, 2003

[27] Violeta Seretan and Eric Wehrli Accurate collocation extraction using a multilingual

parser In Proceedings of the 21st International Conference on Computational

Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 953-960, Stroudsburg, PA, USA, 2006 Association for

Computational Linguistics

[28] Violeta Seretan and Eric Wehrli Multilingual collocation extraction: issues and

solutions In Proceedings of the Workshop on Multilingual Language Resources and

Interoperability, MLRI ’06, pages 40-49, Stroudsburg, PA, USA, 2006 Association for

Computational Linguistics

[29] Frank Smadja Retrieving collocations from text: Xtract Comput Linguist., 19:143-177,

March 1993

[30] Frank Smadja and Kathleen McKeown Translating collocations for use in bilingual

lexicons In Proceedings of the workshop on Human Language Technology, HLT ’94,

pages 152-156, Stroudsburg, PA, USA, 1994 Association for Computational Linguistics

[31] David A Smith Detecting events with date and place information in unstructured text

In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, JCDL

’02, pages 191-196, New York, NY, USA, 2002 ACM

[32] The teaching of collocations in EAP Technical report University of Leeds

[33] Nguyen Cam Tu Hidden topic discovery toward classification and clustering in

vietnamese web documents In Master Thesis in College of Technology, Viet Nam

National University, 2008

[34] James Liu Wan Yin Li, Qin Lu Tctract-a collocation extraction approach for noun

phrases using shallow parsing rules and statistic models In 20th Pacific Asia

Trang 5

Conference on Language, Information and Computation (PACLICi06), pages 109-116,

2006

[35] Joachim Wermter and Udo Hahn Collocation extraction based on

modifiability statistics In Proceedings of the 20th international conference on

ComputationalLinguistics, COLING ’04, Stroudsburg, PA, USA, 2004 Association for

Computational Linguistics

[36] Janyce Wiebe, Theresa Wilson, and Matthew Bell Identifying collocations for

recognizing opinions In In Proc ACL-01 Workshop on Collocation: Computational

Extraction, Analysis, and Exploitation, pages 24-31, 2001

Thesis-related publication:

J Le Anh Cuong, Do Thi Ngoc Quynh and Cao Van Viet Building and Evaluating

Vietnamese Language Models VNU JOURNAL OF SCIENCE (revising)

J Le Anh Cuong, Do Thi Ngoc Quynh Vietnamese collocation extraction (to be

submitted)

Ngày đăng: 15/12/2017, 08:25

TỪ KHÓA LIÊN QUAN