Identifying coordinated compound words for Vietnamese word segmentation

Trang 1

VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

 CÔNG TRÌNH DỰ THI GIẢI THƯỞNG SINH VIÊN

NGHIÊN CỨU KHOA HỌC

NĂM 2012

Tên công trình:

Identifying coordinated compound words for Vietnamese

word segmentation

Người hướng dẫn: Ts Nguyễn Phương Thái

Ths Trần Ngọc Anh

Trang 2

Word segmentation is considered the first step in most natural language processing applications Vietnamese word segmentation encounters some difficulties that other occidental language does not English and many other languages use blanks to separate words which is easy for a tokenizer to do word segmentation tasks Vietnamese words can be formed by one syllables, two or more than two syllables In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex In most vietnamese dictionary, there are small amount of coordinated compound words defined Since most of natural language processing depend heavily on dictionary in word segmentation step, there are much problems apprear when the tokenizer detecting coordinated compound words We are trying to build a coordinated compound word with large number of words which we hope that helps to improve the accuracy of vietnamese segmentation task

Trang 3

Figure List

Trang 4

Chapter 1

Introduction

Word segmentation is considered the first step in most natural language processing applications Vietnamese word segmentation encounters some difficulties that other occidental language does not English and many other languages use blanks

to separate words which is easy for a tokenizer to do word segmentation tasks Vietnamese words can be formed by one syllables, two or more than two syllables In general, Vietnamese compound word meaning is created by combining the meaning of each syllables that made the compound words, and blanks are not used to separate Vietnamese word That creates problems for all natural language processing tasks The main problems include word ambiguities, unknown words detection and proper name recognition

Trang 5

Chapter 2

Vietnamese word segmentation

2.1 Coordinated Compound Word

2.1.1 Definition

Coordinated compound words are made up of two or more single syllables and the meaning of each word is combination of meaning of each syllable which has similar meaning The syllables that made up coordinated compound word are in equal relation In other words, the meaning of coordinated compound word is more general than of each syllable, and equally based on meaning of them

The order of coordinated compound word is oftenly changeable For example:

“quần áo”, “áo quần”, “chung riêng”, “riêng chung”, “đen đỏ”, “đỏ đen”, “ốm đau”,

“đau ốm”,…

2.1.2 Type of coordinated compound words

There ara two types of coordinated compound word:

 All syllables are Vietnamese origin words: “đất nước”, “trời đất”, “đất cát”, “ruộng đấy”, “rượng vường”, “ruộng nương”, “ấm chén”, “bát đĩa”,

“đỏ đen”, “trắng đen”, “may rủi”, etc

 All syllables are Chinese borrowed: “ân nghĩa”, “nam nữ”, “đầu não”,

“đấu tranh”, “học tập”, “lợi lộc”, “thuận lợi”, etc

Trang 6

 One syllable is Vietnamese origin word and one is borrowed from Chinese: “binh lính”, “bụng dạ”, “lính tráng”, “nuôi dưỡng”, “gan dạ”,etc

2.2 VCL Dictionary

In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex A good quality vocabular should provide the language processing system with natural language information in many diffirent steps such as morphology, grammar, semantics, or even able to used for single language processing system or multiple language processing system

VCL (Vietnamese Computational Lexicon) is a dictionary from Vietlex with

35000 words which is created for natural language processing purposes Each word in the dictionary is represented with the information of morphology, syntactics and semantics

 Morphology:

Morphology information include HeadWord, WordType

Figure 1 Basic information and morphology of “bàn” (noun)

 Syntactics :

Trang 7

Syntactics information includes category (noun, verb, adverb, adjective,etc ), subcategory ( proper noun, countable noun, abstract noun, etc), frame set, forward and backward

Figure 2 Syntactics of “bàn”(verb) – frameset

Figure 3 Syntactics of “ăn”(verb) – forward, backward

 Semantics information

Semantics information include logical constraint and semantic contrainst

Trang 8

- Logical constraint include categorial meaning,synonym and antonym.Categorial meaning can be understand as a “semantic-wordtype”, for example ‘tướng sĩ’ and “tướng tá” are belongs to

“People”, “trâu” and “bê” belongs to “Mammal”,etc Synonym and antonym helps with analysing and using words correctly

Figure 4: Semantics tree

- Semantic contraints: information about “semantic role” of words when standing in sentences: Agent, experiencer, possessor, force, patient, recipient, reference, concomitant,etc

Trang 9

Figure 5 Semantics information of “bắt” (verb)

Figure 6 VCL in xml format

Trang 10

Chapter 3

Building Coordinated Compound Word

Dictionary

Vietnamese word segmentation is highly based on the definition of the word in dictionary A good dictionary is very important in vietnamese word segmentation The dictionary contain small amount of coordinated compound words The purpose of building a coordinated compound word is increase the accuracy of vietnamese word segmentation when detecting coordinated compound words

There are several steps when building coordinated compound word dictionary base on the VCL dictionary

3.1 Finding coordinated compound words that already been

defined in VCL dictionary

This step can be helped by a small web-base system.After this step the dictionary now have more than 1600 coordinated compound words

 Using Rails 3.1 framework with Mongoid database

 Read the VCL dictionary and store in database

Trang 11

• The <def> field contain string “[nói khái quát]”

• The <def> field contain string “[nói gộp]”

• The <def> field contain the syllales and word “và”

• The <synonym> field contain the reverse word of the main word

 Query all the possible case and sort for the most number of conditions meeting first

Write a script to help choose the correct coordinated compound words with just one click (the choosen word will be then displayed italic and set a flag to true)

Figure 1 Example of coordinated compound word

Trang 12

3.2 Try to classify these compound words and other simple words

Try to classify these compound words and other simple words from dictionary into ‘categorial meaning’, (semantic-wordtype), in each class, match two simple words that belongs the same ‘categorial meaning’ to make new coordinated compound words For examples

màn chiếu

Trang 13

Figure 2 classify the simple words to ‘categorial meaning’

3.3 Find the new coordinated compound words by reverse the old word

quần áo => áo quần

chung thủy => thủy chung

đỏ đen => đen đỏ

rừng núi => núi rừng

bay lượn => lượn bay

Create all the possible reverse word from all the coordinated compound words that we already reviewed Each new created words have the same ‘categorial meaning’, category, subcategory and definition with the original word

Trang 14

Figure 3 List all reverse word of coordinated compound words then check.

3.4 Review and estimate the accuracy of the dictionary

The new coordinated compound words (about 3000 words) have the same format of the VCL dictionary and it can be easily used for improving the accuracy of vietnamese word segmentation

Trang 15

3.5 Future work

For some reason (time limit, vietnamese words knowledge), the dictionary is still small The work is still continuing finding more words to make the dictionary to

be large

Trang 16

[1] D.Q.Thang 2008, Word segmentation of Vietnamese texts: a comparison of

approaches

[2] Cam-Tu Nguyen 2008, Vietnamese Word Segmentation with CRFs and SVMs:

An Investigation

[3] Le.An.Ha 2003, A method for word segmentation in Vietnamese orpus

Linguistics, Lancaster, UK (2003)

Định dạng
Số trang	16
Dung lượng	535,22 KB