Identifying coordinated compound words for Vietnamese word segmentation
Trang 1VIETNAM NATIONAL UNIVERSITY, HA NOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
CÔNG TRÌNH DỰ THI GIẢI THƯỞNG SINH VIÊN
NGHIÊN CỨU KHOA HỌC
NĂM 2012
Tên công trình:
Identifying coordinated compound words for Vietnamese
word segmentation
Người hướng dẫn: Ts Nguyễn Phương Thái
Ths Trần Ngọc Anh
Trang 2Word segmentation is considered the first step in most natural language processing applications Vietnamese word segmentation encounters some difficulties that other occidental language does not English and many other languages use blanks to separate words which is easy for a tokenizer to do word segmentation tasks Vietnamese words can be formed by one syllables, two or more than two syllables In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex In most vietnamese dictionary, there are small amount of coordinated compound words defined Since most of natural language processing depend heavily on dictionary in word segmentation step, there are much problems apprear when the tokenizer detecting coordinated compound words We are trying to build a coordinated compound word with large number of words which we hope that helps to improve the accuracy of vietnamese segmentation task
Trang 3Figure List
Trang 4Chapter 1
Introduction
Word segmentation is considered the first step in most natural language processing applications Vietnamese word segmentation encounters some difficulties that other occidental language does not English and many other languages use blanks
to separate words which is easy for a tokenizer to do word segmentation tasks Vietnamese words can be formed by one syllables, two or more than two syllables In general, Vietnamese compound word meaning is created by combining the meaning of each syllables that made the compound words, and blanks are not used to separate Vietnamese word That creates problems for all natural language processing tasks The main problems include word ambiguities, unknown words detection and proper name recognition
Trang 5Chapter 2
Vietnamese word segmentation
2.1 Coordinated Compound Word
2.1.1 Definition
Coordinated compound words are made up of two or more single syllables and the meaning of each word is combination of meaning of each syllable which has similar meaning The syllables that made up coordinated compound word are in equal relation In other words, the meaning of coordinated compound word is more general than of each syllable, and equally based on meaning of them
The order of coordinated compound word is oftenly changeable For example:
“quần áo”, “áo quần”, “chung riêng”, “riêng chung”, “đen đỏ”, “đỏ đen”, “ốm đau”,
“đau ốm”,…
2.1.2 Type of coordinated compound words
There ara two types of coordinated compound word:
All syllables are Vietnamese origin words: “đất nước”, “trời đất”, “đất cát”, “ruộng đấy”, “rượng vường”, “ruộng nương”, “ấm chén”, “bát đĩa”,
“đỏ đen”, “trắng đen”, “may rủi”, etc
All syllables are Chinese borrowed: “ân nghĩa”, “nam nữ”, “đầu não”,
“đấu tranh”, “học tập”, “lợi lộc”, “thuận lợi”, etc
Trang 6 One syllable is Vietnamese origin word and one is borrowed from Chinese: “binh lính”, “bụng dạ”, “lính tráng”, “nuôi dưỡng”, “gan dạ”,etc
2.2 VCL Dictionary
In natural language processing, a dictionary is an essential resources to the analysis of language problems from simple to complex A good quality vocabular should provide the language processing system with natural language information in many diffirent steps such as morphology, grammar, semantics, or even able to used for single language processing system or multiple language processing system
VCL (Vietnamese Computational Lexicon) is a dictionary from Vietlex with
35000 words which is created for natural language processing purposes Each word in the dictionary is represented with the information of morphology, syntactics and semantics
Morphology:
Morphology information include HeadWord, WordType
Figure 1 Basic information and morphology of “bàn” (noun)
Syntactics :
Trang 7Syntactics information includes category (noun, verb, adverb, adjective,etc ), subcategory ( proper noun, countable noun, abstract noun, etc), frame set, forward and backward
Figure 2 Syntactics of “bàn”(verb) – frameset
Figure 3 Syntactics of “ăn”(verb) – forward, backward
Semantics information
Semantics information include logical constraint and semantic contrainst
Trang 8- Logical constraint include categorial meaning,synonym and antonym.Categorial meaning can be understand as a “semantic-wordtype”, for example ‘tướng sĩ’ and “tướng tá” are belongs to
“People”, “trâu” and “bê” belongs to “Mammal”,etc Synonym and antonym helps with analysing and using words correctly
Figure 4: Semantics tree
- Semantic contraints: information about “semantic role” of words when standing in sentences: Agent, experiencer, possessor, force, patient, recipient, reference, concomitant,etc
Trang 9Figure 5 Semantics information of “bắt” (verb)
Figure 6 VCL in xml format
Trang 10Chapter 3
Building Coordinated Compound Word
Dictionary
Vietnamese word segmentation is highly based on the definition of the word in dictionary A good dictionary is very important in vietnamese word segmentation The dictionary contain small amount of coordinated compound words The purpose of building a coordinated compound word is increase the accuracy of vietnamese word segmentation when detecting coordinated compound words
There are several steps when building coordinated compound word dictionary base on the VCL dictionary
3.1 Finding coordinated compound words that already been
defined in VCL dictionary
This step can be helped by a small web-base system.After this step the dictionary now have more than 1600 coordinated compound words
Using Rails 3.1 framework with Mongoid database
Read the VCL dictionary and store in database
Trang 11• The <def> field contain string “[nói khái quát]”
• The <def> field contain string “[nói gộp]”
• The <def> field contain the syllales and word “và”
• The <synonym> field contain the reverse word of the main word
Query all the possible case and sort for the most number of conditions meeting first
Write a script to help choose the correct coordinated compound words with just one click (the choosen word will be then displayed italic and set a flag to true)
Figure 1 Example of coordinated compound word
Trang 123.2 Try to classify these compound words and other simple words
Try to classify these compound words and other simple words from dictionary into ‘categorial meaning’, (semantic-wordtype), in each class, match two simple words that belongs the same ‘categorial meaning’ to make new coordinated compound words For examples
màn chiếu
Trang 13Figure 2 classify the simple words to ‘categorial meaning’
3.3 Find the new coordinated compound words by reverse the old word
quần áo => áo quần
chung thủy => thủy chung
đỏ đen => đen đỏ
rừng núi => núi rừng
bay lượn => lượn bay
Create all the possible reverse word from all the coordinated compound words that we already reviewed Each new created words have the same ‘categorial meaning’, category, subcategory and definition with the original word
Trang 14Figure 3 List all reverse word of coordinated compound words then check.
3.4 Review and estimate the accuracy of the dictionary
The new coordinated compound words (about 3000 words) have the same format of the VCL dictionary and it can be easily used for improving the accuracy of vietnamese word segmentation
Trang 153.5 Future work
For some reason (time limit, vietnamese words knowledge), the dictionary is still small The work is still continuing finding more words to make the dictionary to
be large
Trang 16[1] D.Q.Thang 2008, Word segmentation of Vietnamese texts: a comparison of
approaches
[2] Cam-Tu Nguyen 2008, Vietnamese Word Segmentation with CRFs and SVMs:
An Investigation
[3] Le.An.Ha 2003, A method for word segmentation in Vietnamese orpus
Linguistics, Lancaster, UK (2003)