IEEE RIVF’09, 16 July 2009 Natural language processing?. IEEE RIVF’09, 16 July 2009 Translation and machine translation Translate the following sentence into English “Ông già đi nhanh
Trang 1Vietnamese Language Processing: Issues and Challenges
Ho Tu Bao
Japan Advanced Institute of Science
Trang 2IEEE RIVF’09, 16 July 2009
Japan Advanced Institute of
Science and Technology Institute of Information Technology
Vietnamese Academy of Science & Technology
Trang 3 Our VLSP project (Vietnamese
Language and Speech Processing)
Trang 4IEEE RIVF’09, 16 July 2009
Natural language processing?
Psychological view: Understand
human language processing
Alan Turing: Propose
to consider the question:
“Can machine
think?”
Engineering view: Build systems
to process language
Trang 5More languages than you might have
thought
and speech.
langue et de parole vietnamienne.
tiếng nói tiếng Việt.
6912 distinct languages (230 spoken in Europe,
2197 in Asia)
Trang 6IEEE RIVF’09, 16 July 2009
54 ethnic groups in Vietnam
Trang 7English websites and Vietnamese?
Trang 8IEEE RIVF’09, 16 July 2009
Translation and machine translation
Translate the following sentence into English
“Ông già đi nhanh quá”?
Many possible translations
1 [Ông già] [đi] [nhanh quá] The old man walks too fast My father walks too fast
2 [Ông già] [đi] [nhanh quá] The old man died too fast My father died too fast
3 [Ông] [già đi] [nhanh quá] You get old too fast
Grandfather gets old too fast
Ambiguity of language
Trang 9Two approaches to machine
translation
Linguistic rule-based
machine translation
using linguistic rules
about the two
generate translations using statistical
learning methods based on bilingual text corpora
(statistically similar)
Requires large and qualified bilingual text corpora.
DOMINATING!
Trang 10IEEE RIVF’09, 16 July 2009
Lexical / Morphological Analysis
Syntactic Analysis
Semantic Analysis
Discourse Analysis
Tagging Chunking
Word Sense Disambiguation
Grammatical Relation Finding
Named Entity Recognition
Reference Resolution
Shallow parsing
The woman will give Mary a book
The/ Det woman/ NN will/ MD give/ VB
Mary/ NNP a/ Det book/ NN
POS tagging
[ The/ Det woman /NN]NP [ will/ MD give/ VB]VP
[ Mary/ NNP]NP [ a/ Det book/ NN]NP
From text to the meaning
Natural Language Processing (NLP)
Trang 11 1990s–2000s: Statistical learning
algorithms, evaluation, corpora
1980s: Standard resources and tasks
Penn Treebank, WordNet, MUC
1970s: Kernel (vector) spaces
clustering, information retrieval (IR)
1960s: Representation Transformation
Finite state machines (FSM) and
Augmented transition networks (ATNs)
Archeology of natural language
processing
Natural language processing
Information retrieval and Information extraction
Trang 12IEEE RIVF’09, 16 July 2009
some ML/Stat no ML/Stat
(Pages 11-12 from Marie Claire, ECML/PKDD 2005)
ML and statistical methods in NLP
Trang 13Recent learning methods in NLP
Trang 14IEEE RIVF’09, 16 July 2009
Large investment from the government and industry
National Institute of Standards and Technology (NIST), ATR, NICT
USA, CHINA, Singapore, etc.
NLP & CL organizations
ACL (Assoc Comp Linguistics)
NACL( North Amer Assoc on CL)
EACL (Euro Association on CL)
PACLIC (Pacific Assoc on CL)
ICCL (Inter committee CL)
Many NLP people
Rich resources and tools
NLP R&D in other countries
Linguistic Data Consortium
Trang 15 Vietnamese language was
established a long time
ago
used for a long time
Vietnam called Chu Nom
represent the Quốc Ngữ
since the beginning of the
Trang 16IEEE RIVF’09, 16 July 2009
Vietnamese language
Vietnamese is an analytic language (words are
composed of a single morpheme).
ngôn ngữ (analytic), lang-gua-ge (synthetic), 言言 (synthetic)
Vietnamese does not use morphological marking of case, gender, number, and tense.
Trưa nay tôi ăn ba thằng tôm
Syntax conforms to Subject Verb Object word order
Cái thằng chồng em nó chẳng ra gì.
FOCUS CLASSIFIER husband I he not turn.out what “That husband of mine, he is good for nothing.”
Trang 17 Most work aims at machine translation or other tasks at top layers but very few basic work at lower layers
to do their work from the scratch without sharing and collaborating no standards.
Vietnamese Language and Speech
Trang 18IEEE RIVF’09, 16 July 2009
National project with eleven
active research VLSP
groups from Ho Chi Minh
City to Hanoi, with two
Pragmatics: Speech, text and Web data mining
Tools, corpora, resources
Trang 19SP7.3 Vietnamese treebank
SP7.3 Vietnamese treebank
SP7.4 E-V corpora of aligned
sentences
SP7.4 E-V corpora of aligned
sentences
SP3 English-Vietnamese translation system
SP4 IREST: Internet use support system
SP5 Vietnamese spelling checker
SP5 Vietnamese spelling checker
SP8.2 Vietnamese word Segmentation
SP8.2 Vietnamese word Segmentation
SP8.3 Vietnamese POS tagger
SP8.3 Vietnamese POS tagger
SP8.4 Vietnamese chunker
SP8.4 Vietnamese chunker
SP8.5 Vietnamese syntax analyser
SP8.5 Vietnamese syntax analyser
SP7.1 English-Vietnamese dictionary
SP7.1 English-Vietnamese dictionary
SP7.2 Viet dictionary
SP7.2 Viet dictionary
SP1 Apllicationoriented
systems based on
Vietnamese speech
recognition & synthesis
SP2 Speech recognition
system with
large vocabulary
SP8.1 Speech analysis tools
SP8.1 Speech analysis tools
SP6.1 Corpora for speech recognition
SP6.1 Corpora for speech recognition
SP6.2 Corpora for speech synthesis
SP6.2 Corpora for speech synthesis
SP6.3 Corpora for specific words
SP6.3 Corpora for specific words
Project target products
To be standard for long term development
Trang 20IEEE RIVF’09, 16 July 2009
VLSP website: open soon to the
public
Trang 21SP7.2: Viet Machine Readable
Dictionary
The macroscopic structure
The microscopic structure
The content and VCL
structure
Tool and VCL construction
Institute of Electronic Dictionary, 1980s-1990s
Japanese EDR
Trang 22IEEE RIVF’09, 16 July 2009
SP7.2: Viet Machine Readable
Dictionary
Microscopic structure
two kinds of verb
and semantic constraints,
definition, context
VCL content and structure
Tool for the construction
35,000 common used words
Trang 23nhanh quá
SP7.3: Viet Treebank
corpus in which each sentence has been
parsed, i.e annotated with syntactic structure
English: Penn Treebank (4.5M words) and many
others;
Chinese: Penn Chinese Treebank (507K words),
Sinica Treebank (61,087 trees, 361K words) ;
Japanese: ATR Dependency corpus, Kyoto Text
Corpus, Verbmobil treebanks;
Korean: Korean Treebank
c parser
Viet chunker
Viet POS taggerViet word
segmenter
Trang 24IEEE RIVF’09, 16 July 2009
syntax and Vietnamese language
“Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả” (“the house is in jumble” and “at home the door
is not closed”)
“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”
(She keeps her beauty” and “this painting has better color”)
Labeling Agreement between labelers (95%)
SP7.3: Viet Treebank
Trang 25SP7.4: English-Vietnamese parallel
corpus
Parallel Corpus (L1-L2)
Sentence s
L1 Words
L2 Words
German-English 1,313,096 34,700,362 36,663,083Greek-English 662,090 18,834,758 18,827,241Spanish-English 1,304,116 37,870,751 36,429,274Finnish-English 1,257,720 24,895,790 34,802,617French-English 1,334,080 41,573,117 37,436,222Italian-English 1,251,315 36,411,166 36,510,033Dutch-English 1,326,412 36,784,168 36,690,392Portuguese-
English 1,287,757 37,342,426 36,355,907
Swedish-English 1,164,536 28,882,142 32,053,628
(http://www.euromatrix.net)
sentences in English and
Vietnamese (size & quality)
time, money and human
resources (boring job)
Trang 26IEEE RIVF’09, 16 July 2009
parallel text discovery
are translated from
English source in other
web sites from the
Internet
Trang 27Setting up the “standards” for VLSP
Importance of “standards” in VLSP: choose an
appropriate view from different schools on
Vietnamese language
Guide for words recognition and description:
morphological, syntactic, semantic criteria
Guide for constituent labeling: noun phrase, verb phrase, clause, etc.
Guide for sentence split
Others
Challenge: Standards for sustainable development
Trang 28IEEE RIVF’09, 16 July 2009
Example: Guideline for POS tagging
Trang 29VLSP tools for the public
All the tools are
methods to build the
tools with the corpora.
Tools and resources are
to be given to the
public.
SP7.3Vietnamese treebank
SP7.3Vietnamese treebank
SP7.4E-V corpora of aligned sentences
SP7.4E-V corpora of aligned sentences
SP8.2 Vietnamese word segmentation
SP8.2 Vietnamese word segmentation
SP8.3 Vietnamese POS tagger
SP8.3 Vietnamese POS tagger
SP8.4 Vietnamese chunker
SP8.4 Vietnamese chunker
SP8.5Vietnamese syntax
analyser
SP8.5Vietnamese syntax
analyser
SP7.1English-Vietnamese
dictionary
SP7.1English-Vietnamese
dictionary
SP7.2Viet dictionary
SP7.2Viet dictionary
Trang 30IEEE RIVF’09, 16 July 2009
MEMMs and CRFs are
emerging techniques in NLP & machine learning
CRF dynamic programming
- O(|L|2T): first-order
- O(|L|3T):
second-order
(L is the set of labels)
Using machine learning in creating
tools
Finite state machines
Trang 31CRFs Online Learning
Data
Chunking models
Vietnamese
Sentences Decoding output
Anh ấy đang ăn
Trang 32IEEE RIVF’09, 16 July 2009
SP8.5: HGSP grammar for syntax
analysis
Syntax tree
Text
Word segmentation module
Analysis module
Word feature dictionary
Rules with constraints
Attribute elimination
Trang 33IEEE RIVF’09, 16 July 2009
SP3: Machine translation and
EVSMT1.0
Statistical Analysis Vietnamese
died The old man died too
fast Old man died the too
The old man died too fast
(Slides 31-32 adapted from tutorial on SMT, K Knight and P Koehn)
English Bilingual Text
Vietnamese-English Text
SMT:
statistical
machine
translation
Trang 34IEEE RIVF’09, 16 July 2009
Translation Model
Language Model
Decoding Algorithm Argmax P(v|e) x P(e)
SP3: Machine translation and
EVSMT1.0
Statistical Analysis Vietnamese
Statistical Analysis
Broken
English Bilingual Text
Vietnamese-English Text
SMT:
statistical
machine
translation
Trang 35SP3: Machine translation and
English
sentence
Vietnamese sentence
SMT core
- Standardization
- Word segmentation (VNsegmenter)
- POS tagger (CRF Postagger, VnQtag)
- Morphological analyser (morpha)
processing
Corpus collecting and building
Trang 36IEEE RIVF’09, 16 July 2009
Google: English-Vietnamese
translation
26.9.08 (translate.google.com, 35 languages)
Trang 37Machine translation issues and
challenges
SMT major difficulties: word choice, word order,
tense and aspect, pronoun, idioms
Target: Improve phrase-based SMT in two
aspects of word order and word choice
Combination of tree-to-string SMT and
phrase-based SMT (N.P Thai et al., Machine translation,
Vol 20 No 3 (2006), IJCPOL, Vol 20, No 2
(2007)
Focus on translating long and complex sentences
by introducing CRF-based clause splitting and
chunking parsing (N.V Vinh, IJCPOL, 2009)
Trang 38IEEE RIVF’09, 16 July 2009
in English
Danh sách Websites tiếng Việt
Search on Internet for Webpages having information related to the query
Selected Website in English
Trang Web được dịch qua tiếng Việt
Check each Website
Extract news related to the query
Text related to the query
Summarize the text
Summarize
d text in English
Tin tóm tắt được dịch sang tiếng Việt
Extract informati
on related to the query
Summarize the text for its gist
Translate the gist into Vietnamese
Translate the list of
retrieved Webpages into
VietnameseTranslate
the selected Website into Vietnamese
Trang 39Vietnamese named-entities on the web
Ho Chi Minh University of Technology
Trang 40IEEE RIVF’09, 16 July 2009
Small parsed tree
A rule shows relation of a
context and an action
{a, b, c, d, e}
{b, e, a}
Input list, CSTACK, RSTACK
Actions: SHIFT, REDUCE, DROP, ASSIGN TYPE, RESTORE
sequence of actions
(Minh et al., COLING 2004, J CPOL’05, ACM Trans ALP’05, IEICE’06)
Trang 41their increase overtime?
Topic representation
Which features are necessary to characterize topics (interest and utility
overtime)?
Topic identification
How to extract these features from the corpus for each topic?
Emerging trend detection
(Le Minh Hoang, KSS journal, 2006)
Trang 42IEEE RIVF’09, 16 July 2009
ETD: Topic representation
ETD: Detecting topics that are
growing in interest and utility
overtime from a corpus
Topic representation
Which features are necessary to characterize topics (interest and utility
Trang 43ETD: Topic identification
How to extract these features from the corpus for each topic?
Build 6 models corresponding to 6 types of citation
Using HMM, MEMM, an CRF to extract features
Trang 44IEEE RIVF’09, 16 July 2009
ETD: Topic verification
their increase overtime?
6 , 5 , 3 , 1 {
) ,
( 4
1
, ) ,
( 4
1 ) (
j
i i
j
i i
k
k i i
j t Growth )
g(t Utility j
t Growth t
f Interest
time axis along the
(j)}
s {t time-serie growth of
, j) Growth(t
the speed of growing at x = k
the acceleration of growing at x = k
Speed > 0 Acceleration >
0
Trang 46IEEE RIVF’09, 16 July 2009
Conclusion
infrastructure.
processing of other languages, especially
statistical learning from large corpora.
industry for the next phase, and for
collaboration.
Trang 47 The national project
KC01.01.05/06-10
Projects members:
Luong Chi Mai, Ngo Cao
Son, Ho Bao Quoc, Dinh
Dien, Cao Hoang Tru,
Nguyen Thi Minh Huyen,
Vu Luong, Le Thanh
Huong, Nguyen Phuong
Thai, Nguyen Le Minh,
Le Minh Hoang, Phan
Xuan Hieu, Pham Ngoc
Khanh, Ha Thanh Le,
Nguyen Phuong Thao,
Nguyen Viet Cuong,
VLSP forum, among
others
VLSP meeting, 21-25 Nov 2005, JAIST
Trang 48IEEE RIVF’09, 16 July 2009
Korean and Arbic
Kyou, wareware wa kokoni atsumari,
Betonamu-go to speech shori ni tsuite Giron
shimasu
Trang 49Search for parallel document
translated from English source in other web
sites from the Internet
during translation process E.g term, NE, number
Trang 50IEEE RIVF’09, 16 July 2009
Framework
• Queries are generated
and executed from high
ranks to low ranks
• Filtering
•Length-based
•TID-based