Vietnamese-English Cross Language Search Information Retrieval CLIR Discovering Noun Phrases for Translation -CSC 177 Presentation... Motivations – Unknown Translations• Brand names, Pla
Trang 1Vietnamese-English Cross Language Search
Information Retrieval (CLIR) Discovering Noun Phrases for Translation
-CSC 177 Presentation
Trang 2• Motivations
• Crosslingual Query
• Noun phrase translation extraction
• Experiments and results
• Conclusion and next steps
Trang 3Motivations – Unknown Translations
• Brand names, Place names, Personal names
• Titles (music, book, video)
• Terminologies (Science, Computer, Medical, Space,
Farming etc)
• Meaning might not be inferable from individual
components
• Might required expert knowledge for translation
• Might have multiple correct translations
• Cross-language Information Retrieval (CLIR)
• Machine Translation (MT)
• Machine-Readable Dictionary (MRD)
Trang 4software)
Trang 5Quang Dung)
Trang 6Searching the web for translation?
• Parallel Data on the Web:
Vietnamese to EnglishTranslation
Trang 7Searching the web for translation?
• Comparable corpus on the web:
Trang 8Searching the web for translation?
• Mixed language web pages:
EnglishTranslation
Trang 9Our Approach
• Extensions to CMU’s Ying Zhang 2005 paper (Credit)
• Addressing issues focusing to Vietnamese-English OOV translations
• Proper name translation is using pattern recognition technique and not by phonetic similarity and string
alignment
• Detection of borrowed English words
• Improving translation suggestions by utilizing
contextual information
Trang 10Crosslingual Query to Obtain Mixed Languages
WebPages
• Extend the source query, VS , with extended
words/phrases VEX: (tend to frequently co-occur)
Trang 11How to Find This VEX ?
• Find co-occurred terms in web
log
• Use co-occurred terms in search
query (in CLIR)
• Search Google, with VS, and
select Vietnamese words, VEX,
with high frequency
Overture Search Log
Trang 12Original Source Query
Trang 13Crosslingual Query
Trang 14Our Approach: Noun Phrase Translation
Trang 15Yahoo Search API - XML Data Returning
Snippet
Trang 16Proper name recognition & Transliteration
• Extract and concatenate Title, Summary, and URL
• Recognize that proper name text pattern
is likely to appear in capital with the
first letter
• Compute the likelihood of a query text is a proper name
TextSnippet
in V
ofoccurencesAll
TextSnippet
in )Ver_In_Cap(
First_Lettof
Occurences(
• Suggest a translation candidate VN: Quang Dũng → Eng: Quang Dung
• Compute and assign a weight to a translation candidate
Trang 17Preprocessing (Query: Thuật toán genetic)
– Extracting and concatenation of Title, Summary, and URL
Thuật toán-Cấu trúc dữ liệu (Reserve Polish Notation – RPN), một thuật toán "kinh điển" trong lĩnh vực trình biên dịch THUẬT GIẢI DI TRUYỀN – GENETIC ALGORITHM -
Kỳ 2 ity.vnuit.edu.vn/thuattoan/index.htm
– Mark query, normalize text, remove noise text
~123456789 cấu trúc dữ liệu reserve polish notation – rpn một ~123456789 kinh điển trong lĩnh vực trình biên dịch thuẬt giẢi di truyỀn – ~987654321 algorithm kỳ 2 ity
vnuit edu vn thuattoan index htm
– Mark recognized Vietnamese text with VNW tag
~123456789 VNW VNW VNW VNW reserve polish notation VNW rpn ~123456789 VNW VNW trong VNW VNW VNW VNW VNW VNW di VNW VNW ~987654321 algorithm
VNW ity vnuit edu vn thuattoan index htm
– Group continuous English words and build word list
['~ 123456789 ', 'VNW', 'VNW', 'VNW', 'VNW', '', '', 'reserve_polish_notation', 'VNW', 'rpn', '~ 123456789 ', 'VNW', 'VNW', 'trong', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'VNW', 'di', 'VNW', 'VNW', '~ 987654321 ', 'algorithm', 'VNW', 'ity', 'vnuit', 'edu', 'vn', 'thuattoan', 'index',
Trang 18,(
1(
• Example: Thuật toán genetic
Trang 19Contextual Ordering Model &
Result Ranking
• Estimate Closeness Probability
• Overall Score for each candidate
• Sort score and present top 5 suggestions
=
e c e E c E
EX eE
ADJ eE
(
) (
) (
) ( )
ADJ e
(
Trang 20Sample Program Output # 1
(dân ca -> folk or traditional music)
Trang 21Sample Program Output # 2
(Quang Dũng -> Quang Dung)
Trang 22Sample of Translation Results
Category Vietnamese
Phrase/Word Vietnamese-English Web-mining
Translation
Vdict (Machine Translation)
Vietdict (Online Dictionary)
Organization
Name WTO là gì? What is world trade organization ? What is WTO? No definition found
Science & Tech thuật toán di
truyền Genetic algorithms Heredity algorism No definition foundLocation Name Thừa Thiên Huế Thua Thien Hue Partial Excess Hue No definition found Person Name ca sĩ Quang Dũng Singer Quang Dung Optical singer Dũng N/A
Medical Term viêm màng não Meningitis brain
infection meningitis No definition foundGeographical
name Đại dương Bắc Băng Dương Arctic ocean Đạtôi glacial ocean Boreal Yang No definition foundEducation học vị
Tiến sỹ Phd degree Advance academical degree sỹ No definition foundMusic dân ca Folk music folk-song folk-song
Music nhạc hip hop Hip hop music or
Rap music music cây hu-blông hông No definition foundSpace phi hành gia Sally
Ride Former astronaut Sally Ride air-man sự Phá vây cưỡi No definition foundPlant cây kiểng vườn
Nhật Bonsai Japanese garden Japanese garden plant kiểng No definition foundFarming những nghề cá
thủy sản
Aquaculture fisheries seafood fisheries No definition found
Laws cư trú thường trực permanent resident populate permanent No definition found Astrological Thuật chiêm tinh
phong thủy feng shui astrology Geomancy astrology Geomancy
Trang 23Conclusion and Next Steps
• Contributions
– Recognize and translate important phrases
– Translate: persons, locations, concepts
– Low cost for implementation with reasonable
performance
• Future work
– Experiment with a larger set of test data
– Integration with Vietnamese-English CLIR work – Automate the generation of extended
words/phrase to derived English extended word
– Experiment on “Refine Result” concept for