1. Trang chủ
  2. » Ngoại Ngữ

Vietnamese Language Processing: Issues and Challenges

50 54 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 50
Dung lượng 11,63 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

IEEE RIVF’09, 16 July 2009 Natural language processing?. IEEE RIVF’09, 16 July 2009 Translation and machine translation  Translate the following sentence into English “Ông già đi nhanh

Trang 1

Vietnamese Language Processing: Issues and Challenges

Ho Tu Bao

Japan Advanced Institute of Science

Trang 2

IEEE RIVF’09, 16 July 2009

Japan Advanced Institute of

Science and Technology Institute of Information Technology

Vietnamese Academy of Science & Technology

Trang 3

 Our VLSP project (Vietnamese

Language and Speech Processing)

Trang 4

IEEE RIVF’09, 16 July 2009

Natural language processing?

Psychological view: Understand

human language processing

 Alan Turing: Propose

to consider the question:

“Can machine

think?”

Engineering view: Build systems

to process language

Trang 5

More languages than you might have

thought

and speech.

langue et de parole vietnamienne.

tiếng nói tiếng Việt.

6912 distinct languages (230 spoken in Europe,

2197 in Asia)

Trang 6

IEEE RIVF’09, 16 July 2009

54 ethnic groups in Vietnam

Trang 7

English websites and Vietnamese?

Trang 8

IEEE RIVF’09, 16 July 2009

Translation and machine translation

 Translate the following sentence into English

“Ông già đi nhanh quá”?

 Many possible translations

1 [Ông già] [đi] [nhanh quá]  The old man walks too fast  My father walks too fast

2 [Ông già] [đi] [nhanh quá]  The old man died too fast  My father died too fast

3 [Ông] [già đi] [nhanh quá]  You get old too fast

 Grandfather gets old too fast

Ambiguity of language

Trang 9

Two approaches to machine

translation

Linguistic rule-based

machine translation

using linguistic rules

about the two

 generate translations using statistical

learning methods based on bilingual text corpora

(statistically similar)

 Requires large and qualified bilingual text corpora.

DOMINATING!

Trang 10

IEEE RIVF’09, 16 July 2009

Lexical / Morphological Analysis

Syntactic Analysis

Semantic Analysis

Discourse Analysis

Tagging Chunking

Word Sense Disambiguation

Grammatical Relation Finding

Named Entity Recognition

Reference Resolution

Shallow parsing

The woman will give Mary a book

The/ Det woman/ NN will/ MD give/ VB

Mary/ NNP a/ Det book/ NN

POS tagging

[ The/ Det woman /NN]NP [ will/ MD give/ VB]VP

[ Mary/ NNP]NP [ a/ Det book/ NN]NP

From text to the meaning

Natural Language Processing (NLP)

Trang 11

 1990s–2000s: Statistical learning

 algorithms, evaluation, corpora

 1980s: Standard resources and tasks

 Penn Treebank, WordNet, MUC

 1970s: Kernel (vector) spaces

 clustering, information retrieval (IR)

 1960s: Representation Transformation

 Finite state machines (FSM) and

Augmented transition networks (ATNs)

Archeology of natural language

processing

Natural language processing

Information retrieval and Information extraction

Trang 12

IEEE RIVF’09, 16 July 2009

some ML/Stat no ML/Stat

(Pages 11-12 from Marie Claire, ECML/PKDD 2005)

ML and statistical methods in NLP

Trang 13

Recent learning methods in NLP

Trang 14

IEEE RIVF’09, 16 July 2009

 Large investment from the government and industry

 National Institute of Standards and Technology (NIST), ATR, NICT

 USA, CHINA, Singapore, etc.

 NLP & CL organizations

 ACL (Assoc Comp Linguistics)

 NACL( North Amer Assoc on CL)

 EACL (Euro Association on CL)

 PACLIC (Pacific Assoc on CL)

 ICCL (Inter committee CL)

 Many NLP people

 Rich resources and tools

NLP R&D in other countries

Linguistic Data Consortium

Trang 15

 Vietnamese language was

established a long time

ago

used for a long time

Vietnam called Chu Nom

represent the Quốc Ngữ

since the beginning of the

Trang 16

IEEE RIVF’09, 16 July 2009

Vietnamese language

 Vietnamese is an analytic language (words are

composed of a single morpheme).

 ngôn ngữ (analytic), lang-gua-ge (synthetic), 言言 (synthetic)

 Vietnamese does not use morphological marking of case, gender, number, and tense.

 Trưa nay tôi ăn ba thằng tôm

 Syntax conforms to Subject Verb Object word order

 Cái thằng chồng em nó chẳng ra gì.

FOCUS CLASSIFIER husband I he not turn.out what “That husband of mine, he is good for nothing.”

Trang 17

 Most work aims at machine translation or other tasks at top layers but very few basic work at lower layers

to do their work from the scratch without sharing and collaborating  no standards.

Vietnamese Language and Speech

Trang 18

IEEE RIVF’09, 16 July 2009

National project with eleven

active research VLSP

groups from Ho Chi Minh

City to Hanoi, with two

Pragmatics: Speech, text and Web data mining

Tools, corpora, resources

Trang 19

SP7.3 Vietnamese treebank

SP7.3 Vietnamese treebank

SP7.4 E-V corpora of aligned

sentences

SP7.4 E-V corpora of aligned

sentences

SP3 English-Vietnamese translation system

SP4 IREST: Internet use support system

SP5 Vietnamese spelling checker

SP5 Vietnamese spelling checker

SP8.2 Vietnamese word Segmentation

SP8.2 Vietnamese word Segmentation

SP8.3 Vietnamese POS tagger

SP8.3 Vietnamese POS tagger

SP8.4 Vietnamese chunker

SP8.4 Vietnamese chunker

SP8.5 Vietnamese syntax analyser

SP8.5 Vietnamese syntax analyser

SP7.1 English-Vietnamese dictionary

SP7.1 English-Vietnamese dictionary

SP7.2 Viet dictionary

SP7.2 Viet dictionary

SP1 Apllicationoriented

systems based on

Vietnamese speech

recognition & synthesis

SP2 Speech recognition

system with

large vocabulary

SP8.1 Speech analysis tools

SP8.1 Speech analysis tools

SP6.1 Corpora for speech recognition

SP6.1 Corpora for speech recognition

SP6.2 Corpora for speech synthesis

SP6.2 Corpora for speech synthesis

SP6.3 Corpora for specific words

SP6.3 Corpora for specific words

Project target products

To be standard for long term development

Trang 20

IEEE RIVF’09, 16 July 2009

VLSP website: open soon to the

public

Trang 21

SP7.2: Viet Machine Readable

Dictionary

 The macroscopic structure

 The microscopic structure

 The content and VCL

structure

 Tool and VCL construction

Institute of Electronic Dictionary, 1980s-1990s

Japanese EDR

Trang 22

IEEE RIVF’09, 16 July 2009

SP7.2: Viet Machine Readable

Dictionary

 Microscopic structure

two kinds of verb

and semantic constraints,

definition, context

 VCL content and structure

 Tool for the construction

 35,000 common used words

Trang 23

nhanh quá

SP7.3: Viet Treebank

corpus in which each sentence has been

parsed, i.e annotated with syntactic structure

English: Penn Treebank (4.5M words) and many

others;

Chinese: Penn Chinese Treebank (507K words),

Sinica Treebank (61,087 trees, 361K words) ;

Japanese: ATR Dependency corpus, Kyoto Text

Corpus, Verbmobil treebanks;

Korean: Korean Treebank

c parser

Viet chunker

Viet POS taggerViet word

segmenter

Trang 24

IEEE RIVF’09, 16 July 2009

syntax and Vietnamese language

 “Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả” (“the house is in jumble” and “at home the door

is not closed”)

 “Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”

(She keeps her beauty” and “this painting has better color”)

 Labeling Agreement between labelers (95%)

SP7.3: Viet Treebank

Trang 25

SP7.4: English-Vietnamese parallel

corpus

Parallel Corpus (L1-L2)

Sentence s

L1 Words

L2 Words

German-English 1,313,096 34,700,362 36,663,083Greek-English 662,090 18,834,758 18,827,241Spanish-English 1,304,116 37,870,751 36,429,274Finnish-English 1,257,720 24,895,790 34,802,617French-English 1,334,080 41,573,117 37,436,222Italian-English 1,251,315 36,411,166 36,510,033Dutch-English 1,326,412 36,784,168 36,690,392Portuguese-

English 1,287,757 37,342,426 36,355,907

Swedish-English 1,164,536 28,882,142 32,053,628

(http://www.euromatrix.net)

sentences in English and

Vietnamese (size & quality)

time, money and human

resources (boring job)

Trang 26

IEEE RIVF’09, 16 July 2009

parallel text discovery

are translated from

English source in other

web sites from the

Internet

Trang 27

Setting up the “standards” for VLSP

 Importance of “standards” in VLSP: choose an

appropriate view from different schools on

Vietnamese language

 Guide for words recognition and description:

morphological, syntactic, semantic criteria

 Guide for constituent labeling: noun phrase, verb phrase, clause, etc.

 Guide for sentence split

 Others

 Challenge: Standards for sustainable development

Trang 28

IEEE RIVF’09, 16 July 2009

Example: Guideline for POS tagging

Trang 29

VLSP tools for the public

 All the tools are

methods to build the

tools with the corpora.

 Tools and resources are

to be given to the

public.

SP7.3Vietnamese treebank

SP7.3Vietnamese treebank

SP7.4E-V corpora of aligned sentences

SP7.4E-V corpora of aligned sentences

SP8.2 Vietnamese word segmentation

SP8.2 Vietnamese word segmentation

SP8.3 Vietnamese POS tagger

SP8.3 Vietnamese POS tagger

SP8.4 Vietnamese chunker

SP8.4 Vietnamese chunker

SP8.5Vietnamese syntax

analyser

SP8.5Vietnamese syntax

analyser

SP7.1English-Vietnamese

dictionary

SP7.1English-Vietnamese

dictionary

SP7.2Viet dictionary

SP7.2Viet dictionary

Trang 30

IEEE RIVF’09, 16 July 2009

MEMMs and CRFs are

emerging techniques in NLP & machine learning

CRF dynamic programming

- O(|L|2T): first-order

- O(|L|3T):

second-order

(L is the set of labels)

Using machine learning in creating

tools

Finite state machines

Trang 31

CRFs Online Learning

Data

Chunking models

Vietnamese

Sentences Decoding output

Anh ấy đang ăn

Trang 32

IEEE RIVF’09, 16 July 2009

SP8.5: HGSP grammar for syntax

analysis

Syntax tree

Text

Word segmentation module

Analysis module

Word feature dictionary

Rules with constraints

Attribute elimination

Trang 33

IEEE RIVF’09, 16 July 2009

SP3: Machine translation and

EVSMT1.0

Statistical Analysis Vietnamese

died The old man died too

fast Old man died the too

The old man died too fast

(Slides 31-32 adapted from tutorial on SMT, K Knight and P Koehn)

English Bilingual Text

Vietnamese-English Text

SMT:

statistical

machine

translation

Trang 34

IEEE RIVF’09, 16 July 2009

Translation Model

Language Model

Decoding Algorithm Argmax P(v|e) x P(e)

SP3: Machine translation and

EVSMT1.0

Statistical Analysis Vietnamese

Statistical Analysis

Broken

English Bilingual Text

Vietnamese-English Text

SMT:

statistical

machine

translation

Trang 35

SP3: Machine translation and

English

sentence

Vietnamese sentence

SMT core

- Standardization

- Word segmentation (VNsegmenter)

- POS tagger (CRF Postagger, VnQtag)

- Morphological analyser (morpha)

processing

Corpus collecting and building

Trang 36

IEEE RIVF’09, 16 July 2009

Google: English-Vietnamese

translation

26.9.08 (translate.google.com, 35 languages)

Trang 37

Machine translation issues and

challenges

 SMT major difficulties: word choice, word order,

tense and aspect, pronoun, idioms

 Target: Improve phrase-based SMT in two

aspects of word order and word choice

 Combination of tree-to-string SMT and

phrase-based SMT (N.P Thai et al., Machine translation,

Vol 20 No 3 (2006), IJCPOL, Vol 20, No 2

(2007)

 Focus on translating long and complex sentences

by introducing CRF-based clause splitting and

chunking parsing (N.V Vinh, IJCPOL, 2009)

Trang 38

IEEE RIVF’09, 16 July 2009

in English

Danh sách Websites tiếng Việt

Search on Internet for Webpages having information related to the query

Selected Website in English

Trang Web được dịch qua tiếng Việt

Check each Website

Extract news related to the query

Text related to the query

Summarize the text

Summarize

d text in English

Tin tóm tắt được dịch sang tiếng Việt

Extract informati

on related to the query

Summarize the text for its gist

Translate the gist into Vietnamese

Translate the list of

retrieved Webpages into

VietnameseTranslate

the selected Website into Vietnamese

Trang 39

Vietnamese named-entities on the web

Ho Chi Minh University of Technology

Trang 40

IEEE RIVF’09, 16 July 2009

Small parsed tree

A rule shows relation of a

context and an action

{a, b, c, d, e}

{b, e, a}

Input list, CSTACK, RSTACK

Actions: SHIFT, REDUCE, DROP, ASSIGN TYPE, RESTORE

sequence of actions

(Minh et al., COLING 2004, J CPOL’05, ACM Trans ALP’05, IEICE’06)

Trang 41

their increase overtime?

Topic representation

Which features are necessary to characterize topics (interest and utility

overtime)?

Topic identification

How to extract these features from the corpus for each topic?

Emerging trend detection

(Le Minh Hoang, KSS journal, 2006)

Trang 42

IEEE RIVF’09, 16 July 2009

ETD: Topic representation

ETD: Detecting topics that are

growing in interest and utility

overtime from a corpus

Topic representation

Which features are necessary to characterize topics (interest and utility

Trang 43

ETD: Topic identification

How to extract these features from the corpus for each topic?

 Build 6 models corresponding to 6 types of citation

 Using HMM, MEMM, an CRF to extract features

Trang 44

IEEE RIVF’09, 16 July 2009

ETD: Topic verification

their increase overtime?

6 , 5 , 3 , 1 {

) ,

( 4

1

, ) ,

( 4

1 ) (

j

i i

j

i i

k

k i i

j t Growth )

g(t Utility j

t Growth t

f Interest

time axis along the

(j)}

s {t time-serie growth of

, j) Growth(t

the speed of growing at x = k

the acceleration of growing at x = k

Speed > 0 Acceleration >

0

Trang 46

IEEE RIVF’09, 16 July 2009

Conclusion

infrastructure.

processing of other languages, especially

statistical learning from large corpora.

industry for the next phase, and for

collaboration.

Trang 47

 The national project

KC01.01.05/06-10

 Projects members:

Luong Chi Mai, Ngo Cao

Son, Ho Bao Quoc, Dinh

Dien, Cao Hoang Tru,

Nguyen Thi Minh Huyen,

Vu Luong, Le Thanh

Huong, Nguyen Phuong

Thai, Nguyen Le Minh,

Le Minh Hoang, Phan

Xuan Hieu, Pham Ngoc

Khanh, Ha Thanh Le,

Nguyen Phuong Thao,

Nguyen Viet Cuong,

VLSP forum, among

others

VLSP meeting, 21-25 Nov 2005, JAIST

Trang 48

IEEE RIVF’09, 16 July 2009

Korean and Arbic

Kyou, wareware wa kokoni atsumari,

Betonamu-go to speech shori ni tsuite Giron

shimasu

Trang 49

Search for parallel document

translated from English source in other web

sites from the Internet

during translation process E.g term, NE, number

Trang 50

IEEE RIVF’09, 16 July 2009

Framework

• Queries are generated

and executed from high

ranks to low ranks

• Filtering

•Length-based

•TID-based

Ngày đăng: 15/05/2018, 16:43

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN