Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc

Mục tiêu • Đề xuất và cải tiến các phương pháp giải quyết bài toán đảo cụm tò trong dịch máy thống kê dựa vào cụm theo hướng tiếp cận tiền xử lý dựa trên cây phân tích cú pháp phụ thuộc.

Trang 1

ĐẠI HỌC QUỐC GIA HÀ NỘI

BÁO CÁO TỎNG KẾT

KÉT QUẢ THỰC HIỆN ĐÈ TÀI KH&CN

CẤP ĐẠI HỌC QUÓC GIA

Tên đề tài: “Cải tiến chất lượng dịch máy thống kê dựa vào thông tin

CÚ pháp phụ thuộc”

M ã số đề tài: QG.15.23

Chủ nhiệm đề tài: TS Nguyễn Văn Vinh

Trang 2

ĐẠI HỌC QUỐC GIA HÀ NỘI

n ní»KữGKM Ị

\ V

BÁO CÁO TỒNG KỂT KÉT QUẢ T H ự C HIỆN ĐÊ TÀI KH&CN

Trang 3

PHẦN I THÔNG TIN CHƯNG

1.1 Tên đề tài: Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc

1.2 Mã số: QG.15.23

1.3 Danh sách chủ trì, thành viên tham gia thực hiện đề tài _

nghệ - ĐHQGHN

Chủ nhiệm đê tài

- Kỹ thuật Công nghiệp

1.4 Đơn vị chủ trì: Đại học Công nghệ - Đại học Quốc Gia Hà Nội

1.5 Thời gian thực hiện:

1.5.1 Theo hợp đồng: 24 tháng từ tháng 2 năm 2015 đến tháng 2 năm 2017

1.5.2 Gia hạn (nếu có): tháng 2 năm 2017 đến tháng 8 năm 2017

1.5.3 Thực hiện thực tế: từ tháng 2 năm 2015 đến tháng 12 năm 2017

1.6 Những thay đổi so vối thuyết minh ban đầu (nêu có)i

(về mục tiêu, nội dung, phương pháp, kết quả nghiên cứu và tổ chức thực hiện; Nguyên nhân; Y kiến của Cơ quan quản lý)

1.7 Tổng kinh phí được phê duyệt của đề tài: 200 triệu đồng.

1 Đặt vấn đề

Sự bùng nổ của cách tiếp cận dịch máy đã tạo ra các sản phâm thương mại đươc sư dụng rọng

dịch máy thống kê dựa vào cụm liên quan đến việc làm thế nào để sinh ra thứ tự các từ (cụm) chính xác trong ngôn ngữ đích Trong hệ dịch máy thống kê dựa trên cụm (Phrase-Based Statistical Machine Translation- PBSMT), việc đảo cụm từ vẫn còn đơn giản và chất lượng chưa cao Bên caiủ đó, do các ngôn ngữ có nhiều đặc điểm khác nhau (đặc biệt sự khác nhau vê thứ tự tư trong các ngôn ngữ) dẫn tới không thể mô hình hóa chính xác trong quá trình dịch [Och và Ney, 2004] Điều này dẫn đến có nhiều hướng quan tâm nghiên cứu để giải quyết vấn đề đảo trật tự từ bên ĩrong

hệ thống dịch máy thống kê dựa vào cụm Một sô nghiên cứu theo hướng tiep cạn tien xư ly cho

2 ^ S; ^ $latf - ạ0° l e-C0f ♦ I ĐẬĨ H Ọ C Q U Ỗ C Gí A HẢ N Ụ

h tp ://w w w m icrosofttran slator,co m I TRUNG TAM ĨH Q N G ĨiN THƯ V IẺ' ■ 1

Ị "ọ O O Ẽ 0 0 0 0 5 0 9

Trang 4

vấn đề đảo trật tự tò cho kết quả tổt [Peng Xu và cộng sự, 2009; Jason Katz-Brown và cộng sự,

Ý tưởng chính của vấn đề đào cụm từ tiền xử lý câu trong ngôn ngữ nguồn (tiếng Anh) để có thứ tự từ gần nhất có thể trong ngôn ngữ đích (tiếng Việt) Hai hướng nghiên cứu chính để giải quyết vấn đề nêu trên dựa vào tiền xử lý là: phân tích cú pháp thành phần câu nguồn và phân tích cú pháp phụ thuộc câu nguồn.

Một số nghiên cứu sử dụng thông tin cú pháp nhằm giải quyết bài toán đảo trật tự từ Một trong những phương pháp đó là phân tích cú pháp ngôn ngữ nguồn và các luật sắp xếp như các bước tiền

xử lý Ý tưởng chính là chuyển đổi các câu nguồn để các câu đích có thứ tự từ gần nhất có thể, do

đó việc huấn luyện sẽ dễ dàng hơn và chất lượng gióng từ cũng tốt hon.

2 Mục tiêu

• Đề xuất và cải tiến các phương pháp giải quyết bài toán đảo cụm tò trong dịch máy thống kê dựa vào cụm theo hướng tiếp cận tiền xử lý dựa trên cây phân tích cú pháp phụ thuộc.

• Tìm ra cách tích hợp thông tin về cây phân tích cú pháp phụ thuộc vào hệ dịch máy thống kê (lựa chọn thông tin cú pháp, xây dựng luật đảo trật tự thủ công và tự động giữa

2 cặp ngôn ngữ) Tập trung thừ nghiệm & đánh giá trên cặp ngôn ngữ Anh-Việt.

• Xây dựng chương trình thủ nghiệm dịch từ Việt sang Anh, tích hợp các kỹ thuật đề xuất

và cải tiến trong đề tài.

3 Phương pháp nghiên cứu

Chúng tôi áp dụng các phương pháp, kỹ thuật sau:

- Dựa vào cây cú pháp phụ thuộc của ngôn ngữ nguồn & thông tin ngôn ngữ để đưa ra giải pháp hiệu quả cho bài toán đảo cụm từ trong bước tiền xử lý áp dụng cho cặp ngôn ngữ Anh-Việt.

- Dựa vào các kỹ thuật học máy để tích hợp hiệu quả thông tin cú pháp phụ thuộc vào hệ thống dịch máy thống kê Tìm hiểu, khai phá và mở rộng các luật thủ công (xây dựng bằng tay), các luật tự động (trích rút tự động từ kho ngữ liệu) chuyển đổi giữa 2 cặp ngôn ngữ và

áp dụng để cải thiện chất lượng dịch máy thống kê.

- Đề xuất các kỹ thuật tích hợp hiệu quả các trí thức ngôn ngữ (cú pháp phụ thuộc) vào hệ thổng dịch máy thống kê.

2

Trang 5

4 Tổng kết kết quả nghiên cứu

Đe tài đã thực hiện các nội dung nghiên cứu Bao gồm:

Nội dung 1: Nghiên cứu các phương pháp giải quyết đảo cụm từ dựa vào cách tiếp cận tiền xừ lý

- Nghiên cứu các mô hình, kỹ thuật đảo cụm từ giữa 2 cặp ngôn ngữ dựa vào tiền xử lý.

- Cài đặt và thừ nghiệm các kỹ thuật đảo cụm từ dựa vào tiền xử lý.

Nội dung 2: Nghiên cứu cách tích họp thông tin về cây cú pháp phụ thuộc vào hệ dịch máy thống kê

- Nghiên cứu các kỹ thuật tích hợp thông tin cây cú pháp phụ thuộc vào hệ dịch máy thống kê.

- Cài đặt và thủ nghiệm các mô hình tích họp.

Nội dung 3: Thu thập tài nguyên và tiền xử lý phục vụ việc khai phá dữ liệu song ngữ.

- Nghiên cứu nguồn tài nguyên văn bản thích hợp và thu thập.

- Nghiên cứu và xây dựng modul phân tích từ tố cho tiếng Anh.

- Nghiên cứu và áp dụng các kỹ thuật phân tách từ cho tiếng Việt.

Nội dung 4: Xây dựng hệ thống dịch máy thống kê Việt-Anh thử nghiệm

- Xây dựng hệ dịch Việt-Anh cơ sở.

5 Đánh giá về các kết quả đã đạt được và kết luận

Kết quả đạt được gồm:

Cài đặt và thử nghiệm các mô hình tích hợp.

Sản phẩm đã có:

Hệ thống dịch máy thống kê Việt - Anh thử nghiệm.

6 Tóm tắt kết quả (tiếng Việt và tiếng Anh)

Tiếng Việt

Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc

Tóm tắt:

3

Trang 6

Sự bùng nổ của cách tiếp cận dịch máy đã tạo ra các sản phấm thương mại đươc sử dụng rộng

rãi trên thế giới (hệ dịch của Googỉe J Microsoýt 4, .) Một trong những vấn đề quan trọng của

dịch máy thống kê dựa vào cụm liên quan đến việc làm thế nào để sinh ra thứ tự các từ (cụm) chính xác trong ngôn ngữ đích Trong hệ dịch máy thống kê dựa trên cụm (Phrase-Based Statistical Machine Translation- PBSMT), việc đảo cụm từ vẫn còn đơn giản và chất lượng chưa cao Bên cạnh đó, do các ngôn ngữ có nhiều đặc điểm khác nhau (đặc biệt sự khác nhau về thứ tự từ trong các ngôn ngữ) dẫn tới không thể mô hình hỏa chính xác trong quá trình dịch Điều này dẫn đến có nhiều hướng quan tâm nghiên cứu để giải quyết vấn đề đảo trật tự từ bên trong hệ thống dịch máy thống kê dựa vào cụm.

Ý tưởng chính của vấn đề đảo cụm từ tiền xừ lý câu trong ngôn ngữ nguồn (tiếng Anh) đê có thứ tự từ gần nhất có thể trong ngôn ngữ đích (tiếng Việt) Hai hướng nghiên cứu chính để giải quyết vấn đề nêu trên dựa vào tiền xử lý là: phân tích cú pháp thành phần câu nguồn và phân tích cú pháp phụ thuộc câu nguồn.

Đã có một số nghiên cứu về hệ thống dịch máy thống kê dựa vào cụm cho cặp ngôn ngữ Anh- Việt Nghiên cứu về dịch máy thống kê dựa vào cụm sử dụng tiền xử lý với cây cú pháp phụ thuộc chưa nhiều Nghiên cứu về đảo cụm từ sử dụng tiền xử lý chủ yếu cho chiều dịch Anh-Việt bằng cây cú pháp thành phần Những vấn đề thách thức đặt ra:

- Các nghiên cứu chủ yếu áp dụng cho chiều dịch Anh-Việt, chưa có chiều dịch Việt-Anh.

- Một sổ nghiên cứu đã áp dụng đảo trật tự tò dựa trên cây cú pháp phụ thuộc cho chiều Anh- Việt Tuy nhiên những nghiên cứu này chủ yếu dùng các luật bằng tay, chưa áp dụng các luật tự động trong bài toán dịch.

- Có ít nghiên cứu sử dụng tiền xử lý dựa vào cây củ pháp phụ thuộc cho chiều Việt-Anh và tồn tại nhiều hạn chế cần cải tiến để nâng cao chất lượng.

Đề tài:”Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc ’ tập trung giải quyết thách thức trên nhằm cải tiến chất lượng dịch máy thống kê, nhiều nỗ lực nghiên cứu theo hướng sử dụng cây phân tích cú pháp phụ thuộc vào dịch thống kê đã được áp dụng.

Trang 7

machine translation deals with how to generate the exact sequence of words in the target ỉanguage

In Phrase-Based Statistical Machine Translation (PBSMT), phrase translation is still simple and the quality is not high In addition, since languages have many different characteristics (especially differences in order in words), it can not be accurately modeled during translation This led to many research directions to solve the problem of order reversal from within the cluster-based statistical machine translation system.

The main idea of the problem is to preíĩx the phrase pre-sentence in the source language

(English) to get the closest possible order in the target language (Vietnamese) The two main

research directions for addressing the problem based on preprocessing are: parsing the source sentence component and parsing the source sentence.

There have been some studies on cluster-based statistical machine translation systems for the English-Vietnamese language pair The research on statistical machine translation based on preprocessing using the syntactic tree is not very dependent Research on phrase islands uses preprocessing for the English-Vietnamese dimension using the syntaetical tree Some of challenges:

- There are a lot of research applying for the English-Vietnamese translation However, there are no Vietnamese-English translation.

- Some of research have applied the dependency-based pre-ordering for English-Vietnamese Statistical Machine Translation But, these research use manual rules, no applying automatic rules

PHẦN III SẢN PHẨM, CÔNG BỐ VÀ KỂT QUẢ ĐÀO TẠO CỦA ĐÊ TÀI

3.1 Kết quả nghiên cứu

Yêu cầu khoa học hoặc/và chỉ tiêu kinh tế -k ỹ thuật

thuât đảo cum từ.

thuât đảo cum từ

Việt - Anh thử nghiệm

5

Trang 8

3.2 Hình thức, cấp độ công bổ kết quả

Tình trạng

(Đã in/ chấp nhận in/ đã nộp đơn/ đã được chấp nhận đơn hợp lệ/ đã được cấp giấy xác nhận SHTT/ xác nhận sử dụng

sàn phẩm)

Ghi địa chỉ

và cảm 0 'n

sự tài trợ của ĐHQGHN đúng quy định

Đánh giá chung

(Đạt, không đạt)

quốc gia hoặc báo cáo khoa học đăng trong kỷ yếu hội nghị quốc tế

5.1 Viet-Hong Tran, Huyen Vu Thuong,

Vinh Van Nguyen and Minh Le

Nguyen, “Dependency-based Pre-

o rd e rin g F o r E n g lish -V ie tn a m e se

Statistical Machine Translation”, In

VNU Joumal of Science.

5.2 Viet Tran Hong, Vinh Van Nguyen

and Minh Le Nguyen, “Improving

5.3 Viet Tran Hong, Huyen Vu Thuong,

Pham Nghia Luan, Vinh Nguyen Van

and Trung Le Tien, “The Engỉish-

V ietn am ese M a c h in e T ran slatio n

System for IWSLT 2015”,

Proceedings of IWSLT 2015.

5.4 Viet Tran Hong, Huy en Vu Thuong,

V in h N g u y en V a n and M in h N g u y en

Le “A Classiíĩer-based Preordering

Approach for English-Vietnamese

Statistical Machine Translation”,

Proceedings of Ciclings 2016

(ISI/SCORPUS)

5.5 Viet Tran Hong, Huyen Vu Thuong,

Thu Pham Hoai, Vinh Nguyen Van

and N guy en L e M in h “ A R eo rd erin g

Model For Vietnamese-English

Statistical M a c h in e T ran slatio n U sin g

6

Trang 9

Dependency Iníòrmation”,

Próceedings of RIVF 2016 (IEEE).

5.6 Luan Nghia Pham, Vinh Van Nguyen

and Huy Quang Nguyen, Transỉation

model adaptation for Statistical

Machine Translation with domain

classiíier, Proceeeding of the 31 st

Pacifìc Asia Conference on Language,

Iníòrmation and Computation

Cột sàn phẩm khoa học công nghệ: Liệt kê các thông tin các sản phẩm KHCN theo thứ tự

<tên tác giả, tên công trình, tên tạp chí/nhà xuất bản, số phát hành, năm phát hành, trang đăng công trình, mã công trình đăng tạp chí/sách chuyên khảo (DOI), loại tạp chí ISI/Scopus>

Các ấn phẩm khoa học (bài báo, báo cảo KH, sách chuyên khảo ) chỉ đươc chấp nhân nếu

có ghi nhận địa chỉ và cảm ơn tài trợ của ĐHQGHN đủng quy định.

Bản phô tô toàn văn các ẩn phẩm này phải đưa vào phụ lục các minh chứng của báo cáo

Riêng sách chuyển khảo cần có bản phô tô bìa, trang đầu và trang cuối có ghi thông tin mã sổ xuất

bản.

3.3 Kết quả đào tạo

Thời gian và kỉnh phí tham gia đề tài

(sổ tháng/sổ tiền)

Công trình công bố liên quan

Nghiên cứu sinh

Việt

and Minh Le Nguyen, “Improving English-Vietnamese Statistical Machine

Dependency Syntactic”, Proceeđings of

Computational Linguistics 2015, pl 15-

pl21.

Pham Nghĩa Luan, Vinh Nguyen Van and Trung Le Tien, “The English- Vietnamese Machine Translation System for IWSLT 2015”, Proceeding of the 12th International Workshop on Spoken

7

Trang 10

Language Translation, 2015, p80-p84

3 Viet Tran Hong, Huyen Vu Thuong, Vinh Nguyen Van and Minh Nguyen Le

Proceedings of the 17th International

http://site.cicling.org/2016/accepted.html

4 Viet Tran Hong, Huy en Vu Thuong, Thu Pham Hoai, Vinh Nguyen Van and Nguy en Le Minh “A Reordering Model

Machine Translation Using Dependency

International Conference on Computing

Research, Irmovation, and Vision for the Future (RIVF), 2016.

5 Viet-Hong Tran, Huyen Vu Thuong, Vinh Van Nguyên and Minh Le Nguyen,

“Dependency-based Pre-ordering For English-Vietnamese Statistical Machine Translation”, In VNU Joumal of

Communication Engineering, pages 175- 179,2017.

Luân

Pham Nghia Luan, Vinh Nguyen Van and Trung Le Tien, “The English- Vietnamese Machine Translation System for IWSLT 2015”, Proceeding of the 12th International Workshop on Spoken Language Translation, 2015, p80-p84

Available:

http ://workshop2015 iwslt.org

Luan Nghia Pham, Vinh Van Nguyen and Huy Quang Nguy en, Translation model adaptation for Statistical Machine Translation with domain classiĩier, Proceeeding of the 31 st Pacific Asia Conference on Language, Information and Computation (PACLIC 31), 2017

Hoc viên cao hoc

1 Vũ Thương

Huyền

Vinh Nguyen Van and Trung Le Tien,

Translation System for IWSLT 2015”,

Đã bảo vệ

8

Trang 11

Proceeding of the 12th International

Translation, 2015, p80-p84 Available:

http://workshop2015 iwslt.org

2 Viet Tran Hong, Huyen Vu Thuong, Vinh Nguyen Van and Minh Nguyen Le

Proceedings of the 17th International

http://site.cicling.org/2016/accepted.html

3 Viet Tran Hong, Huyen Vu Thuong, Thu Pham Hoai, Vinh Nguyen Van and Nguyen Le Minh “A Reordering Model

Machine Translation Using Dependency

International Conference on Computing

Research, Innovation, and Vision for the Future (RIVF), 2016

4 Luận văn thạc sỹ: Nghiên cứu mô

_ hình ngôn ngữ dựa trên mạng neural

Ghi chú:

Gửi kèm bản photo trang bìa luận án/ luận văn/ khóa luận và bằng hoặc giấy chứng nhận nghiên cứu sinh/thạc sỹ nếu học viên đã bảo vệ thành công luận án/ luận văn;

Cột công trình công bố ghi như mục III 1.

PHẦN IV TỎNG HỢP KÉT QUẢ CÁC SẢN PHẢM KH&CN VÀ ĐÀO TẠO CỦA ĐÈ TÀI

đăng ký

Số lượng đã hoàn thành

ISI/Scopus

bản

tạp chí khoa học chuyên ngành quốc gia hoặc báo cáo khoa

học đăng trong kỷ yếu hội nghị quốc tế

hàng của đơn vị sử dụng

chính sách hoặc cơ sở ứng dụng KH&CN

9

Trang 12

PHẦN V TÌNH HÌNH s ử DỤNG KINH PHÍ

Kinh phí được duyệt

(triệu đồng)

Kinh phí thực hiện

(triệu đồng)

Ghi chú

PHẢN VI PHỤ LỤC (minh chứng các sản phẩm nêu ở Phần III)

1 Bài báo hội nghị

[1] Viet Tran Hong, Vinh Van Nguyen and Minh Le Nguyen, “Improving English-Vietnamese Statistical Machine Translation Using Pre-processing Dependency Syntactic”, Proceedings of the

of the 17th ĩntemational Conference on Intelligent Text Processing and Computational Linguistics,

2016 Available: http://site.cicling.org/2016/accepted.html (ISI)

[4] Viet Tran Hong, Huyen Vu Thuong, Thu Pham Hoai, Vinh Nguyen Van and Nguyen Le Minh

“A Reordering Model For Vietnamese-English Statistical Machine Translation Using Dependency Iníòrmation”, Proceedings of International Conference on Computing & Commimication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2016 (IEEE).

[5] Luan Nghia Pham, Vinh Van Nguyen and Huy Quang Nguyen, Translation model adaptation for Statistical Machine Translation with domain classiĩier, Proceeeding of the 3 lst Pacific Asia Conference on Language, Information and Computation (PACLIC 31), 2017 Available:

http://paclic31 national-u.edu.ph/Ỹpage id=302 (SCOPUS).

10

Trang 13

2 Bài báo tạp chí khoa học chuyên ngành quốc gia (tạp chí chuyên ngành thuôc danh muc hôi đồng chức danh chấp nhận như: tạp chí chuyên ngành CNTT và TT của ĐHQG Hà nội/Tạp chí Tin học điêu khiên/Chuyên san Bưu chính Viên Thông/ )

[1] Viet-Hong Tran, Huyen Vu Thuong, Vinh Van Nguyen and Minh Le Nguyen, “Dependency- based Pre-ordering For English-Vietnamese Síatistical Machine Translation”, In VNƯ Joumal of Science: Computer Science and Communication Engineering, pages 175-179, 2017.

Đơn vị chủ trì đề tài

(Thủ trưởng đơn vị kỷ tên, đóng dấu)

T/L HIỆU TRƯỞNG

.JRƯỎHG PHÒNG KHOA HỌC CÔNG NGHÉ

Hà Nội, ngày ats tháng.:iiL năm 2017.

Trang 14

Accepted Manuscript

Available Online: 3 ỉ M ay, 2017

T h is is a P D F file o f an u n e d ite d m a n u s c rip t th a t h a s been

a c c e p te d f o r p u b lic a tio n A s a Service to o u r c u sto m e rs w e are

providing this early v ersio n o f the m anuscript The m anuscript

w ill u n đ e rg o c o p y e d itin g , ty p e se ttin g , an d re v ie w o f th e

resulting proof before it is published in its final form Please

n o te th a t d u rin g th e p ro d u c tio n p ro c e ss e ư o r s m a y be

discovered which could affect the content, and all legal

d is c la im e rs th a t a p p ly to th e jo u m a l p e rta in A rtic le s in P ress

are accepted, peer revievved articles that are not yet assigned to

v o lu m e s /is s u e s , b u t a re c ita b le u sin g D O I.

i ĐẠI HỌC QUỐC GIA HA NỘI

bữĩtÊM Tỉtôxr;

Trang 15

Dependency-based Pre-ordering For English-Vietnamese

Statistical Machine Translation

T ra n H o n g V ìe t !'2, N g u y e n V an V in h 2, V u T h u o n g H u y e n 3, N g u y e n L e M in h 4

1 University o f Economic and Technical Industries, Hanoi, Vietnam

•'University o f Engineering and Technology, Vietnam National Universitỵ, Hanoi, Vietnam

’ ThuyLoi University, Hanoi, Vietnam

4Japan Advanced ỉnstitute o f Science and Technology

Em ail: thviet@ uneti.edu.vn, vinhnv@ vnu.edu.vn, huvenvt@ tlu.edu.vn, nguyenml@ jaist.ac.jp

Abstract

Reordering is a major challenge in machine translation (MT) betvveen two languages with significant differences

in word order In this paper, vve present an approach as pre-processing step based on a dependency parser in phrase-based statistical machine translation (SMT) to learn automatic and manual reordering rules from English

to Vietnamese The dependency parse trees and transíormation rules are used to reorder the source sentences and applied for systems translating from English to Vietnamese We evaluateđ our approach on English-Vietnamese machine translation tasks, and showed that it outperíorms the baseline phrase-based SMT system.

Keywords: Natural Language Processing, Machine Translation, Phrase-based Statistical Machine Translation.

X Introduction

P hrase-based statistical m achine translation

[1] is the state-of-the-art o f SM T because o f its

pow er in m odelling sh o rt reordering and local

context Hoxvever, w ith phrase-based SMT, long

distance reordering is still problem atic The re-

ordering problem (global reordering) is one of

the m ạịor problem s, since different languages

have differen t word order requirem ents In recent

years, m any reordering m ethods have been pro-

posed to tackle the long distance reordering prob-

lem.

M any solutions solving the reordering prob-

lem have been proposed, such as syntax-based

m odel [2], lexicalized reordering [3] C hiang [2]

shows signiíicant im provem ents by keeping the

strengths o f phrases, w hile incorporating syntax

into SM T Som e approaches were applied at the

w ord level [4] They are u seíu l for language with

rich m orphology, for reducing data sparseness.

* Corresponding author Email: thviet@uneli.edu.vn

O ther kinđs o f syntax reordering methcxis require parser trees, such as the work in [4] T he parsed tree is m ore pow erful in capturing the sentence structure However, it is expensive to create tree structure and build a good quality parser All the above approaches require m uch decoding time,

w hich is expensive.

T he approach that we are interested in is bal- ancing the quality o f translation w ìth decoding tim e R eordering approaches as a preprocessing step [5, 6, 7] are very effective (significant im- provem ent over State of-the-art phrase-based and hierarchical m achine translation systems and sep- arately quality evaluation o f each reordering m odels).

T he end-to-end neural M T (NM T) approach [8] has recently been proposed for MT However, the N M T m ethod has som e lim itations that may jeopardize its ability to generate better translation The N M T system usually causes a serious out-of-vocabulary (OOV) problem , the translation quality would be badly hurt; The N M T de-

1

Trang 16

(b) Preordering for English-Vietnamese transiation

Figure 1: A example o f preorđering for English-Vietnamese

translation.

co d er lacks a m echanism to guarantee that all

the source w ords are ư an slate d and usually favors

short translations It is d ifficult for an N M T sys-

tem to b en efit from targ et language m odel trained

on target m onolingual co rp u s, which is proven

to be u seíu l for im proving translation quality in

statistical m achine tran slatio n (SM T) N M T need

m uch m ore training tim e In [9], N M T requires

longer tím e to train (18 days) com pared to their

best S M T system (3 days).

Inspire by this p rep ro cessin g approaches, we

propose a com bined approach w hich preserves

the strength o f phrase-based SM T in reordering

and decoding tim e as well as the strength o f

integrating syntactic in ío rm atio n in reordering

Firstly, the proposed m eth o d uses a dependency

parsing for preprocessing step with training and

testing Secondly, tran síò rm atio n rules are ap-

plied to reorder the so u rce sentences T he exper-

im ental resulting from English-V ietnam ese pair

shows that o u r approach achieved im provem ents

in B LEƯ scores [10] w h en translating from En-

glish, com pared to M O S E S [11] w hich is the

State o f-th e-art phrase-based SM T system.

T his p ap er is structured as follows: Section

1 introduces the reo rd erin g problem Section 2

reviews the related w orks Section 3 introduces

phrase-based SMT S ectio n 4 expresses how to

apply transíorm ation rules for reordering the

source sentences S ection 5 presents a the leam -

ing m odel in order to tran sío rm the word order of

an input sentence to an order that is natural in the target languages Section 6 describes experim ental results; Section 7 discusses the experim ental results And, conclusions are given in Section 8.

2 Related works

The difference o f the word order between source and target languages is the m ajor problem in phrase-based statistical m achine translation Fig 1 describes an exam ple that a reordering approach m odiíies the word order o f an input sentence o f a source languages (English) in order

to generate the word order o f a target languages (V ietnam ese).

M any preordering m ethods using syntactic in- íòrm ation have been proposed to solve the reordering problem (C ollin 2005; Xu 2009) [4, 5] presented a preordering m ethod w hich used m anually created rules on parse trees In addition, linguistic know ledge for a ỉanguage pair is necessary

to create such rules O ther preordering methods using autom atic created reordering rules or a statistical classifier were studied [12, 7]

Collins [4] developed a clause detection and

used some handvvritten rules to reorder words

in the clause Partly, (H abash 2007)[ 13] built an autom atic extracted syntactic rules Xu [5] described a m ethod using a dependency parse tree and a flexible rule to perform the reordering of subject, object, etc T hese rules were w ritten

by hand, but [5] showed that an autom atic rule learner can be used.

Bach [14] propose a novel source-side dependency tree reordering m odel for statistical machine translation, in w hich subtree m ovements and constraints are represented as reordering events associated w ith the w idely used ]exicalized reordering m odels.

(G enzel 2010; L erner and Petrov 2013) [6, 7] described a m ethod using discrim inative clas- siíiers to directly predict the final word order Cai [15] introduced a novel pre-ordering approach based on dependency parsing for Chinese- English SMT.

Isao G oto [16] described a preordering m ethod using a target-language parser via cross-language

Trang 17

syntactic prọịection for statistical m achine trans-

lation.

Joachim D aiber [17] presented a novel exam-

ining the relationship betw een preordering and

worđ order ĩreedom in M achine Translation,

C henchen Ding, [18] proposed extra-cbunk

pre-ordering o f m orphem es w hich allovvs

Japanese íunctional m orphem es to move across

chunk boundaries.

C hristian H adiw inoto presented a novel re-

ordering approach utilizing sparse íeatures based

on dependency word pairs [19] and presented a

novel reordering approach utilizing a neural net-

w ork and dependency-based em bedding to pre-

dict w hether the translations o f two source words

linked by a dependency relation should rem ain in

the sam e order o r should be sw apped in the trans-

lated sentence [9] T his approach is com plex and

spend much tim e to process.

However, there were not deíinitely many stud-

ies on English-V ietnam ese to SM T system tasks

To o u r know ledge, no research address reorder-

ing m odels for E nglish-V ietnam ese SM T based

on dependency parsing In com parison with these

m entioned approaches, our proposed m ethod has

some differences as follows: We investigate to use

a reordering m odels for English-V ietnam ese SM T

using dependency inform ation We study s v o

language in English-V ietnam ese in order to rec-

ognize the differences about English-V ietnam ese

w ord labels, phrase label as well as dependency

labels We use dependency parser o f English

sentence for translating from English to Viet-

nam ese B ase on above studies, we utilize the En-

glish - V ietnam ese transíorm ation rules (m anual

and autom atic rules are extracted from English-

V ietnam ese parallel corpus) that directly predict

target-side word as a preprocessing step in phrase-

based m achine translation As the sam e w ith [13],

vve also applied preprocessing in both training and

decoding time.

3 B rief D escription o f the B aseline Phrase-

based SM T

In this section, we w ill describe the phrase-

based SM T system w hich was used for the

ex-ROOT

I 'm l o o l đ n g a i a n e w j e w e l r y s i t e .

Figure 2: A example with POS tags and dependency parser.

perim ents Phrase-based SMT, as described by [1] translates a source sentence into a target sentence

by decom posing the source sentence into a sequence o f source phrases, w hich can be any con- tiguous sequences o f words (or tokens treated as

w ords) in the source sentence For each source phrase, a target phrase translation is selected, and the target phrases are arranged in some order to produce the target sentence A set o f possible translation candidates created in this way were scored according to a w eighted linear com bination o f íeature values, and the highest scoring translation candidate was selected as the translation o f the source sentence Sym bolically,

sociated with each feature f i are tuned to maxi-

m ize the quality o f the translation hypothesis selected by the decoding procedure that com putes the argm ax T he log-linear m odel is a natural fram ew ork to integrate m any íeatures The proba- bilities o f source phrase given target phrases, and target phrases given source phrases, are estim ated from the bilingual corpus.

Koehn [1] used the follow ing distortion m odel (reordering m odel), w hich sim ply penalizes non-

m onotonic phrase alignm ent based on the word distance o f successively translated source phrases

w ith an appropriate value for the param eter a:

Trang 18

T H Viet eí aì / VNU io u r n a l o f Science: Comp Science & Com Eng., Vòi 31 No 3 (2017) 1-13

M oses [11] is open source toolkit for statistical

m achine translation system that allovvs autom at-

ically train translation m odels for any language

pair W hen we have a trained m odel, an efficient

search algorithm quickly finds the highest prob-

ability translation am ong the exponential num ber

o f choices In our work, we also used M oses lo

evaluate on E nglish-V ietnam ese m achine transla-

tion tasks.

4 D ependency Syntactic Preprocessing For

SM T

R eordering approaches on English-V ietnam ese

translation task have lim itation In this paper, we

íirstly produce a parse tree using dependency

parser tools [20] Figure 3 shows an exam ple of

parsed a English sentence.

Higure 3: Example about Depenđency Parser o f an English

sentence using Staníord Parser

Then, we utilize som e dependency relations ex-

tracted from a statistical dependency parser to

create the dependency based on reordering rules

D ependency parsing am ong words typed with

gram m atical relations are proven as useíul infor-

m ation in some applications relative to syntactic

Processing.

We use the dependency gram m ars and the dif-

ferences o f w ord order betw een V ietnam ese and

0T VBC JJ JJ NHS Figure 4: Representation o f the Staníord Dependencies for

the English source sentence

English to create a set o f the reordering rules There are approxim ately 50 gram m atical relations

in E nglish, m eanw hile there are 27 ones in Viet- nam ese based on [21] and the differences o f word order betw een E nglish and V ietnam ese to create the set o f the reordering rules Base on these rules, we propose an our m ethod w hich is capable o f applying and com bining them sim ultaneously We utilize the word labels in [21] to ana-

lyze the extraet POS tags and head modifier de-

pendencies.

In addition, we focus on analyzing som e popular structures o f E nglish language w hen translating to V ietnam ese language T his analysis can achieve rem arkable im provem ents in translation períorm ance B ecause E nglish and V ietnam ese both are s v o languages, the order of verb rarely change, we focus m ainly on som e typical relations as noun phrase, adjectival and adverbial phrase, preposition and created m anually written reordering rule set for English-V ietnam ese language pair Inspired from [5], our study em- ploy dependency syntax and transyntaxsibrm a- tion rules to reorder the source sentences and applied to E nglish-V ietnam ese translation system For exam ple, w ith noun phrase, there always exists a head noun and the com ponents beíòre and after it T hese auxiliary com ponents w ilì move to new positions according to V ietnam ese transla- tional order.

Let us consider an exam ple in Figure 6,

Trang 19

Fig-T H Via et a l / V N U JournaI o f Science: Comp Science & Com Eng., Vơi 31, No 3 (2017) 1-13 5

ure 7 to the difference o f w orđ o rd er in English

and V ietnam ese noun phrase and adjectival and

adverbial phrase.

4.1 Transýormation R uỉe

T his section, we describ e a tran sío rm atio n rule.

Figure 5; An Example o f using Dependency Syntactic

bcíore and after our preprocessing

English s c n t e n c M E S ^ a p e ! 3 0 « a l C o m p u te r

Ọĩf «" '^ 1Smiences with Depcndcncles g peraora) eompuw

R e o rd e r s c n le n c e s E s s J > a ram pm eí pereona)

Vletnamese ỉeBleoces 8 ^ mộí máy tính cá nhãn

Figure 6: An example o f word reordering phenomenon in

noun phrase with adjectival rnodifier (amod) and

determiner modiíìer (det) In this example, the noun

“Computer” is swapped vvith thc adjectival “personal”

O ur ru le set is for E nglish-V ietnam ese phrase-

based SM T Table 1 show s h an d w ritten ru les us-

ing dependency syntactic p rep ro cessin g to re-

order from E nglish to V ietnam ese.

In the proposed approach, a transform ru le is a

m apping from T to a set o f tuples (L, w , O )

• T is the part-of-speech (PO S) tag o f the head

in a dependency parse tree node

Figure 7: An example o f word reordering phenomenon in adjectival phrase with adverbial modiíier (advmod) and

determiner modifier (det).

• L is a d ep en d en cy label for a child node.

• w is a w eig h t indicating the order o f that

ch ild node.

• o is the ty p e o f o rd er (either N O R M A L or

R E V E R S E ).

O u r ru le set provides a valuable resource for

p reo rd erin g in E nglish-V ietnam ese phrase-based SMT.

4,2 D ep en d en cy S vn ta ctic Processing

W e aim to reo rd er an E nglish sentence to get a new E nglish, and som e w ords in this sentence are arranged as V ietn am ese vvords order T he type of order is only used w hen w e have m ultiple children

w ith th e sam e w eight, w hile the w eight is used to

d eterm in e the relative o rd er o f the children, go- ing fro m the largest to the sm allest The weight can be any real valueđ number T he order type

N O R M A L m eans w e preserve the original order

o f the children, w h ile R E V E R S E m eans we flip the order W e reserve a special label self to refer to the h ead n o d e itse lf so that we can apply a weight

to the head, too W e w ill call this tuple a prece-

d ence tuple in later discussions In this study, we use m anually created ru les only.

S uppose w e have a reordering rule: NNS -» (prep, 0, N O R M A L ), (rcm od, 1, NORM AL), (se ừ , 0, N O R M A L ), (poss, -1, NO RM A L) (adm od,-2, R E V E R S E ) F o r the exam ple shown

in F igure 4, w e w ould apply it to the RO O T node and resu lt in "songvvriter that w rote many songs rom antic."

W e apply them in a dependency tree recursively startin g from the root node If the POS tag

Trang 20

T (L, w, O)

JJ or JJS or JJR (advcl, 1 ,NORMAL)

(self,-l,NORMAL) (aux,-2,REVERSE) (auxpass,-2,REVERSE) (nég,-2,REVERSE) (cop,0,REVERSE)

N N or NNS (prep.O.NORMAL)

(rcmod, 1 ,NORM AL) (self ,ũ,NORMAL) (poss,-l, NORMAL) (admod,-2,REVERSE)

IN or TO (pobj,l,NORMAL)

(self,2,NORMAL)

Table 1: HandvvriUen rules For Reordering English to Vietnamese using Dependcncy syntactic preproccssing

o f a node m atches the left-hand-side o f a rule, the

rule is applied and the order o f the sentence is

changed W e go through all the children o f the

node and get the precedence vveights for them

from the set o f precedence tuples If we encounter

a chilđ node that has a dependency label not listed

in the set o f tuples, we give it a default w eight o f

0 and default order type o f N O R M A L T he chil-

dren n odes are sorted according to their weights

from highest to lowest, and nodes with the same

w eights are ordered according to the type o f order

defined in the rule.

Figure 5 gives exam ples o f original and prepro-

cessed p h rase in English The íirst line is the orig-

inal E nglish sentences: "that songw riter w rote

many songs rom antic.", and the íourth line is the

target V ietn am ese reordering "Nhạc sĩ đó đã viết

nhiều bài h át lãng m ạn." This sentences is ar-

ranged as the V ietnam ese order We aim to pre-

process as in Figure 5 V ietnam ese sentences is

the output o f o u r m ethod As you can see, after re-

ordering, original English line has the sam e word

order.

5 C lassifier-based Preordering for Phrase-

based S M T

C urrent tim e, state-of-the-art phrase-based

SM T system using the lexicalized reordering

model in M oses toolkit In our work, we also

used M oses to evaluate on English-V ietnam ese

m achine translation tasks.

5.7 Cỉassifier-based Preordering

In this section, we describe a the learning

m odel that can transíorm the word order o f an input sentence to an order that is natural in the target language English is used as source language,

w hile V ietnam ese is used as target language in

our discussion about the word orders.

For exam ple, w hen translating the English sentence:

I ’m looking a t a new je w elry site.

to V ietnam ese, we would like to reorder it as:

I 'm looking at a site new jew elry.

A nd then, this m odel w ill be used in com bination with translation model.

T he feature is built for "site, a, new, jew elry" family in Figure 2:

NN, DT, det, JJ, am od, NN, an, 1230, 1023

We use the dependency gram m ars and the differences o f word order betw een English and

V ietnam ese to create a set o f the reordering rules From part-of-speech (PO S) tag and parse the input sentence, producing the POS tags and heađ-m odifier dependencies shown in Figure 2 Traversing the dependency tree starting at the

Trang 21

Corpus Sentence pairs Training Set Development Set Test Set

T The head’s POS tag

1T The first child’s POS tag

1L The first child’s syntactic label

2T The second child’s POS tag

2L The second chilcTs syntactic label

3T The third child’s POS tag

3L The third chilcTs syntactic label

4T The íourth child’s POS tag

4L The íourth child’s syntactic label

O i The sequence of head and its chíldren

root to reordering We determ ine the order o f

the head and its children (independently o f other

decisions) for each head word and continue the

traversal recursively in that order In the above ex-

am ple, we need to decide the order o f the head

"looking" and the children "I", "’m", and "site.".

T he w ords in sentence are reordered by a

new sequence learned from training data using

m ulti-classifier m odel We use SVM classiíica-

tion m odel [22] that supports m ulti-class predic-

tion The class labels are corresponding to re-

ordering sequence, so it is enable to select the best

one from m any possibỉe sequences.

5.2 Features

T he íeatures extracted based on dependency tree includes PO S tag and alignm ent iníòrm ation

W e traverse the tree from the top, in each family

we create features with the following inform ation:

• T he h ead ’s POS tag,

• T he first child’s PO S tag, the first child’s syntactic label.

• T he second ch ild ’s POS tag, the second child’s syntactic label.

• T he third ch ild ’s PO S tag, the third chilcTs syntactic label.

• T he fourth chilcTs PO S tag, the íòurth ch ild ’s syntactic label.

• The sequence o f head and its children in source alignm ent.

«* The sequence o f head and its children in target alignm ent It is class label for SVM classifier m odel.

We lim ited our self by Processing íam ilies that

have less than five children based on counting to- tal fam ilies in each group: 1 head and 1 child, 1 head and 2 children, 1 head and 3 children, 1 head

Trang 22

Pattern Order Example

NN, DT, det, JJ, amod, NN nn 1,0,2,3 I ’m looking at a new jewelry site

—>1 ’m looking at a site new jewelry NNS, JJ, amod, cc, cc, NNS, con 2,1,0,3 it faced a blank w a ll.

—» it faced a wall blank NNP, NNP, nn, NNP, nn 2,1,0 it ’s a social phenomenon

—> it ’s a phenomenon social

Table 4: Examples o f rules and reorder source sentences

A lgorithm 1 Extract rules

input: dependency trees of source sentences

and alignment pairs;

output: set of automatic rules;

for each íamily in dependency trees of subset

and alignment pairs of sentences do

generate feature (pattern + o rd er);

end for

Build model from set of features;

for each fami]y in đependency trees in the rest

of the sentences do

generate pattern for prediction;

get predicted order from model;

add (pattern, order) as new rule in set of rules;

cnd for

and 4 children We found out that the m ost com -

mon ỉam ilies appear (80% ) in our training sen-

tences is less than and equal four children.

W e trained a separate classifier for each num -

ber o f possible children, In hence, the classiíiers

learn to trade o ff between a rich set o f overlapping

features L ist o f íeatures are given in table 3.

W e use SVM classiíĩcation m odel in the

W E K A to o ls [23] that supports m ulti-class pre-

diction Since it naturally supports m ulti-class

prediction and can thereíòre be used to select one

out o f m any possible perm utations T he learning

algorithm produces a sparse set o f íeatures In our

experim ents, the m odels were based on features

that generated from 100k English - V ietnam ese

sentence pairs.

W hen extracting the íeatures, every word can

be represented by its word identity, its PO S-tags

from the treebank, syntactic label We also in-

clude pairs o f these íeatures, resulting in poten-

tially bilexical íeatures.

Algoríthm 2 Apply rule

input: source-side dependency trees , set of rules; output: set of new sentcnces;

for each dependency tree do for each family in tree do generate pattern get order from set of rules based on pattern apply transíorm

For exam ple with the English sentence in Fig- ure 2:

/ ’m looking at a new jew elry site

is transíòrm ed into V ietnam ese order:

I 'm looking a t a site new jew elry.

For this approach, we first do preprocessing to encode som e special vvords and parser the sentences to dependency tree using Staníord Parser [24] Then, we use target to source alignm ent and dependency tree to generate features We add source, target alignm ent, PO S tag, syntactic label

o f word to each node in the dependency tree For each family in the tree, we generate a training in- stance if it has less than and equal four children In

Trang 23

T H Viel et al / VNU J o u m a l o f Science: Comp Science & Com Eng., Vol 31, No 3 (2017) 1-1 ỉ

case a fam ily has m ore than and equal five chil-

dren, we discard this íam ily but still keep travers-

ing at each child.

Hach ru le consists of: pattern and order For ev-

ery node in the dependency tree, from the top-

down, w e find the node m atching against the pat-

tern, and if a match is found, the associated or-

der applies We arrange the words in the English

sentence, w hich is covered by the m atching node,

like V ietnam ese words order A nd then, we do the

sam e for each children o f this node, If any rule

is applied, we use the order o f original sentence

T hese ru les are learnt autom atically from bilin-

gual corpora The our alg o rith m ’s outline is given

as Alg 1 and Alg 2

A lgorithm 1 extracts autom atically the rules

w ith input incluđing dependency trees o f source

sentences and alignm ent pairs.

A lgorithm 2 proceeds by considering all rules

a íte r íĩn ish A lgorithm 1 and source-side depen-

dency trees to build new sentence.

5.4 C lassi/ication M odel

The reordering decisions are m ađe by m ulti-

class c1assifiers (correspond with num ber o f per-

mutation: 2, 6, 24, 120) where class labels eorre-

spond to perm utation sequences We train a sep-

arate classifier for each num ber o f possible chil-

dren C rucially, we do not learn explicit tree trans-

íorm ations rules, but let the classiílers learn to

trade o ff betw een a rich set o f overlapping fea-

tures To build a classiíication m odel, we use

SVM classiíication m odel in the W E K A tools

T he follow ing result are obtained using 10 folds-

cross validation,

We apply them in a dependency tree recur-

sively starting from the root node If the PO S-tags

o f a node m atches the left-hand~side o f the rule,

th e rule is applied and the order o f the sentence

is changed W e go through all the chilđren o f the

node and m atching rules for them from the set of

autom atically rules.

Table 4 gives exam pies o f original and pre-

p rocessed phrase in E nglish T he first line is the

original English: " Ị ’ra looking at a new jew -

elry site and the target V ietnam ese reordering

" Tôi đang xem m ột trang web mới về nữ_trang

This sentences is arranged as the V ietnam ese order V ietnam ese sentences are the output of our m ethod As you can see, after reordering, the original English line has the sam e word order: " I

’m looking at a site nevv jew elry in Figure 1.

6 E xperim ental R esults

6.1 D ata set a nd E xperim ental Setup

For evaluation, we used an V ietnam ese-English corpus [25], including about 131236 pairs for training, 1000 pairs for' testing and 400 pairs for developm ent test set Table 2 gives m ore statistical iníorm ation about our corpora We conducted some experim ents w ith SM T M oses D ecoder [11] and SRILM [26] W e trained a trigram language model using interpolate and kndiscount sm ooth- ing with V ietnam ese m ono corpus B eíore extracting phrase table, we use G IZ A ++ [3] to build word alignm ent with grow -diag-final-and algorithm B esides using preprocessing, we also used default reordering m odel in M oses D ecoder: using w ord-based extraction (wbe), splitting type of reordering orientation to three classes (m onotone, swap and discontinuous - m sd), com bining back-

ward and forward direction (bidirectional) and

m odeling base on both source and target language (fe) [11] To contrast, we tried preprocessing the source sentence w ith m anual rules and autom atic rules.

We im plem ented as follows:

• We used Staníord P arser [24] to parse source sentence and apply to preprocessing source sentences (English sentences).

• We used classifier-based preordering by using SVM classiíication m odel [22] in W eka tools [23] for training the features-rich discrim inative classifiers to extract autom atic rules and apply them for reordering w ords ÍĨ 1

English sentences according to Vietnam ese word order.

• We im plem ented preprocessing step during both training and decoding time.

• Using the SM T M oses decoder [11] for decoding.

Trang 24

We give som e deíinitions for our experim ents:

• Baseline: use the baseline phrase-based

SM T system using the lexicalized reordering

m odel in M oses toolkit.

• M anual Rules: the phrase-based SM T sys-

tem s applying m anual rules [27].

• A u to R u les : the phrase-based SM T systems

applying autom atic rules [28].

• A u to R u les + Manual Rules: the phrase-

based SM T systems applying autom atic

rules, then applying m anual rules.

6.2 U sing M anual R ules

In this section, we present our experim ents to

translate from English to V ietnam ese in a statisti-

cal m achine translation system We used Stanford

P arser [24] to parse source sentence and apply

to p reprocessing source sentences (English sen-

tences) A ccording to typical differences o f w ord

o rd er betw een English and V ietnam ese, we have

created a set o f dependency-based rules for re-

o rdering w ords in E nglish sentence according to

V ietnam ese word order and types o f rules includ-

ing noun phrase, adjectival and adverbial phrase,

preposition w hich is described in table 1.

6.3 Using A utom atic Rules

W e presen t our experim ents to translate from

E nglish to V ietnam ese in a statistical m achine

translation system In hence, the language pair

chosen is English-V ietnam ese We used Staníord

P arser [24] to parse source sentence (English sen-

tences).

W e used dependency parsing and rules ex-

tracted from training the features-rich discrim i-

native classifiers for reordering source-side sen-

tences T h e rules are autom atically extracted from

E nglish-V ietnam ese parallel corpus and the de-

pendency parser o f English exam ples Finally,

they used these rules to reorder source sen-

tences W e evaluated our approach on E nglish-

V ietnam ese m achine translation tasks w ith sys-

tem s in tab le 5 vvhich shows thát it can outperform

the baseline phrase-based SM T system.

6.4 B L E U score

T he result o f our experim ents in table 6 shovved size o f phrase tables built from translation model base on our m ethod In this m ethod, vve can find out varíous phrases in the translation m odel So that, they enable us to have m ore options for decoder to generate the best translation.

Table 7 describes the BLEU score o f our exper-

im ents As we can see, by applying preprocessing

in both training and decoding, the B LEU score of

"Auto Rules" system is lower by 0.49 point than

"M anual Rules" system This result ỉs due to the fact that m anual rules have better quality than autom atic rules However, "Auto Rules + M anual Rules" system is the best system because applying the com bination rules can cover much linguistic phenom ena.

The above result proved that the effect o f applying transíorm ation rule base on the dependency parse tree.

7 A nalysis and D ỉscussion

We have found that in our experim ents work

is sufficiently correlated to the translation quality done manually B esides, we also have found some errors cause such as parse tree source sentence quality, word alignm ent quality and quality o f corpus A ll the above errors can effect autom atic reordering rules Table 9 shovved the translation output exam ples are better than baseline system produced by our system for the input sentences from English-V ietnaraese test set

Go here for m ore exam ples o f translations for input sentences sam pled random ly from our corpus Sorae phrases in English source sentence were reordered corresponding to Vietnam ese target sentence order W e focus m ainly on som e typical relations as noun phrase, adjectival and adverbial phrase, preposition and created m anually written reordering rule set for Engíish-V ietnam ese

Trang 25

Name Description

Baseline Phrase-based system

Manual Rules Phrase-based system with corpus

which preprocessed using manual rules Auto Rules Phrase-based system with corpus which preprocessed using

automatic learning rules Auto Rules + Manual Rules Phrase-based system with corpus which preprocessed using

automatic learning rules and manual rules

Table 5: Our experimental systems on English-Vietnamese parallel corpus

Auto Rules + Manual Rules 1253401

Table 6: Size o f phrase tables

Auto Rules + Manual Rules 37.85

Table 7: Transiation períormance for the

English-Vietnamese task

language pair O ur study em ployed dependency

syntactic and transform ation rules to reorder the

source sentence and applied to English to Viet-

nam ese translation systems.

For exam ple, with noun phrase, there always

exists a head noun and the com ponents beíòre and

after it These auxiliary coraponents w ill move to

new positions according to V ietnam ese transla-

tional order These rules can popular source lin-

guistic phenom ena equivalent to target language

Based on these phenom ena, translation quality

has significantly improved We carried o ut error

Number Number Description children of head

1 79142 Family has 1 children

2 40822 Fami]y has 2 children

3 26008 Family has 3 children

5 7442 Family has 5 children

6 2728 Family has 6 children

Table 8: Statistical number o f íamily on corpus

English-Vietnamese

analysis sentences and compared to the golden

reordering O ur analysis has also the benefits of autom atic reordering rules on translation quality

In com bination with m achine leam ing m ethod in relateđ work [7], it is shown that applying clas- siíier m ethod to solve reordering problem s automatically.

A ccording to typical differences o f word order betw een English and V ietnam ese, we have created a set o f autom atic rules for reordering words in E nglish sentence according to Viet- nam ese word order and types o f rules including noun phrase, adjectival and adverbial phrase, as well as preposition phrase Table 8 gives statistical fam ilies w hich have larger o r equal 4 children

in our corpus The num ber o f children in each íam ily has lim ited 4 children in our approach So

in target language (V ietnam ese), the num ber of children in each íam ily is the same.

T he m anual rules have good quality [5, 13], the phrase-based SM T systems applying m anual rules is better than the phrase-based SM T sys-

Trang 26

Input sentence: Translation (Baseline): Translation (Auto): Translation (human):The coat was far loo big

- it completeìy enveloped him

Chiếc áo khoác là quá lớn

- nó hoàn toàn phù anh ta

Chièc áo khoác là quá lớn

- nó phù hoàn toàn anh ta

Chiếc áo khoác quá lớn

- nó hoàn toàn phù anh ta Manh Cuong is a young íootball player

bóng đá aẻ rất nhiều triển vọng

Table 9: An example o f a translation produced by our system for an input sentence sampled ííom English-Vietnamese corpus.

tem s applying autom atic rules We believe that

the quality o f the phrase-based SM T system s ap-

plying autom atic rules w ill be better w hen we

have a better corpus.

8 C onclusion

In this paper, we present a preprocessing ap-

proach based on the dependency parser T he pro-

posed approach is applying for English - Viet-

nam ese translation system The experim ental re-

sults shovv that our approach achieved statistical

im provem ents in BLEU scores over a state-of-

the-art phrase-based baseline system By apply-

ing m anual rules and autom atic rules, the quality

o f English-V ietnam ese ư anslation system is im -

proving In our study, our rules cover som e lin-

guistic reordering phenom ena T hese reordering

rules b en eíit English-V ietnam ese languages pair.

We will focus on word order problem s

much m ore w ith linguistic reordering phenom -

ena on English-V ietnam ese to learn b etter the

dependency-based reordering rules (m anual rules

and autom atic rules) T his i-s necessary in im-

proving SM T systems and that m ight lead to its

a w ider adoption.

A cknow ledgm ent

This work described in this paper has been

partially íunded by H anoi N ational U niversity

(QG 15.23 project)

R eĩerences

[1] p Koehn, F J Och, D Marcu, Statistical phrase-based

translation in: Proceeđings o f HLT-NAACL 2003, Ed-

monton, Canada, 2003, pp 127-133.

[2] D Chiang, A hierarchical phrase-based model for sta-

tistical machine translation, in: Proceedings o f the 43rd

Annua] Meeting o f the Association for Computational

Linguistics (ACL’05), Ann Arbor, Michigan, 2005, pp

[3] F J Och, H Ney, A systematic comparison o f various statistical alignment mcxiels, Computational Linguis- tics 2 9 (1 ) (2003) 19-51.

[4] M Collins, p Koehn, I Kucerová, Clausc rcstructur- ing for statistical machine translation, in: Proc ACL

2005, Ann Arbor, USA, 2005, pp 531-540.

[5] p Xu, J Kang M Ringgaard, F Och, Using a dependency parser to improve smt for subject-object-verb languages, in: Proceedings o f Huraan Language Tech nologies: The 2009 Annual Coníerence o f the North American Chapter o f the Association for Computa- tional Linguistics, Association for Computational Lin- guistics, Boulder, Colorado, 2009, pp 245-253 [6] D Genzel, Automatically learning source-side reordering rules for large scale machine translation, in: Proceedings o f the 23rd International Coníerence on Computational Linguistics, COLING ’ 10, 2010, pp 376-384.

[7] u Lerner, s Petrov, Source-siđe classifier preordering

for machine translation., in: EMNLP, 2013, pp 513— 523.

[8] Y Wu, M Schuster, z Chen, Q V Le, M Norouzi,

w Macherey, M Krikun, Y Cao, Q Gao,

K Macherey, et al., Googles neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144, 2016.

[9] c Hadiwìnoto, H T Ng, A dependency-based neural reordering model for statistical machine translation, arXiv preprint arXiv: 1702.04510,2017.

[10] s R T w Papineni, Kishore, w Zhu, Bleu: A method for automatic evaluation of machine ưanslation., in: ACL, 2002.

[11] p Koehn, H Hoang, A Birch, c Callison-Burch,

M Federico, N Bertoldì, B Cowan, w Shen,

c Moran, R Zens, c Dyer, o Bojar, A Constantin,

E Herbst, Moses: Open source toolkìt for statistical machine translation, in: Proceedings o f ACL, Demon- stration Session, 2007.

[12] N Yang, M Li, D Zhang, N Yu, A ranking-based approach to word reordering for statistical machine ứans- lalion, in: Proceedings o f ứie 50th Annual Meeting o f the Association for Computational Linguistics: Long Papers-Volume 1, Association for Computational Lin- guistics, 2012, pp 912-920.

[13] N Habash, Syntactic preprocessing for statìstical raa- chine translation, Proceedings o f the 1 lửi MT Summit, 2007.

Trang 27

[14] N Bach Q Gao, s Vogel, Source-side dependency

tree reordering models vvith subtree movements and

constraints, In: Proceedings o f the Twelfth Machine

Translation Summit (MTSummit-XII), International

Association for Machine Translation, Ottawa, Canada,

2009.

[ 15 Ị H s Y z Jingsheng Cai, Masao Utiyama,

Dependency-based pre-ordering for chinese-english

machine transláon, in: Proceedings o f the 52nd

Annual Meeting o f the Association for Computational

Linguistics, 2014.

[16| I Goto, M Utiyama, E Sumita, s Kurohashi, Pre-

ordering using a target-language parser vía cross-

language syntactic projection for statistical machine

translation, ACM Transactions on Asian and Low-

Resource Language Iníormation Processing 14 (3)

(2 0 1 5 )1 3

[17] J Daiber, M Stanojevic, w Aziz, K Sima’an, Exam-

ining ửie relationship between preordering and word

order freedom in machine translation, in: Proceeđ-

ings o f the First Coníerence on Machine Translation

(WMT16), Berlin, Germany, August Association for

Computational Linguistics, 2016.

[18] c Ding, K Sakanushi, H Touji, M Yamamoto, Inter-

, intra-, and extra-chunk pre-ordering for statistical

japanese-to-english machine translation, ACM Trans

Asian Low-Resour Lang lnf Process 15 (3) (2016)

20:1-20:28 doi: 10.1145/2818381.

URL h t t p : / / d o i acm o r g / 1 0 1 1 4 5 /2 8 1 8 3 8 1

[19] c Hadiwinoto, Y Liu, H T Ng, To swap or not

to swap? cxploiting dependency worđ pairs for

re-ordering in statislical machine ưanslation, in: Thừlieth

AAAI Coníerence on Artiíicial Intelligence, 2016.

[20] B M de M ameffe, c D.Manning, Generating typed

depenđency parses from phrase structure parses, in: In

the Proceeding o f the 5th International Coníerence on

Language Resources and Evaluation, 2006.

[21] T L Nguyen, M L Ha, V H Nguyen, T M H

Nguyen, p Le-Hong, Building a treebank for

viet-namese dependency parsing, in: 2013 IEEE RIVF In ternational Conference on Computing and Communi- cation Technologies, Research, Innovation, and Vision for the Future, RIVF 2013, Hanoi, Vietnam November 10-13, 2013, 2013, pp 147-151.

[22] L Wang, Support Vector Machines: theory and applications, Vol 177, Springer Science & Business Media, 2005.

[23] M Hall, E Frank, G Holmes, B Pfahringer, p Reute- raann, I H Witten, The weka data mining software:

An update, SIGKDD Explor Newsl 11 (1) (2009) 1 0 - 18.

[24] D Cer, M.-C de Marneffe, D Jurafsky, c D Man- nìng, Parsing to staníord dependencies: Trade-offs be- twecn speed and accuracy, in: 7th International Confer- ence on Language Resources and Evaluation (LREC

2 0 1 0 ), 2 0 1 0

[25] H V Huy, T.-L N Phuong-Thai Nguyen, M Nguyen, Boosưapping phrase - based statistical machine translation via wsd integration, in: In Proceeding of the Sixth International Joint Coníerence on Natural Lan- guage Processing (UCNLP 2013), 2013, pp 1042- 1046.

[26] A Stolcke, Srilra - an extensible language modeling

toolkit, in: Proceedings o f International Coníerence on Spoken Language Processing, Vol 29, 2002, pp 9 0 1 - 904.

[27] V H Tran, V V Nguyen, M, L Nguyen, Improving english-vietnamese statistical machine translation using preprocessing dependency syntactic, In Proceed- ings o f the 2015 Coníerence o f the Pacific Association

for Computational Linguistics (Pacling 2015) 115-

121.

[28] V H Tran, H T Vu, V V Nguyen, M L Nguyen,

A classifier-based preordering approach for english- vietnamese statistical machíne translation, 17th Inter national Coníerence on Intelligent Text Processing and Computatìonal Linguistics (CICLing 2016).

Trang 28

2015 Pacific Association for Computational Linguistics

etc T h e se rules were w ritten by hand but [8] sh ow ed that

an autom atic rule leam er can be used.

[5] d ev elo p ed a clau se d etection and used so m e hand-

w ritten rules to reorder w ords in the cla u se Partly, [1],

[13] bu ilt an autom atic extracted syn tactic rules.

[14j inlrod uced a novel pre-ordering approach based on

d ep en d en cy parsing for C h in ese-E n g lish SM T T hey p resent

a set o f depend en cy-b ased preordering rules w h ich im proved

the B L E U score by 1.61 on the N IS T 2 0 0 6 evalu ation data

A nd they also investigate the accuracy o f the rule set by

con d u ctin g hum an evaiuations.

T h e other approaches havc u sed syn tactic parsing to

provide m u ltip le source sen ten ce reordering op tion s through

worđ (phrase) lattices [1 5 ], [16] [16] ap p lied so m e

transíorm ation rules, w h ich is learnt au tom atically from

b ilin gu al corp us, to reorder so m e w ords in a chunk A

crucial đifferen ce betw een the ex istin g m eth o d s and the

p roposed m ethod is that they d o not perform reordering

during training W h ile, the p roposed m eth od can so lv e this

problem by u sin g dep en d en cy syn tactic, w h ich is m ore

e ffic ie n t w ith step preprocessin g.

In com parison w ith th ese m entioned app roaches, the

proposed m eth od has som e d ifferen ces as the foIlow in g

W e have studied svo langu age usin g m anu ally extracted

p reced en ce rules w ith E n glish -V ietn am ese lan gu age pair

T h e p rop osed precedence reordering rules arc m ore íle x ib le

than other rules [13] T he proposed approach in clu đ es

a d e ía u lt p reced en ce vveight and order type, handw ritten

rules handle u n seen children and their order naturally W e

d on ’t need to m atch an often to o s p e c iíic co n d itio n T h is

study app lied dep en d en cy syn tactic p rep ro cessin g rather

than con stitu en cy syntatic p rep rocessin g and com pared the

proposed approach w ith a sy stem b a selin e to translate from

E n g lish to V ietn am ese A s the sam e w ith [1], [1 3 ], this

study ap p lies preprocessin g in both training and d eco d in g

tim e,

III BRIEF DESCRIPTION OF THE BASELINE

PHRASE-BASED SM T

P hrasc-based SM T, as d escrib ed by 12] ư a n sla tes a sou rce

sen ten ce into a target sen ten ce by d eco m p o sin g the sou rce

sen ten ce into a sequ en ce o f sou rce ph rases, w h ich ca n b e

any c o n líg u o u s seq u en ces o f w ords (or tok en s treated as

w ords) in the sou rce sentence, For each sou rce phrase, a

target phrase translation is selec ted , and the target p h rases

are arranged in so m e order to produce the target se n ten ce A

set o f p o ssib le translation candidates created in this w a y is

scored accorđing to a w eigh ted lin ear com b in a tio n o f íeature

valu es, and the h igh est scorin g translation can d iđ ate is se-

lected as the translation o f the sou rce sen ten ce S y m b o lic a lly ,

n

i == a r g m a X ị a £ > / ; ( , , í , a ) (1)

t= 1

w h ere s is the input sentence, í is a p ossib le output

sentence, and a is a phrasal alignment that specifies how í

is constructed from s , and t is the selected output sentence

T h e vveights Aí associated vvith cach íeature f i are tuned to

m a x im ize the quality o f the translation hyp othesis selected

by the d e co d in g procedure that com putes the argmax The

lo g -ỉin e a r m od el is a natural framework to integrate many features T h e baseline system uses the follow in g íeatures:

• The probability of each source phrase in the hypothesis

g iv en the corresponding target phrase.

• T h e probability o f each target phrase in the hypothesis

g iv en the correspon ding source phrase.

• T h e lexical score for each target phrase given the correspon ding source phrase.

• T h e lexical score for each source phrase gi ven the correspon ding target phrase.

• The tareet language model probability for the sequence

o f target phrase in the hypothesis.

• T h e word and phrase penalty score, vvhich allow to ensu re that the translation d o es not get to o lon g or too short.

• The distortion model allows for reordering of the source

sen ten ce.

T he probab ilities o f source phrase given target phrases, and target phrases given sourcc phrases, are estím ated from the

b ilin g u a l corpus [2] useđ the fo llow in g distortion m odel (reordering m odel), w h ich sim ply pen alizes non m onoton ic

phrase alignment based on the word distance of successively

translated source phrases with an appropriate value for the

T h en , w e u tilize som e d epend en cy relations extracted from a statistical dep en d en cy parser to create the depend en cy based reordering rules D ep en d en cy parser betvveen w ords typed

w ith granưnatical relations are proven as u seíu l iníorm ation

in so m e app lication s relating to syntactic Processing.

A T ra n sýo rm a tio n R ule

T h is sectio n , w e focu s on analyzing som e popular structures o f E n glish langu age w h en translating to V ietnam ese lan gu age T h is is a very important area to study and can lead to rem arkable im provem ents in translation períòrm ance

E n g lish is used as source language, vvhile V ietn am ese is used

as target langu age in our d iscu ssion about the word orders.

W e u se the d epend en cy grammars and the d iữ eren ces o f word order betvveen E n glish and V ietnam ese to create the set

116

Trang 29

2015 Paciíic Association for Com putational Linguistics

Seolences»tthDepeiKlencles ES> a personalcompimr

v t e m a m c u ; SCBICBCCS 1— mộ t m áy tín h c i n hủn

Figure 4 An example of word reordering phenomenon in noun phrase with

adjcctival m odifier (am od) and determ iner m odiíier (det) In this exaraple,

the noun "Computer" is swapped with the adjeetival "persỡnal"

o f the reordering rules B asin g on these rules, w e propose

an our m ethod w hich is capable o f applying and com bining

them sim ultaneously.

B ecau se E nglish and V ietn am ese both are svo langu ages,

the order o f verb rarely change, w e focu s m ainly on som e

typical relations as noun phrase, adjectival and adverbial

phrase, preposition and created m anually written reordering

rule set for E n glish -V ietn am ese language pair Inspired from

[18], this study em ploved depend en cy syntactic and trans-

form ation ru les to reorder the source sentence and applied

to system s translating E nglish to V ietnam ese,

For exam ple, w ith noun phrase, there alw ays exists a

head noun and the com ponents beíòre and after it T h ese

auxiliary com p on en ts w ỉll m ove to new p osition s according

to V ietn am ese ưanslational order.

Let us con sid er an exam ple in Figure 4 , Figure 5 to the

In the proposed approach, a transíorm rule is a m apping from T to a set o f tũples (L, w , O)

• T is the part-of-sp eech (P O S ) lag o f the head in a

d epend en cy parse tree node.

• L is a depend en cy label for a child node.

w is a w eigh t indicating the order o f that child node.

117

Trang 30

2015 Paciíic Association for Computational Linguistics

T (L, w , O)

JJ or JJS or JJR (advcl, 1 NORMAL)

(self,-1 NORMAL) (aux,-2 REVERSE) (auxpass,-2,REVERSE) (neg,-2,REVERSE) (cop,0,REVERSE)

NN or NNS (prep.O.NORMAL)

(rcmod,l NORMAL) (self,O.NORMAL) (poss,-l, NORMAL) (adm od,-2,M Ỹ ẼRSE)'

IN or TO (pobj, 1 ,NORMAJL)

(self,2,NỒRMAL)

Table I

• o is the type of order (either NORMAL or REVERSE).

Our rule set provides a valuable resource for preorderứig

in E n glish -V ietn am ese phrase-based SMT.

B D e p e n d e n c y S y n ta c tic P ro c essin g

This secúon presents a melhod lo build a translation

m odel for a p a ứ E n glish to V ietn am ese W e aim to reorder

an E n glish scn te n ce to gct a n ew E nglish , and som c w ords in

this sen ten ce are arranged as V ietn am ese w ords order The

type o f order is only useđ w hen w e have m ultiple children

w ith the sam e w eigh t, w h ile the w eigh t is used to determ ine

the relative order o f the children, g o in g from the largest to

the smallest The weight can be any real valued number The

order type N O R M A L m eans w e preserve the original order

o f the children, w h ile R E V E R SE m eans w e flip the order

We reserve a special label self to refer to the head node

itself so that we can apply a weight to the head, too We

w ill call this tu p le a precedence tuple in later d iscussions

In this study, w e use m anually created rules only.

Suppose we have a reordering rule: NNS —> (prep, 0,

N O R M A L ), (rcm od, 1, N O R M A L ), (self, 0, N O R M A L ),

(p o ss, -1 , N O R M A L ), (adraod,-2, R E V E R SE ) For the ex-

am ple show n in Figure 2, w e vvould apply it to the ROOT

node and result in "songwriter that wrote many son gs

romantic."

W e apply th em in a depend en cy ơ e e recursively starting

from the root node If the POS tag of a node matches the

left-h and-sid e o f a rule, the rule is applied and the order

of the sentence is changed We go Ihrough all tiie children

o f the node and get the precedence w eigh ts for them from

the set o f p reced en ce tuples If w e encounter a child node

that has a d ep en d en cy ỉabel not listed in the set o f tuples,

we givc it a deíault weight of 0 and deíault order type of

N O R M A L T h e children nod es are sorted according to their

w eigh ts from h igh est to lovvest, and nod es with the sam e

vveights arc ordercd according to the type of order deíĩned

in the rule.

Figure 3 g iv es exam ples o f original and preprocessed phrase in E nglish T he first lin e is the original E nglish sentence: "that songw riter wrote many son gs romantic.", and the sec o n d lin e is the target V ietnam ese reordering "Nhạc

sĩ đ ó đã v iế t nhiều bài hát lãng mạn." T his sentences

is arranged as the V ietn am ese order Hovvever, w e aim to preprocess as in Figure 3 V ietn am ese sen ten ces is the output o f our m ethod Finally, the fourth lin e is the target

V ietn a m ese sen ten ces A s you can see, after reordering, original E n g lish line has the sam e word order.

V Ex p e r i m e n t s a n d Re s u l t s

In this sectio n , w e present our experim ental to trans-

late from English to Vietnamese in a statistìcal machine

translation system W e used Staníord Parser [19] to parse sou rce sen ten ce and apply to preprocessing source sentences (E n g lish sen ten ces) A ccordin g to typical d iữ eren ces o f word order b etw een E n glish and V ietn am ese, w e have created a set

o f dep en d en cy-b ased rules w ith about 17 rules for reordering words in E n g lish sentence according to V ietn am ese word order and typ cs o f rules inclu ding noun phrase, adjectival and adverbial phrase, preposition w h ich is described in Table

I

A D a ta s e t a n d E xp e r im e n ta l S e tu p

W e im p lem en ted preprocessing step during both training and d e c o d in g tim e, u sin g the SM T M o se s decoder [20] for d e c o d in g W e used an E nglish -V ietn am ese corpus [21],

in clu d in g about 5 4 ,6 4 2 pairs for training, 5 0 0 pairs for testing and 2 0 0 pairs for d evelop m en t test set.

Table n g iv e s m ore statistical iníorm ation about our corpora

W e co n d u cted so m e experim ents with SM T M o se s D ecod er [20] and SR IL M [22] W e trained a trigram language

m odel u sin g interpolate and kndiscoun t sm ooth in g w iứi 89M

V ietn a m ese m on o corpus.

B e ío r e e x ơ a ctin g phrase table, w e use G IZ A + + [23] to build worđ align m en t w ith grow -diag-final-anđ algoriứun.

118

Trang 31

2015 Paciíic Association for Com putational Linguistics

Corpus Sentence pairs Training Set Development Set Test Set General 56642 54642 200 499

English Vietnamese Training Sentences 54620

Average Length 11.2 10.6 Word 614578 580754 Vocabulary 23804 24097 Development Sentences 200

Name Size of phrase-table Baseline 1152216 System I 1218676 System II 1228187 System III 1231365

TSBimrSlZE OF PHRASE TABLES

System BLEU(%) Description System I 36.75 applying rules with category JJ or JJS or JJR System n 37.51 applying rules with categorỵ NN or NNS System UI 37.71 applying rules with category IN or TO Baseline 36.97 phrase-based SMT in Moses toolkit

Table IV

DETAILS OF OUR EXPERIMENTAL F0R US1NO HANDWKITTEN RULES ON ENGLISH-VlETNAMESE CORPUS

B e sid e s using preprocessin g, vve also used deíault reorder-

in g m odel in M o se s D ecoder: using w ord-based extraction

(w b e), splitting ty p e o f reordering orientation to three class

(monotone, swap and discontinuous - msd), combining back-

ward and forward d ứ ection (bidirectional) and m od ellin g

base OD both source and target language (fe) [20]

W e g ive som e d e íĩn itio n s o f our experim ents:

• Baseline: use the baseline phrase-based SMT system

u sin g d istan ce-based default reordering m odel in M oses

The result o f o u r experim ents sh ow ed our applying trans-

íbrm ation rule to p ro cess the source sentences The perfor-

mances of the statistical machine ừanslation systems in our

experim ents are evalu ated by the B L E U scores [24].

Table IV displays the B L E U scores o f our system s us-

in g transíòrm ation rules (hanchvritten rules) on E nglish -

V ietn am ese corpus T hese system s are com pared w ith base- lin e system s (the state-of-the art phrase-based system ).

B y applying preprocessin g in both ơ a in in g and decoding, the B L E U score o f our best system (System III) increase

by 0 7 4 point over "B aseline system ” Im provem ent over

0 7 4 B L E U point is valuable b ecause baseline system is the strong phrase-based SM T (integratíng lex ica lized reordering

m od els) W e conducted our dependency-based pre-ordering experim ents and w hen applying transíorm ation rules into preprocessing depend en cy syntactic, w e recogn ize that the advantage o f our depend en cy-b ased reordering approach and

it can solve the problem o f word m ovem ent eữectively.

T he result o f our experim ental in table III show ed our applying transíbrmation rule to process the source sentences Thanks to this m ethod, w e can find out various phrases in the translation m odel S o that, they enable us to have more options for decoder to generate the best translation.

W e also carried out the experim ental with handwritten rules U sin g som e handwritten rules help the phrased translation m odel generate som e best translation more than the

119

Trang 32

2015 Paclíic Association for Com putational Linguistics

autom atic rules We focus som e popular structures that

are m en lion ed in the proposed approach are verb phrase,

adjectival and adverbial phrase, preposition and noun phrase.

T he result proved that the effect o f applying transíor-

m ation rule on the dependency syntactic The quality o f

translation system can improved w hen handwritten rules

have cover better on corpus S o that, w e get som e paứ o f

sen ten ces w ith the best alignm ent, and then, w e can extract

m ore and better phrasc tables.

T h e results sh ow ed that the improved system (usin g trans-

form rules) is able to better than the b aseline SM T system

by B L E U scores on a E nglish -V ietn am ese corpus w ith som e

categories.

VI CONCLUSIONS

ỉn this study, a preprocessing approach based on a depen-

d en cy parser is presented T he proposed approach is applied

to system s translating E nglish to V ietnam ese T he experi-

m ent results show ed that our approach achieved statistically

im provem ents in B L E U scores over a state-of-the-art phrase-

b ased b aselin e system Im provem ent over 0 7 4 B L E U point

is valuable b ecau se b aseline system is the strong phrase-

based SM T

Our rules are f]exib le and can cover many lin guistic

reordering ph en om en a We b eliev e that such reordering rules

b en efit E nglish -V ietn am ese pair languages.

In Ihe íuture, w e investigate data-driven approaches w hich

learn the depend en cy-b ased reordering rules autom atically

In addition, w e also focus on word ordcr problem s much

more with linguistic reordering phenomena on English-

Vietnamese corpus This is the important step in trying

improving SMT systems and that might lead to a wider

adoption of them.

W e attem pt to create more effic ie n t pre-ordering rules by

exp loitin g the rich iníbrm ation in depend en cy structures.

A c k n o w l e d g m e n t

T his w ork described in this paper has been partially

íun ded by H anoi National U niversity (QG 15.23 prcýect)

R e f e r e n c e s

[1] F Xia and M McCord, “Improving a statistical mt system

with automatically leamed rewrite pattems," in Proceedings

o f Coling 2004 Geneva, Switzerland: COLING, Aug 2 3 -

Ấug 27 2004 pp 508-514.

[2] p Koehn, F J Och, and D Marcu, “Statistical phrase-based

translation,” in Proceedings o f HLT-NAACL 2003 Edmon-

ton, Canada, 2003, pp 127-133.

[3] F J Och and H Ney, “The alignmenl template approach

to statistical machine translation,” Computational Linguistics,

vol 30, no 4, pp 417-449, 2004.

[4] D Chiang, “A hierarchical phrase-based model for statistical

machine translation,” in Proceedings o f the 43rd Annual

Meetìng o f the Association fo r Computational Linguistics

(ACL’05), Ann Arbor, Michigan, June 2005, pp 263-270.

[5] M Collins, p Koehn, and 1 Kucerová “Clause reslructuring

for statistical machine translation," in Proc ACL 2005 Ann

Arbor, USA, 2005, pp 531-540.

[6] c Quirk, A Menezes, and c Cherry "Dependency treelet

translation: Syntactically iníormed phrasal smt,” in Proceed-

ings o f ACL 2005 Ann Arbor, Michigan, USA, 2005, pp 271-279.

[7] L Huang and H Mi, “Efficient incremental decoding

for tree-to-sưing translation,” in Proceedings o f the 2010

Conference on Empirícal M ethods in Natural Language

Processing Cambridge, MA: Association for Computational

Linguistics, October 2010, pp 273-283 [Online] Available: http://www.aclweb.0rg/anth0l0gy/D ] 0-1027

[8] p Xu, J Kang, M Ringgaard, and F Och, “Using

a dependency parser to improve smt for subject-object-

verb languages,” in Proceedings o f Humart Language

Technologies: The 2009 Annual ConỊerence o f the North American Chapter n f the Association fo r Computatìonaỉ Lin-

ịỊuistics Boulder, Colorado: Association for Computational

Linguistics, June 2009, pp 245-253 [Online] Available http://www.aclweb.Org/anthology/N/N09/N09-1028

[9] D Talbot, H Kazawa, H lchikawa, J Katz-Brown,

M Seno, and F Och, “A lightweight evaluation framework

for machine translation reordering,” in Proceedings o f

the Sixth Workshop on Statistical Machine Translation

Edinburgh, Scotland: Association for Computatíonal Linguistics, July 2011, pp 12-21 [Online] Available: http://www.aclweb.org/anthology/Wll-2102

[10] J Katz-Brown, s Pelrov, R McDonald, F Och, D Talbot,

H Ichikavva, M Seno, and H Kazawa, “Training a parser

for machine translation reordering,” in Proceedings o f the

2011 Conỷerence on Empìrìcal M ethods in Natural Language

Processing Edinburgh, Scotland, UK.: Association for Computational Linguistics, July 2011, pp 183-192 [Online] Available: http://www.aclweb.org/anthology/Dl 1-1017 [11] T p Nguyen and A Shimazu, “Improving phrase-based smt with morpho-syntactic analysis and transíormation,” in

Proceedings AMTA 2006, 2006.

[12] c Wang, M Collins, and p Koehn, “Chinese syntactic

reordering for statistical machine translation,” in Proceedings

o f the 2007 Joint ConỊerence on Empirìcal M ethods

in Natural Language Processing and Computational

Natural Language Leam ing (EMNLP-CoNLL) Prague, Czech Republic: Association for Computational Lin- guistics, June 2007, pp 737-745 [Online] Available: http://www.aclweb.Org/anthology/D/D07/D07-1077

[13] N Habash, “Syntactic preprocessing for statistical machine

ữanslation,” Proceedings oỊ the l ỉ t h M T Summit, 2007.

[14] E s Y z Jingsheng Cai, Masao Utiyama, “Dependency- based pre-ordering for chinese-english machine translation,”

in Proceedings o f the 52nd Annual Meeting o f the Associalion

fo r Computatiotml Linguistics, 2014.

[15] Y Zhajig, R Zens, and H Ney, “Chunk-level reorđering o f source language sentences with automatically learned rules

for statistical m achine translation,” in P roceedings o f SSST,

N AACL-HLT 2007 /A M T A Worksliop on Syníax and Structure

in Statistìcal Translation, 2007, pp 1-8.

120

Trang 33

2015 Paciíic Association for Computational Linguistỉcs

[16] p T Nguyen A Shimazu, L.-M Nguyen and V.-V Nguyen,

"A syntactic transformation model for statislica) machine

translation,” Intern a tio n a l J o u rn a ì o f C om puter P rocessing

o f Oriental Languages (UCPOL), vol 20, no 2, pp 1-20,

2007.

[17] B M Marie-Catherine de Mameffe and c D.Manning,

“Generating typed dependency parses from phrase structure

parscs,” in In the Proceedìng o f the 5th International Confer-

ence an Language Resources and Evaluation, 2006.

[18] M R Peng Xu, Jaeho Kang and F Och, “Using a dependency

parser to improve smt for subject-object-verb languages,”

in The 2009 Annual Conference o f the North American

Chapter o f the ACL Boulder, Colorado: Association for

Computational Linguistics, June 2009 pp 245-253.

[19] D Ccr, M.-C de Marncữc, D Jurafsky, and c D Manning

“Parsing to staníord dependencies: Trade-offs between speed

and accuracy,” in 7th International Con/erence on Language

Resources a n d Evaluation (IJIEC 2010), 2010.

[20] p Koehn, H Hoang, A Birch, c Callison-Burch, M Fed-

erico, N Bertoldi, B Cowan, w Shen, c Moran, R Zens,

c Dyer, o Bojar, A Constantin, and E Herbst, “Moses:

Open source toolkit for statistical machine translation,” in

Proceedings o Ị ACL, Demonstration Session, 2007,

[21] T p Nguyen, A Shiraazu, T B Ho, M L Nguyen,

and V V Nguyen, “A ưee-to-string phrase-based model

for statistica) machine translation,” in Proceedings o f

the Tweifth Conỷerence ort Computational Natural Language

Learning (CoNLL 2008) Manchester, England: Coling 2008

Organizing Committee, August 2008, pp 143-150 [Online]

Available: http://www.aclweb.org/anthologyAV08-2119

[22] A Stolcke, '‘Srilm - an extensible language modeling toolkit,”

in Proceedings o f International Con/erence on Spoken Lan-

guage Processing, vol 29, 2002, pp 901-904.

|23J F J Och and H Ney, “A syslematic comparison o f vari-

ous statistical alignment models,” Compulatiơnal Linguistics,

vol 29, no 1, pp, 19-51, 2003.

[24] s R T w Papineni, Kishore and w Zhu, “Bleu: A method

for automatic evaluation o f machine translation" in ACL,

2002

121

Trang 34

MẪU 05/KHCN

(Ban hành kèm theo Quyết định sổ 3839 /QĐ-ĐHQGHN ngày 24 tháng ỉ 0 năm 2014

của Giám đốc Đại học Quốc gia Hà Nội)

THUYẾT MINH

ĐỀ TÀI KHOA HỌC VÀ CÔNG NGHỆ CẤP ĐHQGHN

(Yêu cầu không thay đổi trình tự các mục, không xóa những gợi ý ghi trong ngoặc)

I THÔNG TIN CHƯNG VỀ ĐÈ TÀI

1 - Tên đề tài

Tiếng Việt:

Cải tiến chất lượng dịch máy thống kê dựa vào thông tin cú pháp phụ thuộc

Tiếng Anh:

Improving Statistical Machine Translation using Dependency Syntax Information

2 - Mã số (được cấp khỉ Hồ sơ trúng tuyển):

3 - T h ò i gian thực hiện: 24 tháng, từ tháng 01 /2015 đến tháng 12/2016

4 - Thông tín về chủ nhiệm đề tài

Họ và tên: Nguyễn Văn Vinh

Trình độ chuyên môn: Tiến sĩ

Chức danh khoa học: Giảng viên

Điện thoại: 0437549359

Tên tổ chức đang công tác: BM KHMT, Khoa CNTT, Trường ĐH Công Nghệ, ĐHQG Hà Nội

Địa chỉ tổ chức : 315, E 3 ,144 Xuân Thủy, cầu Giấy, Hà Nội

5 - Thư ký đề tài (nếu có)

Tên tổ chức đang công tác:

Khoa Công nghệ Thông tin - Trường Đại học Kinh tế Kỹ thuật Công nghiệp

Địa chỉ tổ chức : số 456 Minh Khai - Quận Hai Bà Trưng - Hà Nội

1

Trang 35

6 - Đon vị chủ trì đề tài

Tên đơn vị chủ trì: Trường Đại học Công nghệ, ĐHQG Hà Nội

E-mail: coltech@vnu.edu.vn

Website: http://www2.uet vnu.edu.vn/ueư

Địa chỉ: Nhà E 3 ,144 Xuân Thuỷ, cầu Giấy, Hà Nội

7 - Xuất xứ của đề tài (xét chọn, tuyển chọn, hợp tác )

- Tuyển chọn

8 - Các đơn vị phối hợp chính thực hiện đề tài (nếu có)

Đơn vị 1 (bắt buộc đối với đề tài KH&CN hợp tác song phương)

Tên đơn vị chủ quản:

Địa chỉ:

Đơn vị 2

Tên đơn vị chủ quàn:

{Bám sát và cụ thể hóa mục tiểu theo đặt hàng)

2 Một (01) tháng quy đổi là tháng làm việc gồm 22 ngày, mỗi ngày làm việc gồm 08 tiếng

2

Trang 36

• Đề xuất và cải tiến các phương pháp giải quyết bài toán đảo cụm từ trong dịch máy thống kê dựa vào cụm theo hướng tiếp cận tiền xử lý dựa trên cây phân tích cú pháp phụ thuộc.

• Tìm ra cách tích hợp thông tin về cây phân tích cú pháp phụ thuộc vào hệ dịch máy thống kê (lựa chọn thông tin cú pháp, xây dựng luật đảo trật tự thủ công và tự động giữa

2 cặp ngôn ngữ) Tập trung thử nghiệm & đánh giá ữên cặp ngôn ngữ Anh-Việt.

• Xây dựng chương trình thủ nghiệm dịch từ Việt sang Anh, tích hợp các kỹ thuật đề xuất

và cải tiến ứong đề tài.

11 - Tổng quan tình hình nghiên cứu trong, ngoài nước và đề xuất nghiên cứu của đề tài

11.1 Đánh giá tổng quan tình hình nghiên cứu lý luận và thực tiễn thuộc lĩnh vực của đề tài

Ngoài nước (.Phân tích đánh giá được những công trình nghiên cửu cỏ liên quan và những

kết quả nghiên cứu mới nhất trong lĩnh vực nghiên cứu của đề tài; nêu được những bước tiến về trình độ KH&CN của những kêí quả nghiên cứu đó; những vân đê KHCN đang cân phải nghiên cứu và giải quyết).

Những năm gần đây, sự bùng nổ của cách tiếp cận dịch máy thống kê dựa vào cụm [Koehn, 2003; Och and Ney, 2004] đã tạo ra các sản phẩm thương mại đươc sử dụng rộng rãi trên thể giới

dựa vào cụm liên quan đến việc làm thế nào để sinh ra thứ tự các từ (cụm) chính xác ừong ngôn ngữ đích Trong hệ dịch máy thống kê dựa trên cụm (Phrase-Based Statistical Machine Translation- PBSMT), việc đảo cụm từ vẫn còn đon giản và chất lượng chưa cao Bên cạnh đỏ, do các ngôn ngữ

có nhiều đặc điểm khác nhau (đặc biệt sự khác nhau về thứ tự từ trong các ngôn ngữ) dẫn tới không thể mô hình hóa chính xác ừong quá trình dịch [Och và Ney, 2004] Điều này dẫn đển có nhiều hướng quan tâm nghiên cứu để giải quyết vấn đề đảo ừật tự từ bên trong hệ thống dịch máy thống

kê dựa vào cụm Một số nghiên cứu theo hướng tiếp cận tiền xử lý cho vẩn đề đào trật tự từ cho kết quả tốt [Peng Xu và cộng sự, 2009; Jason Katz-Brown và cộng sự, 2011; Cai và cộng sự, 2014].

Ý tưởng chính của vấn đề đảo cụm tò tiền xử lý câu trong ngôn ngữ nguồn (tiếng Anh) để có thứ tự từ gần nhất có thể trong ngôn ngữ đích (tiếng Việt) Hai hướng nghiên cứu chính để giải quyết vấn đề nêu trên dựa vào tiền xử lý là: phân tích cú pháp thành phần câu nguồn và phân tích cú pháp phụ thuộc câu nguồn.

1 https://translate.google.com

2 http://www.microsofttranslator.com

3

Trang 37

Một số nghiên cứu sử dụng thông tin cú pháp nhằm giải quyết bài toán đảo trật tự từ Một trong những phương pháp đó là phân tích cú pháp ngôn ngữ nguồn và các luật sắp xếp như các bước tiền

xử lý, Ý tưởng chính là chuyển đổi các câu nguồn để các câu đích có thứ tự từ gần nhẩt có thể do

đó việc huấn luyện sẽ dễ dàng hơn và chất lượng gióng từ cũng tốt hơn Một số nghiên cứu cho phương pháp này được nêu trong [N Habash, 2007; C.Wang, M.Collins và p Koehn, 2007].

Các nghiên cứu này đều thực hiện việc sắp xếp lại thứ tự từ ừong bước tiền xử lý dựa trên phân tích cây cú pháp kết hợp các luật tự động [F, Xia và M McCord, 2004] hoặc các luật thủ công [M Collins và cộng sự, 2005].

Một sổ phương pháp mở rộng sử dụng thông tin cú pháp để thay đổi mô hình dịch: phương pháp tiếp cận dịch máy thống kê dựa trên cú pháp [Yamada và Knight, 2001] đề xuất một mô hình dịch máy thống kê dựa trên mô hình kênh nhiễu và nhiệm vụ dịch được chuyển đổi thành nhiệm vụ phân tích.

[M Collins và cộng sự, 2005] áp dụng các quy tắc thủ công để sắp xếp lại các câu Thành công của cách tiếp cận này cho thấy thêm tri thức cú pháp có thể được thể hiện lại cho một cải thiện đáng

kể về mặt thống kê từ 1 đến 2% điểm BLEU dựa trên hệ thổng cơ sờ.

[F Xia và M McCord, 2004] đưa ra nghiên cứu áp dụng các luật tự động đảo ừật tự trong việc dịch từ tiếng Pháp sang tiếng Anh Các luật sắp xếp lại hoạt động ở mức phi ngữ cảnh ừong cây phân tích cú pháp Các phương pháp tiếp cận tương tự [Nguyen và Shimazu, 2006] cũng đề xuất việc sử dụng các luật tự động trích xuất dữ liệu Nghiên cứu cho thấy hiệu suất dịch được cải thiện đáng kể điểm BLEU so với hệ thống chuẩn.

Các nghiên cứu khác sừ dụng phân tích cú pháp đưa ra các lựa chọn đảo trật tự từ ở câu nguồn thông qua các từ (cụm từ) [Y Zhang, R Zens và H Ney 2007; p T Nguyen và cộng sự, 2007] [P

T Nguyen và cộng sự, 2007] áp dụng một số luật chuyển đổi được học tự động từ kho ngữ liệu song ngữ nhằm thực hiện đảo trật tự từ của các từ ừong đoạn.

Đã có một số nghiên cứu về dịch máy thống kê dựa trên cú pháp phụ thuộc như:

[Fox và Heidi, 2005] mô tả mô hình dịch máy thống kê dựa vào phụ thuộc trên mô hình kênh nhiễu, trong đó, mô hình dịch mã hóa xác suất các bản dịch từ vựng, xác suất chuyển đổi các thẻ hội thoại, xác suất vị trí đầu, xác suất thay đổi cấu trúc Mô hình ngôn ngữ dựa trên thành phần cú pháp được sử dụng để hỗ ừợ quá trình giải mã.

[Ding, Yuan và Martha Palmer, 2005] giới thiệu mô hình đồng bộ khả năng Trong đó, mỗi câu cây phụ thuộc nguồn được phân rã thành một chuỗi các cây sơ cấp không xác định, mỗi cây sơ cấp này được chuyển đổi vào cây sơ cấp đích Các cây sơ cấp đích cuối cùng được tổ hợp để sinh câu đích, sau đó áp dụng mô hình ngôn ngữ.

4

Trang 38

IQuirk và cộng sự 2005] mô tả một hệ thống dịch Treelet phụ thuộc (Dependency Treelet) sử dụng cây phụ thuộc với mô hình sắp xếp dựa trên cây kết họp với mô hình dịch máy thống kê cụm

đê sinh ra các bản dịch tốt Sử dụng mô hình logarit tuyên tính để tích hợp nhiều mô hỉnh đặc trưng

bao gồm cả mô hình ngôn ngữ trigram.

[Peng Xu và cộng sự, 2009] nhóm nghiên cứu của Google mô tả việc sử dụng phân tích cú pháp phụ thuộc để cải tiến dịch máy thống kê cho các thành phần ngôn ngữ s o v (subject-object- verb) Ap dụng tiền xử lý trong việc đảo cụm dựa ừên phân tích các vấn đề chính khi chuyển ngôn ngữ s v o sang các ngôn ngữ s o v Thử nghiệm để dịch từ tiếng Anh sang năm thứ tiếng s o v (Hàn Quốc, Nhật Bản, Tiếng Hinđu, tiếng Ảrập, tiếng Thổ Nhĩ Kỳ).

[Cai và cộng sự, 2014] đã đề xuất một số kỹ thuật xây dựng hiệu quả tập luật dựa vào cây cú pháp phụ thuộc đê cải tiến chất ỉưạng của hệ dịch thống kê từ tiếng Trung sang tiếng Anh.

Trong nước (Phăn tích, đảnh giá tình hình nghiên cứu trong nước thuộc lĩnh vực nghiên

cứu của ẩể tài; những kết quả nghiên cứu Hên quan đến đề tài mà các cán bộ tham gia đã thực hiện Nêu có các đề tài cùng lĩnh vực đã và đang được thực hiện ở cấp khác, nơi khác thì phải phân tích nêu rõ các nội dung liên quan đen đề tài này; Nếu phát hiện có đề tài đang tiên hành mà cỏ thê phôi hợp nghiên cứu được thì cần ghi rõ Tên đê tài, Tên Chủ trì

và đơn vị chủ trì đề tài đó).

Các nghiên cứu được thực hiện chủ yếu cho chiều dịch ngôn ngữ Anh-Việt, có ít nghiên cứu cho chiểu dịch Việt-Anh Hom nữa các kỹ thuật chủ yếu sử dụng thông tin của cây cú pháp thành phần để cải tiến chất lượng dịch máy thống kê Anh-Việt.

Nghiên cứu [Hoai-Thu Vuong và cộng sự, 2012] thực hiện đảo ứật tự từ ừong quá trình tiền xử

lý sử dụng cú pháp cây nông Phương pháp có thể giải quyết vấn đề một cách hiệu quả, bằng cách

sử dụng cấu trúc phức hợp - cây cú pháp nông.

Nghiên cửu [Vu HOANG và cộng sự, 2008] đưa ra phương pháp đảo ừật tự từ dựa ưên phụ thuộc mới chỉ áp dựng cho chiều Anh-Việt Hơn nữa, các luật được áp dụng là các luật thủ công, số luật ít chưa thể bao quát các trường hợp, chưa nghiên cứu áp dụng các luật tự động.

Nghiên cửu [Nguyễn Lê Minh và cộng sự, 2008] sử dụng các luật hiệu chỉnh kết quả dùng phương pháp MST phân tích cú pháp phụ thuộc tiếng Việt, tuy nhiên dữ liệu thử nghiệm còn ít, hạn chế về tài nguyên ngôn ngữ, chưa đánh giá hết các tình huống trong việc hiệu chỉnh để hệ thống nhất quán hom.

11.2 Định hướng nội dung cần nghiên cứu của đề tài, luận giải về sự cần thiết, tỉnh cấp bách, ỷ nghĩa lý luận và thực tiễn

{Trên cơ sở đánh giá tình hình nghiên cứu trong và ngoài nước, phân tích những công trình nghiên cửu cỏ liên quan, những kết quả mới nhất trong lĩnh vực nghiên cứu, cần nêu rõ những vân đê còn tôn tại, từ đó nểu được mục tiêu nghiên cứu và hướng giải quyết mới, những nội dung cần thực hiện — trả lời câu hỏi đề tài nghiên cứu giải quyết vấn đề gì,

Trang 39

những thuận lợi khó khăn cần giải quyết).

11.3 Liệt kê danh mục các công trình nghiên cứu, tài liệu có liên quan đến đề tài đã trích dẫn khi đánh giá tổng quan

Hiện nay, đã có nhiều nghiên cứu về hệ thống dịch máy thống kê dựa vào cụm cho cặp ngôn ngữ Anh-Việt Nghiên cứu về dịch máy thống kê dựa vào cụm sử dụng tiền xử lý vói cây cú pháp phụ thuộc chưa nhiều (theo hiểu biết của người viết thì hiện nay chỉ có 01 bài báo [Vu HOANG và cộng sự, 2008] đề cập về vấn đề này, tuy nhiên cũng chỉ dừng lại ờ việc xây dựng một số luật đơn giản và ứng dụng cho chiều dịch Anh-Việt với dữ liệu nhỏ) Nghiên cứu về đảo cụm từ sử dụng tiền

xử lý chủ yếu cho chiều dịch Anh-Việt bằng cây cú pháp thành phần Những vấn đề thách thức đặt

- Các nghiên cứu chủ yếu áp dụng cho chiều dịch Anh-Việt, chưa có chiều dịch Việt-Anh.

- Một số nghiên cứu đã áp dụng đảo trật tự từ dựa trên cây cú pháp phụ thuộc cho chiều Anh- Việt Tuy nhiên những nghiên cứu này chủ yếu dùng các luật bằng tay, chưa áp dụng các luật tự động trong bài toán dịch.

Có ít nghiên cứu sử dụng tiền xử lý dựa vào cầy cú pháp phụ thuộc cho chiều Việt-Anh và tồn tại nhiều hạn chế cần cải tiến để nâng cao chất lượng.

Tài liệu tham khảo:

[1] Peter F Brown, Vincent J Della Pieừa, Stephen A Della Pietra, and Robert L Mercer, “The mathematics of statistical machine ừanslation: parameter estimation” Computational Linguistics,

[5] M Collins, p Koehn, and I Kucerová, “Clause restructuring for statistical machine

translation,” in Proc ACL 2005 Ann Arbor, USA, 2005, pp 531-540.

phrasal smt,” in Proceedings of ACL 2005 Ann Arbor, Michigan, USA, 2005, pp.271-279.

[7] L Huang and H Mi, “Efficient incremental decoding for tree-to-string ừanslation,” in Proceedings of the 2010 Coníerence on Empirical Methods in Natural Language Processing Cambridge, MA: Association for Computational Linguistics, October 2010, pp 273-283 [Online] Available: http://www.aclweb,org/anthology/D 10-1027.

6

Trang 40

[8] F Xia and M McCord, “Improving a statistical mt system with automatically learned rewrite pattems,” in Proceedings o f Coling 2004 Geneva, Switzerland: COLING, Aug 23—Aug 27 2004,

p p 5 0 8 - 5 1 4

[9] p Xu, J Kang, M Ringgaard, and F Och, “Using a dependency parser to improve smt for

subject-object-verb languages,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics Boulder, Colorado: Association for Computational Linguistics, June 2009, pp 245—

[10] Sabine Buchholz and Erwin Marsi 2006 CoNLL-X shared task on multilingual dependency parsing In Proceedũigs of the Tenth Coníerence on Computational Natural Language Leammg [11] T p Nguyen and A Shimazu, “Improving phrase-based smt with morpho-syntactic analysis and transíormation,” in Proceedings AMTA 2006,2006.

translation,” in Proceedings of the 2007 Joint Coníerence on Empirical Meứiods in Natural Language Processing and Computational Natural Language Leaming (EMNLP-CoNLL) Prague, Czech Republic: Association for Computational Linguistics, June 2007, pp 737-745 [Online] Available: http://www.aclweb.Org/anthology/D/D07/D07-1077

[13] N Habash, “Syntactic preprocessing for statistical machine translation,” Proceedings of the

1 ltii MT Summit, 2007

[14] Y Zhang, R Zens, and H Ney, “Chunk-level reordering of source language sentences with automatically leamed rules for statistical machine ừanslation5' in Proceedings of SSST, NÀACL- HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation, 2007, pp 1-8 [15] p T Nguyên, A Shimazu, L.-M Nguyen, and V.-V Nguyen, “A syntactic transíòrmation model for statistical machine ừanslation,” International Joumal of Computer Processing of Onental

[16] Hoai-Thu Vuong, Vinh Van Nguyen, Viet Hong Tran and Akira Shimazu,” Improving

Statistical Machine Translation with Processing Shallow Parsing”, Processding of 26th Paciíic Asia

http ://www.aclweb.org/anthoìogy/Y /Y12/Y12-1043 pdf

[17] Vu HOANG, Mai NGO, Dien DINH, “A Dependency-based Word Reordering Approach for Statistical Machine Translation”, Research, Innovation and Vision for ửie Future, 2008 RTVF 2008 IEEE International Coníerence on

[18] Nguyễn Lê Minh, Hoàng Thị Điệp, Trần Mạnh Kế, “Nghiên cứu luật hiệu chỉnh kết quả dùng

http://wwwjaist.acjp/~bao/VLSP4ext/ICTrda08/ICT08-VLSP-SP84-l.pdf

[19] Kenji Yamada and Kevin Knight, “A syntax-based staíistical translation model”, Proceedings

of the 39th Annual Meeting of the ACL, 2001, pages 523-530.

7

Định dạng
Số trang	81
Dung lượng	9,39 MB