Parallel strands: A preliminary investigation into mining the web for bilingual text.. Automatic construction of parallel english-chinese corpus for cross-language information retrieval.
Trang 1Parallel Texts Extraction from the Web
Lê Quang Hùng
Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Công nghệ thông tin Người hướng dẫn: TS Lê Anh Cường
Năm bảo vệ: 2010
Keywords: Mạng máy tính; Website; Văn bản song ngữ; Sao chép văn bản Content
1.1 Parallel corpus and its role 1
1.2 Current studies on automatically extracting parallel corpus 3
1.3 Objectives of the thesis 4
1.4 Contributions 5
1.5 Thesis’ structure 5
2 Related works 7 2.1 The general framework 7
2.2 Structure-based methods 8
2.3 Content-based methods 12
2.4 Hybrid methods 14
2.5 Summary 15
Trang 23.1 The proposed model 16
3.1.1 Host crawling 17
3.1.2 Content-based filtering module 18
3.1.2.1 The method based on cognation 20
3.1.2.2 The method based on identifying translation segments 23 3.1.3 Structure analysis module 28
3.1.4 Classification modeling 30
3.2 Summary 31
4 Experiment 32 4.1 Evaluation measures 32
4.2 Experimental setup 33
4.3 Experimental results 36
4.4 Discussion 40
5 Conclusion and Future Works 41 5.1 Conclusion 41
5.2 Future works 42
Bibliography
Trang 3Bibliography
[1] P Resnik and Philip Parallel strands: A preliminary investigation into mining the web for
bilingual text In Proceedings of the Third Conference of the Association for Machine Translation in the Americas (AMTA) Langhorne, PA, pages 28-31, 1998
[2] J Chen and Nie J.Y Automatic construction of parallel english-chinese corpus for
cross-language information retrieval In Proceedings ANLP, Seattle, pages 21-28, 2000
[3] Xiaoyi Ma and Liberman Mark Bits: A method for bilingual text search over the web
Machine Translation Summit VII, 1999
[4] J Chen, R Chau, and C.-H Yeh Discovering parallel text from the world wide web In Proceedings Australasian Workshop on Data Mining and Web Intelligence (DMWI), pages 157-161, 2004
[5] Dan Tufis Cross-lingual knowledge induction from parallel corpora Southern Journal of Linguistics, USA, pages 214-223, 2007
[6] E N.Westerhout A corpus of dutch aphasic speech: Sketching the design and performing a pilot study 2005
[7] A Frankenberg-Garcia and D Santos Introducing compara: the portuguese- english
parallel corpus Corpora in translator education, pages 71-87, 2003
[8] Philipp Koehn Europarl: A parallel corpus for statistical machine translation In MT Summit, 2005
[9] P Brown, J Cocke, S Della Pietra, V Della Pietra, F Jelinek, R Mercer, and P Roosin A
statistical approach to machine translation Computational Linguistics, pages 79-85,
1990
[10] Melamed and I Dan Word-to-word models of translation equivalence IRCS technical report, University of Pennsylvania, 1998
[11] M Davis and T Dunning A trec evaluation of query translation methods for multi-lingual
text retrieval Fourth Text Retrieval Conference (TREC- 4), NIST, 1995
[12] Martin Volk, Spela Vintar, and Paul Buitelaar Ontologies in cross-language information
retrieval In Proceedings of WOW2003, pages 43-50, 2003
Trang 4[13] D W Oard Cross-language text retrieval research in the usa Third DELOS Workshop, European Research Consortium for Informatics and Mathematics , 1997
[14] Akira Kumano and Hideki Hirakawa Building an mt dictionary from parallel texts based
on linguisitic and statistical information In Proceedings 15th COLING, pages 76-81,
1994
[15] C McEwan, I Ounis, and I Ruthven Advances in information retrieval Springer, pages
365-368, 2002
[16] Melamed and I Dan Automatic discovery of non-compositional compounds in
parallel data In Proceedings of the Second Conference on Empirical Meth ods in Natural Language Processing Association for Computational Linguis tics, Somerset, New Jersey, pages 97-108, 1997
[17] Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, and Robert L Mercer
Word-sense disambiguation using statistical methods In Proceedings of 29th Annual Meeting of the ACL, Berkeley, pages 264-270, 1991
[18] P.Resnik and Philip Mining the web for bilingual text In Proceedings of the 37th Annual Meeting of the ACL, College Park, MD, pages 527-534, 1999
[19] Christopher C Yang and Kar Wing Li Building parallel corpora by automatic title
alignment 5th International Conference on Asian Digital Li braries, ICADL 2002,
pages 328-339, 2002
[20] P Resnik and N A Smith The web as a parallel corpus Computational Linguistics,
pages 349-380, 2003
[21] Ying Zhang, Ke Wu, Jianfeng Gao, and P Vines Automatic acquisition of chinese-english
parallel corpus from the web In Proceedings of ECIR-06, 2006
[22] Van B Dang and Ho Bao-Quoc Automatic construction of english- vietnamese
parallel corpus through web mining In Proceedings of 5th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future (RIVF), Hanoi, Vietnam, 2007
[23] Jrg Tiedemann, Lars Nygaard, and Tekstlaboratoriet Hf The opus corpus parallel and free
In Proceedings of the 4th International Conference on Lan guage Resources and Evaluation, pages 1183-1186, 2004
Trang 5[24] Jan Pomikalek Building parallel corpora from the web 2007
[25] Dragos Munteanu and Daniel Marcu Extracting parallel sub-sentential fragments from
non-parallel corpora ACL, pages 81-88, 2006
[26] Bing Zhao and Tephan Vogel Adaptive parallel sentences mining from web bilingual news
collection In Proceedings of the IEEE Workshop on Data Mining , 2002
[27] Pascale Fung and Percy Cheung Multi-level bootstrapping for extracting parallel sentences
from a quasi-comparable corpus In Proceedings of Coling, pages 1051-1057, 2004
[28] Jesus Tomas, Enrique Sanchez-Villamil, Jaime Lloret, and Francisco Casacu- berta
Webmining: An unsupervised parallel corpora web retrieval In The Corpus Linguistics Conference, 2005
[29] B Barla Cambazoglu, Evren Karaca, Tayfun Kucukyilmaz, Ata Turk, and Cevdet Aykanat
Architecture of a grid-enabled web search engine Information Processing and Management, pages 609-623, 2007
[30] Michel Simard, George F Foster, and Pierre Isabelle Using cognates to align
sentences in bilingual corpora In Proceedings of the Forth International Conference on Theoretical and Methodological Issues in Machine Translation, Montreal, Canada, 1992
[31] Dragos Munteanu and Daniel Marcu Improving machine translation performance by
exploiting comparable corpora Computational Linguistics, pages 477-504, 2005
[32] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu Bleu: A method for
automatic evaluation of machine translation ACL, Philadelphia, pages 311-318, 2002
[33] G Salton Automatic text processing: the transformation, analysis, and retrieval of
information by computer Addison-Wesley Publishing Company, 1989