Information Extraction for Vietnamese Real-Estate Advertisements Phạm Vi Liên Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS..
Trang 1Information Extraction for Vietnamese
Real-Estate Advertisements
Phạm Vi Liên
Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01
Người hướng dẫn: TS Phạm Bảo Sơn
Năm bảo vệ: 2012
Abstract In recent years, real-estate market in Vietnam is growing rapidly which
creates a lot of information about real-estate, especially information on advertising for buying and selling activities of real-estate development This poses an essential demand for building an information extraction system to help users deal with the increasing amount of real-estate advertisements on the Internet We propose a rule-based approach to build an information extraction system for online real-estate advertisements in Vietnamese At the same time, we set up a process to build an annotated corpus wich can be used in machine learning approaches at a later stage Our system achieve promising results with F-measures of above 90% Our approach
is particularly suitable for under-resourced languages where an annotated corpus of a
decent size is not readily available
Keywords Công nghệ thông tin; Quảng cáo; Bất động sản; Khai thác thông tin
Content
1.1 Problem and Idea 1
Trang 21.2 Scope of the thesis 4
1.3 Thesis’ structure 4
2 Related Work 5 2.1 Approaches 6
2.1.1 Rule-based approach 6
2.1.2 Machine-learning approach 7
2.1.3 Hybrid approach 8
2.2 GATE framework 8
2.2.1 Introduction 8
2.2.2 General Architecture of GATE 9
2.2.3 An example: ANNIE - A Nearly-New Information Extraction System 11
2.2.4 Working with GATE 11
2.2.5 Gazetteers 12
2.2.6 JAPE 13
3 Our Vietnamese Real-Estate Information Extraction system 14 3.1 Template Definition 14
3.2 Corpus Development 16
3.2.1 Criterion of data collection 16
3.2.2 Data collection 17
3.2.3 Data normalization 18
3.2.4 Corpus Annotation 21
3.3 System Development 23
3.3.1 Tokenizer 24
3.3.2 Gazetteer 26
3.3.3 JAPE Transducer 27
3.3.3.1 Remove incorrect Lookup annotations 29
3.3.3.2 Recognizing <TypeEstate> entities 30 3.3.3.3 Recognizing <CategoryEstate> entities 30
3.3.3.4 Recognizing <Zone> entities 31 3.3.3.5 Recognizing<Area>, <Price> and <Telephone> entities 32
3.3.3.6 Recognizing <Fullname> entities 32 3.3.3.7 Recognizing <Address> entities 33 3.3.3.8 Recognizing <Email> entities 33 3.4 Summary 34
4 Experiments and Error Analysis 35 4.1 Evaluation metrics 35
Trang 34.2 Experimental result 36
4.3 Errors Analysis 40
5.1 Conclusion 42
5.2 Future Works 42
Bibliography
References
[1] Truc-Vien Thi Nguyen and Tru Hoang Cao Automatic extraction of vietnamese
named-entities on the web Proceedings of the Journal of New Generation Computing, Ohmsha,
Ltd and Springer, 2007
[2] Diana Maynard, Kalina Bontcheva, and Hamish Cunningham Towards a semantic
extraction of named entities Proceedings Recent Advances in Natural Language
Processing, Borovets, Bulgaria, 2003
[3] Yu-Chieh Wu, Teng-Kai Fan, Yue-Shi Lee, and Show-Jane Yen Extracting named
entities using support vector machines Proceedings of the International Workshop on
Knowledge Discovery in Life Science Literature, 2006
[4] Theodore W Hong and Keith L Clark Using grammatical inference to automate
information extraction from the web In In Principles of Data Mining and Knowledge
Discovery, 2001
[5] Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi Building intelligent systems for
mining information extraction rules from web pages by using domain knowledge In in
Proc IEEE Int Symp Industrial Electronics, Pusan, Korea, 2001
[6] Haisong Gu Zhu and Qiang Ji Information extraction from image sequences of
real-world facial expressions Machine Vision and Applications, Vo 16, No 2, P105-115,
2005, 2005
[7] Dan Istrate, Eric Castelli, Michel Vacher, Laurent Besacier, and Jean-Francois Serignat
Information extraction from sound for medical telemonitoring IEEE Transactions on
Information Technology Biomedicine, Vol 10, No 2, April 2006, 2006
Trang 4[8] Howard D Wactlar New directions in video information extraction and summarization
In Proceedings of the 10th DELOS Workshop, Sanorini, Greece, June 24-25, 1999,
1999
[9] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, , and Valentin Tablan Gate:
A framework and graphical development environment for robust nlp tools and
applications Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, Philadelphia, PA, USA (2002) , 2002
[10] Dat Ba Nguyen, Son Huu Hoang, Son Bao Pham, and Thai Phuong Nguyen Named
entity recognition for vietnamese Springer Berlin/Heidelberg, ACI- IDS, 2010
[11] Borthwick Andrew, Sterling John, Agichtein Eugene, and Grishman Ralph Exploiting diverse knowledge sources via maximum entropy in named entity recognition
Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 1998
[12] Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat Named entity recognition
using a new fuzzy support vector machine Proceedings of the International Journal of
Computer Science and Network Security, IJCSNS, vol 8, n 2, pg 320-325, 2008
[13] Xiaoshan Fang and Huanye Sheng A hybrid approach for chinese named entity
recognition Proceedings of the Fifth International Conference on Discovery Science,
2002
[14] Rohini Srihari, Cheng Niu, and Wei Li A hybrid approach for named entity and
sub-type tagging Proceedings of the Sixth Conference on Applied Natural Language
Processing, 2000
How feasible is the reuse of grammars for named entity recognition? Proceedings of the
Conference on Language Resources and Evaluation (LREC’02), 2002
[16] Indra Budi and Stéphane Bressan Association rules mining for name entity recognition
Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003
[17] Xuan-Thao Thi Pham, Tri Quoc Tran, Ai Kawazoe, Dien Dinh, and Nigel Collier
Construction of vietnamese corpora for named entity recognition Conference
RIA02007, Pittsburgh PA, U.S.A May 30-June 1, 2007 - Copyright C.I.D Paris, France, 2007
[18] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick
Wilks Named entity recognition from diverse text types Proceedings Recent Advances
in Natural Language Processing, 2001
Trang 5[19] Sunita Sarawagi Information Extraction Foundations and Trends in Databases Vol 1, No 3 (2007) 261-377, 2007
[20] Daniel M Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel A high-
performance learning name-finder Proceedings of the Fifth Conference on Applied
Natural Language Processing, PP 194-201, 1998
[21] John Lafferty, Andrew McCallum, and Fernando Pereira Conditional random fields:
Probabilistic models for segmenting and labeling sequence data Proceedings of the
International Conference on Machine Learning (ICML-2001), 2001
[22] Yaoyong Li, Kalia Bontcheva, and Hamish Cunnigham Adapting svm for data
sparseness and imbalance: a case study in information extraction Natural Language
Engineering 15 (2): 241-271., 2008
[23] Doug Downey, Stefan Schoenmackers, and Oren Etzioni Sparse information extraction:
Unsupervised language models to the rescue Annual Meeting of the Association for
Computational Linguistics, 2007
[24] Benjamin Rosenfeld and Ronen Feldman Using corpus statistics on entities to improve
semi-supervised relation extraction from the web Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics, pp 600-607, 2007
Named entity recognition in vietnamese documents Journal of “Progress in
Informatics”, NII (National Institute for Informatics), Tokyo, Japan, Vol 2007, No.4,
pp 1-9, 2007
[26] Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, and The Minh Trinh
Relation extraction in Vietnamese text using conditional random fields The Sixth Asia
Information Retrieval Societies Conference (AIRS), 2010
[27] Gabrielle Gayer, Itzhak Gilboa, and Offer Lieberman Rule-based and case- based
reasoning in housing prices In The B.E Journal of Theoretical Economics, 2007
[28] R Feldman, B Rosenfeld, and M Fresko Teg-a hybrid approach to information
extraction Knowledge and Information Systems , vol 9, pp 1-18, 2006, 2006
[29] Y Choi, C Cardie, E Riloff, and S Patwardhan Identifying sources of opinions with
conditional random fields and extraction patterns In In Proceedings of HLT/EMNLP
2005, 2005
[30] Hamish Cunningham Gate, a general architecture for text engineering Computers and
the Humanities 36, 223-254, 2002
Trang 6[31] David Ferrucci and Adam Lally Uima: An architectural approach to unstructured
information processing in the corporate research environment Natural Language
Engineering, vol 10, nos 3-4, pp 327-348, 2004., 2004
[32] Boyan Onyshkevych Issues and methodology for template design for information
extraction In Proceedings of the workshop on Human Language Technology, pages
171-176, 1994
[33] Jim Cowie and Yorick Wilks Information extraction In R Dale, H Moisl and H
Somers (eds.) Handbook of Natural Language Processing, 2000
using part of speech tags Proceedings of the First International Conference on
Knowledge and Systems Engineering, Hanoi, Vietnam, 2009
hybrid a pproach to word segmentation of Vietnamese texts Proceedings of the 2nd
International Conference on Language and Automata Theory and Applications LATA
2008, 2008
[36] Dinh Quang Thang, Le Hong Phuong, Nguyen Thi Minh Huyen, Nguyen Cam Tu, Mathias Rossignol, and Vu Xuan Luong Word segmentation of Vietnamese texts: a
comparison of approaches Proceedings of the 6th Language Resources and Evaluation
Conference LREC 2008, 2008