1. Trang chủ
  2. » Thể loại khác

DSpace at VNU: Information Extraction for Vietnamese Real-Estate Advertisements

6 162 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 180,21 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Information Extraction for Vietnamese Real-Estate Advertisements Phạm Vi Liên Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: TS..

Trang 1

Information Extraction for Vietnamese

Real-Estate Advertisements

Phạm Vi Liên

Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01

Người hướng dẫn: TS Phạm Bảo Sơn

Năm bảo vệ: 2012

Abstract In recent years, real-estate market in Vietnam is growing rapidly which

creates a lot of information about real-estate, especially information on advertising for buying and selling activities of real-estate development This poses an essential demand for building an information extraction system to help users deal with the increasing amount of real-estate advertisements on the Internet We propose a rule-based approach to build an information extraction system for online real-estate advertisements in Vietnamese At the same time, we set up a process to build an annotated corpus wich can be used in machine learning approaches at a later stage Our system achieve promising results with F-measures of above 90% Our approach

is particularly suitable for under-resourced languages where an annotated corpus of a

decent size is not readily available

Keywords Công nghệ thông tin; Quảng cáo; Bất động sản; Khai thác thông tin

Content

1.1 Problem and Idea 1

Trang 2

1.2 Scope of the thesis 4

1.3 Thesis’ structure 4

2 Related Work 5 2.1 Approaches 6

2.1.1 Rule-based approach 6

2.1.2 Machine-learning approach 7

2.1.3 Hybrid approach 8

2.2 GATE framework 8

2.2.1 Introduction 8

2.2.2 General Architecture of GATE 9

2.2.3 An example: ANNIE - A Nearly-New Information Extraction System 11

2.2.4 Working with GATE 11

2.2.5 Gazetteers 12

2.2.6 JAPE 13

3 Our Vietnamese Real-Estate Information Extraction system 14 3.1 Template Definition 14

3.2 Corpus Development 16

3.2.1 Criterion of data collection 16

3.2.2 Data collection 17

3.2.3 Data normalization 18

3.2.4 Corpus Annotation 21

3.3 System Development 23

3.3.1 Tokenizer 24

3.3.2 Gazetteer 26

3.3.3 JAPE Transducer 27

3.3.3.1 Remove incorrect Lookup annotations 29

3.3.3.2 Recognizing <TypeEstate> entities 30 3.3.3.3 Recognizing <CategoryEstate> entities 30

3.3.3.4 Recognizing <Zone> entities 31 3.3.3.5 Recognizing<Area>, <Price> and <Telephone> entities 32

3.3.3.6 Recognizing <Fullname> entities 32 3.3.3.7 Recognizing <Address> entities 33 3.3.3.8 Recognizing <Email> entities 33 3.4 Summary 34

4 Experiments and Error Analysis 35 4.1 Evaluation metrics 35

Trang 3

4.2 Experimental result 36

4.3 Errors Analysis 40

5.1 Conclusion 42

5.2 Future Works 42

Bibliography

References

[1] Truc-Vien Thi Nguyen and Tru Hoang Cao Automatic extraction of vietnamese

named-entities on the web Proceedings of the Journal of New Generation Computing, Ohmsha,

Ltd and Springer, 2007

[2] Diana Maynard, Kalina Bontcheva, and Hamish Cunningham Towards a semantic

extraction of named entities Proceedings Recent Advances in Natural Language

Processing, Borovets, Bulgaria, 2003

[3] Yu-Chieh Wu, Teng-Kai Fan, Yue-Shi Lee, and Show-Jane Yen Extracting named

entities using support vector machines Proceedings of the International Workshop on

Knowledge Discovery in Life Science Literature, 2006

[4] Theodore W Hong and Keith L Clark Using grammatical inference to automate

information extraction from the web In In Principles of Data Mining and Knowledge

Discovery, 2001

[5] Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi Building intelligent systems for

mining information extraction rules from web pages by using domain knowledge In in

Proc IEEE Int Symp Industrial Electronics, Pusan, Korea, 2001

[6] Haisong Gu Zhu and Qiang Ji Information extraction from image sequences of

real-world facial expressions Machine Vision and Applications, Vo 16, No 2, P105-115,

2005, 2005

[7] Dan Istrate, Eric Castelli, Michel Vacher, Laurent Besacier, and Jean-Francois Serignat

Information extraction from sound for medical telemonitoring IEEE Transactions on

Information Technology Biomedicine, Vol 10, No 2, April 2006, 2006

Trang 4

[8] Howard D Wactlar New directions in video information extraction and summarization

In Proceedings of the 10th DELOS Workshop, Sanorini, Greece, June 24-25, 1999,

1999

[9] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, , and Valentin Tablan Gate:

A framework and graphical development environment for robust nlp tools and

applications Proceedings of the 40th Annual Meeting of the Association for

Computational Linguistics, Philadelphia, PA, USA (2002) , 2002

[10] Dat Ba Nguyen, Son Huu Hoang, Son Bao Pham, and Thai Phuong Nguyen Named

entity recognition for vietnamese Springer Berlin/Heidelberg, ACI- IDS, 2010

[11] Borthwick Andrew, Sterling John, Agichtein Eugene, and Grishman Ralph Exploiting diverse knowledge sources via maximum entropy in named entity recognition

Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 1998

[12] Alireza Mansouri, Lilly Suriani Affendey, and Ali Mamat Named entity recognition

using a new fuzzy support vector machine Proceedings of the International Journal of

Computer Science and Network Security, IJCSNS, vol 8, n 2, pg 320-325, 2008

[13] Xiaoshan Fang and Huanye Sheng A hybrid approach for chinese named entity

recognition Proceedings of the Fifth International Conference on Discovery Science,

2002

[14] Rohini Srihari, Cheng Niu, and Wei Li A hybrid approach for named entity and

sub-type tagging Proceedings of the Sixth Conference on Applied Natural Language

Processing, 2000

How feasible is the reuse of grammars for named entity recognition? Proceedings of the

Conference on Language Resources and Evaluation (LREC’02), 2002

[16] Indra Budi and Stéphane Bressan Association rules mining for name entity recognition

Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003

[17] Xuan-Thao Thi Pham, Tri Quoc Tran, Ai Kawazoe, Dien Dinh, and Nigel Collier

Construction of vietnamese corpora for named entity recognition Conference

RIA02007, Pittsburgh PA, U.S.A May 30-June 1, 2007 - Copyright C.I.D Paris, France, 2007

[18] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and Yorick

Wilks Named entity recognition from diverse text types Proceedings Recent Advances

in Natural Language Processing, 2001

Trang 5

[19] Sunita Sarawagi Information Extraction Foundations and Trends in Databases Vol 1, No 3 (2007) 261-377, 2007

[20] Daniel M Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel A high-

performance learning name-finder Proceedings of the Fifth Conference on Applied

Natural Language Processing, PP 194-201, 1998

[21] John Lafferty, Andrew McCallum, and Fernando Pereira Conditional random fields:

Probabilistic models for segmenting and labeling sequence data Proceedings of the

International Conference on Machine Learning (ICML-2001), 2001

[22] Yaoyong Li, Kalia Bontcheva, and Hamish Cunnigham Adapting svm for data

sparseness and imbalance: a case study in information extraction Natural Language

Engineering 15 (2): 241-271., 2008

[23] Doug Downey, Stefan Schoenmackers, and Oren Etzioni Sparse information extraction:

Unsupervised language models to the rescue Annual Meeting of the Association for

Computational Linguistics, 2007

[24] Benjamin Rosenfeld and Ronen Feldman Using corpus statistics on entities to improve

semi-supervised relation extraction from the web Proceedings of the 45th Annual

Meeting of the Association of Computational Linguistics, pp 600-607, 2007

Named entity recognition in vietnamese documents Journal of “Progress in

Informatics”, NII (National Institute for Informatics), Tokyo, Japan, Vol 2007, No.4,

pp 1-9, 2007

[26] Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, and The Minh Trinh

Relation extraction in Vietnamese text using conditional random fields The Sixth Asia

Information Retrieval Societies Conference (AIRS), 2010

[27] Gabrielle Gayer, Itzhak Gilboa, and Offer Lieberman Rule-based and case- based

reasoning in housing prices In The B.E Journal of Theoretical Economics, 2007

[28] R Feldman, B Rosenfeld, and M Fresko Teg-a hybrid approach to information

extraction Knowledge and Information Systems , vol 9, pp 1-18, 2006, 2006

[29] Y Choi, C Cardie, E Riloff, and S Patwardhan Identifying sources of opinions with

conditional random fields and extraction patterns In In Proceedings of HLT/EMNLP

2005, 2005

[30] Hamish Cunningham Gate, a general architecture for text engineering Computers and

the Humanities 36, 223-254, 2002

Trang 6

[31] David Ferrucci and Adam Lally Uima: An architectural approach to unstructured

information processing in the corporate research environment Natural Language

Engineering, vol 10, nos 3-4, pp 327-348, 2004., 2004

[32] Boyan Onyshkevych Issues and methodology for template design for information

extraction In Proceedings of the workshop on Human Language Technology, pages

171-176, 1994

[33] Jim Cowie and Yorick Wilks Information extraction In R Dale, H Moisl and H

Somers (eds.) Handbook of Natural Language Processing, 2000

using part of speech tags Proceedings of the First International Conference on

Knowledge and Systems Engineering, Hanoi, Vietnam, 2009

hybrid a pproach to word segmentation of Vietnamese texts Proceedings of the 2nd

International Conference on Language and Automata Theory and Applications LATA

2008, 2008

[36] Dinh Quang Thang, Le Hong Phuong, Nguyen Thi Minh Huyen, Nguyen Cam Tu, Mathias Rossignol, and Vu Xuan Luong Word segmentation of Vietnamese texts: a

comparison of approaches Proceedings of the 6th Language Resources and Evaluation

Conference LREC 2008, 2008

Ngày đăng: 18/12/2017, 03:04

TỪ KHÓA LIÊN QUAN