Figure 1.1: The result for query "cần mua nhà ở Hà Nội" on Google Search In order to solve the above problem, the researchers have looked into areassuch as information extraction, text s
Trang 1ORIGINALITY STATEMENT i
1.1 Problem and Idea 1
1.2 Scope of the thesis 4
1.3 Thesis’ structure 4
2 Related Work 5 2.1 Approaches 6
2.1.1 Rule-based approach 6
2.1.2 Machine-learning approach 7
2.1.3 Hybrid approach 8
2.2 GATE framework 8
2.2.1 Introduction 8
2.2.2 General Architecture of GATE 9
2.2.3 An example: ANNIE - A Nearly-New Information Extrac-tion System 11
2.2.4 Working with GATE 11
2.2.5 Gazetteers 12
2.2.6 JAPE 13
3 Our Vietnamese Real-Estate Information Extraction system 14 3.1 Template Definition 14
3.2 Corpus Development 16
3.2.1 Criterion of data collection 16
3.2.2 Data collection 17
iv
Trang 2Table of Contents v
3.2.3 Data normalization 18
3.2.4 Corpus Annotation 21
3.3 System Development 23
3.3.1 Tokenizer 24
3.3.2 Gazetteer 26
3.3.3 JAPE Transducer 27
3.3.3.1 Remove incorrect Lookup annotations 29
3.3.3.2 Recognizing <TypeEstate> entities 30
3.3.3.3 Recognizing <CategoryEstate> entities 30
3.3.3.4 Recognizing <Zone> entities 31
3.3.3.5 Recognizing <Area>, <Price> and <Telephone> entities 32
3.3.3.6 Recognizing <Fullname> entities 32
3.3.3.7 Recognizing <Address> entities 33
3.3.3.8 Recognizing <Email> entities 33
3.4 Summary 34
4 Experiments and Error Analysis 35 4.1 Evaluation metrics 35
4.2 Experimental result 36
4.3 Errors Analysis 40
5 Conclusion and Future Works 42 5.1 Conclusion 42
5.2 Future Works 42
Trang 31.1 The result for query "cần mua nhà ở Hà Nội" on Google Search 2
1.2 The expected result of our system 3
2.1 A screenshot of a GUI in GATE framework 9
2.2 The general architecture of GATE 10
3.1 Template of our system 15
3.2 An example of an original news article before normalization 17
3.3 An example of a normalized news article 21
3.4 The process of creating an annotated corpus and system development 21 3.5 The main code is defined to create a new Callisto task 22
3.6 A news articles annotated by Callisto 23
3.7 Architecture of our Vietnamese Real-Estate Information Extraction system 24
3.8 Typical Vietnamese Real-Estate Information Extraction system com-ponents 28
4.1 The performance of our system in three versions 39
4.2 Using lenient criteria to evaluate the annotation in three versions 39
4.3 Using strict criteria to evaluate the annotation in three versions 40
5.1 The screenshot of Real-Estate Information Extraction system 43
A.1 A code recognize TypeEstate entity 44
A.2 A code recognize Telephone entity 44
A.3 A code recognize Email entity 45
A.4 A code recognize Zone entity 45
vi
Trang 4List of Tables
4.1 Performance on the T raining3 data using lenient criteria 37
4.2 Performance on the T raining3 data using strict criteria 37
4.3 Performance on the T est3 data using lenient criteria 38
4.4 Performance on the T est3 data using strict criteria 38
vii
Trang 5As data and information sources are growing rapidly everyday, dealing with thisdata become a big and challenging problem Popular techniques such as machinelearning can not be easily applied for many language processing tasks in Viet-namese due to the lack of annotated corpora This is indeed the case for pro-cessing real-estate advertising information In this thesis, we propose to build aninformation extraction system for real-estate adverstisements in Vietnamese
1.1 Problem and Idea
With the advent and development of the Internet, a great amount of data hasbeen posted to the Internet Those data are not only text but also image, audio,video, and so on They appear in most areas of life from economic, politic, society,medicine to the emerging areas today such as securities, finance, real-estate, etc.The explosion of data is constantly increasing everyday, especially, in the cloudcomputing age Almost all of user data is stored on the web platform This hugedata source contain a lot of information If data are increasing rapidly, it meansthat, information is also growing much faster than data
With more information, users become more confused because the useful formation that they need is drifting following the stream-data In order to help
in-1
Trang 6Chapter 1 Introduction 2
people deal with this situation, there are many search engines that have been ated such as Google1, Bing2, Yahoo3, etc They quickly become an indispensabletool to assist human in finding useful information from the huge data sources onthe Internet However, they still haven’t met the expectations of the users, espe-cially, in the case where the user’s query is a question Take the following example:
cre-We use the phrase "cần mua nhà ở Hà Nội" (buy a house in Hanoi) as a queryfor Google’s search engine (Figure 1.1) The results which we obtained is a list oflinks These links refer to websites containing one of the words of the above query.From Figure 1.1, we can easily see that these results aren’t the expected results ofthe users Users have to spend a lot of time to find an answer for their query fromthis list of links Therefore, our desire is that the users should get a list of specificanswers to the query
Figure 1.1: The result for query "cần mua nhà ở Hà Nội" on Google Search
In order to solve the above problem, the researchers have looked into areassuch as information extraction, text summarization, data mining, etc to delivermore useful and specific information to users
Information Extraction is one of the important tasks in natural language cessing The main idea of an information extraction system is to extract snippets
pro-1 https://www.google.com/
2 https://www.bing.com/
3 https://www.yahoo.com/
Trang 7of information from unstructured or semi-structured documents to fill in a tured form which is called a template In other words, the system will extract therequisite information from the content of the input documents to fill in the definedtemplate output The data obtained after the extraction process can be presenteddirectly for users or used as input data for third party applications such as analysisand prediction, information retrieval, data mining or search engine, etc.
struc-Around thirty years ago, information extraction started its rapid development.There are many studies in many different domains that have been publised Mostinformation extraction systems of the first generation are research and experimen-tation for documents in English In recent years, many studies of this technologyhave gradually appeared in other languages such as French, Japanese, Chinese,etc However, it is still a new problem in Vietnamese, especially in the domain forreal-estate advertisements
Our thesis addresses the problem of information extraction for Vietnameseonline real-estate advertisements We propose a rule-based approach for buildingthis system At the same time, we also build an annotated corpus for the sametask There are several other approaches that have been used to tackle the aboveproblem such as machine learning and hybrid method However, there aren’t anyannotated corpus in Vietnamese publicized for the community, especially in thereal-estate field So our rule based approach for this problem is reasonable andappropriate at this moment We can reduce the labour cost compared to otherapproaches Figure 1.2 shows and and input sample and its expected output forour system
Figure 1.2: The expected result of our system
Trang 8Chapter 1 Introduction 4
1.2 Scope of the thesis
With the development of the Internet, online advertising is practical and ingly popular It is an effective advertising solution for both advertising individuals,agencies and viewers Thus, the data source from the advertisements is extremelylarge and diverse Our thesis focuses on processing the free online Vietnamese textadvertisement in the real-estate domain
increas-1.3 Thesis’ structure
Our thesis is organized into five chapters as follows:
• Chapter 1: We introduce the problem and idea to build a system to extractinformation for online real-estate advertisements in Vietnamese
• Chapter 2: We present an overview of related research for information traction methods in general and real-estate domain in particular
ex-• Chapter 3: We describe in details of how to build our Vietnamese Real-EstateInformation Extraction system
• Chapter 4: We present the results of our experiments and the analysis ofsome failures
• Chapter 5: We conclude with discussion about future development for thesystem
Trang 9At the end of 19981, the Message Understanding Conferences (MUC) gramme had arrived at a definition of information extraction which is split intofive subtasks:
pro-• Named Entity recognition (NE): Finds and classifies names, places, etc
• Coreference resolution (CO): Identifies identity relations between entities
• Template Element construction (TE): Adds descriptive information to NEresults (using CO)
• Template Relation construction (TR): Finds relations between TE entities
• Scenario Template production (ST): Fits TE and TR results into specifiedevent scenarios
In this chapter, we present an overview of several approaches to build an mation extraction system Specifically, in section 2.1, we will show three commonapproaches that are often used today namely rule-based, machine learning and
infor-1 The final conference was MUC-7 (1998).
5
Trang 10Chapter 2 Related Work 6
hybrid approach And in section 2.2, we describe briefly the GATE framework [9]which we use to build our system
2.1 Approaches
There are various approaches that have been used to build an information tion system The followings are three popular categories of approaches that haveoften been used for this problem
recogni-A rule-based system is easy for a human to interpret, develop, and augment theset of rules These systems are often based on features such as syntactic information(e.g part of speech) [2], contextual information [15], morphological information(e.g uppercase, lowercase, numbers, , special symbol, space, punctuation, and soon) or using Gazetteer [16, 17] or annotations attached by earlier processing steps
Up to now, there are many studies using this method [4, 15, 18] that obtain highperformance including tasks for Vietnamese [1, 10]
One key advantage of a rule-based system compared to other approaches that
is it does not need a large annotated corpus
Take an example about a system for person entity recognition A rule is of theform:
Trang 11Mr + <Person> (ông + <tên người>)
If we supplement this rule into the set of rules of the system, then the systemcan recognize a large number of the person entities In other words, relying solely
on the rule we have immediately a result without having to build an annotatedcorpus
Although the rule-based systems are easier to develop, they require domain experts
to predefined extraction patterns/rules for these systems Moreover, manually veloping patterns/rules is very difficult and tedious So this approach is not aseffective as the machine-learning approach in the open-ended domains like factextraction, or opinion extraction [19]
de-Machine-learning is a method which requires a large number of manually beled unstructured data (an annotated corpus) to train machine-learning models.According to our knowledge, there are many models that have been applied formachine learning Most typical ones are Hidden Markov Model (HMM) [20], Max-imum Entropy [11], Support Vector Machine (SVM) [3, 12], Conditional RandomField (CRF) [21], etc
la-There are many ways to build a machine learning system However, there arethree typical categories namely supervised learning [22], unsupervised learning [23]and semi-supervised learning [24]
The unsupervised and semi-supervised learning systems have little researchfor entity recognition/extraction tasks The supervised learning systems that havebeen widely used for this domain include Bikel [20], Borthwick [11] and Wu’ssystem [3]
Now, there are also some studies using machine learning method for tasks inVietnamese such as named entity recognition for Vietnamese documents [25] whichthe authors used SVM model and obtained an overall F-measure of about 87.75%,
or Sam [26] using Conditional Random Fields to extract relations in Vietnamesetext
Trang 12Chapter 2 Related Work 8
There are work that extract information from real estate advertisements forEnglish [5, 27] but these works take a wrapper induction approach on html doc-uments This differs greatly from our work as we focus on free text which do nothave html tags as clues for recognizing entities
Besides the rule-based and machine-learning methods, there are a few researchersattempting to utilize the advantages of each of the above methods to bring higherperformance for their system [28, 29] This is called a hybrid approach
There are a number of the entity recognition studies for Chinese using thisapproach, and they have given very good results such as Srihari [14] and Fang’ssystem [13] This promises that the entity recognition problem for Vietnamesewill also get positive result as Vietnamese and Chinese have many similarities.However, so far, there are no published work using hybrid approach for Vietnamese
to the best of our knowledge
2 http://reverb.cs.washington.edu/
3 http://code.google.com/p/graph-expression/
4 http://gate.ac.uk/ie/
5 http://uima.apache.org
Trang 13It has been used for lots of information extraction projects in many languagesand problem domains A typical example of an information extraction system isANNIE - A Nearly-New Information Extraction System It is packaged as a plugin
in GATE We will provide details in section 2.2.3
GATE is a Java suite of tools, and is an open source free software under theGNU library license Users can obtain free support from the user and developercommunity via the official website6
Figure 2.1: A screenshot of a GUI in GATE framework
One advantage of GATE is that it divides a system into small and reusable ponents This has helped GATE become more complete by the contribution of theresearch community worldwide Figure 2.2 describes the general architecture ofGATE which contains:
com-• IDE GUI Layer: This layer interacts with the user through the visual face of the framework
inter-6 http://gate.ac.uk/download/
Trang 14Chapter 2 Related Work 10
Figure 2.2: The general architecture of GATE
• Application Layer: This layer manages plugins, which is distributed withGATE, or developed by a programmer
• Document Format Layer: This layer manages document format GATE ports a variety of formats including XML, RTF, HTML, SGML, email andplain text In all cases, when a documents is created/opened in GATE, theformat is analyzed and converted into a single unified model of annotation
sup-• Corpus Layer: This layer manages corpora Corpora consist of many differentcomponents such as Corpus, Document, Document Content, Annotation Set,Annotation, Feature Map
• Language Resources Layer: This layer manages several special objects such
as Ontology, Ontology Protégé, WordNet, Gazetteers, etc
• Processing Resource Layer: This layer manages the most important objectssuch as Part-of-Speech, Named entity recognition, etc
• DataStore Layer: This layer manages storage of data
Trang 152.2.3 An example: ANNIE - A Nearly-New Information
Extraction System
This section presents a typical plugin of GATE that is ANNIE (which stands for
"A Nearly-New Information Extraction system") It is distributed with GATE,known as an important component for information extraction projects in multiplelanguages ANNIE relies on finite state algorithms and the JAPE language It ispackaged from many components The following are a few typical components:Document Reset PR: It will be reset the document to original state.English Tokeniser: It is a word segmentation of English which splits thedocument into tokens such as words, punctuation, numbers, etc
Gazetteer: It is a set of dictionaries which is used to identify entity names
or clues for recognition process
Sentence Splitter: It relies on the punctuation to split a document intoseveral sentences
POS Tagger: It performs POS tagging for the sentence in the document
GATE uses three different types of resources:
• Language Resources: Language Resources consist of documents which need
to be processed They can only be a single document or a corpora They havedifferent formats The language resources such as lists (Gazetteers), the vo-cabulary (lexicons) or resources have complex structures such as ontologies
• Processing Resources: Processing Resources represent entities that are marily algorithmic, such as parsers, generators or n-gram modelers;
pri-• Visual Resources: Visual Resources represent visualization and editing ponents that participate in GUIs
com-We present details of two typical components that are Gazetteers (in section2.2.5) and JAPE (in section 2.2.6) corresponding to the Language Resource andProcessing Resource respectively
Trang 16Chapter 2 Related Work 12
The gazetteer is a set of lists, known as dictionaries Each dictionary is a plain textfile, which contains a set of names (such as names of persons, organizations, com-panies, etc.) or specific phrases They often are used as a clue to assist developersperform typical tasks such as named entity recognition For instance:
• Gazetteer contains named entities such as:
+ Person gazetteer: John, Tom, Peter, etc
+ Organization gazetteer: ASEAN, G8, UEFA, etc
+ Nationality gazetteer: England, France, Vietnam, etc
• Gazetteer contains phrases which is a clue for recognition such as:
+ Gazetteer includes phrase to identify person name: Mr, Mrs, Sir, etc.+ Gazetteer includes phrase to identify units of currency: dollar, $, USD,ECU, pound, etc
All dictionaries are declared in a main file (usually called lists.dif) The generalstructure to declare as follows:
DictionaryName:Major:Minor:Language
In the above syntax, DictionaryName is dictionary name (the phrase file); jor is MajorType feature; Minor is MinorType feature; Language is language usedfor dictionary For example: personName.lst:person:personName:vi Here, ’person-Name.lst’ is dictionary name; ’person’ is MajorType; ’personName’ is MinorType;
Ma-’vi’ is language
When a gazetteer is compiled, GATE will create Lookup annotations if a string
in the document matches with a phrase in the gazetteer For instance: ’John’ is
a word of dictionary’s person, and it also appears in the document Therefore,GATE creates a Lookup annotation with two features as follows: MajorType =
’person’ and MinorType = ’personName’
We can create the dictionaries by using Gaze tool of GATE, or may use othercommon editor: Notepad, Notepad++, etc
Trang 172.2.6 JAPE
JAPE is the Java Annotation Patterns Engine, a component of the open-sourceGATE platform JAPE is a finite state transducer that operates over annotationsbased on regular expressions Thus, it is useful for pattern-matching, semanticextraction, and many other operations over syntactic trees such as those produced
by natural language parsers
JAPE Transducer is an important component of GATE It is responsible forexecution of JAPE grammars, which are written as files with ’.jape’ extension
A JAPE grammar consists of a set of phases, each of which consists of a set ofpattern/action rules The phases run sequentially and constitute a cascade of finitestate transducers over annotations Each rule has the following format:
LHS (left-hand-side) –> RHS (right-hand-side)
The left-hand-side (LHS) is the part preceding the ’–>’ and the side (RHS) is the part following it The LHS of the rules consist of an annotationpattern that may contain regular expression operators (e.g *, ?, +) The RHSconsists of annotation manipulation statements Annotations matched on the LHS
right-hand-of a rule may be referred to on the RHS by means right-hand-of labels that are attached topattern elements Belows is a JAPE grammar example:
Trang 18Chapter 3
Our Vietnamese Real-Estate
Information Extraction system
Today natural language processing in general and information extraction in ular has attracted rapid development in many different languages, but Vietnameselanguage is still at the early stage Our thesis tackles the information extractiontask for online real-estate advertisement in Vietnamese
partic-We build a Vietnamese Real-Estate Information Extraction system as plugins
of the GATE framework with the goal of being released for public for furtherdevelopment
by human expert possessing knowledge domain or system developers
Defining templates is a difficult task involving the selection of the tion elements required, and the definition of their relationships [33] This taskwas determined as one of the challenges when building an information extractionsystem
informa-14
Trang 19There are three template structure categories namely text annotation, flatdata templates and object-oriented templates [32] Based on the specific require-ments and tasks of the IE system, the developer selects a suitable template fortheir system For our system, we choose text annotation template.
After inspecting the collected data, we have decided on the template for oursystem as shown in figure 3.1 This template captures most of the information thatthe posters describe as well as what the viewers are looking for in a real-estateadvertisement The information elements of the template are often called entities
An entity is usually a noun, a noun phrase, or consists of many tokens gether in the content of text They are the words or phrases The most typicalform of an entity is a named entity such as person, location, organization names
to-In the our template (Figure 3.1), we use some entities as follows: TypeEstate,CategoryEstate, Area, Price, Zone, Fullname, Telephone, etc
+ Loại tin (TypeEstate)+ Loại nhà (CategoryEstate)+ Diện tích (Area)
+ Giá tiền (Price)+ Khu vực (Zone)+ Liên hệ (Contact)
• Tên liên hệ (Fullname)
• Điện thoại (Telephone)
• Thư điện tử (Email)
• Địa chỉ (Address)
Figure 3.1: Template of our system
Where:
• The <Loại tin> (<TypeEstate>) entity provides information about the types
of advertising Inspect the real-estate ads, we found only four types as follows:
"mua" (buy), "bán" (sell), "cho thuê" (lease), "chuyển nhượng" (transfer)
Trang 20Chapter 3 Our Vietnamese Real-Estate Information Extraction system 16
• The <Loại nhà> (<CategoryEstate>) entity provides information about theobjects of advertising For example, the types of common advertising suchas: "nhà" (house), "đất" (land), "căn hộ" (apartment), "biệt thự" (villa)
• The <Diện tích> (<Area>) entity provides information about the area ofadvertising objects For example: "50 m2", "từ 100 đến 130 m2" (from 100
• The <Liên hệ> (<Contact>) entity provides information such as "Tên liênhệ" (Fullname), "Điện thoại" (Telephone), "Thư điện tử" (Email) and "Địachỉ" (Address) of the person to contact For example:
Liên hệ (contact):
+ Tên liên hệ (Fullname): anh Thanh (Mr Thanh)
+ Điện thoại (Telephone): 0989858199
+ Thư điện tử (Email): hanhtdb21@gmail.com
+ Địa chỉ (Address): 230 Lê Đức Thọ - Từ Liêm - Hà Nội (230 Le DucTho - Tu Liem - Ha Noi)
3.2 Corpus Development
The news articles was selected for the our system should ensure the followingcriterion:
• An input data file consists of only one news article of real-estateadvertising If there is an input data file having more than one advertising
Trang 21news article, we must divide it into several files In other words, each inputdata file will only has one output template.
• The news article is free text As the focus of our work is on free textprocessing, we strip all html tags and only retain the free text of the collectedadvertisements
In order to develop and test our system, we built a corpus by collecting datafrom reputable websites that provide free online real-estate advertisements such ashttp://vnexpress.net/rao-vat/13/the-house-dat/, http://raovat.thanhnien.com.vn/pages/default.aspx, etc These websites attract a large numbers of theposters as well as viewers The posters can put their advertisements in their ownform and the websites do not modify them Therefore, ads of the same posters tend
to be of a similar format and ads belonging to different posters have different stylesand language used This leads to plenty of ambiguity, not well-formed ads withgrammatical errors that would cause troubles to automatic processing tools suchword segmentation or part-of-speech tagger Some of the ambiguity and style di-versity are: end of sentence without punctuation; person name, organization name,place name are in lower case without the first letter capitalized; frequent use ofacronyms, etc Consider an example of the original news article below:
Tôi cần bán CCCC Tòa CT14, Mỹ đình - từ liêm - Hà Nội
DT:133m2, căn nguyên bản, chưa sửa chữa
Gia chủ nào muốn mua xin liên hệ: anh thanh: 098.985.8199,
giá 50-60tr/m2 (MIỄN TRUNG GIAN)
Email: hanhtdb21@gmail.com
Figure 3.2: An example of an original news article before normalization
In the above example we can see some ambiguity which we have mentionedabove For example:
• The end of sentence without punctuation
• Using the acronyms: CCCC, DT, tr/m2
Trang 22Chapter 3 Our Vietnamese Real-Estate Information Extraction system 18
• The person name (anh thanh), place name (Mỹ đình - từ liêm - Hà Nội) are
in lower case without the first letter capitalized
• The sentences with write grammatical errors
In order to limit the influence of ambiguous data to the system performance,
we carry out the pre-processing step (normalization step) for collected data beforeour system processes them
Normalization is the process to minimize ambiguous data This work must antee consistency of data before and after the normalization In other words, thenews articles were obtained after processing step has to ensure that the main con-tent of them are remained intact This task is more complex than we originallyimagined because the normalization process is executed automatically without thesupervisor Moreover, there are many news articles in our collected data that havediverse styles Consequently, this process can create some unintended errors e.g.important information was removed
guar-With the above difficulties, we decided to normalize for reliable information
to avoid the occurrence of losses of important information We only target tominimize a number of common errors about ambiguous data in the news articles.Our normalization process consists of the following steps:
• Firstly, we add a punctuation at the end of sentence
• Secondly, we merge multiple paragraphs into a unique paragraph, becausemost of these news articles are not too long
• Thirdly, we normalized the punctuation, remove the redundant spaces, italization for the characters after the dot
cap-• Fourthly, we normalized Telephone, Price, Area, Fullname, etc using a mon pattern
com-• Finally, we replace some of the abbreviated phrases by their correspondingfull forms
Trang 23In the above steps, the fourth step is the most difficult This step is an portant contribution in improving the recognition rate of our system We performthe following step:
im-We normalize the phone number from many different styles of the poster Ourgoal is to convert these styles into a common format i.e a number These stylescauses the part-of-speech tagger not to work properly For example: The posterwrote ’098.141.4729’ as a phone number but the part-of-speech tagger does notunderstand and labeled it is ’X’ We rely on the features of the phone number inVietnamese to build the regular expression captureing all the cases that a numberstring is a phone number Specifically, a telephone number is usually a serial num-ber It has two instances namely mobile number home-phone number The mobilenumber often has 10 or 11 numbers, whereas the home-phone number has 7 to
8 numbers The home-phone number may include area code and/or country code
in front of it The mobile number begins with string "09" or "01", whereas thehome-phone number begins with string "0[2-8]" if it has area code and/or countrycode We see some styles for telephone numbers that the posters use as follows:
Before normalization After normalization
Before normalization After normalization
Trang 24Chapter 3 Our Vietnamese Real-Estate Information Extraction system 20
We normalize the prices as follows:
• adding a space between the number and monetary unit if they are writtennext to each other
• replacing the acronyms of the monetary units into standard format, for ample: tr → triệu (million), ngh → nghìn (thousand), etc
ex-• replacing the monetary units from uppercase into lowercase, for example:
TỶ (billion) → tỷ (billion), Triệu (million) → triệu (million) , etc
• reformatting the price pattern as 20-30 triệu → từ 20 đến 30 triệu (from 20
to 30 million)
• removing the redundant words between the keyword "Giá" (price) and aprice number, for example:
Before normalization After normalization
Giá tiền dự kiến 2,5 tỷ Giá: 2,5 tỷ
Giá trọn gói 2,5 tỷ Giá: 2,5 tỷ
Giá cho thuê 2,5 triệu/tháng Giá: 2,5 triệu/tháng
Similar to price we normalize the area as follows:
• adding space between number and unit of area when they are adjacency
• reformatting the area of pattern as 100-120 m2 → từ 100 đến 120 m2 (from
100 to 120 m2)
• removing the redundant words between the keyword "Diện tích" (Area) and
a area number, for example:
Before normalization After normalization
Trang 25Figure 3.3 shows the normalized result for the news article of Figure 3.2.
Tôi cần bán chung cư cao cấp tòa CT14, Mỹ đình - từ
liêm - Hà Nội Diện tích: 133 m2, căn nguyên bản, chưa
sửa chữa Gia chủ nào muốn mua xin liên hệ: anh Thanh
- 0989858199, giá: từ 50 đến 60 triệu/m2 (MIỄN TRUNG
GIAN) Email: hanhtdb21@gmail.com
Figure 3.3: An example of a normalized news article
Our system approach is rules-based Although, these systems don’t require a largecorpus as machine learning methods or hybrid, but an annotated corpus is indis-pensable It is very important for the performance evaluation of the system later.Both system development and corpus building are carried out simultaneously
Figure 3.4: The process of creating an annotated corpus and system
develop-ment
After the documents are retrieved from the Internet and are automaticallynormalized by the Text-Preprocessing engine (Figure 3.7) we will annotate themusing the template defined in the previous section (Figure 3.1) We categorizethe annotation process into two stages In the first stage, when the system hasn’tbeen developed, we manually annotate some initial documents This process is
Trang 26Chapter 3 Our Vietnamese Real-Estate Information Extraction system 22
done completely by human This will help us to have opportunities to observeentities and the context to create the rules In the second stage, when the systemwas built, we utilize this system to automatically annotate the documents Afterthat we observed and adjusted the entities were incorrectly recognized At thesame time, we base on these errors to analyze and modify the system’s rules Thisprocess continues and as the result we get an annotated corpus and a rules basedsystem Specifically, we illustrate the process in figure 3.4
In order to support the manual annotation process, we use Callisto1 to tate our corpus Callisto is an annotation tool developed for linguistic annotation
anno-of textual data It is a free tool and Unicode-support, including Vietnamese text
It stores annotations in a stand-off format using the ATLAS2 (Architecture andTools for Linguistic Analysis Systems) data model, and can support importing/-exporting of inline annotation such as XML3 (eXtensible Markup Language) orSGML4 (Standard Generalized Markup Language)
Our mission is to build a Callisto plugin to define the annotations as ourtemplate (Figure 3.1)
<!ELEMENT Loaitin (#PCDATA)>
<!ELEMENT Loainha (#PCDATA)>
<!ELEMENT Khuvuc (#PCDATA)>
<!ELEMENT Dientich (#PCDATA)>
<!ELEMENT Giatien (#PCDATA)>
<!ELEMENT Lienhe (#PCDATA)>
<!ATTLIST Lienhe Hoten (true|false) "false">
<!ATTLIST Lienhe Diachi (true|false) "false">
<!ATTLIST Lienhe Dienthoai (true|false) "false">
<!ATTLIST Lienhe Email (true|false) "false">
Figure 3.5: The main code is defined to create a new Callisto task
Figure 3.6 shows that we have successfully integrated the plugin which wementioned above into Callisto At the same time, this figure also illustrates theGUI interface for annotating a news article in Callisto
1 http://callisto.mitre.org/
2 http://sourceforge.net/projects/jatlas/
3 http://en.wikipedia.org/wiki/XML
4 http://sourceforge.net/projects/jatlas/
Trang 27Figure 3.6: A news articles annotated by Callisto
3.3 System Development
GATE is a system of methods and software tools to build and develop tions of natural language processing, especially information extraction GATE hasbeen used for many IE projects in many languages and problem domain, and hascompeted in the Message Understanding Conference (MUC) and Automatic Con-tent Extraction (ACE) evaluations So we chose GATE to build our InformationExtraction system
applica-Our system is built as plugins in GATE framework with the architecture shown
in figure 3.7 The system comprises of five components as follows: