Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining General Terms Data Mining; Information Extraction Keywords Data Mining; Information Extraction; Event Extrac
Trang 1Extraction of Disease Events for a Real-time
Monitoring System
Minh-Tien Nguyen Hung Yen University of Technology and
Education (UTEHY)
Knowledge Technology Laboratory (KT-Lab)
tiennm@utehy.edu.vn
Tri-Thanh Nguyen Vietnam National University, Hanoi (VNUH), University of Engineering and Technology (UET) Knowledge Technology Laboratory (KT-Lab)
ntthanh@vnu.edu.vn ABSTRACT
In this paper, we propose a method that uses both
seman-tic rules and machine learning to extract infectious disease
events in Vietnamese electronic news, which can be used
in a real-time system of monitoring the spread of diseases
Our method contains two important steps: detecting
dis-ease events from unstructured data and extracting
informa-tion of the disease events The event detecinforma-tion uses semantic
rules and machine learning to detect a disease event; in the
later step, Name Entity Recognition (NER), rules, and
dic-tionaries are used to capture the event’s information The
performance of detection step is ≈77,33% (F-score) and the
precision of extraction step is ≈91,89% These results are
better that those of the experiments in which rules were not
used This indicates that our method is suitable for
extract-ing disease events in Vietnamese text
Categories and Subject Descriptors
H.2.8 [Database Applications]: Data Mining
General Terms
Data Mining; Information Extraction
Keywords
Data Mining; Information Extraction; Event Extraction;
Dis-ease Event Extraction; Monitoring Systems
Information from electronic newspapers provide valuable
inputs for public health surveillance, early outbreak
detec-tion, and disease monitoring systems When the presence
of a disease is announced by the government and published
on a webpage, it is typically called disease event or an
in-fectious disease outbreak Unfortunately, the electronic
re-sources of infectious diseases are multidimensional, chaotic,
and not well organized, so extracting useful patterns from
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page Copyrights for components
of this work owned by others than ACM must be honored Abstracting with
credit is permitted To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee Request
permissions from Permissions@acm.org.
SoICT’13, December 05-06, 2013, Danang, Viet Nam
Copyright 2013 ACM 978-1-4503-2454-0/13/12 $15.00.
http://dx.doi.org/10.1145/2542050.2542084.
these sources is really challenging "How to detect an
infec-tious disease event?" and "how to extract information of an infectious disease event?" are two important questions which
are deeply focused on this paper
Disease detection and disease spreading/outbreak moni-toring are extremely meaningful issues in society, especially when the diseases are dangerous and have high ability of infection Because an infectious disease normally outbreaks
in a short time and spreads very quickly over a large area,
so it can bring to emergency circumstances not only for the citizens, but also for the government and economy There-fore, monitoring infectious disease outbreak is really crucial
in prevention, handing diseases and helping the authorities
to make suitable decisions
In this paper, we propose a model to automatically de-tect and extract information of human infectious disease events from Vietnamese webpages based on semantic rules and machine learning The model includes two important components: disease event detection and disease event ex-traction In the first component, an infectious disease event
is detected from free text, after that, the information of an event (time, disease name, and locations) is extracted in the second component Subsequently, we combine the extracted information to form an infectious disease event This infec-tious disease events can be the input for our monitoring system for visualization
Our paper is organized as follows: related work is in Sec-tion 2; our method will be discussed in SecSec-tion 3 in which event detection is mentioned in Section 3.3 and event extrac-tion is in Secextrac-tion 3.4 Secextrac-tion 4 gives experiments, results, and explains the source of some errors appearing in our re-search The last section is conclusion
Event extraction was first introduced as an important topic in 1987 in Message Understanding Conference (MUC)
[11] In MUC, an event is defined as: "an event must have
actor, time, place and impact on the surrounding environ-ment" Later, in Automatic Content Extraction (ACE)
pro-gram, Doddington George R., et al gave an event definition:
"an event is an activity that was created by participants" and
divided events into eight types: Life, Movement, Transac-tion, Business, Conflict, Contact, Personnel and Justice [7] Moreover, as Allan J., et al stated, an event includes four
attributes: modality, polarity (Positive, Negative), genericity
(Specific, Generic), and tense (Past, Present, Future, Un-specified) [10] Grishman R., et al gave the definition of a
disease event as a template: Disease Name, Date, Location,
Trang 2Victim Number, Victim Descriptor, Victim Status, Victim
Type, Parent Event [9].
Hogenboom F., et al provided a general guideline on how
to select a suitable method for event extraction purpose [2]
The guideline indicated that event extraction approaches
can be listed as data-driven, knowledge-driven, and hybrid
Each approach has both advantages and disadvantages
Hogen-boom F., et al compared the benefits and drawbacks among
these methods Finally, the authors pointed out the hybrid
approach prevails
Event extraction from unstructured text can be applied
in many fields, especially in disease domain Grishman R.,
et al used linguistic event patterns (120 patterns) to
ana-lyze sentences to capture information of a disease event [9]
These linguistic patterns were built on word classes and
re-lation among them For example, pattern "np (DISEASE)
vg (KILL) np (VICTIM)" will match a clause like "Cholera
killed 23 inhabitants" An event is recognized based on the
trigger of two noun phrases: "outbreak of " and "people
died from " These patterns were applied to extract disease
events and achieved F-score of ≈53,98% Normally, applying
linguistic patterns can achieve high results if these patterns
cover the whole dataset, but preparing these patterns is
al-ways time-consuming and requires domain experts
More-over, the patterns must be changed when the data fluctuate
Finally, because the patterns were built on word classes, so
the authors must identify word classes (e.g., noun phrase,
verb phrase, etc.), but in some other languages (e.g.,
Viet-namese or Chinese), this is more challenging Because of this
drawback, we do not follow this approach
Volkova S., et al mixed entity recognition and sentence
classification to extract animal disease events [4] The event
recognition consists of three main steps: the first step is
entity recognition from unstructured texts; secondly,
sen-tences are classified based on these entities; finally, the
enti-ties within an event sentence are combined into a structured
tuple In the event recognition, true events should contain
a disease name and a disease-related verb The authors got
the precision of 75% and 65% in event tuple recognition
and the sentence classification, correspondingly, with the
features from WordNet and Google-Set corpus However,
us-ing a list of verbs to confirm an event can badly affect the
event extraction in Vietnamese language because the lacking
of resources for Natural Language Processing (NLP) (such
as Vietnamese WordNet or Google-Set like corpus for
Viet-namese) or the performance of parsing utility is not high
enough Thus, we do not use this method
Doan S, et al built a Global Health Monitor system
which shows the disease spreading state around the world
[5] The system includes three main steps: topic
classifica-tion, Named Entity Recognition (NER), and disease/location
detection Na¨ıve Bayes classifier is used in topic classification
with the precision of ≈88,10%, and F-score was ≈76,97% in
entity recognition step with Support Vector Machine (SVM),
and the final step achieved the precision of ≈93,40% with
BioCaster Ontology However, there are some limitations
in this system The first limitation is the location
ambi-guity, because some locations are not mentioned clearly in
input data (they are only provinces/cities, lacking of country
name), then the system can’t recognize the location exactly
Furthermore, BioCaster system can’t detect new diseases or
locations that are not in the ontology
Our approach uses the advantages of both semantic
rule-based method and machine learning in two main compo-nents: event detection and event extraction In the event detection, while the semantic rules play the role of a data filter, the classification model distinguishes that a news arti-cle contains an event or not Because our rules are used as a filter, so it is simpler than those in the research of Grishman R., et al [9] A rule in our study is a short phrase which is composed of a noun phrase and a verb phrase instead of a complete sentence Moreover, we do not use a list of verbs
to confirm events as Volkova S., et al [4], because, typically, this method depends on the coverage of verbs and building these verbs always takes much time In the event extraction, our approach is similar to the method of Doan S., et al [5]
We use rules, a disease dictionary, a NER, and a location dictionary for extracting information of a disease event
In addition, there are several systems which extract events from online news Grishman R., et al built Proteus-BIO system where users can follow infectious diseases [8] Data in this system are collected from webpages and disease reports from World Health Organization1and ProMed2 Collier N.,
et al made BioCaster system3 which follows several event types, especially disease events around the world Similarly, HealthMap 4 was built by Freifeld Clark C., et al where users can monitor disease all over the world [6]
DETEC-TION AND EXTRACDETEC-TION 3.1 Infectious Disease Event Characteristics
An investigation on our data domain indicates that an infectious disease event may contain a disease name, time, locations, and victims In some cases, it may have additional information such as the methods or the environment of in-fection Though Grishman R., et al [9] used a disease name, the time and the location of the outbreak, the number of af-fected victims, and the type of victims as the information
of a disease event, we only focus on three basic information: the time, locations of the outbreak and the infectious name disease We ignore the methods or environment information because we collect data from webpages instead of medical re-ports, so such information is not clearly mentioned in most cases Moreover, an event in MUC must include an actor [11], in our study, the actor is equivalent to a disease, there-fore we use the disease name instead of the actor
In addition, a closer examination on disease news arti-cles showed that a disease name is sometimes similar to a symptom, so this is one of the reasons of confusion in the event extraction For example, ‘pneumonia’ is the symptom
of ‘bird flu’ (A/H5N1), but it was recognized as a disease in some cases
3.2 Problem Definition The infectious disease event extraction problem can be defined as follows:
Input: a news article.
Output: whether the news article contains an infectious
disease event or not? If yes, extract information of the event
In our research, an infectious disease event E is defined as
1 http://www.who.int/csr/don/en/
2 http://www.promedmail.org/
3http://born.nii.ac.jp 4
http://www.healthmap.org
Trang 3Figure 1: Steps of disease event extraction
Figure 2: Event detector components
a tuple that has three elements:
E = <name, time, place> (1)
where name is the name of the infectious disease mentioned
in the disease news article; time is the time when the disease
outbreaks; and place is a set of locations where the disease
appears
We propose a process to extract the information of a
dis-ease event as illustrated in Figure 1 The extraction
pro-cess includes five components: The crawler retrieves data
from the Internet; the pre-processing component extracts
the main content from the web pages returned by the crawler
(the detail of this module is described in Section and Table
3); the event detector decides whether a news article
con-taining a disease event or not; the event extractor captures
the information of the event in a given news article (if any);
finally, the visualization component plots the disease events
on an online Geographic Information System (GIS) map
In this paper, we strongly focus on two key components:
event detector and event extractor that are described in
de-tail in Section 3.3 and Section 3.4
3.3 Event Detector
The goal of Event Detector is to judge whether there is
a disease event from a given news article When a news
article is given, it determines whether it contains a disease
event (EVENT) or not (NOT_EVENT) by using rules (for
title filtering) and machine learning (for classification) The
process of event detector is illustrated in Figure 2
Event detector component consists of two modules: a data
filter and a classifier The filter module receives data from
the pre-process component where HTML tags are removed
to get the main content After that, this module filters
dis-ease news articles by checking their titles Subsequently,
data is transferred into the classifier which distinguishes that
a news article contains an event or not
3.3.1 Filtering Rules
As we mentioned above, the event detection component
has two modules: a data filter and a classifier, in which the
filter uses semantic rules to reduce news articles for later
classification We examined the domain data carefully and
identified that most of news titles express their main
con-tent It means that the title of a news article has enough
evidence to trigger the existence of a disease event
There-fore, we use rules to filter related disease news articles
Table 1: List of frequent-words
We carried out a statistic on a large dataset of news arti-cles from "Sức khỏe" (Health)5 category of "Báo mới" news website6 to find out a set of frequent words (and phrases) The number of frequent words is 34 and some of the most frequent words are given in Table 1, where the third column counts the number of articles containing the corresponding words in the second column We denote this set as Frequent-words set
We recognize that most of news articles contain words in the Frequent-words set relating to a disease event Therefore, our idea is to build semantic rules by combining words in the Frequent-words set for filtering input data purpose As the result, we proposed two patterns named Pattern 1 and Pattern 2 representing all our semantic rules These patterns are showed below:
Pattern 1 = noun phrase # verb phrase (2)
where noun phrase and verb phrase are in the
Frequent-words set
The Pattern 1 is illustrated in Example 1
Example 1:
bệnh nhân tử vong # nhiễm (died patient # infected) dịch tả # bùng phát (cholera # outbreaked)
Pattern 2 = disease name # verb phrases (3) where:
• disease name is retrieved from the BioCaster
Ontol-ogy [3] and The circular of the Ministry of Health of Vietnam7, dated June 24th, 2011;
• verb phrases are in the Frequent-words set.
An example of a sentence matching Pattern 2 is given in Example 2
Example 2:
tiêu chảy cấp # nhiễm (acute diarrhea # infected) tiêu chảy cấp # phát hiện (acute diarrhea # discovered) tiêu chảy cấp # lây lan (acute diarrhea # spread) tiêu chảy cấp # bùng phát (acute diarrhea # outbreaked) tiêu chảy cấp # chết (tử vong) (acute diarrhea # died) tiêu chảy cấp # dương tính (acute diarrhea # is positive)
Both the two patterns have two elements which are sepa-rated by the character "#" We built 43 rules from Pattern 1
by mixing 52 noun phrases and 10 verb phrases Both these 5
http://www.baomoi.com/Home/SucKhoe.epi
6http://www.baomoi.com 7
http://www.moh.gov.vn/
Trang 4Table 2: List of features
No Feature
1 Dịch tay chân miệng (disease limbs)
2 Tiêu chảy (diarrhea)
3 Trẻ tử vong (the child died)
4 Ổ dịch (disease source)
5 Dương tính (is positive)
6 Dịch cúm gia cầm (bird flu)
7 Ca tử vong (deaths)
8 Bùng phát dịch (outbreak)
9 Dịch cúm (flu)
10 Bệnh nhân tử vong (the patient died)
noun phrases and verb phrases are in the Frequent-words set
Similarly, we used a disease name and a verb phrase to
cre-ate a rule following Pattern 2 With 186 disease names from
the disease dictionary and 6 verb phrases in the
Frequent-words set, the number of rules conforming to Pattern 2 is
186 Some verb phrases in Pattern 1 and Pattern 2 are the
same
After building the rules set, we had 229 rules in total The
related articles are retrieved by these rules and transferred
into the classifier
3.3.2 Machine Learning Application
The classification model categorizes a news article into
ei-ther EVENT or NOT_EVENT label The investigation on
input data suggests that the title and abstract of a disease
news article have enough information to represent its
con-tent, therefore these elements are used to create the feature
vector In the data preparation step, articles are manually
tagged with label (EVENT) and label (NOT_EVENT)
Af-ter that, features are generated by using 2-grams, 3-grams,
and 4-grams As the result, we retrieve 4,552 features which
are used for classification Some features are showed in Table
2
We used Maximum Entropy Model8as the classifier The
news articles which are labeled EVENT will become the
input for the Event Extractor component
3.4 Event Extractor
Event Extractor is one of two important components where
the information of a disease event is extracted The event
extraction component is illustrated in Figure 3
Event extraction includes three modules: time extraction,
disease extraction, and location extraction The first
mod-ule uses rmod-ules to extract the time information; the second
module utilizes a disease dictionary extracting the disease
information; and the final module combines NER and a
lo-cation dictionary to capture place information Finally, we
combine the extracted information to form a disease event
and store it in an event database
3.4.1 Time Extraction
The investigation on dataset suggests that time
informa-tion can be captured by rule and it is either absolute or
relative In the absolute case, the time has the format of
DD/MM/YYYY, so we use Regular Expression (RE) to
ex-tract it For the relative case, it always contains two
ele-ments: a prefix and the time The prefix is a set of words
8
http://www.cs.princeton.edu/maxent
Figure 3: Event extractor component
that indicates relative time and the time is usually in the Vietnamese date form of DD/MM/YYYY Therefore, we use a rule [1] to calculate the absolute time The time rule
is showed in Formula (4)
TIME = <RELATIVE TIME> + <DATE TIME>
(4) where:
• RELATIVE TIME = vào (on), ngày (date), sáng
(morn-ing), hôm nay (today), sáng hôm nay (this morn(morn-ing), chiều (afternoon), hôm qua (yesterday), tối qua (yes-terday evening), rạng sáng (early morning), tháng (month)
• DATE TIME has the format of DD/MM/YYYY which
is either the date expressed in the article content or the published date
Example 3 and Example 4 illustrate the use of Regular Ex-pression and the time rule to extract the time information
Example 3:
“Ngày 12/03/2012, Bộ Y tế công bố dịch cúm A H5N1 đã tái phát tại Quảng Ngãi.” (On March 12 th, 2012, Ministry
of Health announced the A H5N1 flu had hit Quang Ngai)
Example 4:
“Sáng ngày 15/01/2012, Sở Y tế Hà Nội thông báo bệnh nhân đầu tiên nhiễm cúm A/H5N1 đã tử vong” (In the
morning of January 15th, 2012, Hanoi Health Department announced the first patient who had infected with A/H5N1 flu died)
The time information in Example 3 is captured by the Regular Expression while it is extracted by Formula (4) in Example 4 As the result, the time information in Example 3
is March 12 th , 2012, whereas it is In the morning of January
15 th , 2012 in Example 4.
3.4.2 Disease Extraction
Disease extraction is the second module which captures the disease name As we mentioned in Figure 1, the pre-processing component tokenizes and word-segments the con-tent of articles As the result, each article has a list of words These words are input for this module Disease extraction module uses a disease dictionary including 186 disease names for the extraction purpose
Trang 5The extraction process can be described in two steps:
find-ing the longest phrase that can be a name candidate, and
matching the candidate with the original article to check
whether it is a correct name The finding process uses the
longest matching method to match a word (in an article)
with a disease name (from the disease name dictionary) If
a disease name contains a given word, then it is probably the
disease name candidate In the matching process, the
candi-date is checked whether it appears in the article to ensure it
is correct or not The correct candidate must appear in the
original article The disease extraction process is illustrated
through Example 5
Example 5:
“Dịch cúm A/H5N1 bùng phát tại Bến Tre” (A/H5N1 flu
outbreaks in Ben Tre)
After tokenizing and word-segmenting, we retrieve two
words related to disease: cúm (flu) and A/H5N1 The
find-ing step matches these words with the disease dictionary to
find out the longest word As the result, with the word of
cúm (flu), we retrieve three words: cúm (flu), cúm A/H5N1
(A/H5N1 flu), and cúm gia cầm (bird flu), while with the
word of A/H5N1, we only have one name: cúm A/H5N1
(A/H5N1 flu) In the later step, the matching process checks
these words against the original article to find out correct
result In this example, the longest item is cúm gia cầm
(bird flu), but it does not appear in Example 5 So this
dis-ease is ignored The second longest word is cúm A/H5N1
(A/H5N1 flu) and the matching process recognizes that it
is in the original article So, it is the correct disease name
and the value of the disease information is the cúm A/H5N1
(A/H5N1 flu)
3.4.3 Location Extraction
Building the final module is more challenging than two
previous ones because the ambiguity among locations In
fact, several places can have the same proper name (e.g.,
"Dong Hai" town is a location in both "Tra Vinh" and "Quang
Ninh" provinces) Therefore, in some cases, if a news articles
does not mention locations clearly, the place information can
be confused To deal with this issue, we combined NER and
a location dictionary to improve the performance of location
extraction
Location extraction process can be described in three steps:
NER, location extraction, and normalization Firstly, the
NER 9 was applied to detect location entities in a given
news article As the result, locations in the article are
la-beled by a pair of <LOC> and </LOC> tags Secondly,
we extract the locations based on these tags In the final
step, each location is normalized by looking up the location
dictionary which will be described in detail later
We used a location dictionary that is organized as a
tax-onomy which is showed in Figure 4, where:
• T is the abbreviation of the town
• C is the abbreviation of the commune
In this taxonomy, the highest level is the root node; level
1 represents 63 provinces; 692 districts are in level 2; and
11,101 towns and communes are represented by nodes in
the level 3 If a phrase inside the <LOC> and </LOC>
tag is matched with the value of a node, then current node
9
http://jvntextpro.sourceforge.net
Figure 4: The location dictionary taxonomy
is marked and complete location is the path from the cur-rent node to the root node Obviously, this organization is efficient to identify the relation between communes, towns, and provinces and helps to avoid the geo-ambiguity The ef-ficiency of the taxonomy is showed in Example 6
Example 6:
“Ngày 12/04/2013, Sở y tế Quảng Ngãi thông báo dịch
cúm A H5N1 đã bùng phát tại thị trấn Sông Vệ (On April
12th, 2013, Department of Health of Quang Ngai announced
a A H5N1 flu outbreak in Song Ve town).
This example mentions only the town where the A H5N1 flu outbreaks ("Song Ve" town), while the district and the province are absent In the process of location extraction, this sentence is parsed by the NER, and "Song Ve" is la-beled by the <LOC> and </LOC> tags, while "Quang Ngai" is recognized as the organization entity (<ORG>)
As the common way, after retrieving the location (inside
<LOC> and </LOC>tags), "Song Ve" should be the loca-tion informaloca-tion But "Song Ve" does not have enough infor-mation to become a real location on a GIS map, since it is not complete In order to solve this problem, we looked up
"Song Ve" in the location taxonomy When the node having this value is found, we traversed from this node to the root node in the taxonomy to extract the complete information (i.e "Song Ve" town, "Tu Nghia" district, and "Quang Ngai" province) This step is called location normalization Finally, the extracted time, disease name, and locations from the article are combined to create an infectious dis-ease event in which the set of locations found in this module comprises the place component of the event The event is stored in an event database which is used for the visualiza-tion component in a real-time monitoring system
4.1 Data Preparation Our data is retrieved from "Báo mới" news website10 be-cause "Báo mới" automatically crawls a large number of news articles (per day) from most of famous Vietnamese websites, hence, it is a good data source After crawling, we had a dataset (denoted as raw dataset) of 3,842,137 news articles Elements of a news article (after pre-processing step) are showed in Table 3
After crawling the data, we used Pattern 1 (2) and Pattern
2 (3) to filter and got a set of 1,668 disease related news arti-cles We denotes the set of 1,668 articles as Filtered dataset for later reference
In our study, experiments are conducted on two important components: Event Detector and Event Extractor, which are 10
http://www.baomoi.com/
Trang 6Table 3: News article’s elements
Element Description
Title The title of the article
Abstract The short paragraph what summaries
articles’ content
Published time Time when the news is published It
supports for time extraction process
Content The content of the article
Table 4: The error rate of the data filter module
Incorrect articles Total Error Rate (%)
described in detail in Section 4.2 and Section 4.3
4.2 Event Detection Evaluation
As we mentioned above, the Event Detector has two
mod-ules named the data filter and the classifier Therefore, we
will evaluate performance of this component based on these
two modules
4.2.1 Data Filter Evaluation
The data filter is the first module in the event detection
which filters articles from the data crawler component As
we mentioned above, this module uses Pattern 1 (2) and
Pattern 2 (3) to filter articles, so the performance of this
module depends on the coverage of rules of the two
pat-terns Normally, we must evaluate the precision of Pattern
1 and Pattern 2 on the whole dataset (about 3,842,137 news
articles), but this approach is very costly because we have
to label them manually
To evaluate the performance of this module, we randomly
selected 486 articles from raw dataset to manually check the
error rate The error rate was calculated using the Formula
(5), and the results are showed in Table 4 The results show
that the error rate is high or the accuracy is low due to the
fact that it filtered all the articles related to diseases in which
a large number of articles did not present disease events
(the detail of this issue will be discussed in Section 4.4) We
accept this to gain high recall, and the overall performance
will be improved by subsequent phases
ErrorRate = #incorrect
where:
• #incorrect is the number of articles which are not
re-lated to disease
• total is total number of articles.
4.2.2 Classification Evaluation
We carried two experiments to evaluate performance of
the classification, namely, Experiment a which combines rules
and machine learning, and Experiment b which uses only
machine learning The measures used to evaluate the
per-formance of this modules are precision, recall, and F-score
based on the 10-fold cross validation
In the Experiment a, we randomly selected 686 articles
from the Filtered dataset and tagged them as EVENT or
Table 5: The comparison of Experiment a and
Ex-periment b
Experiment a Experiment b
Avg 75,07 79,76 77,33 72,35 77,84 74,97
NOT_EVENT We denoted this set as Experiment a dataset.
In the Experiment b, we selected 50 more articles from the raw dataset, and added them to Experiment a dataset to form Experiment b dataset.
After preparing the training dataset, we compare the per-formance of the two experiments The comparison of two
classifiers is showed in Table 5 where the results of
Experi-ment b are in three columns on the right, while the results
of the Experiment a are showed in three columns on the left.
The average of F-score in the two experiments indicates that
the F-score of classifier in the Experiment a is better than that of the Experiment b of ≈2,36% The difference between
two classifiers is not big because we added only 50
arti-cles into the Experiment a dataset The performance will be much better if we add more raw articles into Experiment b
dataset.
4.3 Event Extraction Evaluation
Because an infectious disease event E is defined as a tuple
that includes name, time, and place as given in Formula (1),
so a correct event should completely contain all 3 elements When the time of an event is not clearly mentioned in the text, we use the published date of the article as the time of the event In other cases, if a disease event does not include either a disease name or locations, then it is considered to
be a false event
To evaluate the precision of the event extraction step, we
carried out two experiments, namely, Experiment c viated as Expr c) which uses rules, and Experiment d (abbre-viated Expr d) which uses both rules and NER The dataset
used in both experiments is 152 news articles which were se-lected from the articles set returned by the event detector
We use three measures Precision (P), Recall (R), and F-score (F) to compare the performance of the two experi-ments These measures are denoted by Formula (6), (7), and (8) as following:
where:
• # correct is the number of correct disease events
• # incorrect is the incorrect disease events
Trang 7Table 6: The comparison of Experiment c and
Exper-iment d
Name Correct Incorrect P R F
Expr c 127 25 83,55 92,02 87,58
Expr d 136 16 89,47 94,44 91,89
where:
• # correct is the number of correct disease events.
• # not_found is the number of disease events which
the model did not recognize
F = 2 × P × R
Based on the Formula (6), (7), and (8), we compare the
precision of Experiment c and Experiment d The
compari-son is showed in Table 6, where the second row is the result
of Experiment c whereas the third row is the result of
Exper-iment d In the ExperExper-iment c, the F-score is ≈87,58%, while
it is ≈91,89% in the later experiment The result shows that
the precision of Experiment d improves by ≈4,31% in
com-parison with that of Experiment c The cause of the
differ-ence between two experiments will be explained in the next
section
4.4 Error Analysis and Discussion
In the Event Detector component, the results in Table 4
suggest that there is confusion in the data filter module To
find out the cause of confusion, we manually checked articles
which were selected from the dataset used in Section 4.2.1
The analyzed results indicate that in cases of error, some
rules of Pattern 1 (2) and Pattern 2 (3) are not efficient to
filter articles The reason is that several topics can share a
verb For instance, verb phrase "tử vong" (die) may belong
to either disease or treatment topics If this verb appears
in an article, the data filter module considers this article
related to a disease event, however in fact, it is a treatment
topic as illustrated in Example 7
Example 7:
Uống thuốc hạ sốt sau 30 phút bệnh nhân tử vong (The
patient died after having had the fever medication for 30
minutes)
This sentence is captured by a rule of Patten 1 (2) of "bệnh
nhân # tử vong" (patient # died), but in fact, the cause of
death is related to the medication instead of a disease
Moreover, some rules of Pattern 2 (3) (which is a
combina-tion of a disease name and a verb phrase) confuse the disease
event with a topic related to a disease as showed in Example
8
Example 8:
"Phát hiện chủng virus mới gây bệnh tay chân miệng" (A
new strain of virus causing the hand, foot, and mouth disease
has been discovered)
The rule of Pattern 2 (3) of "tay chân miệng # phát hiện"
(hand, foot, and mouth # detect) captures this sentence,
but it mentions the discovery of a new virus strain instead
of a disease event
For Event Extractor component, the results in Table 6
in-dicate that the precision of Experiment d is ≈5,92% higher
than that of the Experiment c At first, we were surprised with the comparative result, because, the Experiment c uses
rule-based method to capture information of an event Nor-mally, using rules (knowledge-driven method) often gets highly accuracy
To find out the source of errors appearing in the event ex-traction, we manually checked the incorrect articles in the two experiments (mentioned in Section 4.3) The investi-gated results are showed in Table 7 and Table 8, respectively The statistic from Table 7 and Table 8 indicates that the cause of errors in both experiments originated from the lo-cation extraction and, in some cases, from the diseases
ex-traction In the Experiment c, we recognized that the rules
which are used to extract locations did not cover all cases In
a few cases, if the location information is abbreviated, then the rules can not recognize them as illustrated in Example 9
Example 9:
“Phát hiện một trường hợp bệnh nhân nhiễm cúm A H5N1
tại P.7, Q.8, TP HCM.” (We discovered a patient who
infected A H5N1 flu in ward 7, district 8, HCM city)
In this example, ward 7, district 8 and Ho Chi Minh city are
abbreviated as (P.7, Q.8, TP HCM ), therefore, the rules
can not recognize location information
In the Experiment d, the main cause that reduced the
pre-cision of location extraction is the performance of NER tool
In a few cases, the it did not detect locations exactly because the abbreviation of places in articles (similar to the rule-based method) In some other cases, the it mis-recognized a location as an organization as showed in Example 10
Example 10:
“Ngày 12/03/2012, dịch tiêu chảy cấp đã bùng phát tại
Hà Nội, Hải Phòng, Quảng Ninh, Bến Tre, Cần Thơ.”
(On December 3rd, 2012, cholera outbreaked in Hanoi, Hai
Phong, Quang Ninh, Ben Tre, and Can Tho).
In this example, Hanoi, "Hai Phong", "Quang Ninh", "Ben Tre", "Can Tho" are recognized as organizations (tagged with <ORG> and </ORG> pairs) which would be ignored during processing
In both Experiment c and Experiment d, some extracted
disease names were incorrect, because they are not in the disease dictionary Moreover, the disease dictionary contains some names which are equivalent to the symptoms of a dis-ease Thus it makes confusion for the disease extraction module For instance, in the Table 7, a disease name of A/H1N flu in the 89th article is detected as pneumonia, while pneumonia is a symptom of the A/H1N flu
In addition, there are some factors which have bad effect
to the event extraction Firstly, typo errors of the location in articles reduces the performance of the location extraction For instance, "Đắk Lắk" is written as "Đắc Lắc", but "Đắc Lắc" does not appear in the location dictionary Therefore, the location information can be missed Secondly, if some
locations are not described clearly such as “các huyện phía
Tây của tỉnh Bến Tre” (the western districts of "Ben Tre"
province), then the NER utility can not recognize them Finally, another important cause is the geo-ambiguity that reduces the precision of event extraction component In fact, one proper name can be named for several places, if the disease news articles do not mention the places clearly, the location information can be confused The geo-ambiguity is showed in Example 11:
Example 11:
Trang 8Table 7: The errors in Experiment c (15 of 25 errors)
Correct Information Extracted Information
3 13 Ward 6, District 8, Ward 14, Ho Chi Minh City District 5, District 8, Ward 7, Binh Thanh District,
Hoc Mon District
Đinh
Nam Đinh
Thang, D’Dak Rong
NULL
Quan Ngoai, Lang Chanh village, Lang Mau villiage, and Nhan Ly
Tam Quan commune
“Ngày 05/10/2012, Sở Y tế Quảng Ninh thông báo đã phát
hiện vi khuẩn tả tại thị trấn Đông Hải” (On May 10 th, 2012,
Quang Ninh Department of Health announced the detection
of cholera in the Dong Hai town)
In this example, "Dong Hai" town is a location in both
"Tra Vinh" and "Quang Ninh" provinces, but the article only
mentions Dong Hai town, so the module failed to decide
whether the disease outbreak was in "Quang Ninh" or "Tra
Vinh"?
Another error source came from the incomplete
recogni-tion of locarecogni-tion, i.e only some parts of a locarecogni-tion was
de-tected as shown in row 4 of Table 7 (where only the Nam
Dinh province was detected), and row 11 of Table 8 (where
only Binh Duong province was recognized)
The last error source originated from the case in which a
location mentioned in the text was not the outbreak place
This made the location module misunderstand, and extract
the incorrect information as depicted in row 9 of Table 7
and row 8 of Table 8
In this paper, we introduced our method that combines
se-mantic rules and machine learning to extract disease events
in Vietnamese webpages The results of experiments
illus-trated that our method is suitable for extracting disease
events in the Vietnamese Furthermore, we have described
briefly our system process, especially we emphasize two key
components: Event Detector and Event Extractor We plan
to integrate the event database into Vn-Loc system11where
user can follow some event types: FIRE, CRIMINAL, and
TRANSPORT ACCIDENT.
However, our method needs to have some improvements
to enhance the performance in the future Firstly, the
cover-age of semantic rules and the performance of the Maximum
11
http://vnloc.com/
Entropy classifier must be enhanced by adding useful in-formation Secondly, the precision of event extraction can
be increased by improving the performance of NER tool Besides, the geo-ambiguity and the confusion between dis-eases and symptoms should be improved Finally, relations between disease events should be considered to enhance the quality of the monitoring system
[1] Mai-Vu Tran, Minh Hoang Nguyen, Sy-Quan Nguyen,
Minh-Tien Nguyen, and Xuan-Hieu Phan "VnLoc: A
Real - Time News Event Extraction Framework for Vietnamese". KSE, pp.161-166, 2012
[2] Hogenboom Frederik, et al "An Overview of Event
Extraction from Text", Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011) at Tenth International Semantic Web Conference (ISWC 2011) Vol 779 2011
[3] Collier Nigel, et al "An Ontology-driven System for
Detecting Global Health Events". In Proceedings of the 23rd International Conference on Computational Linguistics (pp 215-222) Association for
Computational Linguistics
[4] Volkova Svitlana, et al "Animal Disease Event
Recognition and Classification". Proceedings of the First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010) 2010
[5] Doan S., Hung-Ngo Q., Kawazoe A., and Collier N.,
"Global Health Monitor - a Web-based System for Detecting and Mapping Infectious Diseases". Proc International Joint Conference on Natural Language Processing (IJCNLP), Companion Volume,
Hyderabad, India, January 7-12, pp.951-956, 2008
[6] Freifeld Clark C., et al "HealthMap: Global Infectious
Trang 9Table 8: The errors in Experiment d
Correct Information Extracted Information
Nhon City
Binh Dinh
5 25 4 village, Hoa An commune, Krong Pac district, Dak
Lak
Hoa An Commune, Chiem Hoa district, Tuyen Quang
6 26 Ward 8, District 5, Ho Chi Minh City (P.8, Q.5, TP
HCM)
NULL
Phu, Chau Thanh Ba Tri, Cho Lach
Ben Tre
Dinh, Hanoi
Hanoi, Vinh Phuc
Town, Binh Duong
Binh Duong
14 84 Thanh Binh Ward, Hai Chau district, Da Nang city,
Dak Lak
Ward Thanh Binh, Ninh Binh City, Ninh Binh City
Da Nang, Hai Chau District
Hoan Kiem District, Thanh Tri, Dong Da, Quang Ninh, Bac Giang, Nam Dinh, Thai Binh, Ha Nam, Hung Yen
Hanoi
Disease Monitoring through Automated Classification
and Visualization of Internet Media Reports".
Journal of the American Medical Informatics
Association 15.2 (2008): 150-157
[7] Doddington George R., et al "The Automatic Content
Extraction (ACE) Program – Tasks, Data, and
Evaluation". LREC 2004
[8] Grishman Ralph, Silja Huttunen, and Roman
Yangarber "Real-Time Event Extraction for Infectious
Disease Outbreaks". Proceedings of the second
international conference on Human Language
Technology Research Morgan Kaufmann Publishers
Inc., 2002
[9] Grishman Ralph, Silja Huttunen, and Roman
Yangarber "Information extraction for enhanced
access to disease outbreak reports". Journal of
Biomedical Informatics (JBI), Vol 35, No 4,
pp.236-246, 2002
[10] Allan James, Ron Papka, and Victor Lavrenko
"On-line new event detection and tracking".
Proceedings of the 21st annual international ACM
SIGIR conference on Research and development in
information retrieval ACM, 1998
[11] Grishman Ralph, and Beth Sundheim "Message
understanding conference-6: a brief history"
COLING, Vol 1, pp.466–471, 1996