Three major challenges associated with distributed text classification are examined: 1 Coordinating classification activities in a distributed environment, 2 Achieving high quality class
Trang 1
AUTOMATIC TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK
Yueyu Fu
Submitted to the faculty of the University Graduate School
in partial fulfillment of the requirements
for the degree Doctor of Philosophy
in the School of Library and Information Science,
Indiana University October 2006
Trang 2All rights reserved This microform edition is protected against
unauthorized copying under Title 17, United States Code
_
ProQuest Information and Learning Company
789 East Eisenhower Parkway
P.O Box 1346 Ann Arbor, MI 48106-1346
Trang 3Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the
requirements for the degree of Doctor of Philosophy
Charles Davis, Ph.D
Kiduk Yang, Ph.D
David Leake (Computer Science, minor),
Ph.D
Trang 4© 2006 Yueyu Fu ALL RIGHTS RESERVED
Trang 5DEDICATION
To my beloved parents Guanghui Fu and Lan Chen, my dear wife Wenjie Sun, and my
grandparents for their unconditional love and encouragement
Trang 6ACKNOWLEDMENTS
I feel so grateful to numerous people who generously provide me the guidance, support, and encouragement to complete this dissertation
First and foremost, I would like to thank Dr Javed Mostafa, my committee chair, for his
professional and personal guidance that goes far beyond his responsibilities It is his patient guidance, sharp mind, and gentle encouragement that led me to the achievement I have
today
Special thanks also go to the rest of my committee, Dr Charles Davis, Dr Kiduk Yang, and
Dr David Leake, for their insightful comments and enduring support during the entire
process of my dissertation research
I would like to thank my colleagues and the staff at Indiana University, especially Weimao
Ke, Kazuhiro Seki, Mary Kennedy, Arlene Merkel, Erica Bodnar, and Rhonda Spencer, for
their kind help and support throughout all these memorable years in Bloomington
Finally, I must express my deepest gratitude to my parents, Guanghui Fu and Lan Chen, for opening my eyes to the world and encouraging me to pursue my career abroad, and to my
beloved wife, Wenjie Sun, for making our family full of joy, support, and understanding
Trang 7Automatic text classification is an important operational problem in information systems Most automatic text classification efforts so far concentrated on developing centralized solutions However, centralized classification approaches often are limited due to constraints on knowledge and computing resources To overcome the limitations of centralized approaches, an alternative distributed approach based on a multi-agent framework is proposed Three major challenges associated with distributed text classification are examined: 1) Coordinating classification activities in a distributed environment, 2) Achieving high quality classification, and 3) Minimizing communication overhead This study presents solutions to these specific challenges and describes a prototype system implementation As agent coordination is the key component in conducting multi-agent text classification, two agent coordination protocols, namely blackboard-bidding protocol and adaptive-blackboard protocol, are proposed in the study
To analyze the performance of the distributed approach a comparative evaluation methodology is described, which treats outcome of a centralized approach as baseline performance A series of experiments was conducted in a simulation environment The simulation environment permitted manipulation of independent variables such as scalability and coordination strategy, and investigation of the impact on two critical dependent variables, namely efficiency and effectiveness There were three critical findings First, in dealing with automatic text classification the multi-agent approach can achieve improved system efficiency while maintaining classification effectiveness comparable to a centralized approach Second, the agent protocols were effective in coordinating the text classification activities of distributed agents Third, the application
of content-based adaptive learning for acquiring knowledge about the agent community reduced communication cost and improved system efficiency
Trang 8TABLE OF CONTENTS
1 INTRODUCTION 1
1.1 MANUAL CLASSIFICATION 1
1.2 AUTOMATIC CLASSIFICATION 2
1.3 MULTI-AGENT PARADIGM 5
2 PROBLEM STATEMENT 7
2.1 SPECIFIC CHALLENGES 8
2.2 VARIABLES 10
2.3 IMPLICATIONS OF THIS RESEARCH 16
3 LITERATURE REVIEW 18
3.1 AUTOMATIC TEXT CLASSIFICATION 18
3.1.1 Text classification task 19
3.1.2 Text classification methods 20
3.1.3 Evaluation metrics for text classification 24
3.1.4 Test Collections 26
3.1.5 Centralized Text Classification Procedure 27
3.2 TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK 29
3.2.1 Multi-agent paradigm 29
3.2.2 Differences between multi-agent systems and other concurrent systems 29
3.2.3 Connections between multi-agent paradigm and peer-to-peer paradigm 31
3.2.4 Recent applications of the multi-agent paradigm 34
3.2.5 Centralized vs Multi-agent text classification 35
3.3 MULTI-AGENT COORDINATION PROTOCOLS 39
3.3.1 Definition of coordination 40
3.3.2 Coordination Protocols 41
3.3.2.1 Organizational Structuring 41
3.3.2.2 Multi-agent planning 43
3.3.2.3 Contract net protocol 44
3.3.2.4 Negotiation 45
4 METHODOLOGY 50
4.1 DATA 50
4.2 DESIGN METHODOLOGY 51
4.2.1 Multi-Agent Community for Text Classification 51
4.2.2 Classification Module 53
4.2.3 Algorithms of Agent Coordination Protocols 55
4.2.4 Proposed Agent Coordination Protocols 59
4.2.4.1 Blackboard-bidding Protocol 59
4.2.4.2 Adaptive-blackboard Protocol 61
4.3 IMPLEMENTATION 65
4.3.1 System Architecture 65
4.3.2 Alternative approach 67
4.4 EVALUATION METHODOLOGY 67
4.4.1 Measurements 67
Trang 94.4.1.1 Effectiveness Measurements 67
4.4.1.2 Efficiency Measurements 68
4.4.2 Variables 70
4.4.2.1 Centralized vs Distributed 70
4.4.2.2 Coordination Protocols 71
4.4.2.3 Number of Agents 71
4.4.3 Experimental Settings 72
5 RESULTS 72
5.1 CENTRALIZED VS.DISTRIBUTED 72
5.2 COORDINATION PROTOCOLS 75
5.2.1 Effectiveness 76
5.2.2 Efficiency Measured by Messages 79
5.2.3 Efficiency Measured by Time 82
5.3 NUMBER OF AGENTS 85
5.3.1 Impact of the number of agents on effectiveness 86
5.3.2 Impact of the number of agents on efficiency 89
6 CONCLUSIONS 91
6.1 SUMMARY 91
6.2 FUTURE RESEARCH 95
REFERENCES 97
Trang 101 Introduction
Automatic text classification is an important operational problem in information systems
Many tasks, such as retrieval, filtering, and indexing, in information systems can be
considered as classification problems Most text classification efforts so far concentrated
on developing centralized solutions, where data and computation are located on a single
computer However, centralized classification approaches often are limited due to
constraints on knowledge and computing resources In addition, centralized approaches are
more vulnerable to attacks or system failures and less robust in dealing with them This
research presents an alternative classification approach, called distributed text
classification using a multi-agent framework, where data and computation are distributed
across a network of computers
1.1 Manual Classification
In library and information science, class/classification and category/categorization are
sometimes considered as distinct terms (Jacob, 2004) Although they are both used to
organize related entities, these two terms have a fundamental difference Classification
groups entities into mutually exclusive classes based on a set of predefined rules
regardless of the context, whereas categorization associates entities solely based on their
similarities within a given context (Jacob, 2004) This distinction makes categorization
more flexible than classification in organizing similar entities However, for the purpose
of broader audience, this study uses class/classification and category/categorization
interchangeably
Trang 11One approach to text classification is manual classification, which involves human
experts manually classifying documents based on classification rules and subjective
judgment This approach has been used in library practice for many years to organize,
index, and retrieve documents Human experts typically assign each book a code
representing a category according to a set of classification schemes, such as the Dewey
Decimal Classification, the Universal Decimal Classification, and the Library of
Congress Classification A recent application of this approach on the web is the Yahoo
Directory, which organizes web pages into a hierarchical structure
The main challenge of manual classification is its demand on resources Manual
classification is a time-consuming process that relies heavily on domain knowledge It
requires significant investment of time from many human experts with knowledge of
different domains Another associated problem is that subjective judgments can generate
inconsistent classification results Because of these limitations, manual classification
works best for relatively small document collections
1.2 Automatic Classification
To address the problems of manual classification, researchers have explored automatic
text classification as an alternative approach Using machine learning techniques,
automatic text classification assigns documents to a set of pre-defined categories This
approach has been applied in many areas, such as patent classification, news delivery,
and email spam filtering In contrast to manual classification, automatic classification
offers the advantages of automation, efficiency, and consistency
Trang 12In automatic classification, documents are typically classified by a single classification
software system running on a single machine This is also called centralized text
classification Significant efforts have focused on developing document classifiers in a
centralized manner and various classification algorithms have been developed to improve
the performance of centralized classification systems The advantages of centralized
classification stem from the centralized architecture Because data and computing
resources are located in the same place, the management of the classification task is easy
and the classification speed is fast Since the communication in centralized classification
takes place in the same machine, the communication cost is relatively small
However, as information becomes more distributed and its volume increases
exponentially, several critical disadvantages of centralized classification are revealed
The effectiveness of a classification system is mostly determined by the artificial
knowledge1 maintained by the system, which typically comes from training data
Currently, centralized classification systems suffer from the problem of scarcity of local
knowledge2 The extent of local knowledge is limited by the cost and constraints of
storing complete knowledge in a single place, and it is sometimes impossible to collect
all the necessary knowledge and store it in a central location Since the classification
system can successfully classify only documents that are within the scope of a limited
amount of local knowledge, it is likely to fail when the expansion of its domain (i.e., local
knowledge) does not keep up with growing diversity in knowledge Another disadvantage
is that due to its centralized architecture centralized classification has only a certain
1 Knowledge learned from training documents using machine learning techniques
2 Knowledge maintained by a single classification system
Trang 13amount of computing power and input/output capacity When an information system has
to handle a large number of documents, the classification component may become a
performance bottleneck and suffer from the problem of single point of failure
Distributed text classification, which is an alternative approach to automatic centralized
classification, employs a de-centralized architecture for organizing knowledge and
computing resources This approach allows multiple classification software systems to
collaborate with each other to fulfill the classification task in a distributed computing
environment Distributed classification has several advantages over centralized
classification The distributed architecture offers computational scalability for
classification Mukhopadhyay et al (2005) demonstrate that classification time decreases
dramatically with the increasing number of collaborating classification software systems
Also, not completely relying on a single classification software system allows the
classification system to avoid the problem of single point of failure When one of the
classification software systems fails, its tasks can be carried out by alternative
classification software system Lastly, distributed classification fits the web model better
The Internet is a distributed system and it offers the opportunity to take advantage of
distributed computing paradigms and distributed knowledge resources for classification
However, distributed classification has some disadvantages Unlike centralized
classification, distributed classification, which consists of multiple independent
classification software systems, does not have global control of all the classification
activities Such global control is essential for achieving coherent system performance
Trang 14Without global control, distributed classification activities can produce conflicting and
inconsistent results Therefore, an alternative mechanism for coordinating distributed
classification activities is needed Another limitation of distributed classification is the
large communication overhead In order for classification software systems to collaborate,
they must communicate with each other and the amount of exchanged information can be
very large For example, Mukhopadhyay et al (2003) show that the average response
time for classification increases almost linearly with the number of classification software
systems, which is the direct result of increasing communication overhead They also
show that the classification performance quickly saturates with increasing number of
classification software systems This later result points to the potential of improving
overall performance by reducing communication overhead
1.3 Multi-Agent Paradigm
This research employs a multi-agent paradigm for conducting distributed text
classification The multi-agent paradigm evolved from distributed artificial intelligence in
telt0shrgnsaecnieesatnmosnelgnoptrsfwae
An agent exhibits three major characteristics, namely reactivity, proactiveness, and social
ability (Wooldridge & Jennings, 1995) Reactivity refers to the capability of sensing the
changes in its environment and taking fast corresponding actions Proactiveness refers to
the capability of operating in an active fashion according to its design goal Social ability
refers to the capability of working with other agents A group of such agents form a
multi-agent system Durfee and Montgomery (1989) define a multi-agent system (MAS)
aaloeyculdntokorbeovrhtwoktgteoslepoblems
tareodteridvdaaaiiis”
Trang 15For text classification, a multi-agent paradigm offers several critical advantages
According to Sycara (1998), the multi-agent paradigm distributes computing resources
and capabilities across a network of agents, which can avoid the single point of failure
problem The modular, scalable architecture of the multi-agent paradigm provides
computational scalability and flexibility for agents entering and leaving agent
communities The multi-agent paradigm can also make efficient use of spatially
distributed information resources and serve as a solution when expertise is distributed
Because of these advantages, the multi-agent paradigm has been utilized in designing of
information retrieval systems and information management systems However, the
applicability of the multi-agent paradigm in text classification has not been thoroughly
examined yet
Agent coordination is a critical component of the multi-agent paradigm It determines the
relationship among the agents in a multi-agent environment and governs the behaviors of
the interacting agents The overall system performance, including both quality and
efficiency, depends on the appropriate design of the coordination mechanism Quality
measures the correctness of the system behavior, which is the collective result of the
codntdaet’bhvosfiinymesrstetmeiesohytmrcs,
which counts mainly the communication among the coordinated agents Due to its
importance in system performance optimization, agent coordination has been well studied
in various domains, such as transportation, economics, and management As the
Trang 16multi-agent paradigm is applied in text classification, multi-agent coordination will be the focus of
this research
Evaluation of system performance is an essential aspect of the multi-agent
implementation plan As the overall performance of a multi-agent system is a collective
rslfmutpeaet’bhvosheutaoemaentdrclnesadbe
The evaluation framework typically reflects the system performance at different levels,
including the agent level and the overall system level Also, the evaluation metric covers
different aspects of the system performance, including effectiveness and efficiency
Integration of the evaluation metric of text classification and the multi-agent paradigm
may provide us a powerful tool to validate the approach of automatic text classification
using a multi-agent framework
2 Problem Statement
The primary purpose of this study is to investigate automatic text classification using a
multi-agent framework Automatic text classification and the multi-agent paradigm
respectively have been extensively studied over the years Although, problems within
each area have been investigated, new problems that arise with the introduction of the
multi-agent paradigm into automatic text classification remain mostly unexplored In this
section, three major challenges associated with distributed text classification will be
examined and key variables related to these challenges will be discussed
Trang 172.1 Specific Challenges
Distributed text classification is different from centralized text classification because of
its distributed architecture One of the main challenges in distributed text classification is
coordinating classification activities in a distributed environment Unlike centralized
classification relying on a mediator to ensure the coherence of the overall system
performance, distributed classification lacks centralized control, and thus may produce
conflicting and inconsistent classification results Consequently, an effective mechanism
of coordinating distributed classification activities is greatly needed In the multi-agent
paradigm, such mechanisms (e.g., agent coordination protocols) have been extensively
studied The agent coordination research has drawn on various domains including
artificial intelligence, social science, game theory, and economics Many agent
coordination protocols, such as blackboard and contracting, have been explored in those
domains Although these agent coordination protocols have been successfully applied in
many domains, they have not been seriously studied in information science, particularly
for text classification The question is whether these agent coordination protocols will
work well for the classification task Different coordination protocols will be explored for
designing suitable coordination mechanisms for text classification
Another challenge in distributed classification is achieving high quality of classification
in multi-agent environments In automatic centralized classification, quality of
classification is mainly influenced by the quality of the test collection and the
classification algorithm Most efforts so far have concentrated on developing new
methods to improve the classification performance Several classification methods, such
Trang 18as Support Vector Machines and k-Nearest Neighbor, have been applied in centralized
environments In multi-agent environments, the classification task is distributed across a
network of classification agents The classification process involves the actual
classification conducted by individual agents, the interactions among agents, and the
merge of individual classification results Whether those well-established classification
methods are applicable in multi-agent environments has to be examined To validate the
performance of these classification methods in multi-agent environments and identify the
suitable classification methods for distributed classification, a thorough evaluation of
quality of distributed classification needs to be conducted The evaluation may cover the
comparison among different classification methods and the comparison between
distributed classification and centralized classification The result from such an
evaluation may tell us about whether certain distributed classification approaches can
achieve satisfactory quality of classification
Minimizing communication overhead in distributed classification without compromising
quality of classification is yet another challenge Communication is a key issue in
distributed classification, where agents exchange information, interact with each other,
and work together through the means of communication to achieve satisfactory quality of
classification In such an environment, the amount of communication greatly affects
system efficiency Consequently, an appropriate agent coordination protocol that governs
teaetscmmuiainbhvonefcieadefcetmanrmanuehgh
quality of classification and reduce communication overhead The key objective of the
agent coordination protocol is to balance between quality of classification and system
Trang 19efficiency To achieve such a balance, an evaluation procedure has to be established to
measure quality of classification and system efficiency in multi-agent environments
However, there is no standard evaluation framework to fulfill this goal An evaluation
framework for measuring system efficiency needs to be established To summarize, the
three main challenges are: 1) Coordinating classification activities in a distributed
environment, 2) Achieving high quality of classification in multi-agent environments, and
3) Minimizing communication overhead in distributed classification without
compromising quality of classification
2.2 Variables
This research will be conducted using an experimental study design The study will
explore the applicability of different classification methods in multi-agent environments
The result of this exploration will help researchers to choose appropriate classification
methods in distributed computing environments and design new methods for distributed
classification The primary focus of this study will be to investigate the coordination of
distributed classification activities in multi-agent environments A comparative study of
different coordination protocols for multi-agent classification will help in identifying the
best coordination protocol, which can achieve satisfactory classification performance
with acceptable communication overhead A comparative study between centralized
classification and distributed classification will also be conducted to evaluate the
performance of the distributed classification approach The evaluation framework will
draw on the centralized classification research and new approaches will be developed that
are uniquely suitable for distributed classification environments To carry out this study
Trang 20and address the three challenges discussed above, three variables will be studied: quality
of classification, system efficiency, and agent granularity
One of the main goals is to achieve satisfactory classification performance in a
multi-agent environment Therefore, quality of classification must be taken into consideration
throughout the study The quality of classification refers to the accuracy of a completed
classification task In contrast to centralized text classification, quality of classification in
a multi-agent context is determined by not only the performances of individual classifiers,
but also the agent coordination protocol Researchers in information retrieval and
machine learning communities have tested various effectiveness measurements for
classification tasks Lewis (1995) demonstrated using different families of single
effectiveness measures to estimate and optimize the performances of classification
systems Joachims (2001) summarized the most commonly used effectiveness measures
for evaluation text classification systems In this study, precision, recall, and F measure
have been chosen to measure quality of classification
Figure 1.1 Precision
Trang 21Figure 1.2 Recall
A study by Mukhopadhyay et al (2005) demonstrates how these evaluation measures can
be applied in a multi-agent environment The study, which shows that as the number of
agent increases, precision drops (see Figure 1.1) while recall increases (see Figure 1.2),
proposes that quality of classification must be evaluated at both the system level and the
individual agent level In additional to the overall performance evaluation at the system
level, the classification performance of individual agent will help in understanding each
aetsbhvoadrltosiwihohraet.Itesuyteqaiyo
classification at the system level is calculated by averaging the corresponding
measurement scores across all agents For each category, classification decisions can be
represented as a contingency table as following:
Expert decision: Yes Expert decision: No
Table 1: Contingency table
Trang 22Based on the contingency table, recall and precision are defined as following:
FN TP
TP call
FP TP
TP ecision
Pr
Typically, both micro-averaging and macro-averaging methods are applied to calculate
the average scores In the micro-averaging method, precision and recall are computed
bsdogoa”cnignytbehcstesmfidvdaotnecals
In the macro-averaging method, precision and recall is computed by averaging the
precision and recall scores of all categories (Sebanstiani, 2002) The micro-averaging and
macro-averaging scores reflect the classification performance on different categories
Yang and Liu (1999) note that micro-averaging gives equal weights to each item (e.g.,
document) and can be dominated by large (common) categories, whereas
macro-averaging gives equal weights to each category, so small (rare) categories can unduly
influence the score
In this study, the main goal is not only to achieve high quality of classification, but also
to attain acceptable system efficiency The efficiency of a multi-agent system largely
depends on its communication overhead Agent interaction and coordination, and the
agent environment influence communication overhead in multi-agent systems Different
agent coordination protocols produce different amount of communication overhead This
variable, efficiency, will help in identifying appropriate coordination protocols which can
achieve acceptable communication overhead Efficiency here refers to the time spent
during the completion of a classification task A study by Mukhopadhyay et al (2003)
shows that as the number of agent increases, the communication overhead increases
almost linearly (see Figure 2.1) while the classification performance quickly saturates
Trang 23(see Figure 2.2) This result shows the possibility of achieving a satisfactory classification
performance with reduced communication overhead by interacting with fewer agents
Efficiency can be measured at both the system level and the individual agent level
Efficiency at the system level represents the time spent to classify all the documents in
the multi-agent environment Efficiency at the agent level represents the time that an
individual agent spends to classify its own documents
Figure 2.1 Average response time
Trang 24Figure 2.2 Number of successful classification
Agent granularity refers to the amount of knowledge possessed by an agent Agent
granularity has impact on not only their classification capabilities, but also efficiency at
the system level Each agent possesses a certain amount of knowledge, which is a
proportion of the complete global knowledge In an extreme case, each agent has only the
knowledge of one class When the total number of classes is fixed, the number of agents
decreases as the number of classes possessed by each agent increases Theoretically, the
classification capability of each agent gets enhanced as the number of classes increase
because the probability of a document being classified by such an agent increases Also,
with increased knowledge, the communication overhead decreases because there are
fewer agents and less coordination needs (see Figure 3.1 & Figure 3.2)
Trang 25Figure 3.1 Precision
Figure 3.2 Average response time
2.3 Implications of this Research
This research has been developed to investigate automatic text classification using a
multi-agent framework Automatic text classification has not been seriously tested in
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Trang 26multi-agent environments Although some preliminary analyses were conducted
(Mukhopadhyay et al., 2003; 2005), they were limited in scope The findings of this study
will reveal the advantages and disadvantages of conducting text classification using a
multi-agent framework in a more comprehensive manner These findings may be useful
in choosing classification solutions between centralized approaches and distributed
approaches in different scenarios For example, distributed classification can serve as an
alternative when centralized classification cannot be realized Different classification
methods will be tested in the multi-agent classification environment The results will be
useful for identifying classification methods for distributed classification tasks and
inspiring researchers to design new classification methods for distributed computing
environments Different approaches for addressing the agent coordination problem will
be evaluated and compared The findings may help in designing appropriate coordination
protocols for distributed classification The evaluation framework based on centralized
classification will be tested in the multi-agent environment The findings will help in
better understanding the differences between centralized classification and distributed
classification and may contribute to the establishment of evaluation framework for
distributed classification
Ultimately, my goal is to make a scholarly contribution to the area of text classification
and multi-agent systems and produce findings that will be of interest to both practitioners
and researchers At the practical level, the proposed approach can facilitate the sharing of
activities among different libraries For example, if several small libraries adopt the
proposed approach and build a multi-agent community, they can help each other without
Trang 27having to maintain comprehensive capacity for classifying materials At the research
level, this proposed research may be of interest to researchers interested in advancing the
state-of-art in automated text classification
3 Literature Review
This research investigates an alternative classification approach, namely distributed text
classification conducted using a multi-agent framework Three major challenges
associated with distributed text classification are examined: 1) Coordinating classification
activities in a distributed environment, 2) Achieving high quality of classification, and 3)
Minimizing communication overhead This chapter reviews literature on these and
related problem areas Different approaches for addressing the three major problem areas
are compared and potential solutions are proposed
3.1 Automatic text classification
The section below mostly draws on centralized text classification research and provides
the basic components for building distributed text classification approaches Although
distributed classification is different from centralized classification, each classification
agent itself is still a relatively independent classification unit This section reviews
classification tasks, classification methods, test collections, and the measurements for
quality of classification, which is hoped to contribute to better understanding of
challenges associated with distributed text classification
Trang 283.1.1 Text classification task
Automatic text classification assigns textual documents into classes using the rules or
patterns learned from a set of pre-classified documents Sebastiani (2002) defines
automatic text classification as a process of assigning natural language documents to
predefined semantic classes Generally, the text classification task can be defined as
follows: Given a set of pre-classified documents, learn the classification rules or patterns
and find the correct classes for a new document
Binary classification is the simplest and most fundamental operation in text classification
In binary classification, there are only two classes For example, when a user submits a
query term to an online library catalog, the library items are divided into two classes
Items in one class contain the query term in their indices and items in the other class do
not include the information Email spam detection is another example Emails are
classified into two classes: spam and non-spam A more complex task is multi-class
classification In this scenario, documents are classified into more than two classes For
example, news documents are classified into one of several pre-defined classes in news
filtering applications In email routing, based on the content and metadata of the emails
they are directed to one of a few different folders or recipients Another scenario is called
multi-label classification (Joachims, 2002) In multi-label classification, the mapping
between documents and classes is not restricted to one-to-one mapping A document,
which can belong to multiple classes, can be classified as members in those classes at the
same time In the news filtering example above, one news document could be about
Trang 29Olympics 2008 in China and it could be classified into both International and Sports
categories
3.1.2 Text classification methods
Most efforts in automatic text classification have concentrated on developing centralized
classifiers to improve the classification performance Various approaches have been
explored including probabilistic models, symbolic algorithms, and artificial intelligence
approaches Several text classification methods will be briefly described here It is hoped
that some of these methods will be suitable for multi-agent environments
Nạve Bayes (NB) classification is based on a probabilistic model of text This method is
called naive Bayes because it assumes that the word occurrences in a document are
idpnetooeaohrByapynBae’termadasmigwod
independence, the probabilities of classes for a given document are estimated with a
relatively simple formula Estimating the word probability distribution for each class,
which is typically done by analyzing the training documents, is a key step Although
nạve Bayes classifier is typically used in Binary classification, a multinomial model can
be applied to solve the multi-class problem by ranking the classes or thresholding the
probabilities of the class membership given a document (Sebastiani, 2002) Nạve Bayes
classifiers are fast as compared to other classification methods One disadvantage of
nạve Bayes classifiers is their discrimination capabilities, which depend on only the
occurrences of words in the training documents (Chakrabarti, 2002)
Trang 30K-nearest neighbor (kNN) classification is based on the assumption that similar
documents are likely to belong to the same class (Joachims, 2001) Based on the
user-specified similarity measures, this method finds the k nearest neighbors for a given test
document The candidate classes are scored by the similarity of the k nearest neighbors to
the test document The class of the test document is determined by the majority of the k
nearest neighbor documents The top ranked candidate classes can be used for multi-label
classification K-NN classifiers do not have the drawback of linear classifiers, which
divide the document space with linear boundaries However, the classification time of
this method is expensive During the classification phase, all training documents have to
be compared with the test document and ranked based on the similarity values (Sebastiani,
2002)
Support vector machines (SVM) are derived from the statistical learning theory The
basic idea is to find a decision surface in the vector space that separates the document
vectors into two classes with a maximum margin (Joachims, 2001) This can be applied
to both linearly separable data sets and linearly non-separable ones As argued by
Joachims (2001), SVM is well suited for binary text classification tasks with very small
training sets because of the properties of natural language texts SVMs can handle the
high dimensionalities of textual documents and avoid the term selection process
(Sebastiani, 2002)
The Rociehdidpermhoci’loithm for relevance feedback in
the vector space retrieval model This method is used for profile-style classifiers A
Trang 31profile of a document class, also called a centroid, is the prototypical document of that
class which can be represented as a list of weighted terms The classification is conducted
by controlling the relative importance of positive and negative test documents (Sebastiani,
2002) The class membership of a given document is determined by the similarity of the
class centroids and the document vector using an appropriate threshold Rocchio
classifiers are more easily understood by human experts But they suffer from the same
drawback of using linear discriminators as other linear classifiers do (Sebastiani, 2002)
A neural network (NN) classifier is a network of units This method consists of at least
two layers: an input layer and an output layer A two-layer configuration is called a
perceptron In more complex networks, there is at least one additional hidden layer The
classifier takes term weights of a document as the input and outputs the probabilities of
the classes for the given document This can solve both binomial and multinomial
classification problems For a multinomial classification task, this method produces a
network of multiple outputs, each corresponding to one class Feature selection is
required because neural networks can easily suffer from the problem of overfitting
(Joachims, 2001)
Decision tree classification refers to a class of human interpretable symbolic algorithms
A decision tree classifier is a tree in which internal nodes represent document attributes
and leafs represent the classification If the tree has multiple outputs at each stage, this
tree can deal with multiple class classification The structure of the tree and the selected
attributes are very important for the performance of the classifier The selection of
Trang 32attributes is always determined by information gain or entropy criterion After the
decision trees have been created, the attributes and decision rules can be modified by
Support vector
machine
Table 2: Comparison of classification methods
The classification methods described above are the most commonly used ones in
automatic text classification Each of them has some unique advantages Some of them
are easier to implement, such as k-NN and decision tree classifiers Others are more
complex, but they are more robust and adaptive, such as SVMs These also suffer from
additional shortcomings The linear classifiers, such as Rocchio, may have the centroid of
a class falling outside the clustered documents To further enhance the performance of
centralized classification systems, researchers developed a new hybrid classification
method, called classifier committees The basic assumption is that a combination of
judgments from multiple experts is better than any single one (Sebastiani, 2002) The
Trang 33effectiveness of a classifier committee depends on how the classifiers are combined The
combinations include majority voting, weighted linear combination, dynamic classifier
selection, and adaptive classifier combination A special case of classifier committees,
called boosting, has yielded successful results in recent experiments (Sebastiani, 2002)
3.1.3 Evaluation metrics for text classification
Evaluation of text classification is important for the optimization and comparison of
different classification methods The evaluation process is often conducted through
experiments instead of theoretical analysis because typically the performance of text
classification depends on domain specific problems (Sebastiani, 2002) The quality of a
text classification method is usually measured by its effectiveness Researchers in
information retrieval (IR) and machine learning (ML) communities have explored various
effectiveness measurements for classification tasks Lewis (1995) estimated and
optimized the performances of classification systems using different families of single
effectiveness measures Lewis concludes that the selection of effectiveness measures
should be determined by the specific classification task (Lewis, 1995) Joachims (2001)
summarized the most commonly used effectiveness measures for evaluating text
classification systems Sebastiani (2002) discussed various effectiveness measures as well
as some measures alternative to effectiveness such as efficiency and utility The most
commonly used effectiveness measures include precision, recall, precision/recall
breakeven point, F- measure, and error rate
The following discussion about the effectiveness measures is based on binary
classification and the classification decisions are represented as a contingency table
Trang 34Expert decision: Yes Expert decision: No
Table 3: Contingency table
Precision and recall are two primary effectiveness measures for text classification and
many other measures are built based on them They are adapted from information
retrieval to fit the need for measuring text classification tasks Recall is the proportion of
documents that the classifier recognizes as class members among all the documents of
that class Precision is the proportion of document decided by the classifier as belonging
to individual classes that are truly class members (Lewis, 1995) Based on the
contingency table, recall and precision are defined as following:
FN TP
TP call
Pr
There are always trade-offs between recall and precision Very high levels of one
measure can be easily achieved without considering the other one Using just one of the
measures cannot provide a complete view of the effectiveness of classification
performance One solution is to use a combined effectiveness measure, such as F and
precision/recall breakeven point (PRBEP) F is defined as follows:
call ecision
call ecision
F
RePr
RePr
)1(2 2
where represents the relative importance of recall and precision Usually, a value
=1 is used to give equal weight to recall and precision (Sebastiani, 2002) Another
Trang 35method of combining the two measures is using a single value called the precision/recall
breakeven point, where recall and precision are equal (Joachims, 2001) There are a
couple of additional measures, namely error rate and eleven-point average precision,
which can also be used for calculating classification effectiveness However, these
measures are not widely used
3.1.4 Test Collections
Text classification methods are typically evaluated using standard test collections Text
classification researchers have compiled and published several standard test collections,
among which the most commonly used ones includes the Reuters-21578, the OHSUMED,
20 newsgroup collection, and the RCV1
The Reuters-21578 collection is a newswire corpus of 9603 training documents
ad39etnouethMoApepisused The distribution of the
documents across the categories is highly skewed that only 57 of the 135
categories in this collection have at least 20 document occurrences
The OHSUMED collection, developed by Hersh et al (1994) is a MEDLINE subset, consisting of a training set of 54,710 references and a testing set of
293,856 references The available MEDLINE fields include title, abstract, and
Mesh indexing terms The relatively large size of this collection makes it suitable
for large-scale experiments, such as TREC However, since the content of this
collection is highly specialized in the medical domain, domain knowledge is
required to understand the relationships between the documents and the categories
Trang 36 The 20 newsgroup collection consists of about 20,000 messages, collected from
20 different newsgroups Messages from each of the 20 newsgroups were chosen
at random and almost evenly distributed across the newsgroups
The RCV1 collection is so far the latest and largest test collection for text classification About 810,000 manually categorized newswire stories from
Reuters, Ltd are split into a training set of 23,149 documents and a test set of
781,265 documents Documents from 103 categories cover topics of corporation,
economics, government, and markets
3.1.5 Centralized Text Classification Procedure
A centralized text classification experiment is typically conducted following a standard
procedure including data collection preparation, classifier training, and classifier
evaluation (Yang & Liu, 1999; Sebanstiani, 2002; Lewis, Yang, Rose, & Li, 2004)
In the preparation of data collection, one of the standard test collections discussed above
or a custom compiled test collection is chosen based on the purpose of the classification
experiment The data collection is usually split into a set of training documents and a set
of testing documents Then, the data collection is filtered to remove stop words, digits,
and punctuations The final step in data preparation is feature selection, which aims to
remove non-informative words based on data collection statistics (Yang & Pedersen,
1997) Feature selection methods including document frequency thresholding (DF),
information gain (IG), mutual information (MI), 2statistics (CHI), and term strength (TS) have shown strong correlation in their performance (Yang & Pedersen, 1997) Since
Trang 37the details of the feature selection methods are out of the scope of this study, please refer
to the previous research for more in depth discussion
During the classifier training, any chosen classification algorithm is trained using the set
of training documents This training process usually includes parameter tuning and is
repeated until the minimum training error is obtained The trained classifier is then used
in the classifier evaluation to classifier the set of testing documents The labels of the
classification results are compared with the pre-defined labels of the testing documents
and the performance is measured by the commonly used precision, recall, and F scores
For example, Yang and Liu applied Support Vector Machines (SVM) on the Reuters
21578 collection and produced the results shown in Table 4 A more recent experiment
by Lewis and his colleague (2004) using a much larger collection produced a new
benchmark for centralized text classification (see Table 5)
Micro Recall Micro Precision Micro F1 Macro F1
Trang 383.2 Text classification using a multi-agent framework
Text classification using a distributed approach, particularly a multi-agent framework,
offers a solution to address the challenges of centralized text classification and improve
the quality of classification This section introduces the multi-agent paradigm, its
distinction from other similar computing paradigms, and its recent applications in
information systems Text classification using a multi-agent framework will be compared
with centralized text classification Its advantages over centralized text classification will
be pointed out
3.2.1 Multi-agent paradigm
In recent years, agent-based systems, particularly as implemented in a distributed
framework, have attracted considerable interests Agent-based systems have evolved
from single-agent systems to multi-agent systems (MASs) Multi-agent frameworks have
been developed by the Distributed Artificial Intelligence community A multi-agent
faeokialoeyculdntokorblmovrhtwoktgteosle
polmstareoderidvdaaaiiisDufe&otoey99
3.2.2 Differences between multi-agent systems and other concurrent systems
Parallel computing, distributed computing, and Peer-to-Peer (P2P) computing have
offered different means to improve the overall performance of information systems
Although multi-agent systems share some common characteristics with those traditional
computing paradigms, it is important to point out the differences
Trang 39In parallel systems, multiple processors, each working on a different part of the same
problem, work simultaneously to solve a single problem Different processors are
coordinated through a central processor, which collects the user input and merges the
results (Baeza-Yates and Ribeiro-Neto, 1999) The components of parallel systems,
which are simply homogeneous processors with no distinct expertise, operate in a static
environment (Wooldridge, 2002)
Distributed systems are very similar to parallel systems Baeza-Yates and Ribeiro-Neto (99eieadsrbtdsses“utpecmptronceyalcloie
area network cooperate tosleasngepolm.oprdoprleytmsh
communication cost in distributed systems is much higher between processors, which in
nature can be heterogeneous To accomplish a task in a distributed system, it often
involves a subset of the processors instead of all processors in the system (Baeza-Yates
and Ribeiro-Neto, 1999) In contrast to distributed systems, multi-agent systems have two
main distinctions (Wooldridge, 2002) First, synchronization and coordination have to be
done at run-time Second, agents are self-interested entities, which cannot be assumed to
share a common goal
The computing paradigm is shifting from the traditional Client/Server model to the
Peer-to-Per(2)moe.SoeIeerhrosdrP2s“aoleverage vast
amounts of computing power, storage, and connectivity from personal computers
dsrbtdaonhol”(enlpu-Yazti, Kalogeraki, & Gunopulos, 2004)
Trang 40Compared to parallel systems and distributed systems, current P2P systems have the
following two characteristics First, peers operate in a dynamic environment without any
centralized coordination (Zeinalipour-Yazti, Kalogeraki, & Gunopulos, 2004) Peers can
join and leave the system at any time Second, they are not completely decentralized For
example, Napster has to maintain a directory in a central server However, there are some
emerging P2P systems working in a completely distributed manner such as pSearch
(Tang, Xu, and Dwarkadas, 2003)
Regarding other concurrent systems, a multi-agent system (MAS) has additional distinct
characteristics (Jennings, Sycara, & Wooldridge, 1998) First, a MAS is composed of
multiple autonomous components Second, the operating entities, namely agents, are
intelligent Third, knowledge is decentralized among the agents Each agent has its
distinct and incomplete knowledge to solve a problem Fourth, agents operate in a
completely open and distributed environment Fifth, there is no global system control
Sixth, data is decentralized Finally, computation is asynchronous in MAS
3.2.3 Connections between multi-agent paradigm and peer-to-peer paradigm
As discussed in section 3.2.3, the multi-agent paradigm and peep-to-peer (P2P) paradigm
are different de-centralized computing paradigms, but in both paradigms the computing
units, either peers or agents, have to interact with each other to complete a task In the
P2P paradigm, one of the key interactions is to search the user requested files among the
connected peers Search techniques for information retrieval in the P2P paradigm cover a
variety of approaches including centralized indexing, query flooding, breadth-first search
(BFS), document routing search, and selective intelligent search (Tang, Xu, &