1. Trang chủ
  2. » Luận Văn - Báo Cáo

automatic text classification using a multi-agent framework

124 138 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 124
Dung lượng 0,93 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Three major challenges associated with distributed text classification are examined: 1 Coordinating classification activities in a distributed environment, 2 Achieving high quality class

Trang 1

AUTOMATIC TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK

Yueyu Fu

Submitted to the faculty of the University Graduate School

in partial fulfillment of the requirements

for the degree Doctor of Philosophy

in the School of Library and Information Science,

Indiana University October 2006

Trang 2

All rights reserved This microform edition is protected against

unauthorized copying under Title 17, United States Code

_

ProQuest Information and Learning Company

789 East Eisenhower Parkway

P.O Box 1346 Ann Arbor, MI 48106-1346

Trang 3

Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the

requirements for the degree of Doctor of Philosophy

Charles Davis, Ph.D

Kiduk Yang, Ph.D

David Leake (Computer Science, minor),

Ph.D

Trang 4

© 2006 Yueyu Fu ALL RIGHTS RESERVED

Trang 5

DEDICATION

To my beloved parents Guanghui Fu and Lan Chen, my dear wife Wenjie Sun, and my

grandparents for their unconditional love and encouragement

Trang 6

ACKNOWLEDMENTS

I feel so grateful to numerous people who generously provide me the guidance, support, and encouragement to complete this dissertation

First and foremost, I would like to thank Dr Javed Mostafa, my committee chair, for his

professional and personal guidance that goes far beyond his responsibilities It is his patient guidance, sharp mind, and gentle encouragement that led me to the achievement I have

today

Special thanks also go to the rest of my committee, Dr Charles Davis, Dr Kiduk Yang, and

Dr David Leake, for their insightful comments and enduring support during the entire

process of my dissertation research

I would like to thank my colleagues and the staff at Indiana University, especially Weimao

Ke, Kazuhiro Seki, Mary Kennedy, Arlene Merkel, Erica Bodnar, and Rhonda Spencer, for

their kind help and support throughout all these memorable years in Bloomington

Finally, I must express my deepest gratitude to my parents, Guanghui Fu and Lan Chen, for opening my eyes to the world and encouraging me to pursue my career abroad, and to my

beloved wife, Wenjie Sun, for making our family full of joy, support, and understanding

Trang 7

Automatic text classification is an important operational problem in information systems Most automatic text classification efforts so far concentrated on developing centralized solutions However, centralized classification approaches often are limited due to constraints on knowledge and computing resources To overcome the limitations of centralized approaches, an alternative distributed approach based on a multi-agent framework is proposed Three major challenges associated with distributed text classification are examined: 1) Coordinating classification activities in a distributed environment, 2) Achieving high quality classification, and 3) Minimizing communication overhead This study presents solutions to these specific challenges and describes a prototype system implementation As agent coordination is the key component in conducting multi-agent text classification, two agent coordination protocols, namely blackboard-bidding protocol and adaptive-blackboard protocol, are proposed in the study

To analyze the performance of the distributed approach a comparative evaluation methodology is described, which treats outcome of a centralized approach as baseline performance A series of experiments was conducted in a simulation environment The simulation environment permitted manipulation of independent variables such as scalability and coordination strategy, and investigation of the impact on two critical dependent variables, namely efficiency and effectiveness There were three critical findings First, in dealing with automatic text classification the multi-agent approach can achieve improved system efficiency while maintaining classification effectiveness comparable to a centralized approach Second, the agent protocols were effective in coordinating the text classification activities of distributed agents Third, the application

of content-based adaptive learning for acquiring knowledge about the agent community reduced communication cost and improved system efficiency

Trang 8

TABLE OF CONTENTS

1 INTRODUCTION 1

1.1 MANUAL CLASSIFICATION 1

1.2 AUTOMATIC CLASSIFICATION 2

1.3 MULTI-AGENT PARADIGM 5

2 PROBLEM STATEMENT 7

2.1 SPECIFIC CHALLENGES 8

2.2 VARIABLES 10

2.3 IMPLICATIONS OF THIS RESEARCH 16

3 LITERATURE REVIEW 18

3.1 AUTOMATIC TEXT CLASSIFICATION 18

3.1.1 Text classification task 19

3.1.2 Text classification methods 20

3.1.3 Evaluation metrics for text classification 24

3.1.4 Test Collections 26

3.1.5 Centralized Text Classification Procedure 27

3.2 TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK 29

3.2.1 Multi-agent paradigm 29

3.2.2 Differences between multi-agent systems and other concurrent systems 29

3.2.3 Connections between multi-agent paradigm and peer-to-peer paradigm 31

3.2.4 Recent applications of the multi-agent paradigm 34

3.2.5 Centralized vs Multi-agent text classification 35

3.3 MULTI-AGENT COORDINATION PROTOCOLS 39

3.3.1 Definition of coordination 40

3.3.2 Coordination Protocols 41

3.3.2.1 Organizational Structuring 41

3.3.2.2 Multi-agent planning 43

3.3.2.3 Contract net protocol 44

3.3.2.4 Negotiation 45

4 METHODOLOGY 50

4.1 DATA 50

4.2 DESIGN METHODOLOGY 51

4.2.1 Multi-Agent Community for Text Classification 51

4.2.2 Classification Module 53

4.2.3 Algorithms of Agent Coordination Protocols 55

4.2.4 Proposed Agent Coordination Protocols 59

4.2.4.1 Blackboard-bidding Protocol 59

4.2.4.2 Adaptive-blackboard Protocol 61

4.3 IMPLEMENTATION 65

4.3.1 System Architecture 65

4.3.2 Alternative approach 67

4.4 EVALUATION METHODOLOGY 67

4.4.1 Measurements 67

Trang 9

4.4.1.1 Effectiveness Measurements 67

4.4.1.2 Efficiency Measurements 68

4.4.2 Variables 70

4.4.2.1 Centralized vs Distributed 70

4.4.2.2 Coordination Protocols 71

4.4.2.3 Number of Agents 71

4.4.3 Experimental Settings 72

5 RESULTS 72

5.1 CENTRALIZED VS.DISTRIBUTED 72

5.2 COORDINATION PROTOCOLS 75

5.2.1 Effectiveness 76

5.2.2 Efficiency Measured by Messages 79

5.2.3 Efficiency Measured by Time 82

5.3 NUMBER OF AGENTS 85

5.3.1 Impact of the number of agents on effectiveness 86

5.3.2 Impact of the number of agents on efficiency 89

6 CONCLUSIONS 91

6.1 SUMMARY 91

6.2 FUTURE RESEARCH 95

REFERENCES 97

Trang 10

1 Introduction

Automatic text classification is an important operational problem in information systems

Many tasks, such as retrieval, filtering, and indexing, in information systems can be

considered as classification problems Most text classification efforts so far concentrated

on developing centralized solutions, where data and computation are located on a single

computer However, centralized classification approaches often are limited due to

constraints on knowledge and computing resources In addition, centralized approaches are

more vulnerable to attacks or system failures and less robust in dealing with them This

research presents an alternative classification approach, called distributed text

classification using a multi-agent framework, where data and computation are distributed

across a network of computers

1.1 Manual Classification

In library and information science, class/classification and category/categorization are

sometimes considered as distinct terms (Jacob, 2004) Although they are both used to

organize related entities, these two terms have a fundamental difference Classification

groups entities into mutually exclusive classes based on a set of predefined rules

regardless of the context, whereas categorization associates entities solely based on their

similarities within a given context (Jacob, 2004) This distinction makes categorization

more flexible than classification in organizing similar entities However, for the purpose

of broader audience, this study uses class/classification and category/categorization

interchangeably

Trang 11

One approach to text classification is manual classification, which involves human

experts manually classifying documents based on classification rules and subjective

judgment This approach has been used in library practice for many years to organize,

index, and retrieve documents Human experts typically assign each book a code

representing a category according to a set of classification schemes, such as the Dewey

Decimal Classification, the Universal Decimal Classification, and the Library of

Congress Classification A recent application of this approach on the web is the Yahoo

Directory, which organizes web pages into a hierarchical structure

The main challenge of manual classification is its demand on resources Manual

classification is a time-consuming process that relies heavily on domain knowledge It

requires significant investment of time from many human experts with knowledge of

different domains Another associated problem is that subjective judgments can generate

inconsistent classification results Because of these limitations, manual classification

works best for relatively small document collections

1.2 Automatic Classification

To address the problems of manual classification, researchers have explored automatic

text classification as an alternative approach Using machine learning techniques,

automatic text classification assigns documents to a set of pre-defined categories This

approach has been applied in many areas, such as patent classification, news delivery,

and email spam filtering In contrast to manual classification, automatic classification

offers the advantages of automation, efficiency, and consistency

Trang 12

In automatic classification, documents are typically classified by a single classification

software system running on a single machine This is also called centralized text

classification Significant efforts have focused on developing document classifiers in a

centralized manner and various classification algorithms have been developed to improve

the performance of centralized classification systems The advantages of centralized

classification stem from the centralized architecture Because data and computing

resources are located in the same place, the management of the classification task is easy

and the classification speed is fast Since the communication in centralized classification

takes place in the same machine, the communication cost is relatively small

However, as information becomes more distributed and its volume increases

exponentially, several critical disadvantages of centralized classification are revealed

The effectiveness of a classification system is mostly determined by the artificial

knowledge1 maintained by the system, which typically comes from training data

Currently, centralized classification systems suffer from the problem of scarcity of local

knowledge2 The extent of local knowledge is limited by the cost and constraints of

storing complete knowledge in a single place, and it is sometimes impossible to collect

all the necessary knowledge and store it in a central location Since the classification

system can successfully classify only documents that are within the scope of a limited

amount of local knowledge, it is likely to fail when the expansion of its domain (i.e., local

knowledge) does not keep up with growing diversity in knowledge Another disadvantage

is that due to its centralized architecture centralized classification has only a certain

1 Knowledge learned from training documents using machine learning techniques

2 Knowledge maintained by a single classification system

Trang 13

amount of computing power and input/output capacity When an information system has

to handle a large number of documents, the classification component may become a

performance bottleneck and suffer from the problem of single point of failure

Distributed text classification, which is an alternative approach to automatic centralized

classification, employs a de-centralized architecture for organizing knowledge and

computing resources This approach allows multiple classification software systems to

collaborate with each other to fulfill the classification task in a distributed computing

environment Distributed classification has several advantages over centralized

classification The distributed architecture offers computational scalability for

classification Mukhopadhyay et al (2005) demonstrate that classification time decreases

dramatically with the increasing number of collaborating classification software systems

Also, not completely relying on a single classification software system allows the

classification system to avoid the problem of single point of failure When one of the

classification software systems fails, its tasks can be carried out by alternative

classification software system Lastly, distributed classification fits the web model better

The Internet is a distributed system and it offers the opportunity to take advantage of

distributed computing paradigms and distributed knowledge resources for classification

However, distributed classification has some disadvantages Unlike centralized

classification, distributed classification, which consists of multiple independent

classification software systems, does not have global control of all the classification

activities Such global control is essential for achieving coherent system performance

Trang 14

Without global control, distributed classification activities can produce conflicting and

inconsistent results Therefore, an alternative mechanism for coordinating distributed

classification activities is needed Another limitation of distributed classification is the

large communication overhead In order for classification software systems to collaborate,

they must communicate with each other and the amount of exchanged information can be

very large For example, Mukhopadhyay et al (2003) show that the average response

time for classification increases almost linearly with the number of classification software

systems, which is the direct result of increasing communication overhead They also

show that the classification performance quickly saturates with increasing number of

classification software systems This later result points to the potential of improving

overall performance by reducing communication overhead

1.3 Multi-Agent Paradigm

This research employs a multi-agent paradigm for conducting distributed text

classification The multi-agent paradigm evolved from distributed artificial intelligence in

telt0shrgnsaecnieesatnmosnelgnoptrsfwae

An agent exhibits three major characteristics, namely reactivity, proactiveness, and social

ability (Wooldridge & Jennings, 1995) Reactivity refers to the capability of sensing the

changes in its environment and taking fast corresponding actions Proactiveness refers to

the capability of operating in an active fashion according to its design goal Social ability

refers to the capability of working with other agents A group of such agents form a

multi-agent system Durfee and Montgomery (1989) define a multi-agent system (MAS)

aaloeyculdntokorbeovrhtwoktgteoslepoblems

tareodteridvdaaaiiis”

Trang 15

For text classification, a multi-agent paradigm offers several critical advantages

According to Sycara (1998), the multi-agent paradigm distributes computing resources

and capabilities across a network of agents, which can avoid the single point of failure

problem The modular, scalable architecture of the multi-agent paradigm provides

computational scalability and flexibility for agents entering and leaving agent

communities The multi-agent paradigm can also make efficient use of spatially

distributed information resources and serve as a solution when expertise is distributed

Because of these advantages, the multi-agent paradigm has been utilized in designing of

information retrieval systems and information management systems However, the

applicability of the multi-agent paradigm in text classification has not been thoroughly

examined yet

Agent coordination is a critical component of the multi-agent paradigm It determines the

relationship among the agents in a multi-agent environment and governs the behaviors of

the interacting agents The overall system performance, including both quality and

efficiency, depends on the appropriate design of the coordination mechanism Quality

measures the correctness of the system behavior, which is the collective result of the

codntdaet’bhvosfiinymesrstetmeiesohytmrcs,

which counts mainly the communication among the coordinated agents Due to its

importance in system performance optimization, agent coordination has been well studied

in various domains, such as transportation, economics, and management As the

Trang 16

multi-agent paradigm is applied in text classification, multi-agent coordination will be the focus of

this research

Evaluation of system performance is an essential aspect of the multi-agent

implementation plan As the overall performance of a multi-agent system is a collective

rslfmutpeaet’bhvosheutaoemaentdrclnesadbe

The evaluation framework typically reflects the system performance at different levels,

including the agent level and the overall system level Also, the evaluation metric covers

different aspects of the system performance, including effectiveness and efficiency

Integration of the evaluation metric of text classification and the multi-agent paradigm

may provide us a powerful tool to validate the approach of automatic text classification

using a multi-agent framework

2 Problem Statement

The primary purpose of this study is to investigate automatic text classification using a

multi-agent framework Automatic text classification and the multi-agent paradigm

respectively have been extensively studied over the years Although, problems within

each area have been investigated, new problems that arise with the introduction of the

multi-agent paradigm into automatic text classification remain mostly unexplored In this

section, three major challenges associated with distributed text classification will be

examined and key variables related to these challenges will be discussed

Trang 17

2.1 Specific Challenges

Distributed text classification is different from centralized text classification because of

its distributed architecture One of the main challenges in distributed text classification is

coordinating classification activities in a distributed environment Unlike centralized

classification relying on a mediator to ensure the coherence of the overall system

performance, distributed classification lacks centralized control, and thus may produce

conflicting and inconsistent classification results Consequently, an effective mechanism

of coordinating distributed classification activities is greatly needed In the multi-agent

paradigm, such mechanisms (e.g., agent coordination protocols) have been extensively

studied The agent coordination research has drawn on various domains including

artificial intelligence, social science, game theory, and economics Many agent

coordination protocols, such as blackboard and contracting, have been explored in those

domains Although these agent coordination protocols have been successfully applied in

many domains, they have not been seriously studied in information science, particularly

for text classification The question is whether these agent coordination protocols will

work well for the classification task Different coordination protocols will be explored for

designing suitable coordination mechanisms for text classification

Another challenge in distributed classification is achieving high quality of classification

in multi-agent environments In automatic centralized classification, quality of

classification is mainly influenced by the quality of the test collection and the

classification algorithm Most efforts so far have concentrated on developing new

methods to improve the classification performance Several classification methods, such

Trang 18

as Support Vector Machines and k-Nearest Neighbor, have been applied in centralized

environments In multi-agent environments, the classification task is distributed across a

network of classification agents The classification process involves the actual

classification conducted by individual agents, the interactions among agents, and the

merge of individual classification results Whether those well-established classification

methods are applicable in multi-agent environments has to be examined To validate the

performance of these classification methods in multi-agent environments and identify the

suitable classification methods for distributed classification, a thorough evaluation of

quality of distributed classification needs to be conducted The evaluation may cover the

comparison among different classification methods and the comparison between

distributed classification and centralized classification The result from such an

evaluation may tell us about whether certain distributed classification approaches can

achieve satisfactory quality of classification

Minimizing communication overhead in distributed classification without compromising

quality of classification is yet another challenge Communication is a key issue in

distributed classification, where agents exchange information, interact with each other,

and work together through the means of communication to achieve satisfactory quality of

classification In such an environment, the amount of communication greatly affects

system efficiency Consequently, an appropriate agent coordination protocol that governs

teaetscmmuiainbhvonefcieadefcetmanrmanuehgh

quality of classification and reduce communication overhead The key objective of the

agent coordination protocol is to balance between quality of classification and system

Trang 19

efficiency To achieve such a balance, an evaluation procedure has to be established to

measure quality of classification and system efficiency in multi-agent environments

However, there is no standard evaluation framework to fulfill this goal An evaluation

framework for measuring system efficiency needs to be established To summarize, the

three main challenges are: 1) Coordinating classification activities in a distributed

environment, 2) Achieving high quality of classification in multi-agent environments, and

3) Minimizing communication overhead in distributed classification without

compromising quality of classification

2.2 Variables

This research will be conducted using an experimental study design The study will

explore the applicability of different classification methods in multi-agent environments

The result of this exploration will help researchers to choose appropriate classification

methods in distributed computing environments and design new methods for distributed

classification The primary focus of this study will be to investigate the coordination of

distributed classification activities in multi-agent environments A comparative study of

different coordination protocols for multi-agent classification will help in identifying the

best coordination protocol, which can achieve satisfactory classification performance

with acceptable communication overhead A comparative study between centralized

classification and distributed classification will also be conducted to evaluate the

performance of the distributed classification approach The evaluation framework will

draw on the centralized classification research and new approaches will be developed that

are uniquely suitable for distributed classification environments To carry out this study

Trang 20

and address the three challenges discussed above, three variables will be studied: quality

of classification, system efficiency, and agent granularity

One of the main goals is to achieve satisfactory classification performance in a

multi-agent environment Therefore, quality of classification must be taken into consideration

throughout the study The quality of classification refers to the accuracy of a completed

classification task In contrast to centralized text classification, quality of classification in

a multi-agent context is determined by not only the performances of individual classifiers,

but also the agent coordination protocol Researchers in information retrieval and

machine learning communities have tested various effectiveness measurements for

classification tasks Lewis (1995) demonstrated using different families of single

effectiveness measures to estimate and optimize the performances of classification

systems Joachims (2001) summarized the most commonly used effectiveness measures

for evaluation text classification systems In this study, precision, recall, and F measure

have been chosen to measure quality of classification

Figure 1.1 Precision

Trang 21

Figure 1.2 Recall

A study by Mukhopadhyay et al (2005) demonstrates how these evaluation measures can

be applied in a multi-agent environment The study, which shows that as the number of

agent increases, precision drops (see Figure 1.1) while recall increases (see Figure 1.2),

proposes that quality of classification must be evaluated at both the system level and the

individual agent level In additional to the overall performance evaluation at the system

level, the classification performance of individual agent will help in understanding each

aetsbhvoadrltosiwihohraet.Itesuyteqaiyo

classification at the system level is calculated by averaging the corresponding

measurement scores across all agents For each category, classification decisions can be

represented as a contingency table as following:

Expert decision: Yes Expert decision: No

Table 1: Contingency table

Trang 22

Based on the contingency table, recall and precision are defined as following:

FN TP

TP call

FP TP

TP ecision

Pr

Typically, both micro-averaging and macro-averaging methods are applied to calculate

the average scores In the micro-averaging method, precision and recall are computed

bsdogoa”cnignytbehcstesmfidvdaotnecals

In the macro-averaging method, precision and recall is computed by averaging the

precision and recall scores of all categories (Sebanstiani, 2002) The micro-averaging and

macro-averaging scores reflect the classification performance on different categories

Yang and Liu (1999) note that micro-averaging gives equal weights to each item (e.g.,

document) and can be dominated by large (common) categories, whereas

macro-averaging gives equal weights to each category, so small (rare) categories can unduly

influence the score

In this study, the main goal is not only to achieve high quality of classification, but also

to attain acceptable system efficiency The efficiency of a multi-agent system largely

depends on its communication overhead Agent interaction and coordination, and the

agent environment influence communication overhead in multi-agent systems Different

agent coordination protocols produce different amount of communication overhead This

variable, efficiency, will help in identifying appropriate coordination protocols which can

achieve acceptable communication overhead Efficiency here refers to the time spent

during the completion of a classification task A study by Mukhopadhyay et al (2003)

shows that as the number of agent increases, the communication overhead increases

almost linearly (see Figure 2.1) while the classification performance quickly saturates

Trang 23

(see Figure 2.2) This result shows the possibility of achieving a satisfactory classification

performance with reduced communication overhead by interacting with fewer agents

Efficiency can be measured at both the system level and the individual agent level

Efficiency at the system level represents the time spent to classify all the documents in

the multi-agent environment Efficiency at the agent level represents the time that an

individual agent spends to classify its own documents

Figure 2.1 Average response time

Trang 24

Figure 2.2 Number of successful classification

Agent granularity refers to the amount of knowledge possessed by an agent Agent

granularity has impact on not only their classification capabilities, but also efficiency at

the system level Each agent possesses a certain amount of knowledge, which is a

proportion of the complete global knowledge In an extreme case, each agent has only the

knowledge of one class When the total number of classes is fixed, the number of agents

decreases as the number of classes possessed by each agent increases Theoretically, the

classification capability of each agent gets enhanced as the number of classes increase

because the probability of a document being classified by such an agent increases Also,

with increased knowledge, the communication overhead decreases because there are

fewer agents and less coordination needs (see Figure 3.1 & Figure 3.2)

Trang 25

Figure 3.1 Precision

Figure 3.2 Average response time

2.3 Implications of this Research

This research has been developed to investigate automatic text classification using a

multi-agent framework Automatic text classification has not been seriously tested in

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Trang 26

multi-agent environments Although some preliminary analyses were conducted

(Mukhopadhyay et al., 2003; 2005), they were limited in scope The findings of this study

will reveal the advantages and disadvantages of conducting text classification using a

multi-agent framework in a more comprehensive manner These findings may be useful

in choosing classification solutions between centralized approaches and distributed

approaches in different scenarios For example, distributed classification can serve as an

alternative when centralized classification cannot be realized Different classification

methods will be tested in the multi-agent classification environment The results will be

useful for identifying classification methods for distributed classification tasks and

inspiring researchers to design new classification methods for distributed computing

environments Different approaches for addressing the agent coordination problem will

be evaluated and compared The findings may help in designing appropriate coordination

protocols for distributed classification The evaluation framework based on centralized

classification will be tested in the multi-agent environment The findings will help in

better understanding the differences between centralized classification and distributed

classification and may contribute to the establishment of evaluation framework for

distributed classification

Ultimately, my goal is to make a scholarly contribution to the area of text classification

and multi-agent systems and produce findings that will be of interest to both practitioners

and researchers At the practical level, the proposed approach can facilitate the sharing of

activities among different libraries For example, if several small libraries adopt the

proposed approach and build a multi-agent community, they can help each other without

Trang 27

having to maintain comprehensive capacity for classifying materials At the research

level, this proposed research may be of interest to researchers interested in advancing the

state-of-art in automated text classification

3 Literature Review

This research investigates an alternative classification approach, namely distributed text

classification conducted using a multi-agent framework Three major challenges

associated with distributed text classification are examined: 1) Coordinating classification

activities in a distributed environment, 2) Achieving high quality of classification, and 3)

Minimizing communication overhead This chapter reviews literature on these and

related problem areas Different approaches for addressing the three major problem areas

are compared and potential solutions are proposed

3.1 Automatic text classification

The section below mostly draws on centralized text classification research and provides

the basic components for building distributed text classification approaches Although

distributed classification is different from centralized classification, each classification

agent itself is still a relatively independent classification unit This section reviews

classification tasks, classification methods, test collections, and the measurements for

quality of classification, which is hoped to contribute to better understanding of

challenges associated with distributed text classification

Trang 28

3.1.1 Text classification task

Automatic text classification assigns textual documents into classes using the rules or

patterns learned from a set of pre-classified documents Sebastiani (2002) defines

automatic text classification as a process of assigning natural language documents to

predefined semantic classes Generally, the text classification task can be defined as

follows: Given a set of pre-classified documents, learn the classification rules or patterns

and find the correct classes for a new document

Binary classification is the simplest and most fundamental operation in text classification

In binary classification, there are only two classes For example, when a user submits a

query term to an online library catalog, the library items are divided into two classes

Items in one class contain the query term in their indices and items in the other class do

not include the information Email spam detection is another example Emails are

classified into two classes: spam and non-spam A more complex task is multi-class

classification In this scenario, documents are classified into more than two classes For

example, news documents are classified into one of several pre-defined classes in news

filtering applications In email routing, based on the content and metadata of the emails

they are directed to one of a few different folders or recipients Another scenario is called

multi-label classification (Joachims, 2002) In multi-label classification, the mapping

between documents and classes is not restricted to one-to-one mapping A document,

which can belong to multiple classes, can be classified as members in those classes at the

same time In the news filtering example above, one news document could be about

Trang 29

Olympics 2008 in China and it could be classified into both International and Sports

categories

3.1.2 Text classification methods

Most efforts in automatic text classification have concentrated on developing centralized

classifiers to improve the classification performance Various approaches have been

explored including probabilistic models, symbolic algorithms, and artificial intelligence

approaches Several text classification methods will be briefly described here It is hoped

that some of these methods will be suitable for multi-agent environments

Nạve Bayes (NB) classification is based on a probabilistic model of text This method is

called naive Bayes because it assumes that the word occurrences in a document are

idpnetooeaohrByapynBae’termadasmigwod

independence, the probabilities of classes for a given document are estimated with a

relatively simple formula Estimating the word probability distribution for each class,

which is typically done by analyzing the training documents, is a key step Although

nạve Bayes classifier is typically used in Binary classification, a multinomial model can

be applied to solve the multi-class problem by ranking the classes or thresholding the

probabilities of the class membership given a document (Sebastiani, 2002) Nạve Bayes

classifiers are fast as compared to other classification methods One disadvantage of

nạve Bayes classifiers is their discrimination capabilities, which depend on only the

occurrences of words in the training documents (Chakrabarti, 2002)

Trang 30

K-nearest neighbor (kNN) classification is based on the assumption that similar

documents are likely to belong to the same class (Joachims, 2001) Based on the

user-specified similarity measures, this method finds the k nearest neighbors for a given test

document The candidate classes are scored by the similarity of the k nearest neighbors to

the test document The class of the test document is determined by the majority of the k

nearest neighbor documents The top ranked candidate classes can be used for multi-label

classification K-NN classifiers do not have the drawback of linear classifiers, which

divide the document space with linear boundaries However, the classification time of

this method is expensive During the classification phase, all training documents have to

be compared with the test document and ranked based on the similarity values (Sebastiani,

2002)

Support vector machines (SVM) are derived from the statistical learning theory The

basic idea is to find a decision surface in the vector space that separates the document

vectors into two classes with a maximum margin (Joachims, 2001) This can be applied

to both linearly separable data sets and linearly non-separable ones As argued by

Joachims (2001), SVM is well suited for binary text classification tasks with very small

training sets because of the properties of natural language texts SVMs can handle the

high dimensionalities of textual documents and avoid the term selection process

(Sebastiani, 2002)

The Rociehdidpermhoci’loithm for relevance feedback in

the vector space retrieval model This method is used for profile-style classifiers A

Trang 31

profile of a document class, also called a centroid, is the prototypical document of that

class which can be represented as a list of weighted terms The classification is conducted

by controlling the relative importance of positive and negative test documents (Sebastiani,

2002) The class membership of a given document is determined by the similarity of the

class centroids and the document vector using an appropriate threshold Rocchio

classifiers are more easily understood by human experts But they suffer from the same

drawback of using linear discriminators as other linear classifiers do (Sebastiani, 2002)

A neural network (NN) classifier is a network of units This method consists of at least

two layers: an input layer and an output layer A two-layer configuration is called a

perceptron In more complex networks, there is at least one additional hidden layer The

classifier takes term weights of a document as the input and outputs the probabilities of

the classes for the given document This can solve both binomial and multinomial

classification problems For a multinomial classification task, this method produces a

network of multiple outputs, each corresponding to one class Feature selection is

required because neural networks can easily suffer from the problem of overfitting

(Joachims, 2001)

Decision tree classification refers to a class of human interpretable symbolic algorithms

A decision tree classifier is a tree in which internal nodes represent document attributes

and leafs represent the classification If the tree has multiple outputs at each stage, this

tree can deal with multiple class classification The structure of the tree and the selected

attributes are very important for the performance of the classifier The selection of

Trang 32

attributes is always determined by information gain or entropy criterion After the

decision trees have been created, the attributes and decision rules can be modified by

Support vector

machine

Table 2: Comparison of classification methods

The classification methods described above are the most commonly used ones in

automatic text classification Each of them has some unique advantages Some of them

are easier to implement, such as k-NN and decision tree classifiers Others are more

complex, but they are more robust and adaptive, such as SVMs These also suffer from

additional shortcomings The linear classifiers, such as Rocchio, may have the centroid of

a class falling outside the clustered documents To further enhance the performance of

centralized classification systems, researchers developed a new hybrid classification

method, called classifier committees The basic assumption is that a combination of

judgments from multiple experts is better than any single one (Sebastiani, 2002) The

Trang 33

effectiveness of a classifier committee depends on how the classifiers are combined The

combinations include majority voting, weighted linear combination, dynamic classifier

selection, and adaptive classifier combination A special case of classifier committees,

called boosting, has yielded successful results in recent experiments (Sebastiani, 2002)

3.1.3 Evaluation metrics for text classification

Evaluation of text classification is important for the optimization and comparison of

different classification methods The evaluation process is often conducted through

experiments instead of theoretical analysis because typically the performance of text

classification depends on domain specific problems (Sebastiani, 2002) The quality of a

text classification method is usually measured by its effectiveness Researchers in

information retrieval (IR) and machine learning (ML) communities have explored various

effectiveness measurements for classification tasks Lewis (1995) estimated and

optimized the performances of classification systems using different families of single

effectiveness measures Lewis concludes that the selection of effectiveness measures

should be determined by the specific classification task (Lewis, 1995) Joachims (2001)

summarized the most commonly used effectiveness measures for evaluating text

classification systems Sebastiani (2002) discussed various effectiveness measures as well

as some measures alternative to effectiveness such as efficiency and utility The most

commonly used effectiveness measures include precision, recall, precision/recall

breakeven point, F- measure, and error rate

The following discussion about the effectiveness measures is based on binary

classification and the classification decisions are represented as a contingency table

Trang 34

Expert decision: Yes Expert decision: No

Table 3: Contingency table

Precision and recall are two primary effectiveness measures for text classification and

many other measures are built based on them They are adapted from information

retrieval to fit the need for measuring text classification tasks Recall is the proportion of

documents that the classifier recognizes as class members among all the documents of

that class Precision is the proportion of document decided by the classifier as belonging

to individual classes that are truly class members (Lewis, 1995) Based on the

contingency table, recall and precision are defined as following:

FN TP

TP call

Pr

There are always trade-offs between recall and precision Very high levels of one

measure can be easily achieved without considering the other one Using just one of the

measures cannot provide a complete view of the effectiveness of classification

performance One solution is to use a combined effectiveness measure, such as F and

precision/recall breakeven point (PRBEP) F is defined as follows:

call ecision

call ecision

F

RePr

RePr

)1(2 2

where  represents the relative importance of recall and precision Usually, a value

=1 is used to give equal weight to recall and precision (Sebastiani, 2002) Another

Trang 35

method of combining the two measures is using a single value called the precision/recall

breakeven point, where recall and precision are equal (Joachims, 2001) There are a

couple of additional measures, namely error rate and eleven-point average precision,

which can also be used for calculating classification effectiveness However, these

measures are not widely used

3.1.4 Test Collections

Text classification methods are typically evaluated using standard test collections Text

classification researchers have compiled and published several standard test collections,

among which the most commonly used ones includes the Reuters-21578, the OHSUMED,

20 newsgroup collection, and the RCV1

 The Reuters-21578 collection is a newswire corpus of 9603 training documents

ad39etnouethMoApepisused The distribution of the

documents across the categories is highly skewed that only 57 of the 135

categories in this collection have at least 20 document occurrences

 The OHSUMED collection, developed by Hersh et al (1994) is a MEDLINE subset, consisting of a training set of 54,710 references and a testing set of

293,856 references The available MEDLINE fields include title, abstract, and

Mesh indexing terms The relatively large size of this collection makes it suitable

for large-scale experiments, such as TREC However, since the content of this

collection is highly specialized in the medical domain, domain knowledge is

required to understand the relationships between the documents and the categories

Trang 36

 The 20 newsgroup collection consists of about 20,000 messages, collected from

20 different newsgroups Messages from each of the 20 newsgroups were chosen

at random and almost evenly distributed across the newsgroups

 The RCV1 collection is so far the latest and largest test collection for text classification About 810,000 manually categorized newswire stories from

Reuters, Ltd are split into a training set of 23,149 documents and a test set of

781,265 documents Documents from 103 categories cover topics of corporation,

economics, government, and markets

3.1.5 Centralized Text Classification Procedure

A centralized text classification experiment is typically conducted following a standard

procedure including data collection preparation, classifier training, and classifier

evaluation (Yang & Liu, 1999; Sebanstiani, 2002; Lewis, Yang, Rose, & Li, 2004)

In the preparation of data collection, one of the standard test collections discussed above

or a custom compiled test collection is chosen based on the purpose of the classification

experiment The data collection is usually split into a set of training documents and a set

of testing documents Then, the data collection is filtered to remove stop words, digits,

and punctuations The final step in data preparation is feature selection, which aims to

remove non-informative words based on data collection statistics (Yang & Pedersen,

1997) Feature selection methods including document frequency thresholding (DF),

information gain (IG), mutual information (MI), 2statistics (CHI), and term strength (TS) have shown strong correlation in their performance (Yang & Pedersen, 1997) Since

Trang 37

the details of the feature selection methods are out of the scope of this study, please refer

to the previous research for more in depth discussion

During the classifier training, any chosen classification algorithm is trained using the set

of training documents This training process usually includes parameter tuning and is

repeated until the minimum training error is obtained The trained classifier is then used

in the classifier evaluation to classifier the set of testing documents The labels of the

classification results are compared with the pre-defined labels of the testing documents

and the performance is measured by the commonly used precision, recall, and F scores

For example, Yang and Liu applied Support Vector Machines (SVM) on the Reuters

21578 collection and produced the results shown in Table 4 A more recent experiment

by Lewis and his colleague (2004) using a much larger collection produced a new

benchmark for centralized text classification (see Table 5)

Micro Recall Micro Precision Micro F1 Macro F1

Trang 38

3.2 Text classification using a multi-agent framework

Text classification using a distributed approach, particularly a multi-agent framework,

offers a solution to address the challenges of centralized text classification and improve

the quality of classification This section introduces the multi-agent paradigm, its

distinction from other similar computing paradigms, and its recent applications in

information systems Text classification using a multi-agent framework will be compared

with centralized text classification Its advantages over centralized text classification will

be pointed out

3.2.1 Multi-agent paradigm

In recent years, agent-based systems, particularly as implemented in a distributed

framework, have attracted considerable interests Agent-based systems have evolved

from single-agent systems to multi-agent systems (MASs) Multi-agent frameworks have

been developed by the Distributed Artificial Intelligence community A multi-agent

faeokialoeyculdntokorblmovrhtwoktgteosle

polmstareoderidvdaaaiiisDufe&otoey99

3.2.2 Differences between multi-agent systems and other concurrent systems

Parallel computing, distributed computing, and Peer-to-Peer (P2P) computing have

offered different means to improve the overall performance of information systems

Although multi-agent systems share some common characteristics with those traditional

computing paradigms, it is important to point out the differences

Trang 39

In parallel systems, multiple processors, each working on a different part of the same

problem, work simultaneously to solve a single problem Different processors are

coordinated through a central processor, which collects the user input and merges the

results (Baeza-Yates and Ribeiro-Neto, 1999) The components of parallel systems,

which are simply homogeneous processors with no distinct expertise, operate in a static

environment (Wooldridge, 2002)

Distributed systems are very similar to parallel systems Baeza-Yates and Ribeiro-Neto (99eieadsrbtdsses“utpecmptronceyalcloie

area network cooperate tosleasngepolm.oprdoprleytmsh

communication cost in distributed systems is much higher between processors, which in

nature can be heterogeneous To accomplish a task in a distributed system, it often

involves a subset of the processors instead of all processors in the system (Baeza-Yates

and Ribeiro-Neto, 1999) In contrast to distributed systems, multi-agent systems have two

main distinctions (Wooldridge, 2002) First, synchronization and coordination have to be

done at run-time Second, agents are self-interested entities, which cannot be assumed to

share a common goal

The computing paradigm is shifting from the traditional Client/Server model to the

Peer-to-Per(2)moe.SoeIeerhrosdrP2s“aoleverage vast

amounts of computing power, storage, and connectivity from personal computers

dsrbtdaonhol”(enlpu-Yazti, Kalogeraki, & Gunopulos, 2004)

Trang 40

Compared to parallel systems and distributed systems, current P2P systems have the

following two characteristics First, peers operate in a dynamic environment without any

centralized coordination (Zeinalipour-Yazti, Kalogeraki, & Gunopulos, 2004) Peers can

join and leave the system at any time Second, they are not completely decentralized For

example, Napster has to maintain a directory in a central server However, there are some

emerging P2P systems working in a completely distributed manner such as pSearch

(Tang, Xu, and Dwarkadas, 2003)

Regarding other concurrent systems, a multi-agent system (MAS) has additional distinct

characteristics (Jennings, Sycara, & Wooldridge, 1998) First, a MAS is composed of

multiple autonomous components Second, the operating entities, namely agents, are

intelligent Third, knowledge is decentralized among the agents Each agent has its

distinct and incomplete knowledge to solve a problem Fourth, agents operate in a

completely open and distributed environment Fifth, there is no global system control

Sixth, data is decentralized Finally, computation is asynchronous in MAS

3.2.3 Connections between multi-agent paradigm and peer-to-peer paradigm

As discussed in section 3.2.3, the multi-agent paradigm and peep-to-peer (P2P) paradigm

are different de-centralized computing paradigms, but in both paradigms the computing

units, either peers or agents, have to interact with each other to complete a task In the

P2P paradigm, one of the key interactions is to search the user requested files among the

connected peers Search techniques for information retrieval in the P2P paradigm cover a

variety of approaches including centralized indexing, query flooding, breadth-first search

(BFS), document routing search, and selective intelligent search (Tang, Xu, &

Ngày đăng: 13/11/2014, 09:10

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN