automatic text classification using a multi-agent framework

Three major challenges associated with distributed text classification are examined: 1 Coordinating classification activities in a distributed environment, 2 Achieving high quality class

Trang 1

AUTOMATIC TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK

Yueyu Fu

Submitted to the faculty of the University Graduate School

in partial fulfillment of the requirements

for the degree Doctor of Philosophy

in the School of Library and Information Science,

Indiana University October 2006

Trang 2

unauthorized copying under Title 17, United States Code

_

ProQuest Information and Learning Company

789 East Eisenhower Parkway

P.O Box 1346 Ann Arbor, MI 48106-1346

Trang 3

Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the

requirements for the degree of Doctor of Philosophy

Charles Davis, Ph.D

Kiduk Yang, Ph.D

David Leake (Computer Science, minor),

Ph.D

Trang 4

Trang 5

DEDICATION

To my beloved parents Guanghui Fu and Lan Chen, my dear wife Wenjie Sun, and my

grandparents for their unconditional love and encouragement

Trang 6

ACKNOWLEDMENTS

I feel so grateful to numerous people who generously provide me the guidance, support, and encouragement to complete this dissertation

First and foremost, I would like to thank Dr Javed Mostafa, my committee chair, for his

professional and personal guidance that goes far beyond his responsibilities It is his patient guidance, sharp mind, and gentle encouragement that led me to the achievement I have

today

Special thanks also go to the rest of my committee, Dr Charles Davis, Dr Kiduk Yang, and

Dr David Leake, for their insightful comments and enduring support during the entire

process of my dissertation research

I would like to thank my colleagues and the staff at Indiana University, especially Weimao

Ke, Kazuhiro Seki, Mary Kennedy, Arlene Merkel, Erica Bodnar, and Rhonda Spencer, for

their kind help and support throughout all these memorable years in Bloomington

Finally, I must express my deepest gratitude to my parents, Guanghui Fu and Lan Chen, for opening my eyes to the world and encouraging me to pursue my career abroad, and to my

beloved wife, Wenjie Sun, for making our family full of joy, support, and understanding

Trang 7

Automatic text classification is an important operational problem in information systems Most automatic text classification efforts so far concentrated on developing centralized solutions However, centralized classification approaches often are limited due to constraints on knowledge and computing resources To overcome the limitations of centralized approaches, an alternative distributed approach based on a multi-agent framework is proposed Three major challenges associated with distributed text classification are examined: 1) Coordinating classification activities in a distributed environment, 2) Achieving high quality classification, and 3) Minimizing communication overhead This study presents solutions to these specific challenges and describes a prototype system implementation As agent coordination is the key component in conducting multi-agent text classification, two agent coordination protocols, namely blackboard-bidding protocol and adaptive-blackboard protocol, are proposed in the study

To analyze the performance of the distributed approach a comparative evaluation methodology is described, which treats outcome of a centralized approach as baseline performance A series of experiments was conducted in a simulation environment The simulation environment permitted manipulation of independent variables such as scalability and coordination strategy, and investigation of the impact on two critical dependent variables, namely efficiency and effectiveness There were three critical findings First, in dealing with automatic text classification the multi-agent approach can achieve improved system efficiency while maintaining classification effectiveness comparable to a centralized approach Second, the agent protocols were effective in coordinating the text classification activities of distributed agents Third, the application

of content-based adaptive learning for acquiring knowledge about the agent community reduced communication cost and improved system efficiency

Trang 8

TABLE OF CONTENTS

1 INTRODUCTION 1

1.1 MANUAL CLASSIFICATION 1

1.2 AUTOMATIC CLASSIFICATION 2

1.3 MULTI-AGENT PARADIGM 5

2 PROBLEM STATEMENT 7

2.1 SPECIFIC CHALLENGES 8

2.2 VARIABLES 10

2.3 IMPLICATIONS OF THIS RESEARCH 16

3 LITERATURE REVIEW 18

3.1 AUTOMATIC TEXT CLASSIFICATION 18

3.1.1 Text classification task 19

3.1.2 Text classification methods 20

3.1.3 Evaluation metrics for text classification 24

3.1.4 Test Collections 26

3.1.5 Centralized Text Classification Procedure 27

3.2 TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK 29

3.2.1 Multi-agent paradigm 29

3.2.2 Differences between multi-agent systems and other concurrent systems 29

3.2.3 Connections between multi-agent paradigm and peer-to-peer paradigm 31

3.2.4 Recent applications of the multi-agent paradigm 34

3.2.5 Centralized vs Multi-agent text classification 35

3.3 MULTI-AGENT COORDINATION PROTOCOLS 39

3.3.1 Definition of coordination 40

3.3.2 Coordination Protocols 41

3.3.2.1 Organizational Structuring 41

3.3.2.2 Multi-agent planning 43

3.3.2.3 Contract net protocol 44

3.3.2.4 Negotiation 45

4 METHODOLOGY 50

4.1 DATA 50

4.2 DESIGN METHODOLOGY 51

4.2.1 Multi-Agent Community for Text Classification 51

4.2.2 Classification Module 53

4.2.3 Algorithms of Agent Coordination Protocols 55

4.2.4 Proposed Agent Coordination Protocols 59

4.2.4.1 Blackboard-bidding Protocol 59

4.2.4.2 Adaptive-blackboard Protocol 61

4.3 IMPLEMENTATION 65

4.3.1 System Architecture 65

4.3.2 Alternative approach 67

4.4 EVALUATION METHODOLOGY 67

4.4.1 Measurements 67

Trang 9

4.4.1.1 Effectiveness Measurements 67

4.4.1.2 Efficiency Measurements 68

4.4.2 Variables 70

4.4.2.1 Centralized vs Distributed 70

4.4.2.2 Coordination Protocols 71

4.4.2.3 Number of Agents 71

4.4.3 Experimental Settings 72

5 RESULTS 72

5.1 CENTRALIZED VS.DISTRIBUTED 72

5.2 COORDINATION PROTOCOLS 75

5.2.1 Effectiveness 76

5.2.2 Efficiency Measured by Messages 79

5.2.3 Efficiency Measured by Time 82

5.3 NUMBER OF AGENTS 85

5.3.1 Impact of the number of agents on effectiveness 86

5.3.2 Impact of the number of agents on efficiency 89

6 CONCLUSIONS 91

6.1 SUMMARY 91

6.2 FUTURE RESEARCH 95

REFERENCES 97

Trang 10

1 Introduction

Automatic text classification is an important operational problem in information systems

Many tasks, such as retrieval, filtering, and indexing, in information systems can be

considered as classification problems Most text classification efforts so far concentrated

on developing centralized solutions, where data and computation are located on a single

computer However, centralized classification approaches often are limited due to

constraints on knowledge and computing resources In addition, centralized approaches are

more vulnerable to attacks or system failures and less robust in dealing with them This

research presents an alternative classification approach, called distributed text

classification using a multi-agent framework, where data and computation are distributed

across a network of computers

1.1 Manual Classification

In library and information science, class/classification and category/categorization are

sometimes considered as distinct terms (Jacob, 2004) Although they are both used to

organize related entities, these two terms have a fundamental difference Classification

groups entities into mutually exclusive classes based on a set of predefined rules

regardless of the context, whereas categorization associates entities solely based on their

similarities within a given context (Jacob, 2004) This distinction makes categorization

more flexible than classification in organizing similar entities However, for the purpose

of broader audience, this study uses class/classification and category/categorization

interchangeably

Trang 11

One approach to text classification is manual classification, which involves human

experts manually classifying documents based on classification rules and subjective

judgment This approach has been used in library practice for many years to organize,

index, and retrieve documents Human experts typically assign each book a code

representing a category according to a set of classification schemes, such as the Dewey

Decimal Classification, the Universal Decimal Classification, and the Library of

Congress Classification A recent application of this approach on the web is the Yahoo

Directory, which organizes web pages into a hierarchical structure

The main challenge of manual classification is its demand on resources Manual

classification is a time-consuming process that relies heavily on domain knowledge It

requires significant investment of time from many human experts with knowledge of

different domains Another associated problem is that subjective judgments can generate

inconsistent classification results Because of these limitations, manual classification

works best for relatively small document collections

1.2 Automatic Classification

To address the problems of manual classification, researchers have explored automatic

text classification as an alternative approach Using machine learning techniques,

automatic text classification assigns documents to a set of pre-defined categories This

approach has been applied in many areas, such as patent classification, news delivery,

and email spam filtering In contrast to manual classification, automatic classification

offers the advantages of automation, efficiency, and consistency

Trang 12

In automatic classification, documents are typically classified by a single classification

software system running on a single machine This is also called centralized text

classification Significant efforts have focused on developing document classifiers in a

centralized manner and various classification algorithms have been developed to improve

the performance of centralized classification systems The advantages of centralized

classification stem from the centralized architecture Because data and computing

resources are located in the same place, the management of the classification task is easy

and the classification speed is fast Since the communication in centralized classification

takes place in the same machine, the communication cost is relatively small

However, as information becomes more distributed and its volume increases

exponentially, several critical disadvantages of centralized classification are revealed

The effectiveness of a classification system is mostly determined by the artificial

knowledge1 maintained by the system, which typically comes from training data

Currently, centralized classification systems suffer from the problem of scarcity of local

knowledge2 The extent of local knowledge is limited by the cost and constraints of

storing complete knowledge in a single place, and it is sometimes impossible to collect

all the necessary knowledge and store it in a central location Since the classification

system can successfully classify only documents that are within the scope of a limited

amount of local knowledge, it is likely to fail when the expansion of its domain (i.e., local

knowledge) does not keep up with growing diversity in knowledge Another disadvantage

is that due to its centralized architecture centralized classification has only a certain

1 Knowledge learned from training documents using machine learning techniques

2 Knowledge maintained by a single classification system

Trang 13

amount of computing power and input/output capacity When an information system has

to handle a large number of documents, the classification component may become a

performance bottleneck and suffer from the problem of single point of failure

Distributed text classification, which is an alternative approach to automatic centralized

classification, employs a de-centralized architecture for organizing knowledge and

computing resources This approach allows multiple classification software systems to

collaborate with each other to fulfill the classification task in a distributed computing

environment Distributed classification has several advantages over centralized

classification The distributed architecture offers computational scalability for

classification Mukhopadhyay et al (2005) demonstrate that classification time decreases

dramatically with the increasing number of collaborating classification software systems

Also, not completely relying on a single classification software system allows the

classification system to avoid the problem of single point of failure When one of the

classification software systems fails, its tasks can be carried out by alternative

classification software system Lastly, distributed classification fits the web model better

The Internet is a distributed system and it offers the opportunity to take advantage of

distributed computing paradigms and distributed knowledge resources for classification

However, distributed classification has some disadvantages Unlike centralized

classification, distributed classification, which consists of multiple independent

classification software systems, does not have global control of all the classification

activities Such global control is essential for achieving coherent system performance

Trang 14

Without global control, distributed classification activities can produce conflicting and

inconsistent results Therefore, an alternative mechanism for coordinating distributed

classification activities is needed Another limitation of distributed classification is the

large communication overhead In order for classification software systems to collaborate,

they must communicate with each other and the amount of exchanged information can be

very large For example, Mukhopadhyay et al (2003) show that the average response

time for classification increases almost linearly with the number of classification software

systems, which is the direct result of increasing communication overhead They also

show that the classification performance quickly saturates with increasing number of

classification software systems This later result points to the potential of improving

overall performance by reducing communication overhead

1.3 Multi-Agent Paradigm

This research employs a multi-agent paradigm for conducting distributed text

classification The multi-agent paradigm evolved from distributed artificial intelligence in

telt0shrgnsaecnieesatnmosnelgnoptrsfwae

An agent exhibits three major characteristics, namely reactivity, proactiveness, and social

ability (Wooldridge & Jennings, 1995) Reactivity refers to the capability of sensing the

changes in its environment and taking fast corresponding actions Proactiveness refers to

the capability of operating in an active fashion according to its design goal Social ability

refers to the capability of working with other agents A group of such agents form a

multi-agent system Durfee and Montgomery (1989) define a multi-agent system (MAS)

aaloeyculdntokorbeovrhtwoktgteoslepoblems

tareodteridvdaaaiiis”

Trang 15

For text classification, a multi-agent paradigm offers several critical advantages

According to Sycara (1998), the multi-agent paradigm distributes computing resources

and capabilities across a network of agents, which can avoid the single point of failure

problem The modular, scalable architecture of the multi-agent paradigm provides

computational scalability and flexibility for agents entering and leaving agent

communities The multi-agent paradigm can also make efficient use of spatially

distributed information resources and serve as a solution when expertise is distributed

Because of these advantages, the multi-agent paradigm has been utilized in designing of

information retrieval systems and information management systems However, the

applicability of the multi-agent paradigm in text classification has not been thoroughly

examined yet

Agent coordination is a critical component of the multi-agent paradigm It determines the

relationship among the agents in a multi-agent environment and governs the behaviors of

the interacting agents The overall system performance, including both quality and

efficiency, depends on the appropriate design of the coordination mechanism Quality

measures the correctness of the system behavior, which is the collective result of the

codntdaet’bhvosfiinymesrstetmeiesohytmrcs,

which counts mainly the communication among the coordinated agents Due to its

importance in system performance optimization, agent coordination has been well studied

in various domains, such as transportation, economics, and management As the

Trang 16

multi-agent paradigm is applied in text classification, multi-agent coordination will be the focus of

this research

Evaluation of system performance is an essential aspect of the multi-agent

implementation plan As the overall performance of a multi-agent system is a collective

rslfmutpeaet’bhvosheutaoemaentdrclnesadbe

The evaluation framework typically reflects the system performance at different levels,

including the agent level and the overall system level Also, the evaluation metric covers

different aspects of the system performance, including effectiveness and efficiency

Integration of the evaluation metric of text classification and the multi-agent paradigm

may provide us a powerful tool to validate the approach of automatic text classification

using a multi-agent framework

2 Problem Statement

The primary purpose of this study is to investigate automatic text classification using a

multi-agent framework Automatic text classification and the multi-agent paradigm

respectively have been extensively studied over the years Although, problems within

each area have been investigated, new problems that arise with the introduction of the

multi-agent paradigm into automatic text classification remain mostly unexplored In this

section, three major challenges associated with distributed text classification will be

examined and key variables related to these challenges will be discussed

Trang 17

2.1 Specific Challenges

Distributed text classification is different from centralized text classification because of

its distributed architecture One of the main challenges in distributed text classification is

coordinating classification activities in a distributed environment Unlike centralized

classification relying on a mediator to ensure the coherence of the overall system

performance, distributed classification lacks centralized control, and thus may produce

conflicting and inconsistent classification results Consequently, an effective mechanism

of coordinating distributed classification activities is greatly needed In the multi-agent

paradigm, such mechanisms (e.g., agent coordination protocols) have been extensively

studied The agent coordination research has drawn on various domains including

artificial intelligence, social science, game theory, and economics Many agent

coordination protocols, such as blackboard and contracting, have been explored in those

domains Although these agent coordination protocols have been successfully applied in

many domains, they have not been seriously studied in information science, particularly

for text classification The question is whether these agent coordination protocols will

work well for the classification task Different coordination protocols will be explored for

designing suitable coordination mechanisms for text classification

Another challenge in distributed classification is achieving high quality of classification

in multi-agent environments In automatic centralized classification, quality of

classification is mainly influenced by the quality of the test collection and the

classification algorithm Most efforts so far have concentrated on developing new

methods to improve the classification performance Several classification methods, such

Trang 18

as Support Vector Machines and k-Nearest Neighbor, have been applied in centralized

environments In multi-agent environments, the classification task is distributed across a

network of classification agents The classification process involves the actual

classification conducted by individual agents, the interactions among agents, and the

merge of individual classification results Whether those well-established classification

methods are applicable in multi-agent environments has to be examined To validate the

performance of these classification methods in multi-agent environments and identify the

suitable classification methods for distributed classification, a thorough evaluation of

quality of distributed classification needs to be conducted The evaluation may cover the

comparison among different classification methods and the comparison between

distributed classification and centralized classification The result from such an

evaluation may tell us about whether certain distributed classification approaches can

achieve satisfactory quality of classification

Minimizing communication overhead in distributed classification without compromising

quality of classification is yet another challenge Communication is a key issue in

distributed classification, where agents exchange information, interact with each other,

and work together through the means of communication to achieve satisfactory quality of

classification In such an environment, the amount of communication greatly affects

system efficiency Consequently, an appropriate agent coordination protocol that governs

teaetscmmuiainbhvonefcieadefcetmanrmanuehgh

quality of classification and reduce communication overhead The key objective of the

agent coordination protocol is to balance between quality of classification and system

Trang 19

efficiency To achieve such a balance, an evaluation procedure has to be established to

measure quality of classification and system efficiency in multi-agent environments

However, there is no standard evaluation framework to fulfill this goal An evaluation

framework for measuring system efficiency needs to be established To summarize, the

three main challenges are: 1) Coordinating classification activities in a distributed

environment, 2) Achieving high quality of classification in multi-agent environments, and

3) Minimizing communication overhead in distributed classification without

compromising quality of classification

2.2 Variables

This research will be conducted using an experimental study design The study will

explore the applicability of different classification methods in multi-agent environments

The result of this exploration will help researchers to choose appropriate classification

methods in distributed computing environments and design new methods for distributed

classification The primary focus of this study will be to investigate the coordination of

distributed classification activities in multi-agent environments A comparative study of

different coordination protocols for multi-agent classification will help in identifying the

best coordination protocol, which can achieve satisfactory classification performance

with acceptable communication overhead A comparative study between centralized

classification and distributed classification will also be conducted to evaluate the

performance of the distributed classification approach The evaluation framework will

draw on the centralized classification research and new approaches will be developed that

are uniquely suitable for distributed classification environments To carry out this study

Trang 20

and address the three challenges discussed above, three variables will be studied: quality

of classification, system efficiency, and agent granularity

One of the main goals is to achieve satisfactory classification performance in a

multi-agent environment Therefore, quality of classification must be taken into consideration

throughout the study The quality of classification refers to the accuracy of a completed

classification task In contrast to centralized text classification, quality of classification in

a multi-agent context is determined by not only the performances of individual classifiers,

but also the agent coordination protocol Researchers in information retrieval and

machine learning communities have tested various effectiveness measurements for

classification tasks Lewis (1995) demonstrated using different families of single

effectiveness measures to estimate and optimize the performances of classification

systems Joachims (2001) summarized the most commonly used effectiveness measures

for evaluation text classification systems In this study, precision, recall, and F measure

have been chosen to measure quality of classification

Figure 1.1 Precision

Trang 21

Figure 1.2 Recall

A study by Mukhopadhyay et al (2005) demonstrates how these evaluation measures can

be applied in a multi-agent environment The study, which shows that as the number of

agent increases, precision drops (see Figure 1.1) while recall increases (see Figure 1.2),

proposes that quality of classification must be evaluated at both the system level and the

individual agent level In additional to the overall performance evaluation at the system

level, the classification performance of individual agent will help in understanding each

aetsbhvoadrltosiwihohraet.Itesuyteqaiyo

classification at the system level is calculated by averaging the corresponding

measurement scores across all agents For each category, classification decisions can be

represented as a contingency table as following:

Expert decision: Yes Expert decision: No

Table 1: Contingency table

Trang 22

Based on the contingency table, recall and precision are defined as following:

FN TP

TP call





FP TP

TP ecision





Pr

Typically, both micro-averaging and macro-averaging methods are applied to calculate

the average scores In the micro-averaging method, precision and recall are computed

bsdogoa”cnignytbehcstesmfidvdaotnecals

In the macro-averaging method, precision and recall is computed by averaging the

precision and recall scores of all categories (Sebanstiani, 2002) The micro-averaging and

macro-averaging scores reflect the classification performance on different categories

Yang and Liu (1999) note that micro-averaging gives equal weights to each item (e.g.,

document) and can be dominated by large (common) categories, whereas

macro-averaging gives equal weights to each category, so small (rare) categories can unduly

influence the score

In this study, the main goal is not only to achieve high quality of classification, but also

to attain acceptable system efficiency The efficiency of a multi-agent system largely

depends on its communication overhead Agent interaction and coordination, and the

agent environment influence communication overhead in multi-agent systems Different

agent coordination protocols produce different amount of communication overhead This

variable, efficiency, will help in identifying appropriate coordination protocols which can

achieve acceptable communication overhead Efficiency here refers to the time spent

during the completion of a classification task A study by Mukhopadhyay et al (2003)

shows that as the number of agent increases, the communication overhead increases

almost linearly (see Figure 2.1) while the classification performance quickly saturates

Trang 23

(see Figure 2.2) This result shows the possibility of achieving a satisfactory classification

performance with reduced communication overhead by interacting with fewer agents

Efficiency can be measured at both the system level and the individual agent level

Efficiency at the system level represents the time spent to classify all the documents in

the multi-agent environment Efficiency at the agent level represents the time that an

individual agent spends to classify its own documents

Figure 2.1 Average response time

Trang 24

Figure 2.2 Number of successful classification

Agent granularity refers to the amount of knowledge possessed by an agent Agent

granularity has impact on not only their classification capabilities, but also efficiency at

the system level Each agent possesses a certain amount of knowledge, which is a

proportion of the complete global knowledge In an extreme case, each agent has only the

knowledge of one class When the total number of classes is fixed, the number of agents

decreases as the number of classes possessed by each agent increases Theoretically, the

classification capability of each agent gets enhanced as the number of classes increase

because the probability of a document being classified by such an agent increases Also,

with increased knowledge, the communication overhead decreases because there are

fewer agents and less coordination needs (see Figure 3.1 & Figure 3.2)

Trang 25

Figure 3.1 Precision

Figure 3.2 Average response time

2.3 Implications of this Research

This research has been developed to investigate automatic text classification using a

multi-agent framework Automatic text classification has not been seriously tested in

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Trang 26

multi-agent environments Although some preliminary analyses were conducted

(Mukhopadhyay et al., 2003; 2005), they were limited in scope The findings of this study

will reveal the advantages and disadvantages of conducting text classification using a

multi-agent framework in a more comprehensive manner These findings may be useful

in choosing classification solutions between centralized approaches and distributed

approaches in different scenarios For example, distributed classification can serve as an

alternative when centralized classification cannot be realized Different classification

methods will be tested in the multi-agent classification environment The results will be

useful for identifying classification methods for distributed classification tasks and

inspiring researchers to design new classification methods for distributed computing

environments Different approaches for addressing the agent coordination problem will

be evaluated and compared The findings may help in designing appropriate coordination

protocols for distributed classification The evaluation framework based on centralized

classification will be tested in the multi-agent environment The findings will help in

better understanding the differences between centralized classification and distributed

classification and may contribute to the establishment of evaluation framework for

distributed classification

Ultimately, my goal is to make a scholarly contribution to the area of text classification

and multi-agent systems and produce findings that will be of interest to both practitioners

and researchers At the practical level, the proposed approach can facilitate the sharing of

activities among different libraries For example, if several small libraries adopt the

proposed approach and build a multi-agent community, they can help each other without

Trang 27

having to maintain comprehensive capacity for classifying materials At the research

level, this proposed research may be of interest to researchers interested in advancing the

state-of-art in automated text classification

3 Literature Review

This research investigates an alternative classification approach, namely distributed text

classification conducted using a multi-agent framework Three major challenges

associated with distributed text classification are examined: 1) Coordinating classification

activities in a distributed environment, 2) Achieving high quality of classification, and 3)

Minimizing communication overhead This chapter reviews literature on these and

related problem areas Different approaches for addressing the three major problem areas

are compared and potential solutions are proposed

3.1 Automatic text classification

The section below mostly draws on centralized text classification research and provides

the basic components for building distributed text classification approaches Although

distributed classification is different from centralized classification, each classification

agent itself is still a relatively independent classification unit This section reviews

classification tasks, classification methods, test collections, and the measurements for

quality of classification, which is hoped to contribute to better understanding of

challenges associated with distributed text classification

Trang 28

3.1.1 Text classification task

Automatic text classification assigns textual documents into classes using the rules or

patterns learned from a set of pre-classified documents Sebastiani (2002) defines

automatic text classification as a process of assigning natural language documents to

predefined semantic classes Generally, the text classification task can be defined as

follows: Given a set of pre-classified documents, learn the classification rules or patterns

and find the correct classes for a new document

Binary classification is the simplest and most fundamental operation in text classification

In binary classification, there are only two classes For example, when a user submits a

query term to an online library catalog, the library items are divided into two classes

Items in one class contain the query term in their indices and items in the other class do

not include the information Email spam detection is another example Emails are

classified into two classes: spam and non-spam A more complex task is multi-class

classification In this scenario, documents are classified into more than two classes For

example, news documents are classified into one of several pre-defined classes in news

filtering applications In email routing, based on the content and metadata of the emails

they are directed to one of a few different folders or recipients Another scenario is called

multi-label classification (Joachims, 2002) In multi-label classification, the mapping

between documents and classes is not restricted to one-to-one mapping A document,

which can belong to multiple classes, can be classified as members in those classes at the

same time In the news filtering example above, one news document could be about

Trang 29

Olympics 2008 in China and it could be classified into both International and Sports

Định dạng
Số trang	124
Dung lượng	0,93 MB