A new term weighting method for text categorization

A New Term Weighting Method for Text CategorizationMan Lan, 2006National University of Singapore LIST OF TABLES 2.1 A Rule-based classifier for the wheat category of Reuters Corpus in co

Trang 1

A New Term Weighting Method for Text

Categorization

By Man Lan

Submitted For The Degree Of Doctor of Philosophy

at Department of Computer Science School of Computing National University of Singapore

3 Science Drive 2, Singapore 117543

September, 2006

c

Trang 2

Name: Man Lan

Degree: Doctor of Philosophy

Department: Department of Computer Science

Thesis Title: A New Term Weighting Method for Text CategorizationAbstract: Text representation is the task of transforming the content of a

textual document into a compact representation of its content sothat the document could be recognized and classified by a com-puter or a classifier This thesis focuses on the development of

an effective and efficient term weighting method for text rization task We selected the single token as the unit of featurebecause the previous researches showed that this simple type offeatures outperformed other complicated type of features

catego-We have investigated several widely-used unsupervised and pervised term weighting methods on several popular data col-

su-lections in combination with SVM and k NN algorithms In

consideration of the distribution of relevant documents in thecollection and analysis of the term’s discriminating power, we

have proposed a new term weighting scheme, namely tf.rf The

controlled experimental results showed that the term weightingmethods show mixed performance in terms of different categorydistribution data sets and different learning algorithms Most ofthe supervised term weighting methods which are based on infor-mation theory have not shown satisfactory performance accord-ing to our experimental results However, the newly proposed

tf.rf method shows a consistently better performance than other

term weighting methods On the other hand, the popularly used

tf.idf method has not shown a uniformly good performance with

respect to different category distribution data sets

Keywords: Text Categorization, Term Weighting Method, Support Vector

Machine, k NN.

Trang 3

To my parents and my husband.

Trang 4

A New Term Weighting Method for Text Categorization

Man Lan, 2006National University of Singapore

ACKNOWLEDGEMENT

I would first thank my advisors Prof Chew Lim Tan and Dr Hwee Boon Low fortheir deep insights and dedication to guide and help me through this thesis research.Without their creative, valuable supervision, this work would have encountered alot of difficulties

I also sincerely appreciated the suggestions and insights I obtained from my mer academic advisors: Professor Sam Yuan Sung for his suggestions on my prelim-

for-inary thesis report in the Center for Information Mining and Extraction(CHIME )

lab of School of Computing, National University of Singapore; Dr Ah Hwee Tancurrently with Nanyang Technology of University for giving me many useful sugges-tions during my working in the Text Mining lab of A-STAR Institute for InfocommResearch; Prof Kang Lin Xie, in Shanghai Jiao Tong University, for encouraging

me to further my education and research

The former staff in the CHIME lab of School of Computing, National University

of Singapore, Dr Ji He, helped me with discussions, cooperations, encouragement,and making the research life in Singapore a very interesting and exciting experience.Last but not least, to my loving parents and my husband, for their support andencouragement through all these years in the Ph.D program

Trang 5

Man Lan, 2006 National University of Singapore

TABLE OF CONTENTS

1.1 Motivation 1

1.2 Structure of the Thesis 7

2 A Brief Review of Text Categorization 10 2.1 A Definition of Text Categorization 11

2.2 Relationship With Information Retrieval and Machine Learning 12

2.3 Various Subcases of Text Categorization Tasks 14

2.3.1 Single-label and Multilabel Text Categorization 15

2.3.2 Flat and Hierarchical Text Categorization 16

2.4 A Variety of Applications of Text Categorization Technology 18

2.4.1 Automatic Document Indexing for IR Systems 18

2.4.2 Documentation Organization 19

2.4.3 Text Filtering System 20

2.4.4 Word Sense Disambiguation 20

2.4.5 Hierarchical Categorization of Web Pages 21

2.5 Approaches to Effectively Learning Text Classifiers from Labelled Corpora 22

2.5.1 The Rocchio Method From Information Retrieval 23

2.5.2 k Nearest Neighbor 25

2.5.3 Na¨ıve Bayes Method 27

Trang 6

TABLE OF CONTENTS vi

2.5.4 Decision Tree 29

2.5.5 Support Vector Machines 32

2.5.6 A Summary of These Approaches 33

3 Text Representation for Text Categorization 35 3.1 Introduction 37

3.2 The Prerequisites of Text Representation 38

3.2.1 Stop Words 39

3.2.2 Stemming 39

3.2.3 Features Selection 40

3.3 What Should a Term Be? 41

3.3.1 Sub-Word Level 42

3.3.2 Word Level 42

3.3.3 Multi-Word Level 43

3.3.4 Semantic and Syntactic Representations 45

3.3.5 Other Knowledge-based Text Representations 50

3.3.6 Remarks on the Term Types 51

3.4 How to Weigh a Term? 52

3.4.1 Term Frequency Factor 52

3.4.2 Collection Frequency Factor 53

3.4.3 Normalization Factor 55

3.4.4 Traditional Term Weighting Methods from IR 55

3.5 Supervised Term Weighting Methods 58

3.5.1 Combined with information-theory functions or statistical metrics 58

3.5.2 Based on Statistical Confidence Intervals 60

Trang 7

TABLE OF CONTENTS vii

3.6 Analysis of Term’s Discriminating Power 62

3.7 A New Proposed Supervised Term Weighting Scheme — RF 69

3.8 Empirical Observation of Term’s Discriminating Power 73

4 Methodology of Research 77 4.1 Machine Learning Algorithms Applied in This Thesis 77

4.1.1 Support Vector Machines 78

4.1.2 k Nearest Neighbors 79

4.2 Benchmark Data Collections 80

4.2.1 Text Preprocessing 80

4.2.2 Reuters News Corpus 81

4.2.3 20 Newsgroups Corpus 82

4.2.4 Ohsumed Corpus 83

4.2.5 18 Journals Corpus 84

4.3 Evaluation Methodology 86

4.3.1 Precision and Recall 86

4.3.2 F1 Function 87

4.3.3 Breakeven Point 88

4.3.4 Accuracy 89

4.4 Statistical Significance Tests 90

5 Experimental Research 92 5.1 Experiment Set 1: Exploring the Best Term Weighting Method for SVM-based Text Categorization 93

5.1.1 Term Weighting Methods 94

5.1.2 Results and Discussion 96

5.1.3 Concluding Remarks 101

Trang 8

TABLE OF CONTENTS viii

5.2 Experiment Set 2: Investigating Supervised Term Weighting

Meth-ods and Their Relationship with Machine Learning Algorithms 103

5.2.1 Methodology 103

5.2.3 Further Analysis 119

5.3 Experiment Set 3: Application to Biomedical Data Collections 129

5.3.1 Motivation 130

5.3.2 Examples of Terms’ Discriminating Power 133

6 Contributions and Future Directions 145 6.1 Contributions 145

6.2 Future Work 147

6.2.1 Extending Term Weighting Methods on Feature Types other than Words 147

6.2.2 Applying Term Weighting Methods to Other Text-related Applications 148

Trang 9

LIST OF TABLES

2.1 A Rule-based classifier for the wheat category of Reuters Corpus

in construe system 133.1 Term frequency component 533.2 Collection frequency component 543.3 The first three terms which share the same idf but have different ratio of a and c 66

3.4 The rf values with different a and c values 72

3.5 Comparison of six weighting values of four features in category 00 acq 73

3.6 Comparison of six weighting values of four features in category

03 earn 74

4.1 Statistical information of the 18 Journals Corpus 854.2 Statistical information of three subsets of the 18 Journals corpus 854.3 McNemar’s test contingency table 905.1 Summary of 10 term weighting methods studied this experiment set 1 945.2 Statistical significance tests results on Reuters-21578 at differentnumbers of features 975.3 Statistical significance tests results on the subset of 20 Newsgroups

at different numbers of features 995.4 Summary of 8 supervised and unsupervised term weighting methods 1045.5 Statistical significance tests results on the two data corpora andtwo learning algorithms at certain numbers of features in terms of

the micro-averaged F1 measure 116

Trang 10

LIST OF TABLES x

5.6 Statistics of the top 10 largest categories in the 18 Journal tion and the top 3 terms with the largest feature selection metric

Collec-χ2 1345.7 Comparison of the weighting values of four terms with respect tocategory chemistry 1355.8 Comparison of the weighting values of four terms with respect tocategory genetics 1355.9 The best performance of SVM with four term weighting schemes

on the Ohsumed Corpus 138

Trang 11

LIST OF FIGURES

2.1 A Two-Level Hierarchy in Text Categorization 172.2 A decision tree equivalent to the DNF rule of Table 2.1 Edgesare labelled by terms (underlining denotes negation) and leaves arelabelled by categories (wheat in this example) 303.1 An example of vector space model 383.2 Examples of document distributions with respect to six terms inthe whole corpus 64

5.1 Micro-averaged break-even points results for the Reuters-21578 topten categories by using ten term weighting schemes at differentnumbers of features 975.2 Micro-averaged break-even points results for the subset of 20 News-groups corpus by using ten term weighting schemes at differentnumbers of features 985.3 Micro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on the Reuters-21578 top ten cat-egories using linear SVM algorithm with different numbers of features1065.4 Macro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on the Reuters-21578 top ten cat-egories using linear SVM algorithm with different numbers of features1075.5 Micro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on 20 Newsgroups Corpus usinglinear SVM algorithm with different numbers of features 1095.6 Macro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on 20 Newsgroups Corpus usinglinear SVM algorithm with different numbers of features 110

Trang 12

LIST OF FIGURES xii

5.7 Micro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on the Reuters-21578 top ten cat-

super-egories using k NN algorithm with different numbers of features 111

5.8 Macro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on the Reuters-21578 top ten cat-

super-egories using k NN algorithm with different numbers of features 112

5.9 Micro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on 20 Newsgroups Corpus using

super-k NN algorithm with different numbers of features 114

5.10 Macro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on 20 Newsgroups Corpus using

super-k NN algorithm with different numbers of features 115

5.11 F1 measure of the four term weighting methods on each category

of Reuters-21578 corpus using SVM algorithm at the full vocabulary122

of Reuters-21578 corpus using k NN algorithm at a feature set size

of 405 123

of 20 Newsgroups using SVM algorithm at a feature set size of 13456125

of 20 Newsgroups using k NN algorithm at a feature set size of 494 126

5.15 Micro-averaged break-even points results for the Ohsumed DataCollection by using four term weighting schemes at different num-bers of features 1375.16 Micro-averaged F1 value of top 10 categories in 18 Journals DataCollection 1405.17 Macro-averaged F1 value of top 10 categories in 18 Journals DataCollection 141

5.18 Micro-averaged F1 value of different number of categories in 18Journals Data Collection 142

Trang 13

Automatic text categorization plays a crucial role in many applications to sort,direct, classify, and provide the proper documents in a timely and correct man-ner It is a basic building block in a wide range of contexts, ranging from doc-ument indexing, to document filtering, word sense disambiguation, population of

Trang 14

tion” is also called supervised text classification which has known label of training

data set in advance and automatically assigns the documents to a predefined set of

categories This is the main topic of this thesis However, the term “text

cluster-ing” is called unsupervised text classification and it performs without any known

labelled data set Therefore, aside from the meaning of text categorization, the

term “text clustering” has also been used to mean the automatic identification of such a set of categories and the grouping of documents under them.

Generally, building an automated TC system consists of two key subtasks Thefirst task is text representation which converts the content of documents into com-pact format so that they can be further processed by the text classifiers Anothertask is to learn the model of a text classifier which is used to classify the unlabelleddocuments

The algorithms which have been applied to TC task have been studied sively in recent decades and most of them are usually borrowed from the traditional

exten-machine learning (ML) domain, such as Support Vector Machines (SVMs), k NN,

Decision Tree (DT), Na¨ıve Bayes (NB), Neural Network (NN), Linear Regression(LR), etc As a relatively new algorithm, SVM has a better performance thanother methods due to its ability to efficiently handle relatively high dimensionaland large-scale data sets without decreasing classification accuracy In essence,

Trang 15

Introduction 3

k NN makes prediction based on the k training patterns which are closest to the

unlabelled (test) pattern It is very simple and effective but not efficient in thecase of high dimensional and large-scale data sets The DT algorithm is sometimesquite effective but the consequent overfitting problem is intractable and needs to

be handled manually case by case The NB method assumes that the terms in onedocument are independent even this is not the case in the real world The NNmethod, usually used in artificial intelligence (AI) field has shown lower classifica-tion accuracy than other machine learning methods

The textual information is stored in many kinds of machine readable form, such

as PDF, DOC, PostScript, HTML, XML and so on Before the computer appliesthe text classifier to label the unknown document, the content of a document must

be transformed into a compact and interpretable format so that it can be further

recognized and classified by a computer or a classifier This indexing procedure

is called text representation Apart from the inductive learning algorithms, text

representation also has a crucial influence on how well the text classifiers can alize Moreover, once given a learning algorithm, choosing the text representationbecomes the central modelling tool of building text classifier for several reasons.First, excellent algorithms are few Second, since the rationale is inherent to eachalgorithm, the method is usually fixed for one given algorithm Third, tuning theparameters of algorithm shows lower improvement than one might expect for thecomplexity of the algorithm Furthermore, not all algorithms have such parame-ters to tune Therefore, the focus of this thesis is on the text representation fortext categorization

gener-In the traditional vector space model, the content of a document d is represented

Trang 16

Introduction 4

as a vector in the term space, d = (w1, , w |T | ), where |T | is the size of the terms (sometimes called features) set and the value of w k between (0, 1) represents how much the term w k contributes to the semantics of the document d Thus,

there are two important issues for text representation: (1) what a term should

be and (2) how to weigh a term As the basic indexing units for representingthe content of documents, terms can be at different levels, such as sub-word level(syllables), word-level (single token), multi-word level (phrases, sentences), etc

Different terms have different importance in a text, thus an important indicator w k

(i.e term weight, usually between 0 and 1) associated with each term represents

how much the term w k contributes to the semantics of document for TC Thewidely-used term weighting methods are borrowed from traditional information

retrieval (IR) field, such as tf.idf, binary, term frequency and so on.

As we just mentioned, among the promising approaches to TC, SVM has a ter performance than others ([DPHS98], [Joa98], [LK02], [YL99], [DP03]) Gen-erally, SVMs are classified into two categories, i.e linear and non-linear based

bet-on the different kernel functibet-ons Usually, the kernel functibet-ons of SVMs are cial to the classifier’s performance However, for TC task, Leopold [LK02] pointsout that it is text representation schemes which dominate the performance of TCrather than the kernel functions of SVM Therefore, choosing an appropriate textrepresentation is more important than choosing and tuning kernel functions ofSVM for TC However, even given these previous studies, we could not definitelydraw a conclusion as to which term weighting method is better than others forSVM-based text classifier, “Because we have to bear in mind that comparisonsare reliable only when based on experiments performed under carefully con-

Trang 17

cru-Introduction 5

different data preparation (stemming, stop words removal, feature selection, ent term weighting methods, etc), different benchmark data collections, differentclassifiers with various parameters, and even different evaluation methods (micro-and macro-averaged precision, recall, accuracy, error, break-even point, ROC, etc),have been adopted by different researchers Therefore, a question surfaces here :

differ-“Among the various term weighting methods, which is the best term weightingmethod for SVM-based text classifier?”

The traditional term weighting methods, such as binary, tf.idf and its variants,

are usually borrowed from the IR domain In contrast to IR, TC is a form of

supervised learning as it makes use of prior information on the membership of

training documents in predefined categories This known information is effectiveand has been widely used in the step of feature selection [YP97] and the supervisedlearning of text classifier Recently, researchers proposed to combine this priorinformation into term weighting methods Since these supervised term weightingmethods take the document distribution into consideration, they are naturallyexpected to be superior to the unsupervised (traditional) term weighting methods.However, not much work has been done on their comprehensive comparison withunsupervised term weighting methods Although there are partial comparisons in[DS03] and [SM05], these supervised term weighting methods have shown to give

mixed results with respect to tf.idf and in most cases they have no superiority over

tf.idf On the other hand, another work [DTY+04] has replaced idf factor with

χ2 factor as term weighting method and drawn a conclusion that χ2 is better than

idf , which is quite contrary to the finding in [DS03] Therefore, two fundamental

questions arise here, i.e “Are supervised term weighting methods based on knowninformation able to lead to better performance than unsupervised ones for text

Trang 18

Introduction 6

categorization?” and “Can we propose a new effective and efficient term weightingmethod by making use of this prior information given by the training data set?”One goal of this thesis is to propose a new effective and efficient term weightingmethod for TC task by using the prior information given by training data set.Meanwhile, we also give an analytical explanation of terms’ discriminating powerwith empirical observation and explore the classification performance of severalwidely-used term weighing methods for SVM-based text classifier Another goal is

to examine the superiority of supervised term weighting methods and investigatethe relationship between term weighting methods and learning algorithms Thereare three critical questions that this thesis will address:

1 How can we propose a new effective term weighting method by using theimportant prior information given by the training data set?

2 Among the various term weighting methods, which is the best term weightingmethod for SVM-based text classifier?

3 Are supervised term weighting methods able to lead to better performancethan unsupervised ones for text categorization? What kinds of relation-ship can we find between term weighting methods and the two widely-used

learning algorithms, i.e k NN and SVM, given different benchmark data

collections?

To address these three questions, this thesis is divided into three subtasks:

• First, we will analyze and investigate the terms’ discriminating power with

Trang 19

Introduction 7

to improve the performance of TC From these investigations and analysis,

we will gain insights into a better understanding concerning the intuitive idea

of our newly-proposed supervised term weighting method

• Second task is to explore term weighting methods for SVM-based text

classi-fier and provide insights into the difference between various traditional termweighting methods and their variants for TC task The empirical answer tothis question is definitely interesting to researchers who would like to choose

an appropriate term weighting method for SVM-based TC

• Finally, we will examine the superiority of supervised term weighting methods

and investigate the relationship between different term weighting methodsand different learning algorithms on more general experimental conditions.This work will answer the third question with empirical evidence and give

a practical guidance on how to choose term weighting methods in terms ofdifferent learning algorithms Moreover, we will also extend our study to anew application domain, i.e biomedical literature classification

1.2 Structure of the Thesis

Each of the remaining chapters of the thesis captures different aspects of the work,including approaches to text categorization, text representation and term weight-ing methods, analysis of term’s discriminating power and details of the researchinfrastructure Below is a roadmap of the remaining chapters of this thesis.Chapter 2 provides a review of techniques for the task of TC including thedefinition, its relationship with IR and ML, its taxonomy, applications, and most

Trang 20

then propose a new term weighting method, i.e tf.rf From the further

quantita-tive comparison of their discriminating power in the real cases, we gain an insight

into a better understanding regarding the basic idea behind tf.rf This chapter

provides a detailed qualitative analysis and explanation of our newly proposed pervised term weighting method and partial answers to the first question that thisthesis study will address from qualitative analysis aspect

su-Chapter 4 lays out the methodology of research for the experiments in thisthesis including the inductive learning algorithms, the benchmark data collections,text preprocessing, performance evaluation and statistical significance tests Thischapter provides a detailed description of all the experimental settings in thisthesis

Chapter 5 presents a series of experiments to investigate various widely-usedtraditional and the state-of-the-art supervised term weighting methods on variousexperimental conditions The purpose of this chapter is to seek answers to thethree questions with more general experimental evidence To accomplish this, webuild a fixed universal platform to compare a variety of traditional and supervised

term weighting methods with our tf.rf using a cross-classifier, cross-corpus and

even cross-domain validation This chapter serves an important role in this thesissince it not only examines the performance of various term weighting methods

Trang 21

Introduction 9

with more experimental evidence, but also provides us with deeper insights and

a practical guidance on choosing term weighting methods in terms of differentlearning algorithms and corpora

Chapter 6 summarizes the contributions of this thesis and outlines some sible directions for future research

pos-In this study, we only focus on the the study of text representations ratherthan the improvements of inductive learning for TC The default language onwhich we studied is English Whether this new term weighting method is able

to lead to different results for different languages is not clear In addition, when

we focus on the study on term weighting schemes for TC, we only change theterm weighting schemes by using the bag-of-words approach, while the remainingbackground conditions such as data preparation, classifier and evaluation measuresremain unchanged

The results of this study could be useful for researchers to choose an ate term weighting method for TC; to find the relationship between term weightingmethods and various widely-used learning algorithms; and as such finally to im-prove the performance of automatic TC from the text representation aspect

Trang 22

CHAPTER 2

A BRIEF REVIEW OF TEXT CATEGORIZATION

This chapter presents background knowledge of the TC task and the techniquesfor building text classifiers We begin by introducing the definition of TC Gener-ally, TC is considered as a field at the crossroads of machine learning (ML) andinformation retrieval (IR) since it shares a number of characteristics with these twofields This chapter gives a broad overview of the TC system from the ML aspect

In the next chapter, we will discuss text representation from the IR aspect.This chapter is organized as follows Section 2.1 describes the concept of TC.Section 2.2 discusses its relationship with IR and ML Sections 2.3 reviews varioussubcases of TC tasks Section 2.4 introduces its most important applications

Trang 23

A Brief Review of Text Categorization 11

Finally Section 2.5 reviews several most popular inductive learning algorithms inTC

2.1 A Definition of Text Categorization

Text categorization is the task of assigning unlabelled documents into predefined

categories Assume D is a domain of documents and C = {c1, c2, , c |C| } is a set

of predefined categories Then the task is, for each document d j ∈ D, to assign a

decision to file d j under c i or a decision not to file d j under c i (c i ∈ C) by virtue

of a function Φ, where the function Φ is also called the classifier (or model, or

• Only endogenous knowledge extracted from the semantics of documents is

available Other exogenous knowledge, such as, publication date, documenttype, publication source and other metadata, is inaccessible

Moreover, other types of TC tasks that are not dependent on semantics onlywill not be discussed in this thesis For example, text sentiment classification is

a task to classify a document according to the positive or negative polarity of itsopinion (favorable or unfavorable, see [PLV02]) Another example is text genreclassification, which differs from text classification as it discriminates between the

Trang 24

styles of the documents as opposed to the latter which discriminates between thetopics of the documents (see [LM02])

2.2 Relationship With Information Retrieval and Machine

Since TC is a content-based document management task, it heavily relies onthe basic machinery of IR and shares the characteristics with IR in the followingsteps:

1 IR-style indexing which is performed on the training documents and on those

to be classified during the later inductive learning step;

2 IR-style induction which is used to construct the inductive text classifier;

3 IR-style evaluation which performs to assess the performance of the classifier.

Dating back to the ’80s, the most popular approach to building automaticdocument classifiers was composed of manually constructing an expert system

capable of taking text classification decisions by means of knowledge engineering

(KE) techniques The most famous example of this approach is the construe

Trang 25

system [HANS90], built by Carnegie Group for the Reuters news agency Table2.1 shows a sample rule of the type used in construe system; key words are

indicated in italic, categories are indicated in small caps.

Table 2.1: A Rule-based classifier for the wheat category of Reuters Corpus inconstrue system

if ((wheat & farm) or

(wheat & commodity) or

(bushels & export) or

(wheat & tonnes) or

(wheat & winter & ¬ soft)) then wheat else ¬ wheat

Clearly, the drawback of this approach is the knowledge acquisition bottleneck

which is a well known built-in problem from the expert system, that is, the rulesmust be manually defined by a knowledge engineer with the aid of a domain expert(an expert in the membership of documents in the chosen set of categories) Thus,

if the set of categories is updated, these two professionals must intervene again,and if the classifier is transported to a completely different domain (i.e set ofcategories), a different domain expert is needed to intervene and the work has to

be repeated from scratch

Since the early ’90s, the machine learning approach to TC has gained popularityand has eventually become the dominant one In this approach, a general inductive

process (also called the learner ) automatically builds a classifier for a category c i

by observing the characteristics of a set of documents manually classified under c i

or under c i by a domain expert; from these characteristics, the inductive process

Trang 26

gathers the characteristics that a new unseen document should have in order to be

classified under c i

The advantages of the ML approach over the KE approach are quite evident.The KE approach puts efforts on construction of classifier with the aid of domainexpert Thus, this construction is done manually not automatically On the otherhand, the ML approach endeavors to construct an automatic classifier This meansthat if an inductive learning process is available off-the-shelf, all that one needs to

do is to inductively and automatically construct a classifier from a set of manuallyclassified documents, namely, training data set The advantage is more evident ifthe classifier already exists and the original set of categories is updated, or if theclassifier is transported to a completely different domain

Classifiers built by means of ML techniques nowadays achieve impressive progress

of effectiveness and efficiency, making automatic classification a qualitatively able alternative to manual classification We will discuss several promising methodsthat have been most popular in TC in Section 2.5

vi-2.3 Various Subcases of Text Categorization Tasks

Usually, the inductive approaches to building text classifiers cannot be applied to

TC directly because several constraints may be enforced on the TC tasks according

to different applications Next we describe the techniques for these two subcases

of TC tasks

Trang 27

2.3.1 Single-label and Multilabel Text Categorization

Since semantics is a subjective notion, the membership of a document in a category

cannot be decided deterministically In fact, this inconsistency happens with veryhigh frequency in the real world when two human experts decide whether to classify

document d j under category c i For example, given a news article on PresidentBush attending a WTO conference it could be filed under Politics, or underEconomics, or under both, or even under neither, depending on the subjectivejudgement of the expert

Thus, the case in which exactly one category must be assigned to each document

d j is often called the single-label classification, while the case in which any number

of categories from 0 to |C| may be assigned to the same document d j is called the

multilabel classification A special case of single-label text categorization is binary

classification, in which each document d j must be assigned either to category c i or

to its complement c i

The binary classification is more general than the multilabel classification since

an algorithm for binary classification can also be used for multilabel classification

To do this, one needs only transform the problem of multilabel classification under

{c1, c2, , c |c| } into |C| independent problems of binary classification under {c i , c i },

for i = 1, , |C| That is, for each given positive category c i, when we build a

clas-sifier for c i, all the other categories are combined together as the negative category

c i This transformation requires that these |C| categories should be stochastically independent of each other, that is, for any c m and c n (m, n ∈ [1, |C|]), the value of the model for category c m does not depend on the value of the model for category

c n and vice versa Typically this is assumed to be the case (This is not the case

Trang 28

in hierarchical classification which we will discuss next.)

However, the converse transformation is not true: an algorithm for multilabelclassification cannot be used for either binary or single-label classification There

are two cases that need to be considered Given a document d j to classify, (i) the

classifier might attribute k > 1 categories to d j, and it might not be obvious how

to choose a “most appropriate” category from them; or (ii) the classifier might

attribute to d j no category at all, and it might not be obvious how to choose a

“least appropriate” category from C Thus, it is not a typical case to assign only

one and the most appropriate category to each document in the corpus

In this thesis we also adopt this splitting techniques to deal with the binarycase for two reasons, i.e (1) many important TC applications consist of binaryclassification problems and (2) solution to the binary case can be extended tothe multilabel case Note that since handling multilabel classification is also aresearch area in its own right, choosing to naively combine binary classifiers isonly one widely-adopted technique in current TC literature

2.3.2 Flat and Hierarchical Text Categorization

Imagine that there is a hierarchy with two top-level categories, Computers and

Sports, and three subcategories within each, namely, Computers/Hardware, puters/Software, Computer/Chat, Sports/Football, Sports/Basketball, Sports/Chat

Com-as Figure 2.1 shows

In the flat non-hierarchical classification case, a model corresponding to a itive category is learned to distinguish the target category from all the others

Trang 29

pos-A Brief Review of Text Categorization 17

Computers

/ Hardware

Computers / Chat

Computers / Software

Sports / Football

Sports / Basketball

Sports / Chat

Figure 2.1: A Two-Level Hierarchy in Text Categorization

categories However, in the hierarchical classification case, a model corresponding

to a positive category is learned to distinguish the target category from other

cat-egories within the same top level In the example shown in Figure 2.1, the text

classifiers corresponding to each top-level category, Computers and Sports,

distin-guish them from each other This is the same as flat non-hierarchical TC On the

other hand, the model corresponding to each second-level category is learned to

dis-tinguish the second-level category from other categories within the same top-level

category Specifically, the model built on category Computers/Hardware

distin-guishes itself from the other two categories under Computers category, namely,

Computers/Software and Computer/Chat.

Hierarchical TC has recently aroused a lot of interest also for its possible

ap-plication in automatically classifying web pages which are under the hierarchical

catalogues Since the categories in a hierarchical structure are not independent

of each other, the binary classifiers discussed in the previous subsection cannot

be applicable To solve it, this hierarchical text classification problem is usually

decomposed into a set of smaller problems corresponding to hierarchical splits in

Trang 30

the tree by using this known hierarchical structure That is, one first learns to tinguish among categories at the top level, then lower level distinctions are learnedonly within the appropriate top level of the tree Each of these sub-problems can

dis-be solved much more efficiently, and hopefully more accurately as well Techniquesexploiting this intuition in a TC context have been presented by [DC00]

2.4 A Variety of Applications of Text Categorization

Tech-nology

TC techniques have been used for a number of different applications Although wegroup these applications into different cases, the borders between them are fuzzyand somehow artificial, that is, some of these cases could be considered special cases

of others Here we only discuss the most important ones Other applications incombination with the availability of multimedia resources and/or other informationextraction techniques will not be discussed in this thesis, for example, the speechcategorization by means of a combination of speech recognition and TC techniques[KMW00] [SS00], the image categorization with textual title [SH00], etc

2.4.1 Automatic Document Indexing for IR Systems

The most early research of TC techniques originated from automatic documentindexing for IR systems In this case each document is assigned one or more key

words or key phrases (from a finite word set called controlled dictionary)

describ-ing its content The controlled dictionary often consists of a thematic hierarchical

Trang 31

MESH thesaurus for the biomedical literature Automatic indexing with a

con-trolled dictionary is also closely related to automated metadata generation In

digital libraries, one is usually interested in tagging documents by metadata thatdescribes them under a variety of aspects (e.g creation date, document type, au-

thor, availability, etc.) Some of this metadata is thematic, that is, its role is to

describe the semantics of the document by means of bibliographic codes, key words

or key phrases

Usually, this work is done by trained human indexers, and is thus a costlyactivity However, if the entries in the controlled vocabulary or the thematicmetadata are viewed as categories, document indexing is actually an instance of

TC, and may thus be addressed by the general automatic techniques Varioustext classifiers explicitly conceived for text indexing have been described in theliterature, see [TH93], [RH84], [FK84]

2.4.2 Documentation Organization

Document organization may be the most general application of TC techniques

to many kinds of textual information, such as ads, newspaper, emails, patents,conference papers, abstracts, newsgroup posters and so on For example, theclassification of incoming newspaper “classified” advertisements under differentcategories such as Apartments or House for Rent/Sale, Cars for Sale, JobHunting, Cheap Airfare, Vacation Packages, the organization of patents intocategories for making their search easier [Lar99], the automatic filing of newspaperarticles under the appropriate sections (e.g., Politics, Home News, Lifestyles, etc.),

or the automatic grouping of conference papers into sessions related to themes

Trang 32

2.4.3 Text Filtering System

Text filtering is an activity of classifying a stream of incoming documents

dis-patched in an asynchronous way by an information producer to an informationconsumer (see [BC92]) One typical example is a news feed, where the producer

is a news agency and the consumer is a newspaper (see [HANS90]) In this case,the filtering system should block the delivery of the documents the consumer islikely not interested in (e.g., all news not concerning sports in the case of a sportsnewspaper) In addition, a text filtering system may also further classify the doc-uments deemed relevant to the consumer into different thematic categories Forexample, all articles about sports, that is, the relevant documents, should be fur-ther classified according to which sport they deal with, so as to allow journalistsspecialized in individual sports to access only documents of prospective interest

to them Another example is junk e-mails filtering system In recent years, junke-mails have become an increasingly important problem with great economic im-pacts Similarly, a junk e-mail filtering system may be trained to discard “spam”mails (see [AKCS00] and [HDW99]) and further classify non-spam mails into top-ical categories of interest to the user

Information filtering by machine learning techniques has been widely discussed

in the literature, see [AC99], [ILS+00], [KHZ00], [TKSK00] and [YL98]

2.4.4 Word Sense Disambiguation

Resolving natural language ambiguities is one important problem in computationallinguistics as polysemous and homonymous words commonly exist in various types

Trang 33

of articles of different domains whether in English or other languages For instance,the word, bank may have many different meanings in English The two most com-mon senses are a financial institute (as in the Bank of National Development)and a hydraulic engineering artifact (as in the bank of river Thames) Thus,identifying the meanings of words in given contexts is quite important for manylinguistics applications, such as natural language processing (NLP), and indexing

documents by word senses rather than by words for IR purposes Word sense

dis-ambiguation (WSD) is such kind of activity of finding the sense of an ambiguous

word, given the occurrence in a text of this particular word Although a number

of IE techniques have been adopted in WSD, another possible solution to WSD is

to adopt TC techniques once we view word occurrence contexts as documents andword senses as categories (see [GY93] and [EMR00]) based on the assumption ofone sense per discourse

Other issues regarding resolving natural language ambiguities may all be led by means of TC techniques along the lines discussed for WSD, which include

tack-context-sensitive spelling correction, prepositional phrase attachment, part of speech tagging, and lexical choice in machine translation (see [Rot98] for an introduction).

2.4.5 Hierarchical Categorization of Web Pages

When documents are catalogued in this hierarchical way, a researcher may find

it easier to first navigate in the hierarchy of categories and restrict his search to

a particular category of interest Therefore, many real world web classificationsystems have been built on complex hierarchical structure, such as Yahoo!, MeSH,U.S.Patents, LookSmart and so on This hierarchical web page classification may

Trang 34

be solved by the hierarchical TC techniques we discussed in section 2.3.2 Previousresearch works exploiting the hierarchical structure in a TC context have beendiscussed by [DC00], [WWP99], [RS02], [MRMN98] and [CDAR98] In practice,

as a rich resource of information, links also have been exploited in web pagesclassification by [OML00], [GLF99], [F¨99], [CDI98], [Att98] and experimentallycompared by [YSG02]

2.5 Approaches to Effectively Learning Text Classifiers from

Labelled Corpora

As we mentioned in Section 2.2, since recent decades, machine learning approaches

to effectively learning text classifiers have been widely tackled in a variety of ways

In this section, we will deal only with the methods that have been most popularlyapplied in TC Apart from the ML approaches, Rocchio is a unique approach whichwas borrowed from the traditional IR field and thus we also include it

Usually, the construction of a classifier for each category c i ∈ C consists in

the definition of a target function Φ : (D, C) → [0, 1] which returns a value for a given document d j The returned value is usually between 0 and 1 which roughly

represents the evidence for the fact that d j ∈ c i Commonly, there is a preassigned

threshold τ such that if the returned value from Φ(d j , c i ) ≥ τ , the document d j is

assigned to be positive category c i and vice versa

The target function of the classifier can be a model, a hypothesis, or a rule,

which depends on the approach applied For example, the Rocchio method builds

an explicit profile of each category c which is a weighted list of the discriminative

Trang 35

terms whether present or absent under this category; k Nearest Neighbor is an

example- (sample- or instance-) based classifier; the C4.5 algorithm learns rules

by constructing a decision tree; Na¨ıve Bayes classifier uses a probabilistic model

of text; Support Vector Machines find the hyperplane which separates the positive

and negative samples with the maximum margin, etc

Note that other less standard or less popular approaches exist, such as sion methods, Neural Networks, genetic algorithms, maximum entropy modelling,but they are not included in this thesis because in TC domain, they have no com-parable performance to that of the above promising ones and/or have not beenwidely used in recent years Moreover, there are several techniques that have beenapplied to improve the classification performance effectively and efficiently, such

Regres-as majority voting (namely clRegres-assifier committee), boosting, bootstrapping and so

on These techniques are not covered in this thesis either

2.5.1 The Rocchio Method From Information Retrieval

The Rocchio method may be the only TC method rooted in the conventional IR field

rather than in the ML field It is used for inducing linear, profile-style classifiers,

by means of an adaptation to TC of the well known Rocchio’s formula for relevancefeedback in the vector space model The classifier built from the initial corpus, is

in fact an explicit profile, that is, for each category c i, it is a weighted list of the

terms whose presence or absence is most useful for discriminating c i This adaptionwas first proposed by Hull [Hul94], and has been used by many authors since then,either as an object of research in its own right ([Joa97]), or as a baseline classifier([Joa98], [CS96]), or as a member of a classifier committee

Trang 36

The Rocchio method computes a classifier ~c i = (w 1i , w |T |i ) (|T | is the term set size) for category c i given an initial corpus T r = {d1, , d |T r| } ⊂ D by means

where w kj is the weight of term t k in document d j , P OS i = {d j ∈ T r| ˘ Φ(d j , c i) =

T rue}, and NEG i = {d j ∈ T r| ˘ Φ(d j , c i ) = F alse} In this formula, β and γ are

two control parameters that allow setting the relative importance of positive and

negative examples in the training data set For instance, if β is set to 1 and γ to 0 (as in [DPHS98], [Hul94] and [Joa98]), the profile of c i is the centroid of its positive

training examples Thus, the centroid-based text classifier is actually a special case

of the Rocchio method Clearly, a classifier built by means of the Rocchio methodrewards the closeness of a test document to the centroid of the positive trainingexamples, and its distance from the centroid of the negative training examples.Sometimes, the role of the negative examples is usually deemphasized by setting

β to a high value and γ to a low one, e.g [Joa97] (use β = 16 and γ = 4) and

[CS96]

One issue in the application of the Rocchio formula to profile extraction is

whether the set NEG i should be considered in its entirety, or whether a

well-chosen sample of it, such as the set NP OS i of near-positives (defined as “themost positive among the negative training examples”), should be selected from

it The NP OS i factor is more significant than NEG i, since near-positives are themost difficult documents to tell apart from the positives This method originatesfrom the observation that, when the original Rocchio formula is used for relevance

Trang 37

documents on which user judgements are available are among the ones that havescored highest in the previous ranking Regarding this issue, see [SSS98], [RS99],[WWP99]

The obvious advantage of this method is interpretability, as such a profile is

more easily understandable by a human than neural network classifiers, bilistic classifiers or high-dimensional SVM classifiers Another advantage is itsease of implementation and it is also quite efficient, since learning a classifier ba-sically comes down to averaging term weights On the other hand, in terms ofeffectiveness, a drawback of this method is that if the documents in the categorytend to occur in disjoint clusters, such a classifier may miss most of them, as thecentroid of these documents may fall outside all of these clusters More generally,

proba-a clproba-assifier built by the Rocchio method, proba-as proba-all lineproba-ar clproba-assifiers, hproba-as the disproba-advproba-an-tage that it divides the space of documents linearly Note that even most of thepositive training examples would not be classified correctly by the linear classifier.Generally, the Rocchio classifier has always been considered as an underperformerand cannot achieve an effectiveness comparable to that of a state-of-the-art ma-chine learning method ([SSS98] improved its effectiveness comparable to that of

disadvan-a boosting method by using other enhdisadvan-ancements.)

2.5.2 k Nearest Neighbor

k Nearest Neighbor (k NN) is a kind of example-based classifiers which do not build

an explicit, declarative representation of the category c i, but rely on the categorylabels attached to the training documents similar to the test documents Other

example-based methods exist, but k NN is the most widely-used one In essence,

Trang 38

k NN makes its prediction based on the k training patterns that are closest to

the unlabelled (testing) pattern, according to a distance metric The commonlyused distance metrics that measure the similarity between two normalized patternsinclude the Euclidean distance

Dis(p, q) =sX

i

(p i − q i)2, (2.2)the inner product

d j also are in c i; if the answer is positive for a large enough proportion of them, a

positive decision is taken, and a negative decision is taken otherwise The k NN used

in [YC94] is actually a distance-weighted version Then thresholding methods need

to be used to convert the real-valued distance into binary categorization decisions

[YP97] and [YL99] used k NN based on the cosine similarity metric to measure the

similarity between the two documents

The construction of a k NN classifier also involves determining a threshold k

that indicates how many top-ranked training documents have to be considered

for computing the distance [LC96] used k = 20, while [YL99] and [YC94] has

Trang 39

found 30 ≤ k ≤ 45 to yield the best effectiveness [Joa98] also achieved the best performance for k NN when 30 ≤ k ≤ 45.

Unlike linear classifiers, k NN does not divide the document space linearly, and

thus does not suffer from the problem discussed at the end of subsection 2.5.1 A

number of different experiments have shown k NN to be quite effective However,

the most significant drawback is its inefficiency at classification time resultingfrom the its natural rationale in case of huge dimensional and huge-scale datasets Unlike a linear classifier where only a dot product needs to be computed to

classify a test document, k NN requires the entire training documents to be ranked

for similarity with the test documents, which is much more expensive Actually,

k NN method may not be called an inductive learner as it does not have a true

training (learning) phase and thus postpone all the computation to classification

time

2.5.3 Na¨ıve Bayes Method

Na¨ıve Bayes classifier is a probabilistic classifier which views the target function

˘

Φ(d j , c i ) in terms of the conditional probability P (c i |~ d j), that is, it computes the

probability that a document represented by a vector ~ d j = (w 1j , , w |T |j) of terms

belongs to c i By an application of Bayes’ theorem, this probability is given by

P (c i |~ d j) = P (c i )P (~ d j |c i)

where P (~ d j ) is the probability that a randomly picked document has vector ~ d j as

its representation, and P (c i) is the probability that a randomly picked document

belongs to c i

Trang 40

In order to make the estimation of P (~ d j |c i) in (2.5) practical, it is common

to make the assumption that any two coordinates of the document vector arestatistically independent of each other when they are viewed as random variables;

this independence assumption is encoded by the equation:

The probabilistic classifiers that use this assumption are called Na¨ıve Bayes

classifiers, and account for most of the probabilistic approaches to TC in the erature, for example, [Joa98] and [Lew98]

lit-Without the independence assumption, the estimation of P (~ d j |c i) is an

impos-sible mission since the number of posimpos-sible vectors ~ d j is too high Although the

naive character of this classifier makes the computation possible, this assumption

is not verified in practice In addition, the non-binary term weights cannot beapplicable to this method1

To calculateQ|T | k=1 P (w kj |c i), two models are used: one is a multi-variate Bernoullimodel which is a Bayesian Network with no dependencies between words and bi-nary word features; another is a multinomial model, that is, a uni-gram languagemodel with integer word counts [MN98] empirically compared their classificationperformance and found that the multi-variate Bernoulli performs well with smallvocabulary sizes, but the multinomial model usually performs even better at largervocabulary sizes One prominent characteristic of the multinomial model is to relaxthe constraint that document vectors should be binary representation

1 [MN98] used a multinomial event model for Na¨ıve Bayes text classification which can relax the constraint that document vectors should be binary-valued.

Định dạng
Số trang	179
Dung lượng	1,87 MB