A New Term Weighting Method for Text CategorizationMan Lan, 2006National University of Singapore LIST OF TABLES 2.1 A Rule-based classifier for the wheat category of Reuters Corpus in co
Trang 1A New Term Weighting Method for Text
Categorization
By Man Lan
Submitted For The Degree Of Doctor of Philosophy
at Department of Computer Science School of Computing National University of Singapore
3 Science Drive 2, Singapore 117543
September, 2006
c
° Copyright 2006 by Man Lan (lanman.sg@gmail.com)
Trang 2Name: Man Lan
Degree: Doctor of Philosophy
Department: Department of Computer Science
Thesis Title: A New Term Weighting Method for Text CategorizationAbstract: Text representation is the task of transforming the content of a
textual document into a compact representation of its content sothat the document could be recognized and classified by a com-puter or a classifier This thesis focuses on the development of
an effective and efficient term weighting method for text rization task We selected the single token as the unit of featurebecause the previous researches showed that this simple type offeatures outperformed other complicated type of features
catego-We have investigated several widely-used unsupervised and pervised term weighting methods on several popular data col-
su-lections in combination with SVM and k NN algorithms In
consideration of the distribution of relevant documents in thecollection and analysis of the term’s discriminating power, we
have proposed a new term weighting scheme, namely tf.rf The
controlled experimental results showed that the term weightingmethods show mixed performance in terms of different categorydistribution data sets and different learning algorithms Most ofthe supervised term weighting methods which are based on infor-mation theory have not shown satisfactory performance accord-ing to our experimental results However, the newly proposed
tf.rf method shows a consistently better performance than other
term weighting methods On the other hand, the popularly used
tf.idf method has not shown a uniformly good performance with
respect to different category distribution data sets
Keywords: Text Categorization, Term Weighting Method, Support Vector
Machine, k NN.
Trang 3To my parents and my husband.
Trang 4A New Term Weighting Method for Text Categorization
Man Lan, 2006National University of Singapore
ACKNOWLEDGEMENT
I would first thank my advisors Prof Chew Lim Tan and Dr Hwee Boon Low fortheir deep insights and dedication to guide and help me through this thesis research.Without their creative, valuable supervision, this work would have encountered alot of difficulties
I also sincerely appreciated the suggestions and insights I obtained from my mer academic advisors: Professor Sam Yuan Sung for his suggestions on my prelim-
for-inary thesis report in the Center for Information Mining and Extraction(CHIME )
lab of School of Computing, National University of Singapore; Dr Ah Hwee Tancurrently with Nanyang Technology of University for giving me many useful sugges-tions during my working in the Text Mining lab of A-STAR Institute for InfocommResearch; Prof Kang Lin Xie, in Shanghai Jiao Tong University, for encouraging
me to further my education and research
The former staff in the CHIME lab of School of Computing, National University
of Singapore, Dr Ji He, helped me with discussions, cooperations, encouragement,and making the research life in Singapore a very interesting and exciting experience.Last but not least, to my loving parents and my husband, for their support andencouragement through all these years in the Ph.D program
Trang 5A New Term Weighting Method for Text Categorization
Man Lan, 2006 National University of Singapore
TABLE OF CONTENTS
1.1 Motivation 1
1.2 Structure of the Thesis 7
2 A Brief Review of Text Categorization 10 2.1 A Definition of Text Categorization 11
2.2 Relationship With Information Retrieval and Machine Learning 12
2.3 Various Subcases of Text Categorization Tasks 14
2.3.1 Single-label and Multilabel Text Categorization 15
2.3.2 Flat and Hierarchical Text Categorization 16
2.4 A Variety of Applications of Text Categorization Technology 18
2.4.1 Automatic Document Indexing for IR Systems 18
2.4.2 Documentation Organization 19
2.4.3 Text Filtering System 20
2.4.4 Word Sense Disambiguation 20
2.4.5 Hierarchical Categorization of Web Pages 21
2.5 Approaches to Effectively Learning Text Classifiers from Labelled Corpora 22
2.5.1 The Rocchio Method From Information Retrieval 23
2.5.2 k Nearest Neighbor 25
2.5.3 Na¨ıve Bayes Method 27
Trang 6TABLE OF CONTENTS vi
2.5.4 Decision Tree 29
2.5.5 Support Vector Machines 32
2.5.6 A Summary of These Approaches 33
3 Text Representation for Text Categorization 35 3.1 Introduction 37
3.2 The Prerequisites of Text Representation 38
3.2.1 Stop Words 39
3.2.2 Stemming 39
3.2.3 Features Selection 40
3.3 What Should a Term Be? 41
3.3.1 Sub-Word Level 42
3.3.2 Word Level 42
3.3.3 Multi-Word Level 43
3.3.4 Semantic and Syntactic Representations 45
3.3.5 Other Knowledge-based Text Representations 50
3.3.6 Remarks on the Term Types 51
3.4 How to Weigh a Term? 52
3.4.1 Term Frequency Factor 52
3.4.2 Collection Frequency Factor 53
3.4.3 Normalization Factor 55
3.4.4 Traditional Term Weighting Methods from IR 55
3.5 Supervised Term Weighting Methods 58
3.5.1 Combined with information-theory functions or statistical metrics 58
3.5.2 Based on Statistical Confidence Intervals 60
Trang 7TABLE OF CONTENTS vii
3.6 Analysis of Term’s Discriminating Power 62
3.7 A New Proposed Supervised Term Weighting Scheme — RF 69
3.8 Empirical Observation of Term’s Discriminating Power 73
4 Methodology of Research 77 4.1 Machine Learning Algorithms Applied in This Thesis 77
4.1.1 Support Vector Machines 78
4.1.2 k Nearest Neighbors 79
4.2 Benchmark Data Collections 80
4.2.1 Text Preprocessing 80
4.2.2 Reuters News Corpus 81
4.2.3 20 Newsgroups Corpus 82
4.2.4 Ohsumed Corpus 83
4.2.5 18 Journals Corpus 84
4.3 Evaluation Methodology 86
4.3.1 Precision and Recall 86
4.3.2 F1 Function 87
4.3.3 Breakeven Point 88
4.3.4 Accuracy 89
4.4 Statistical Significance Tests 90
5 Experimental Research 92 5.1 Experiment Set 1: Exploring the Best Term Weighting Method for SVM-based Text Categorization 93
5.1.1 Term Weighting Methods 94
5.1.2 Results and Discussion 96
5.1.3 Concluding Remarks 101
Trang 8TABLE OF CONTENTS viii
5.2 Experiment Set 2: Investigating Supervised Term Weighting
Meth-ods and Their Relationship with Machine Learning Algorithms 103
5.2.1 Methodology 103
5.2.2 Results and Discussion 105
5.2.3 Further Analysis 119
5.2.4 Concluding Remarks 127
5.3 Experiment Set 3: Application to Biomedical Data Collections 129
5.3.1 Motivation 130
5.3.2 Examples of Terms’ Discriminating Power 133
5.3.3 Results and Discussion 136
5.3.4 Concluding Remarks 143
6 Contributions and Future Directions 145 6.1 Contributions 145
6.2 Future Work 147
6.2.1 Extending Term Weighting Methods on Feature Types other than Words 147
6.2.2 Applying Term Weighting Methods to Other Text-related Applications 148
Trang 9A New Term Weighting Method for Text Categorization
Man Lan, 2006National University of Singapore
LIST OF TABLES
2.1 A Rule-based classifier for the wheat category of Reuters Corpus
in construe system 133.1 Term frequency component 533.2 Collection frequency component 543.3 The first three terms which share the same idf but have different ratio of a and c 66
3.4 The rf values with different a and c values 72
3.5 Comparison of six weighting values of four features in category 00 acq 73
3.6 Comparison of six weighting values of four features in category
03 earn 74
4.1 Statistical information of the 18 Journals Corpus 854.2 Statistical information of three subsets of the 18 Journals corpus 854.3 McNemar’s test contingency table 905.1 Summary of 10 term weighting methods studied this experiment set 1 945.2 Statistical significance tests results on Reuters-21578 at differentnumbers of features 975.3 Statistical significance tests results on the subset of 20 Newsgroups
at different numbers of features 995.4 Summary of 8 supervised and unsupervised term weighting methods 1045.5 Statistical significance tests results on the two data corpora andtwo learning algorithms at certain numbers of features in terms of
the micro-averaged F1 measure 116
Trang 10LIST OF TABLES x
5.6 Statistics of the top 10 largest categories in the 18 Journal tion and the top 3 terms with the largest feature selection metric
Collec-χ2 1345.7 Comparison of the weighting values of four terms with respect tocategory chemistry 1355.8 Comparison of the weighting values of four terms with respect tocategory genetics 1355.9 The best performance of SVM with four term weighting schemes
on the Ohsumed Corpus 138
Trang 11A New Term Weighting Method for Text Categorization
Man Lan, 2006National University of Singapore
LIST OF FIGURES
2.1 A Two-Level Hierarchy in Text Categorization 172.2 A decision tree equivalent to the DNF rule of Table 2.1 Edgesare labelled by terms (underlining denotes negation) and leaves arelabelled by categories (wheat in this example) 303.1 An example of vector space model 383.2 Examples of document distributions with respect to six terms inthe whole corpus 64
5.1 Micro-averaged break-even points results for the Reuters-21578 topten categories by using ten term weighting schemes at differentnumbers of features 975.2 Micro-averaged break-even points results for the subset of 20 News-groups corpus by using ten term weighting schemes at differentnumbers of features 985.3 Micro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on the Reuters-21578 top ten cat-egories using linear SVM algorithm with different numbers of features1065.4 Macro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on the Reuters-21578 top ten cat-egories using linear SVM algorithm with different numbers of features1075.5 Micro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on 20 Newsgroups Corpus usinglinear SVM algorithm with different numbers of features 1095.6 Macro-averaged F1 measure of the eight unsupervised and super-vised term weighting approaches on 20 Newsgroups Corpus usinglinear SVM algorithm with different numbers of features 110
Trang 12LIST OF FIGURES xii
5.7 Micro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on the Reuters-21578 top ten cat-
super-egories using k NN algorithm with different numbers of features 111
5.8 Macro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on the Reuters-21578 top ten cat-
super-egories using k NN algorithm with different numbers of features 112
5.9 Micro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on 20 Newsgroups Corpus using
super-k NN algorithm with different numbers of features 114
5.10 Macro-averaged F1 measure of the eight unsupervised and vised term weighting approaches on 20 Newsgroups Corpus using
super-k NN algorithm with different numbers of features 115
5.11 F1 measure of the four term weighting methods on each category
of Reuters-21578 corpus using SVM algorithm at the full vocabulary122
5.12 F1 measure of the four term weighting methods on each category
of Reuters-21578 corpus using k NN algorithm at a feature set size
of 405 123
5.13 F1 measure of the four term weighting methods on each category
of 20 Newsgroups using SVM algorithm at a feature set size of 13456125
5.14 F1 measure of the four term weighting methods on each category
of 20 Newsgroups using k NN algorithm at a feature set size of 494 126
5.15 Micro-averaged break-even points results for the Ohsumed DataCollection by using four term weighting schemes at different num-bers of features 1375.16 Micro-averaged F1 value of top 10 categories in 18 Journals DataCollection 1405.17 Macro-averaged F1 value of top 10 categories in 18 Journals DataCollection 141
5.18 Micro-averaged F1 value of different number of categories in 18Journals Data Collection 142
Trang 13A New Term Weighting Method for Text Categorization
Man Lan, 2006National University of Singapore
Automatic text categorization plays a crucial role in many applications to sort,direct, classify, and provide the proper documents in a timely and correct man-ner It is a basic building block in a wide range of contexts, ranging from doc-ument indexing, to document filtering, word sense disambiguation, population of
Trang 14tion” is also called supervised text classification which has known label of training
data set in advance and automatically assigns the documents to a predefined set of
categories This is the main topic of this thesis However, the term “text
cluster-ing” is called unsupervised text classification and it performs without any known
labelled data set Therefore, aside from the meaning of text categorization, the
term “text clustering” has also been used to mean the automatic identification of such a set of categories and the grouping of documents under them.
Generally, building an automated TC system consists of two key subtasks Thefirst task is text representation which converts the content of documents into com-pact format so that they can be further processed by the text classifiers Anothertask is to learn the model of a text classifier which is used to classify the unlabelleddocuments
The algorithms which have been applied to TC task have been studied sively in recent decades and most of them are usually borrowed from the traditional
exten-machine learning (ML) domain, such as Support Vector Machines (SVMs), k NN,
Decision Tree (DT), Na¨ıve Bayes (NB), Neural Network (NN), Linear Regression(LR), etc As a relatively new algorithm, SVM has a better performance thanother methods due to its ability to efficiently handle relatively high dimensionaland large-scale data sets without decreasing classification accuracy In essence,
Trang 15Introduction 3
k NN makes prediction based on the k training patterns which are closest to the
unlabelled (test) pattern It is very simple and effective but not efficient in thecase of high dimensional and large-scale data sets The DT algorithm is sometimesquite effective but the consequent overfitting problem is intractable and needs to
be handled manually case by case The NB method assumes that the terms in onedocument are independent even this is not the case in the real world The NNmethod, usually used in artificial intelligence (AI) field has shown lower classifica-tion accuracy than other machine learning methods
The textual information is stored in many kinds of machine readable form, such
as PDF, DOC, PostScript, HTML, XML and so on Before the computer appliesthe text classifier to label the unknown document, the content of a document must
be transformed into a compact and interpretable format so that it can be further
recognized and classified by a computer or a classifier This indexing procedure
is called text representation Apart from the inductive learning algorithms, text
representation also has a crucial influence on how well the text classifiers can alize Moreover, once given a learning algorithm, choosing the text representationbecomes the central modelling tool of building text classifier for several reasons.First, excellent algorithms are few Second, since the rationale is inherent to eachalgorithm, the method is usually fixed for one given algorithm Third, tuning theparameters of algorithm shows lower improvement than one might expect for thecomplexity of the algorithm Furthermore, not all algorithms have such parame-ters to tune Therefore, the focus of this thesis is on the text representation fortext categorization
gener-In the traditional vector space model, the content of a document d is represented
Trang 16Introduction 4
as a vector in the term space, d = (w1, , w |T | ), where |T | is the size of the terms (sometimes called features) set and the value of w k between (0, 1) represents how much the term w k contributes to the semantics of the document d Thus,
there are two important issues for text representation: (1) what a term should
be and (2) how to weigh a term As the basic indexing units for representingthe content of documents, terms can be at different levels, such as sub-word level(syllables), word-level (single token), multi-word level (phrases, sentences), etc
Different terms have different importance in a text, thus an important indicator w k
(i.e term weight, usually between 0 and 1) associated with each term represents
how much the term w k contributes to the semantics of document for TC Thewidely-used term weighting methods are borrowed from traditional information
retrieval (IR) field, such as tf.idf, binary, term frequency and so on.
As we just mentioned, among the promising approaches to TC, SVM has a ter performance than others ([DPHS98], [Joa98], [LK02], [YL99], [DP03]) Gen-erally, SVMs are classified into two categories, i.e linear and non-linear based
bet-on the different kernel functibet-ons Usually, the kernel functibet-ons of SVMs are cial to the classifier’s performance However, for TC task, Leopold [LK02] pointsout that it is text representation schemes which dominate the performance of TCrather than the kernel functions of SVM Therefore, choosing an appropriate textrepresentation is more important than choosing and tuning kernel functions ofSVM for TC However, even given these previous studies, we could not definitelydraw a conclusion as to which term weighting method is better than others forSVM-based text classifier, “Because we have to bear in mind that comparisonsare reliable only when based on experiments performed under carefully con-
Trang 17cru-Introduction 5
different data preparation (stemming, stop words removal, feature selection, ent term weighting methods, etc), different benchmark data collections, differentclassifiers with various parameters, and even different evaluation methods (micro-and macro-averaged precision, recall, accuracy, error, break-even point, ROC, etc),have been adopted by different researchers Therefore, a question surfaces here :
differ-“Among the various term weighting methods, which is the best term weightingmethod for SVM-based text classifier?”
The traditional term weighting methods, such as binary, tf.idf and its variants,
are usually borrowed from the IR domain In contrast to IR, TC is a form of
supervised learning as it makes use of prior information on the membership of
training documents in predefined categories This known information is effectiveand has been widely used in the step of feature selection [YP97] and the supervisedlearning of text classifier Recently, researchers proposed to combine this priorinformation into term weighting methods Since these supervised term weightingmethods take the document distribution into consideration, they are naturallyexpected to be superior to the unsupervised (traditional) term weighting methods.However, not much work has been done on their comprehensive comparison withunsupervised term weighting methods Although there are partial comparisons in[DS03] and [SM05], these supervised term weighting methods have shown to give
mixed results with respect to tf.idf and in most cases they have no superiority over
tf.idf On the other hand, another work [DTY+04] has replaced idf factor with
χ2 factor as term weighting method and drawn a conclusion that χ2 is better than
idf , which is quite contrary to the finding in [DS03] Therefore, two fundamental
questions arise here, i.e “Are supervised term weighting methods based on knowninformation able to lead to better performance than unsupervised ones for text
Trang 18Introduction 6
categorization?” and “Can we propose a new effective and efficient term weightingmethod by making use of this prior information given by the training data set?”One goal of this thesis is to propose a new effective and efficient term weightingmethod for TC task by using the prior information given by training data set.Meanwhile, we also give an analytical explanation of terms’ discriminating powerwith empirical observation and explore the classification performance of severalwidely-used term weighing methods for SVM-based text classifier Another goal is
to examine the superiority of supervised term weighting methods and investigatethe relationship between term weighting methods and learning algorithms Thereare three critical questions that this thesis will address:
1 How can we propose a new effective term weighting method by using theimportant prior information given by the training data set?
2 Among the various term weighting methods, which is the best term weightingmethod for SVM-based text classifier?
3 Are supervised term weighting methods able to lead to better performancethan unsupervised ones for text categorization? What kinds of relation-ship can we find between term weighting methods and the two widely-used
learning algorithms, i.e k NN and SVM, given different benchmark data
collections?
To address these three questions, this thesis is divided into three subtasks:
• First, we will analyze and investigate the terms’ discriminating power with
Trang 19Introduction 7
to improve the performance of TC From these investigations and analysis,
we will gain insights into a better understanding concerning the intuitive idea
of our newly-proposed supervised term weighting method
• Second task is to explore term weighting methods for SVM-based text
classi-fier and provide insights into the difference between various traditional termweighting methods and their variants for TC task The empirical answer tothis question is definitely interesting to researchers who would like to choose
an appropriate term weighting method for SVM-based TC
• Finally, we will examine the superiority of supervised term weighting methods
and investigate the relationship between different term weighting methodsand different learning algorithms on more general experimental conditions.This work will answer the third question with empirical evidence and give
a practical guidance on how to choose term weighting methods in terms ofdifferent learning algorithms Moreover, we will also extend our study to anew application domain, i.e biomedical literature classification
1.2 Structure of the Thesis
Each of the remaining chapters of the thesis captures different aspects of the work,including approaches to text categorization, text representation and term weight-ing methods, analysis of term’s discriminating power and details of the researchinfrastructure Below is a roadmap of the remaining chapters of this thesis.Chapter 2 provides a review of techniques for the task of TC including thedefinition, its relationship with IR and ML, its taxonomy, applications, and most
Trang 20then propose a new term weighting method, i.e tf.rf From the further
quantita-tive comparison of their discriminating power in the real cases, we gain an insight
into a better understanding regarding the basic idea behind tf.rf This chapter
provides a detailed qualitative analysis and explanation of our newly proposed pervised term weighting method and partial answers to the first question that thisthesis study will address from qualitative analysis aspect
su-Chapter 4 lays out the methodology of research for the experiments in thisthesis including the inductive learning algorithms, the benchmark data collections,text preprocessing, performance evaluation and statistical significance tests Thischapter provides a detailed description of all the experimental settings in thisthesis
Chapter 5 presents a series of experiments to investigate various widely-usedtraditional and the state-of-the-art supervised term weighting methods on variousexperimental conditions The purpose of this chapter is to seek answers to thethree questions with more general experimental evidence To accomplish this, webuild a fixed universal platform to compare a variety of traditional and supervised
term weighting methods with our tf.rf using a cross-classifier, cross-corpus and
even cross-domain validation This chapter serves an important role in this thesissince it not only examines the performance of various term weighting methods
Trang 21Introduction 9
with more experimental evidence, but also provides us with deeper insights and
a practical guidance on choosing term weighting methods in terms of differentlearning algorithms and corpora
Chapter 6 summarizes the contributions of this thesis and outlines some sible directions for future research
pos-In this study, we only focus on the the study of text representations ratherthan the improvements of inductive learning for TC The default language onwhich we studied is English Whether this new term weighting method is able
to lead to different results for different languages is not clear In addition, when
we focus on the study on term weighting schemes for TC, we only change theterm weighting schemes by using the bag-of-words approach, while the remainingbackground conditions such as data preparation, classifier and evaluation measuresremain unchanged
The results of this study could be useful for researchers to choose an ate term weighting method for TC; to find the relationship between term weightingmethods and various widely-used learning algorithms; and as such finally to im-prove the performance of automatic TC from the text representation aspect
Trang 22A New Term Weighting Method for Text Categorization
Man Lan, 2006National University of Singapore
CHAPTER 2
A BRIEF REVIEW OF TEXT CATEGORIZATION
This chapter presents background knowledge of the TC task and the techniquesfor building text classifiers We begin by introducing the definition of TC Gener-ally, TC is considered as a field at the crossroads of machine learning (ML) andinformation retrieval (IR) since it shares a number of characteristics with these twofields This chapter gives a broad overview of the TC system from the ML aspect
In the next chapter, we will discuss text representation from the IR aspect.This chapter is organized as follows Section 2.1 describes the concept of TC.Section 2.2 discusses its relationship with IR and ML Sections 2.3 reviews varioussubcases of TC tasks Section 2.4 introduces its most important applications
Trang 23A Brief Review of Text Categorization 11
Finally Section 2.5 reviews several most popular inductive learning algorithms inTC
2.1 A Definition of Text Categorization
Text categorization is the task of assigning unlabelled documents into predefined
categories Assume D is a domain of documents and C = {c1, c2, , c |C| } is a set
of predefined categories Then the task is, for each document d j ∈ D, to assign a
decision to file d j under c i or a decision not to file d j under c i (c i ∈ C) by virtue
of a function Φ, where the function Φ is also called the classifier (or model, or
• Only endogenous knowledge extracted from the semantics of documents is
available Other exogenous knowledge, such as, publication date, documenttype, publication source and other metadata, is inaccessible
Moreover, other types of TC tasks that are not dependent on semantics onlywill not be discussed in this thesis For example, text sentiment classification is
a task to classify a document according to the positive or negative polarity of itsopinion (favorable or unfavorable, see [PLV02]) Another example is text genreclassification, which differs from text classification as it discriminates between the
Trang 24A Brief Review of Text Categorization 12
styles of the documents as opposed to the latter which discriminates between thetopics of the documents (see [LM02])
2.2 Relationship With Information Retrieval and Machine
Since TC is a content-based document management task, it heavily relies onthe basic machinery of IR and shares the characteristics with IR in the followingsteps:
1 IR-style indexing which is performed on the training documents and on those
to be classified during the later inductive learning step;
2 IR-style induction which is used to construct the inductive text classifier;
3 IR-style evaluation which performs to assess the performance of the classifier.
Dating back to the ’80s, the most popular approach to building automaticdocument classifiers was composed of manually constructing an expert system
capable of taking text classification decisions by means of knowledge engineering
(KE) techniques The most famous example of this approach is the construe
Trang 25A Brief Review of Text Categorization 13
system [HANS90], built by Carnegie Group for the Reuters news agency Table2.1 shows a sample rule of the type used in construe system; key words are
indicated in italic, categories are indicated in small caps.
Table 2.1: A Rule-based classifier for the wheat category of Reuters Corpus inconstrue system
if ((wheat & farm) or
(wheat & commodity) or
(bushels & export) or
(wheat & tonnes) or
(wheat & winter & ¬ soft)) then wheat else ¬ wheat
Clearly, the drawback of this approach is the knowledge acquisition bottleneck
which is a well known built-in problem from the expert system, that is, the rulesmust be manually defined by a knowledge engineer with the aid of a domain expert(an expert in the membership of documents in the chosen set of categories) Thus,
if the set of categories is updated, these two professionals must intervene again,and if the classifier is transported to a completely different domain (i.e set ofcategories), a different domain expert is needed to intervene and the work has to
be repeated from scratch
Since the early ’90s, the machine learning approach to TC has gained popularityand has eventually become the dominant one In this approach, a general inductive
process (also called the learner ) automatically builds a classifier for a category c i
by observing the characteristics of a set of documents manually classified under c i
or under c i by a domain expert; from these characteristics, the inductive process
Trang 26A Brief Review of Text Categorization 14
gathers the characteristics that a new unseen document should have in order to be
classified under c i
The advantages of the ML approach over the KE approach are quite evident.The KE approach puts efforts on construction of classifier with the aid of domainexpert Thus, this construction is done manually not automatically On the otherhand, the ML approach endeavors to construct an automatic classifier This meansthat if an inductive learning process is available off-the-shelf, all that one needs to
do is to inductively and automatically construct a classifier from a set of manuallyclassified documents, namely, training data set The advantage is more evident ifthe classifier already exists and the original set of categories is updated, or if theclassifier is transported to a completely different domain
Classifiers built by means of ML techniques nowadays achieve impressive progress
of effectiveness and efficiency, making automatic classification a qualitatively able alternative to manual classification We will discuss several promising methodsthat have been most popular in TC in Section 2.5
vi-2.3 Various Subcases of Text Categorization Tasks
Usually, the inductive approaches to building text classifiers cannot be applied to
TC directly because several constraints may be enforced on the TC tasks according
to different applications Next we describe the techniques for these two subcases
of TC tasks
Trang 27A Brief Review of Text Categorization 15
2.3.1 Single-label and Multilabel Text Categorization
Since semantics is a subjective notion, the membership of a document in a category
cannot be decided deterministically In fact, this inconsistency happens with veryhigh frequency in the real world when two human experts decide whether to classify
document d j under category c i For example, given a news article on PresidentBush attending a WTO conference it could be filed under Politics, or underEconomics, or under both, or even under neither, depending on the subjectivejudgement of the expert
Thus, the case in which exactly one category must be assigned to each document
d j is often called the single-label classification, while the case in which any number
of categories from 0 to |C| may be assigned to the same document d j is called the
multilabel classification A special case of single-label text categorization is binary
classification, in which each document d j must be assigned either to category c i or
to its complement c i
The binary classification is more general than the multilabel classification since
an algorithm for binary classification can also be used for multilabel classification
To do this, one needs only transform the problem of multilabel classification under
{c1, c2, , c |c| } into |C| independent problems of binary classification under {c i , c i },
for i = 1, , |C| That is, for each given positive category c i, when we build a
clas-sifier for c i, all the other categories are combined together as the negative category
c i This transformation requires that these |C| categories should be stochastically independent of each other, that is, for any c m and c n (m, n ∈ [1, |C|]), the value of the model for category c m does not depend on the value of the model for category
c n and vice versa Typically this is assumed to be the case (This is not the case
Trang 28A Brief Review of Text Categorization 16
in hierarchical classification which we will discuss next.)
However, the converse transformation is not true: an algorithm for multilabelclassification cannot be used for either binary or single-label classification There
are two cases that need to be considered Given a document d j to classify, (i) the
classifier might attribute k > 1 categories to d j, and it might not be obvious how
to choose a “most appropriate” category from them; or (ii) the classifier might
attribute to d j no category at all, and it might not be obvious how to choose a
“least appropriate” category from C Thus, it is not a typical case to assign only
one and the most appropriate category to each document in the corpus
In this thesis we also adopt this splitting techniques to deal with the binarycase for two reasons, i.e (1) many important TC applications consist of binaryclassification problems and (2) solution to the binary case can be extended tothe multilabel case Note that since handling multilabel classification is also aresearch area in its own right, choosing to naively combine binary classifiers isonly one widely-adopted technique in current TC literature
2.3.2 Flat and Hierarchical Text Categorization
Imagine that there is a hierarchy with two top-level categories, Computers and
Sports, and three subcategories within each, namely, Computers/Hardware, puters/Software, Computer/Chat, Sports/Football, Sports/Basketball, Sports/Chat
Com-as Figure 2.1 shows
In the flat non-hierarchical classification case, a model corresponding to a itive category is learned to distinguish the target category from all the others
Trang 29pos-A Brief Review of Text Categorization 17
Computers
/ Hardware
Computers / Chat
Computers / Software
Sports / Football
Sports / Basketball
Sports / Chat
Figure 2.1: A Two-Level Hierarchy in Text Categorization
categories However, in the hierarchical classification case, a model corresponding
to a positive category is learned to distinguish the target category from other
cat-egories within the same top level In the example shown in Figure 2.1, the text
classifiers corresponding to each top-level category, Computers and Sports,
distin-guish them from each other This is the same as flat non-hierarchical TC On the
other hand, the model corresponding to each second-level category is learned to
dis-tinguish the second-level category from other categories within the same top-level
category Specifically, the model built on category Computers/Hardware
distin-guishes itself from the other two categories under Computers category, namely,
Computers/Software and Computer/Chat.
Hierarchical TC has recently aroused a lot of interest also for its possible
ap-plication in automatically classifying web pages which are under the hierarchical
catalogues Since the categories in a hierarchical structure are not independent
of each other, the binary classifiers discussed in the previous subsection cannot
be applicable To solve it, this hierarchical text classification problem is usually
decomposed into a set of smaller problems corresponding to hierarchical splits in
Trang 30A Brief Review of Text Categorization 18
the tree by using this known hierarchical structure That is, one first learns to tinguish among categories at the top level, then lower level distinctions are learnedonly within the appropriate top level of the tree Each of these sub-problems can
dis-be solved much more efficiently, and hopefully more accurately as well Techniquesexploiting this intuition in a TC context have been presented by [DC00]
2.4 A Variety of Applications of Text Categorization
Tech-nology
TC techniques have been used for a number of different applications Although wegroup these applications into different cases, the borders between them are fuzzyand somehow artificial, that is, some of these cases could be considered special cases
of others Here we only discuss the most important ones Other applications incombination with the availability of multimedia resources and/or other informationextraction techniques will not be discussed in this thesis, for example, the speechcategorization by means of a combination of speech recognition and TC techniques[KMW00] [SS00], the image categorization with textual title [SH00], etc
2.4.1 Automatic Document Indexing for IR Systems
The most early research of TC techniques originated from automatic documentindexing for IR systems In this case each document is assigned one or more key
words or key phrases (from a finite word set called controlled dictionary)
describ-ing its content The controlled dictionary often consists of a thematic hierarchical
Trang 31A Brief Review of Text Categorization 19
MESH thesaurus for the biomedical literature Automatic indexing with a
con-trolled dictionary is also closely related to automated metadata generation In
digital libraries, one is usually interested in tagging documents by metadata thatdescribes them under a variety of aspects (e.g creation date, document type, au-
thor, availability, etc.) Some of this metadata is thematic, that is, its role is to
describe the semantics of the document by means of bibliographic codes, key words
or key phrases
Usually, this work is done by trained human indexers, and is thus a costlyactivity However, if the entries in the controlled vocabulary or the thematicmetadata are viewed as categories, document indexing is actually an instance of
TC, and may thus be addressed by the general automatic techniques Varioustext classifiers explicitly conceived for text indexing have been described in theliterature, see [TH93], [RH84], [FK84]
2.4.2 Documentation Organization
Document organization may be the most general application of TC techniques
to many kinds of textual information, such as ads, newspaper, emails, patents,conference papers, abstracts, newsgroup posters and so on For example, theclassification of incoming newspaper “classified” advertisements under differentcategories such as Apartments or House for Rent/Sale, Cars for Sale, JobHunting, Cheap Airfare, Vacation Packages, the organization of patents intocategories for making their search easier [Lar99], the automatic filing of newspaperarticles under the appropriate sections (e.g., Politics, Home News, Lifestyles, etc.),
or the automatic grouping of conference papers into sessions related to themes
Trang 32A Brief Review of Text Categorization 20
2.4.3 Text Filtering System
Text filtering is an activity of classifying a stream of incoming documents
dis-patched in an asynchronous way by an information producer to an informationconsumer (see [BC92]) One typical example is a news feed, where the producer
is a news agency and the consumer is a newspaper (see [HANS90]) In this case,the filtering system should block the delivery of the documents the consumer islikely not interested in (e.g., all news not concerning sports in the case of a sportsnewspaper) In addition, a text filtering system may also further classify the doc-uments deemed relevant to the consumer into different thematic categories Forexample, all articles about sports, that is, the relevant documents, should be fur-ther classified according to which sport they deal with, so as to allow journalistsspecialized in individual sports to access only documents of prospective interest
to them Another example is junk e-mails filtering system In recent years, junke-mails have become an increasingly important problem with great economic im-pacts Similarly, a junk e-mail filtering system may be trained to discard “spam”mails (see [AKCS00] and [HDW99]) and further classify non-spam mails into top-ical categories of interest to the user
Information filtering by machine learning techniques has been widely discussed
in the literature, see [AC99], [ILS+00], [KHZ00], [TKSK00] and [YL98]
2.4.4 Word Sense Disambiguation
Resolving natural language ambiguities is one important problem in computationallinguistics as polysemous and homonymous words commonly exist in various types
Trang 33A Brief Review of Text Categorization 21
of articles of different domains whether in English or other languages For instance,the word, bank may have many different meanings in English The two most com-mon senses are a financial institute (as in the Bank of National Development)and a hydraulic engineering artifact (as in the bank of river Thames) Thus,identifying the meanings of words in given contexts is quite important for manylinguistics applications, such as natural language processing (NLP), and indexing
documents by word senses rather than by words for IR purposes Word sense
dis-ambiguation (WSD) is such kind of activity of finding the sense of an ambiguous
word, given the occurrence in a text of this particular word Although a number
of IE techniques have been adopted in WSD, another possible solution to WSD is
to adopt TC techniques once we view word occurrence contexts as documents andword senses as categories (see [GY93] and [EMR00]) based on the assumption ofone sense per discourse
Other issues regarding resolving natural language ambiguities may all be led by means of TC techniques along the lines discussed for WSD, which include
tack-context-sensitive spelling correction, prepositional phrase attachment, part of speech tagging, and lexical choice in machine translation (see [Rot98] for an introduction).
2.4.5 Hierarchical Categorization of Web Pages
When documents are catalogued in this hierarchical way, a researcher may find
it easier to first navigate in the hierarchy of categories and restrict his search to
a particular category of interest Therefore, many real world web classificationsystems have been built on complex hierarchical structure, such as Yahoo!, MeSH,U.S.Patents, LookSmart and so on This hierarchical web page classification may
Trang 34A Brief Review of Text Categorization 22
be solved by the hierarchical TC techniques we discussed in section 2.3.2 Previousresearch works exploiting the hierarchical structure in a TC context have beendiscussed by [DC00], [WWP99], [RS02], [MRMN98] and [CDAR98] In practice,
as a rich resource of information, links also have been exploited in web pagesclassification by [OML00], [GLF99], [F¨99], [CDI98], [Att98] and experimentallycompared by [YSG02]
2.5 Approaches to Effectively Learning Text Classifiers from
Labelled Corpora
As we mentioned in Section 2.2, since recent decades, machine learning approaches
to effectively learning text classifiers have been widely tackled in a variety of ways
In this section, we will deal only with the methods that have been most popularlyapplied in TC Apart from the ML approaches, Rocchio is a unique approach whichwas borrowed from the traditional IR field and thus we also include it
Usually, the construction of a classifier for each category c i ∈ C consists in
the definition of a target function Φ : (D, C) → [0, 1] which returns a value for a given document d j The returned value is usually between 0 and 1 which roughly
represents the evidence for the fact that d j ∈ c i Commonly, there is a preassigned
threshold τ such that if the returned value from Φ(d j , c i ) ≥ τ , the document d j is
assigned to be positive category c i and vice versa
The target function of the classifier can be a model, a hypothesis, or a rule,
which depends on the approach applied For example, the Rocchio method builds
an explicit profile of each category c which is a weighted list of the discriminative
Trang 35A Brief Review of Text Categorization 23
terms whether present or absent under this category; k Nearest Neighbor is an
example- (sample- or instance-) based classifier; the C4.5 algorithm learns rules
by constructing a decision tree; Na¨ıve Bayes classifier uses a probabilistic model
of text; Support Vector Machines find the hyperplane which separates the positive
and negative samples with the maximum margin, etc
Note that other less standard or less popular approaches exist, such as sion methods, Neural Networks, genetic algorithms, maximum entropy modelling,but they are not included in this thesis because in TC domain, they have no com-parable performance to that of the above promising ones and/or have not beenwidely used in recent years Moreover, there are several techniques that have beenapplied to improve the classification performance effectively and efficiently, such
Regres-as majority voting (namely clRegres-assifier committee), boosting, bootstrapping and so
on These techniques are not covered in this thesis either
2.5.1 The Rocchio Method From Information Retrieval
The Rocchio method may be the only TC method rooted in the conventional IR field
rather than in the ML field It is used for inducing linear, profile-style classifiers,
by means of an adaptation to TC of the well known Rocchio’s formula for relevancefeedback in the vector space model The classifier built from the initial corpus, is
in fact an explicit profile, that is, for each category c i, it is a weighted list of the
terms whose presence or absence is most useful for discriminating c i This adaptionwas first proposed by Hull [Hul94], and has been used by many authors since then,either as an object of research in its own right ([Joa97]), or as a baseline classifier([Joa98], [CS96]), or as a member of a classifier committee
Trang 36A Brief Review of Text Categorization 24
The Rocchio method computes a classifier ~c i = (w 1i , w |T |i ) (|T | is the term set size) for category c i given an initial corpus T r = {d1, , d |T r| } ⊂ D by means
where w kj is the weight of term t k in document d j , P OS i = {d j ∈ T r| ˘ Φ(d j , c i) =
T rue}, and NEG i = {d j ∈ T r| ˘ Φ(d j , c i ) = F alse} In this formula, β and γ are
two control parameters that allow setting the relative importance of positive and
negative examples in the training data set For instance, if β is set to 1 and γ to 0 (as in [DPHS98], [Hul94] and [Joa98]), the profile of c i is the centroid of its positive
training examples Thus, the centroid-based text classifier is actually a special case
of the Rocchio method Clearly, a classifier built by means of the Rocchio methodrewards the closeness of a test document to the centroid of the positive trainingexamples, and its distance from the centroid of the negative training examples.Sometimes, the role of the negative examples is usually deemphasized by setting
β to a high value and γ to a low one, e.g [Joa97] (use β = 16 and γ = 4) and
[CS96]
One issue in the application of the Rocchio formula to profile extraction is
whether the set NEG i should be considered in its entirety, or whether a
well-chosen sample of it, such as the set NP OS i of near-positives (defined as “themost positive among the negative training examples”), should be selected from
it The NP OS i factor is more significant than NEG i, since near-positives are themost difficult documents to tell apart from the positives This method originatesfrom the observation that, when the original Rocchio formula is used for relevance
Trang 37A Brief Review of Text Categorization 25
documents on which user judgements are available are among the ones that havescored highest in the previous ranking Regarding this issue, see [SSS98], [RS99],[WWP99]
The obvious advantage of this method is interpretability, as such a profile is
more easily understandable by a human than neural network classifiers, bilistic classifiers or high-dimensional SVM classifiers Another advantage is itsease of implementation and it is also quite efficient, since learning a classifier ba-sically comes down to averaging term weights On the other hand, in terms ofeffectiveness, a drawback of this method is that if the documents in the categorytend to occur in disjoint clusters, such a classifier may miss most of them, as thecentroid of these documents may fall outside all of these clusters More generally,
proba-a clproba-assifier built by the Rocchio method, proba-as proba-all lineproba-ar clproba-assifiers, hproba-as the disproba-advproba-an-tage that it divides the space of documents linearly Note that even most of thepositive training examples would not be classified correctly by the linear classifier.Generally, the Rocchio classifier has always been considered as an underperformerand cannot achieve an effectiveness comparable to that of a state-of-the-art ma-chine learning method ([SSS98] improved its effectiveness comparable to that of
disadvan-a boosting method by using other enhdisadvan-ancements.)
2.5.2 k Nearest Neighbor
k Nearest Neighbor (k NN) is a kind of example-based classifiers which do not build
an explicit, declarative representation of the category c i, but rely on the categorylabels attached to the training documents similar to the test documents Other
example-based methods exist, but k NN is the most widely-used one In essence,
Trang 38A Brief Review of Text Categorization 26
k NN makes its prediction based on the k training patterns that are closest to
the unlabelled (testing) pattern, according to a distance metric The commonlyused distance metrics that measure the similarity between two normalized patternsinclude the Euclidean distance
Dis(p, q) =sX
i
(p i − q i)2, (2.2)the inner product
d j also are in c i; if the answer is positive for a large enough proportion of them, a
positive decision is taken, and a negative decision is taken otherwise The k NN used
in [YC94] is actually a distance-weighted version Then thresholding methods need
to be used to convert the real-valued distance into binary categorization decisions
[YP97] and [YL99] used k NN based on the cosine similarity metric to measure the
similarity between the two documents
The construction of a k NN classifier also involves determining a threshold k
that indicates how many top-ranked training documents have to be considered
for computing the distance [LC96] used k = 20, while [YL99] and [YC94] has
Trang 39A Brief Review of Text Categorization 27
found 30 ≤ k ≤ 45 to yield the best effectiveness [Joa98] also achieved the best performance for k NN when 30 ≤ k ≤ 45.
Unlike linear classifiers, k NN does not divide the document space linearly, and
thus does not suffer from the problem discussed at the end of subsection 2.5.1 A
number of different experiments have shown k NN to be quite effective However,
the most significant drawback is its inefficiency at classification time resultingfrom the its natural rationale in case of huge dimensional and huge-scale datasets Unlike a linear classifier where only a dot product needs to be computed to
classify a test document, k NN requires the entire training documents to be ranked
for similarity with the test documents, which is much more expensive Actually,
k NN method may not be called an inductive learner as it does not have a true
training (learning) phase and thus postpone all the computation to classification
time
2.5.3 Na¨ıve Bayes Method
Na¨ıve Bayes classifier is a probabilistic classifier which views the target function
˘
Φ(d j , c i ) in terms of the conditional probability P (c i |~ d j), that is, it computes the
probability that a document represented by a vector ~ d j = (w 1j , , w |T |j) of terms
belongs to c i By an application of Bayes’ theorem, this probability is given by
P (c i |~ d j) = P (c i )P (~ d j |c i)
where P (~ d j ) is the probability that a randomly picked document has vector ~ d j as
its representation, and P (c i) is the probability that a randomly picked document
belongs to c i
Trang 40A Brief Review of Text Categorization 28
In order to make the estimation of P (~ d j |c i) in (2.5) practical, it is common
to make the assumption that any two coordinates of the document vector arestatistically independent of each other when they are viewed as random variables;
this independence assumption is encoded by the equation:
The probabilistic classifiers that use this assumption are called Na¨ıve Bayes
classifiers, and account for most of the probabilistic approaches to TC in the erature, for example, [Joa98] and [Lew98]
lit-Without the independence assumption, the estimation of P (~ d j |c i) is an
impos-sible mission since the number of posimpos-sible vectors ~ d j is too high Although the
naive character of this classifier makes the computation possible, this assumption
is not verified in practice In addition, the non-binary term weights cannot beapplicable to this method1
To calculateQ|T | k=1 P (w kj |c i), two models are used: one is a multi-variate Bernoullimodel which is a Bayesian Network with no dependencies between words and bi-nary word features; another is a multinomial model, that is, a uni-gram languagemodel with integer word counts [MN98] empirically compared their classificationperformance and found that the multi-variate Bernoulli performs well with smallvocabulary sizes, but the multinomial model usually performs even better at largervocabulary sizes One prominent characteristic of the multinomial model is to relaxthe constraint that document vectors should be binary representation
1 [MN98] used a multinomial event model for Na¨ıve Bayes text classification which can relax the constraint that document vectors should be binary-valued.