An Improved Term Weighting Scheme forText Categorization Pham Xuan Nguyen Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Su
Trang 1An Improved Term Weighting Scheme for
Text Categorization
Pham Xuan Nguyen
Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi
Supervised by
Dr Le Quang Hieu
A thesis submitted in fulfillment of the requirements
for the degree ofMaster of Science in Computer Science
August 2014
Trang 2ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work To the best of my knowledge, itcontains no materials previously published by another person, or substantial proportions
of material which have been accepted for the award of any other degrees or diplomas
at University of Engineering and Technology (UET/Coltech) or any other educationalinstitutions, except where due acknowledgement is made in the thesis Any contributionsmade to the researches by others are explicitly acknowledged in the thesis I also declarethat the intellectual content of this thesis is the product of my own work, except tothe extent that assistance from others in the project’s design and conception or in style,presentation and linguistic expression are acknowledged.’
Trang 3In text categorization, term weighting is the task to assign weights to terms duringthe document presentation phase Thus, it affects the classification performance Inaddition to resulting in a high performance of text categorization, an effective termweighting scheme should be easy to use
Term weighting methods can be divided into two categories, namely, supervisedand unsupervised [27] The traditional term weighting schemes such as binary, tfand tf.idf [38], belong to unsupervised term weighting methods Other schemes (forexample, tf.χ2 [12]) that make use of the prior information about the membership
of training documents, belong to the supervised term weighting methods
The supervised term weighting method tf.rf [27] is one of the most effectiveschemes to date It showed better performance than many others [27] However,tf.rf is not the best in some cases Moreover, tf.rf requires many rf values for eachterm
In this thesis, we present an improved term weighting scheme from tf.rf, calledlogtf.rfmax Our new scheme uses logtf = log2 (1.0 + tf ) instead of tf Furthermore,our scheme is simpler than tf.rf because it only uses the maximum value of rf foreach term Our experimental results showed that our scheme is consistently betterthan tf.rf and others
ii
Trang 4To my family ♥
iii
Trang 5First, I would like to express my gratitude to my supervisor, Dr Le Quang Hieu
He guided me throughout the years and gave me several useful advices about studymethod He was very patient with me His words influenced strongly on me I alsowould like to give my honest appreciation to my colleagues at Hoalu University andUniversity of Engineering and Technology (UET/Coltech) for their great support.Thank you all!
iv
Trang 6Table of Contents
1.1 Motivation 1
1.2 Structure of this Thesis 2
2 Overview of Text Categorization 4 2.1 Introduction 4
2.2 Text Representation 5
2.3 Text Categorization tasks 7
2.3.1 Single-label and Multi-label Text Categorization 7
2.3.2 Flat and Hierarchical Text Categorization 8
2.4 Applications of Text Categorization 9
2.4.1 Automatic Document Indexing for IR Systems 10
2.4.2 Documentation Organization 10
2.4.3 Word Sense Disambiguation 10
2.4.4 Text Filtering System 11
2.4.5 Hierarchical Categorization of Web Pages 11
2.5 Machine learning approaches to Text Categorization 12
2.5.1 k Nearest Neighbor 12
2.5.2 Decision Tree 13
2.5.3 Support Vector Machines 14
2.6 Performance Measures 15
3 Term Weighting Schemes 18 3.1 Introduction 18
3.2 Previous Term Weighting Schemes 19
3.2.1 Unsupervised Term Weighting Schemes 19
3.2.2 Supervised Term Weighting Schemes 21
v
Trang 7TABLE OF CONTENTS vi
3.3 Our New Term Weighting Scheme 23
4 Experiments 26 4.1 Term Weighting Methods 26
4.2 Machine Learning Algorithm 27
4.3 Corpora 28
4.3.1 Reuters News Corpus 28
4.3.2 20 Newsgroups Corpus 29
4.4 Evaluation Measures 29
4.5 Results and Discussion 30
4.5.1 Results on the 20 Newsgroups corpus 30
4.5.2 Results on the Reuters News corpus 31
4.5.3 Discussion 33
4.5.4 Further Analysis 34
Trang 8List of Figures
2.1 An example of vector space model 5
2.2 An example of transforming a multi-label problem into 3 binary clas-sification problems 8
2.3 A hiararchy with two top-level categories 9
2.4 Text Categorization using machine learning techniques 12
2.5 An example of a decision tree [source [27]] 14
4.1 Linear Support Vector Machine [source [14]] 27
4.2 The micro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features 30
4.3 The macro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features 31
4.4 The micro − F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features 32
4.5 The macro − F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features 32
4.6 The f1 measure of four methods on each category of Reuters News corpus using SVM algorithm at the full vocabulary 34
4.7 The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from 1 to 10 34
4.8 The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from 11 to 20 34
vii
Trang 9List of Tables
3.1 Traditional Term Weighting Schemes 19
3.2 Examples of two terms having different tf and log2(1 + tf ) 24
4.1 Experimental Term Weighting Schemes 26
5.1 Examples of two term weights as using rf and rfmax 38
viii
Trang 11Chapter 1
Introduction
In recent decades, there is a huge growth in the number of textual information
documents, especially in Word Wide Web As a result, the needs for categorizingthe documents increased rapidly and text categorization (TC) field attracted many
researchers
In text representation phase, the content of documents are transformed into a
compact format Specifically, each document is presented as a vector of terms in thevector space model Each vector component contains a value presenting how much a
term contributes to the discriminative semantics of the document Term WeightingScheme is the task to assign weights to terms in this phase
TWS is a well-studied field The traditional term weighting methods such asbinary, tf and tf.idf are borrowed from information retrieval (IR) domain These
term weighting schemes do not use the previous information about the ship of training documents Other schemes using this information are called the
member-supervised term weighting schemes, for example, tf.χ2 [12]
To date, the supervised term weighting scheme tf.rf [27] is one of the best
meth-ods It achieves better performance than many others in a series of thorough
ex-1
Trang 121.2 Structure of this Thesis 2
periments using two commonly-used algorithms (SVM and kNN) as well as twobenchmark data collections (Reuters News and 20 Newsgroups) However, the per-
formance of tf.rf is not stable tf.rf shows the considerable better performance thanall other schemes in the experiments on Reuters News data set, while its perfor-
mance is worse than rf ’s performance (a term weighting scheme does not use the tffactor), and is slightly better than tf.idf (a common term weighting method) in the
experiments on 20 Newsgroups corpus Furthermore, for each term, tf.rf requires N(the total number of categories) rf values in a multi-label classification problem It
raises a question whether there is a typical rf value for each term
In this thesis, we propose an improved term weighting scheme, which applies two
improvements to tf.rf First, we replace tf by logtf = log2(1.0 + tf ) Moreover, weonly use the maximum of rf value (rfmax) for each term in a multi-label classification
problem The formula for our scheme is logtf.rfmax
We conducted experiments with the experimental settings described in [27],
where tf.rf was proposed We use two standard measures (micro − F1 and macro −
F1) as well as linear SVM We carefully select eight term weighting schemes,
in-cluding two common methods, two schemes used in [27], four methods applyingour improvements, in order to assess our work The experimental results show that
logtf.rfmaxconsistently outperforms tf.rf as well as other schemes on two data sets
The remainder of this thesis is organized as follows Chapter 2 provides an overview
of text categorization Chapter 3 reviews the term weighting schemes for text gorization, and describes our improved term weighting scheme Chapter 4 describes
cate-our experiments, including the used algorithms, data sets, measures, results anddiscussion Chapter 5 presents the conclusion
In this study, the default studied language is English In addition, we only applythe bag-of-words approach to represent a document, and used data sets are flat The
Trang 131.2 Structure of this Thesis 3
results of the study can result in a valuable term weighting method for TC
Trang 14Chapter 2
Overview of Text Categorization
This chapter gives an overview of TC We begin by introducing TC, then present
some applications and tasks of TC The rest this chapter is about the approaches
to TC, especially SVM, which is applied in this thesis
Automated text categorization (or text classification) is the supervised learning task
of assigning documents into the predefined categories TC differs from text clustering
where we can not know the set of categories in advance
TC has been studying since the early 1960s, but it only has been focused in recent
decades due to the needs of categorizing a large number of the documents in WordWide Web Generally, TC relates to the machine learning (ML) and information
retrieval (IR) field
In the 1980s, the popular approaches to TC is constructing an expert system,
which is capable of taking text classification decision based on knowledge engineeringtechniques The famous example of this method is CONSTRUE system [22] Since
the early 1990s, the machine learning approaches to TC have become popular
4
Trang 15document is converted to a vector in the term space (each term usually associates
a word) For detail, the document d is represented as (w1, , wn), where n is the
total number of terms The value of wkrepresents how much the term tk contributes
to classify the document d Figure2.1illustrates the way of representing documents
in VSM Five documents are represented as five vectors in the 3-dimensional space(System, Class, Text )
Trang 162.2 Text Representation 6
In the process of transforming documents according to VSM, the word sequence
in a document is not considered and each dimension in vector space associates with
a word in the vocabulary that is built after text preprocessing phase In this phase,the words assume to have no information content (such as stop words, numbers,
and so on) in a document are removed Then words can be stemmed Finally, therest words in all of documents are sorted alphabetically, and numbered consecutively
Stop words are common words that are not useful to TC such as article (for example,
“the”, “a”), prepositions (for example, “of”, “in”), conjunctions (for example, “and”,
“or”) Stemming algorithms are used to map several morphological forms of a word
to a term (for instance, “computers” is mapped to “computer”) To reduce the
dimension of the feature space, feature selection process is usually applied In thisprocess, each term is assigned a score presenting the “important” level of this term
for TC task Then only top terms with highest scores are used to represent alldocuments
Two key issues considered in the text representation phase are term types andterm weights A term (or a feature) can be a sub-word, a word, a phrase, a sentence,
and so on The common type of term is a word, and a document is treated as a group
of words with different frequency This representation method is called bag-of-words
approach and it performs well in practice The bag-of-words approach is simplicity,but it discards a lot of useful information about the semantic between words For
example, two words in a phrase verb are considered as two independent ones Tosolve this problem, many researchers used phrases (for instance, noun phrases) or
sentences as terms These phrases often include syntactic and/or statistical tion [29], [6] Furthermore, the term type can be a combination [10] of the different
informa-types, for example, the word-level type and the 3-gram type [10] Term weights will
Trang 172.3 Text Categorization tasks 7
vector representaion
Text categorization can be classified into many different types according to the
number of categories assigned to a document, the total number of categories, andthe category structure
Based on the number of categories that a document can belong to, text tion is classified into two types, namely, single-label and multi-label
categoriza-Single-label classification is the case that each document is assigned to only
one category, and there are two or more categories Binary classification is a cial case of single-label text categorization, in which the number of categories is two
spe-Multi-label Text Categorization In multi-label classification, a document can
be assigned to more than one category, and it involves two or more categories.Multi-label classification differs from multi-class single-label classification where the
number of categories is also more than one, but a document is assigned only onecategory
To solve the multi-label problem, we can apply either the problem tion methods or the algorithm adaptation methods The problem transformation
transforma-methods transform multi-label problem into a set of binary classification problems,each of them can be solved by a single-label classifier
An example of the transformation method is OneVsAll This approach forms the multi-label classification problem of N categories into N binary classifi-
trans-cation problems, each of which corresponds to a different category To determine
Trang 182.3 Text Categorization tasks 8
Figure 2.2: An example of transforming a multi-label problem into 3 binary fication problems
classi-which category are assigned to a document, each binary classifier is used to
deter-mine whether this document belongs to the corresponding category
To build a binary classifier for a given category C, all training documents are
divided into two categories The positive category contains documents belonging tothe category C All documents in other categories belong to the negative category
Figure2.2 illustrates a 3 - category problem that transforms into three binary lems For the binary classifier corresponding to class 1, the documents in this class
prob-belong to the positive category, and all documents in class 2 and class 3 togetherbelong to the negative category
According to the category structure, text categorization can be divided into twocategories The former is flat categorization where a category is separate from
others The latter is hierarchical categorization in which there is a hierarchicalcategory structure An example of a hierarchy with two top-level categories, Cars
and Sports, and three subcategories within each, namely, Cars/Lorry, Cars/Truck,
Trang 192.4 Applications of Text Categorization 9
Figure 2.3: A hiararchy with two top-level categories
Cars/Taxi, Sports/Football, Sports/Skiing, Sports/Tennis is shown in Figure2.3
In the flat classification case, a model corresponding to a positive category is
learned to distinguish the target category from all other categories However, in thehierarchical classification, a model corresponding to a positive category is learned
to distinguish the target category from other categories within the same top level
In figure2.3, the text classifiers corresponding to each top-level category, Cars and
Sports, distinguish them from each other This is the same as flat TC Meanwhile,the model corresponding to each second-level category is learned to distinguish a
second-level category from other second-level categories within the same top-levelcategory Specifically, the model built on category Cars/Lorry distinguishes it-
self from the other two categories under Cars category, namely, Cars/Taxi andCars/Truck
There are a large number of applications of text categorization In this session, wediscuss the important ones
Trang 202.4 Applications of Text Categorization 10
Automatic document indexing for IR systems is the activity that each document
is assigned some key words or key phrases describing its content from a dictionary
Generally, this work is done by trained human indexers However, if we treat theentries in the dictionary as categories, document indexing is an application of TC,
and it may be solved by computers Several ways of using TC techniques for matic document indexing have been described in [41], [35] The dictionary usually
auto-consists of a thematic hierarchical thesaurus, for example, the NASA thesaurus forthe aerospace discipline, or the MESH thesaurus for the biomedical literature
Automatic indexing with a controlled dictionary and automated metadata eration are closely related to each other In digital libraries, documents are tagged
gen-by metadata (for example, creation date, document type, author, availability, and soon) Some of this metadata is thematic, and the role of the metadata is to describe
the documents by means of bibliographic codes, key words or key phrases
Documentation organization might be the most general application of TC because
there is a huge number of documents that need to be classified Textual informationcan be in ads, newspaper, emails, patents, conference papers, abstracts, newsgroup
posters and so on A system classifying newspaper advertisements under differentcategories such as Cars for Sale and Job Hunting, or a system grouping conference
papers into sessions related to themes are two examples of documentation tion
The task of word sense disambiguation (WSD) is to find the sense of an ambiguousword (for instance, bank may mean a financial institution or a land long side of a
river), given the occurrence in a context of this particular word Although a number
Trang 212.4 Applications of Text Categorization 11
of other techniques have been used in WSD, another solution to WSD is to apply
TC techniques when we treat the word occurrence contexts as documents, and treat
word senses as categories [19], [15]
Text filtering is an activity of categorizing a stream of incoming documents in an
asynchronous way based on an information producer to an information consumer[4] One typical instance is a news feed, in which the consumer is a newspaper and
the producer is a news agency [22] In this case, the filtering system should blockthe delivery of the documents that the consumers are likely not interested in (for
example, all news not concerning sports in a sports newspaper) Moreover, a textFiltering system might also further categorize the documents considered relevant to
the consumer into different thematic categories For instance, the relevant ments (news about sports) should be further classified based on which sport they
docu-invovel Junk e-mails Filtering system is another instance It may be trained to getrich of spam mails and further categorize non-spam mails into different categories [2],
[20] Information Filtering based on machine learning techniques has been discussed
in [1], [24]
When documents are catalogued hierarchically, it is easier for a researcher to firstnavigate in the hierarchy of categories and limit his search to a interested category
Therefore, many real world web classification systems have been built on complicatedhierarchical structure such as Yahoo!, MeSH, U.S.Patents, LookSmart and so on
This hierarchical web page classification may be dealt with the hierarchical TCtechniques Prior works related to the hierarchical structure in a TC context have
been discussed in [13], [42] In practice, links also have been used in web pagesclassification by [34], [20]
Trang 222.5 Machine learning approaches to Text Categorization 12
Figure 2.4: Text Categorization using machine learning techniques
k Nearest Neighbor (k NN) is a kind of example-based classifiers It relies on the
category labels assigned to the training documents similar to the test documents.Specifically, a classifier using k Nearest Neighbor (k NN) algorithm categories an
Trang 232.5 Machine learning approaches to Text Categorization 13
unlabelled document under a class based on categories of k training documents thatare most similar to this document The distance metrics measuring the similarity
between two documents include the Eculidean distance
Dis(P, Q) =
sX
ip2 i
pP
iq2 i
(2.3)
where P and Q are two samples, pi and qi are the attributes of two samples,
respec-tively
k NN has been shown quite effective, but the significant drawback is classification
time in the case of the huge dimensional data sets Furthermore, k NN requires theentire training samples to be ranked for similarity with the test documents, which
is much more expensive Actually, the k NN method can not be called an inductivelearner because it does not have a training phase
A decision tree (DT) text classifier is a tree in which each internal node is labelled
by a term, each branch corresponds to a term weight, and each leaf node is labelled
by a category To categorize a test document, a classifier starts at the root of thetree, and moves through this tree until a leaf node, which provides a category At
each internal node, classifier tests whether the document contains the term beinglabelled in this node or not If yes, the moving direction follows the weight of this
term in the document Most such classifiers apply binary text representations and
Trang 242.5 Machine learning approaches to Text Categorization 14
Figure 2.5: An example of a decision tree [source [27]]
binary trees Figure 2.5 is an example of a binary tree where edges are labeled byterms (underlining denotes negation) and leaves are labelled by categories (WHEAT
in this example)
One important issue of DT is overfitting when some branches may be too
spe-cific to the training samples Thus most decision tree learning methods contain amethod for growing and pruning the tree (for example, discarding the overly specific
branches) Among the standard packages for DT learning, the popular ones are ID3[17], C4.5 [7] and C5 [31]
Support vector machine (SVM) algorithm has been first introduced by Vapnik It
has been originally applied to text categorization by Joachims and Dumais [25],[14] Among all the surfaces dividing the training examples into two classes in |W|-
Trang 25SVMs are usually grouped into linear SVM and non-linear SVM based on the
different kernel functions For instane, the different kernel functions are linear tion
for vectors (xi and xj), where, γ, τ and d are kernel parameters
In recent years, SVM has been widely used and has shown better performancethan other machine learning algorithms due to its ability to handle high dimensional
and large-scale training set [25], [14] There are a number of software packagesimplementing SVM algorithm with the different kernel functions such as SVM-Light,
LIBSVM, TinySVM, LIBLINEAR, and so on
In this section, we describe the measures of TC effectiveness
Trang 26category); T Ni denote true negatives (the number of the samples that do not belong
to this category and are correctly not assigned to this category); and F Ni denote
false negatives (the number of the documents that belong to this category, but areincorrectly not assigned to this category) We define five measures as follows:
Measures for multi-label classification To assess performance of m categories
in a multi-label classification task, we have two averaging methods, namely, macro−
F1 and micro − F1 The formula for macro − F1 is: