An Improved Term Weighting Scheme for Text Categorization : M.A Thesis Information Technology : 60 48 01

An Improved Term Weighting Scheme forText Categorization Pham Xuan Nguyen Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Su

Trang 1

An Improved Term Weighting Scheme for

Text Categorization

Pham Xuan Nguyen

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi

Supervised by

Dr Le Quang Hieu

A thesis submitted in fulfillment of the requirements

for the degree ofMaster of Science in Computer Science

August 2014

Trang 2

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work To the best of my knowledge, itcontains no materials previously published by another person, or substantial proportions

of material which have been accepted for the award of any other degrees or diplomas

at University of Engineering and Technology (UET/Coltech) or any other educationalinstitutions, except where due acknowledgement is made in the thesis Any contributionsmade to the researches by others are explicitly acknowledged in the thesis I also declarethat the intellectual content of this thesis is the product of my own work, except tothe extent that assistance from others in the project’s design and conception or in style,presentation and linguistic expression are acknowledged.’

Trang 3

In text categorization, term weighting is the task to assign weights to terms duringthe document presentation phase Thus, it affects the classification performance Inaddition to resulting in a high performance of text categorization, an effective termweighting scheme should be easy to use

Term weighting methods can be divided into two categories, namely, supervisedand unsupervised [27] The traditional term weighting schemes such as binary, tfand tf.idf [38], belong to unsupervised term weighting methods Other schemes (forexample, tf.χ2 [12]) that make use of the prior information about the membership

of training documents, belong to the supervised term weighting methods

The supervised term weighting method tf.rf [27] is one of the most effectiveschemes to date It showed better performance than many others [27] However,tf.rf is not the best in some cases Moreover, tf.rf requires many rf values for eachterm

In this thesis, we present an improved term weighting scheme from tf.rf, calledlogtf.rfmax Our new scheme uses logtf = log2 (1.0 + tf ) instead of tf Furthermore,our scheme is simpler than tf.rf because it only uses the maximum value of rf foreach term Our experimental results showed that our scheme is consistently betterthan tf.rf and others

ii

Trang 4

To my family ♥

iii

Trang 5

First, I would like to express my gratitude to my supervisor, Dr Le Quang Hieu

He guided me throughout the years and gave me several useful advices about studymethod He was very patient with me His words influenced strongly on me I alsowould like to give my honest appreciation to my colleagues at Hoalu University andUniversity of Engineering and Technology (UET/Coltech) for their great support.Thank you all!

iv

Trang 6

Table of Contents

1.1 Motivation 1

1.2 Structure of this Thesis 2

2 Overview of Text Categorization 4 2.1 Introduction 4

2.2 Text Representation 5

2.3 Text Categorization tasks 7

2.3.1 Single-label and Multi-label Text Categorization 7

2.3.2 Flat and Hierarchical Text Categorization 8

2.4 Applications of Text Categorization 9

2.4.1 Automatic Document Indexing for IR Systems 10

2.4.2 Documentation Organization 10

2.4.3 Word Sense Disambiguation 10

2.4.4 Text Filtering System 11

2.4.5 Hierarchical Categorization of Web Pages 11

2.5 Machine learning approaches to Text Categorization 12

2.5.1 k Nearest Neighbor 12

2.5.2 Decision Tree 13

2.5.3 Support Vector Machines 14

2.6 Performance Measures 15

3 Term Weighting Schemes 18 3.1 Introduction 18

3.2 Previous Term Weighting Schemes 19

3.2.1 Unsupervised Term Weighting Schemes 19

3.2.2 Supervised Term Weighting Schemes 21

v

Trang 7

TABLE OF CONTENTS vi

3.3 Our New Term Weighting Scheme 23

4 Experiments 26 4.1 Term Weighting Methods 26

4.2 Machine Learning Algorithm 27

4.3 Corpora 28

4.3.1 Reuters News Corpus 28

4.3.2 20 Newsgroups Corpus 29

4.4 Evaluation Measures 29

4.5 Results and Discussion 30

4.5.1 Results on the 20 Newsgroups corpus 30

4.5.2 Results on the Reuters News corpus 31

4.5.3 Discussion 33

4.5.4 Further Analysis 34

Trang 8

List of Figures

2.1 An example of vector space model 5

2.2 An example of transforming a multi-label problem into 3 binary clas-sification problems 8

2.3 A hiararchy with two top-level categories 9

2.4 Text Categorization using machine learning techniques 12

2.5 An example of a decision tree [source [27]] 14

4.1 Linear Support Vector Machine [source [14]] 27

4.2 The micro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features 30

4.3 The macro − F1 measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features 31

4.4 The micro − F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features 32

4.5 The macro − F1 measure of eight term weighting schemes on the Reuters News corpus with different numbers of features 32

4.6 The f1 measure of four methods on each category of Reuters News corpus using SVM algorithm at the full vocabulary 34

4.7 The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from 1 to 10 34

4.8 The f1 measure of four methods on each category of 20 Newsgroups corpus using SVM algorithm at the full vocabulary, category from 11 to 20 34

vii

Trang 9

List of Tables

3.1 Traditional Term Weighting Schemes 19

3.2 Examples of two terms having different tf and log2(1 + tf ) 24

4.1 Experimental Term Weighting Schemes 26

5.1 Examples of two term weights as using rf and rfmax 38

viii

Trang 11

Chapter 1

Introduction

In recent decades, there is a huge growth in the number of textual information

documents, especially in Word Wide Web As a result, the needs for categorizingthe documents increased rapidly and text categorization (TC) field attracted many

researchers

In text representation phase, the content of documents are transformed into a

compact format Specifically, each document is presented as a vector of terms in thevector space model Each vector component contains a value presenting how much a

term contributes to the discriminative semantics of the document Term WeightingScheme is the task to assign weights to terms in this phase

TWS is a well-studied field The traditional term weighting methods such asbinary, tf and tf.idf are borrowed from information retrieval (IR) domain These

term weighting schemes do not use the previous information about the ship of training documents Other schemes using this information are called the

member-supervised term weighting schemes, for example, tf.χ2 [12]

To date, the supervised term weighting scheme tf.rf [27] is one of the best

meth-ods It achieves better performance than many others in a series of thorough

ex-1

Trang 12

periments using two commonly-used algorithms (SVM and kNN) as well as twobenchmark data collections (Reuters News and 20 Newsgroups) However, the per-

formance of tf.rf is not stable tf.rf shows the considerable better performance thanall other schemes in the experiments on Reuters News data set, while its perfor-

mance is worse than rf ’s performance (a term weighting scheme does not use the tffactor), and is slightly better than tf.idf (a common term weighting method) in the

experiments on 20 Newsgroups corpus Furthermore, for each term, tf.rf requires N(the total number of categories) rf values in a multi-label classification problem It

raises a question whether there is a typical rf value for each term

In this thesis, we propose an improved term weighting scheme, which applies two

improvements to tf.rf First, we replace tf by logtf = log2(1.0 + tf ) Moreover, weonly use the maximum of rf value (rfmax) for each term in a multi-label classification

problem The formula for our scheme is logtf.rfmax

We conducted experiments with the experimental settings described in [27],

where tf.rf was proposed We use two standard measures (micro − F1 and macro −

F1) as well as linear SVM We carefully select eight term weighting schemes,

in-cluding two common methods, two schemes used in [27], four methods applyingour improvements, in order to assess our work The experimental results show that

logtf.rfmaxconsistently outperforms tf.rf as well as other schemes on two data sets

The remainder of this thesis is organized as follows Chapter 2 provides an overview

of text categorization Chapter 3 reviews the term weighting schemes for text gorization, and describes our improved term weighting scheme Chapter 4 describes

cate-our experiments, including the used algorithms, data sets, measures, results anddiscussion Chapter 5 presents the conclusion

In this study, the default studied language is English In addition, we only applythe bag-of-words approach to represent a document, and used data sets are flat The

Trang 13

results of the study can result in a valuable term weighting method for TC

Trang 14

Chapter 2

Overview of Text Categorization

This chapter gives an overview of TC We begin by introducing TC, then present

some applications and tasks of TC The rest this chapter is about the approaches

to TC, especially SVM, which is applied in this thesis

Automated text categorization (or text classification) is the supervised learning task

of assigning documents into the predefined categories TC differs from text clustering

where we can not know the set of categories in advance

TC has been studying since the early 1960s, but it only has been focused in recent

decades due to the needs of categorizing a large number of the documents in WordWide Web Generally, TC relates to the machine learning (ML) and information

retrieval (IR) field

In the 1980s, the popular approaches to TC is constructing an expert system,

which is capable of taking text classification decision based on knowledge engineeringtechniques The famous example of this method is CONSTRUE system [22] Since

the early 1990s, the machine learning approaches to TC have become popular

4

Trang 15

document is converted to a vector in the term space (each term usually associates

a word) For detail, the document d is represented as (w1, , wn), where n is the

total number of terms The value of wkrepresents how much the term tk contributes

to classify the document d Figure2.1illustrates the way of representing documents

in VSM Five documents are represented as five vectors in the 3-dimensional space(System, Class, Text )

Trang 16

2.2 Text Representation 6

In the process of transforming documents according to VSM, the word sequence

in a document is not considered and each dimension in vector space associates with

a word in the vocabulary that is built after text preprocessing phase In this phase,the words assume to have no information content (such as stop words, numbers,

and so on) in a document are removed Then words can be stemmed Finally, therest words in all of documents are sorted alphabetically, and numbered consecutively

Stop words are common words that are not useful to TC such as article (for example,

“the”, “a”), prepositions (for example, “of”, “in”), conjunctions (for example, “and”,

“or”) Stemming algorithms are used to map several morphological forms of a word

to a term (for instance, “computers” is mapped to “computer”) To reduce the

dimension of the feature space, feature selection process is usually applied In thisprocess, each term is assigned a score presenting the “important” level of this term

for TC task Then only top terms with highest scores are used to represent alldocuments

Two key issues considered in the text representation phase are term types andterm weights A term (or a feature) can be a sub-word, a word, a phrase, a sentence,

and so on The common type of term is a word, and a document is treated as a group

of words with different frequency This representation method is called bag-of-words

approach and it performs well in practice The bag-of-words approach is simplicity,but it discards a lot of useful information about the semantic between words For

example, two words in a phrase verb are considered as two independent ones Tosolve this problem, many researchers used phrases (for instance, noun phrases) or

sentences as terms These phrases often include syntactic and/or statistical tion [29], [6] Furthermore, the term type can be a combination [10] of the different

informa-types, for example, the word-level type and the 3-gram type [10] Term weights will

Trang 17

vector representaion

Text categorization can be classified into many different types according to the

number of categories assigned to a document, the total number of categories, andthe category structure

Based on the number of categories that a document can belong to, text tion is classified into two types, namely, single-label and multi-label

categoriza-Single-label classification is the case that each document is assigned to only

one category, and there are two or more categories Binary classification is a cial case of single-label text categorization, in which the number of categories is two

spe-Multi-label Text Categorization In multi-label classification, a document can

be assigned to more than one category, and it involves two or more categories.Multi-label classification differs from multi-class single-label classification where the

number of categories is also more than one, but a document is assigned only onecategory

To solve the multi-label problem, we can apply either the problem tion methods or the algorithm adaptation methods The problem transformation

transforma-methods transform multi-label problem into a set of binary classification problems,each of them can be solved by a single-label classifier

An example of the transformation method is OneVsAll This approach forms the multi-label classification problem of N categories into N binary classifi-

trans-cation problems, each of which corresponds to a different category To determine

Trang 18

Figure 2.2: An example of transforming a multi-label problem into 3 binary fication problems

classi-which category are assigned to a document, each binary classifier is used to

deter-mine whether this document belongs to the corresponding category

To build a binary classifier for a given category C, all training documents are

divided into two categories The positive category contains documents belonging tothe category C All documents in other categories belong to the negative category

Figure2.2 illustrates a 3 - category problem that transforms into three binary lems For the binary classifier corresponding to class 1, the documents in this class

prob-belong to the positive category, and all documents in class 2 and class 3 togetherbelong to the negative category

According to the category structure, text categorization can be divided into twocategories The former is flat categorization where a category is separate from

others The latter is hierarchical categorization in which there is a hierarchicalcategory structure An example of a hierarchy with two top-level categories, Cars

and Sports, and three subcategories within each, namely, Cars/Lorry, Cars/Truck,

Trang 19

Figure 2.3: A hiararchy with two top-level categories

Cars/Taxi, Sports/Football, Sports/Skiing, Sports/Tennis is shown in Figure2.3

In the flat classification case, a model corresponding to a positive category is

learned to distinguish the target category from all other categories However, in thehierarchical classification, a model corresponding to a positive category is learned

to distinguish the target category from other categories within the same top level

In figure2.3, the text classifiers corresponding to each top-level category, Cars and

Sports, distinguish them from each other This is the same as flat TC Meanwhile,the model corresponding to each second-level category is learned to distinguish a

second-level category from other second-level categories within the same top-levelcategory Specifically, the model built on category Cars/Lorry distinguishes it-

self from the other two categories under Cars category, namely, Cars/Taxi andCars/Truck

There are a large number of applications of text categorization In this session, wediscuss the important ones

Trang 20

Automatic document indexing for IR systems is the activity that each document

is assigned some key words or key phrases describing its content from a dictionary

Generally, this work is done by trained human indexers However, if we treat theentries in the dictionary as categories, document indexing is an application of TC,

and it may be solved by computers Several ways of using TC techniques for matic document indexing have been described in [41], [35] The dictionary usually

auto-consists of a thematic hierarchical thesaurus, for example, the NASA thesaurus forthe aerospace discipline, or the MESH thesaurus for the biomedical literature

Automatic indexing with a controlled dictionary and automated metadata eration are closely related to each other In digital libraries, documents are tagged

gen-by metadata (for example, creation date, document type, author, availability, and soon) Some of this metadata is thematic, and the role of the metadata is to describe

the documents by means of bibliographic codes, key words or key phrases

Documentation organization might be the most general application of TC because

there is a huge number of documents that need to be classified Textual informationcan be in ads, newspaper, emails, patents, conference papers, abstracts, newsgroup

posters and so on A system classifying newspaper advertisements under differentcategories such as Cars for Sale and Job Hunting, or a system grouping conference

papers into sessions related to themes are two examples of documentation tion

The task of word sense disambiguation (WSD) is to find the sense of an ambiguousword (for instance, bank may mean a financial institution or a land long side of a

river), given the occurrence in a context of this particular word Although a number

Trang 21

of other techniques have been used in WSD, another solution to WSD is to apply

TC techniques when we treat the word occurrence contexts as documents, and treat

word senses as categories [19], [15]

Text filtering is an activity of categorizing a stream of incoming documents in an

asynchronous way based on an information producer to an information consumer[4] One typical instance is a news feed, in which the consumer is a newspaper and

the producer is a news agency [22] In this case, the filtering system should blockthe delivery of the documents that the consumers are likely not interested in (for

example, all news not concerning sports in a sports newspaper) Moreover, a textFiltering system might also further categorize the documents considered relevant to

the consumer into different thematic categories For instance, the relevant ments (news about sports) should be further classified based on which sport they

docu-invovel Junk e-mails Filtering system is another instance It may be trained to getrich of spam mails and further categorize non-spam mails into different categories [2],

[20] Information Filtering based on machine learning techniques has been discussed

in [1], [24]

When documents are catalogued hierarchically, it is easier for a researcher to firstnavigate in the hierarchy of categories and limit his search to a interested category

Therefore, many real world web classification systems have been built on complicatedhierarchical structure such as Yahoo!, MeSH, U.S.Patents, LookSmart and so on

This hierarchical web page classification may be dealt with the hierarchical TCtechniques Prior works related to the hierarchical structure in a TC context have

been discussed in [13], [42] In practice, links also have been used in web pagesclassification by [34], [20]

Trang 22

Figure 2.4: Text Categorization using machine learning techniques

k Nearest Neighbor (k NN) is a kind of example-based classifiers It relies on the

category labels assigned to the training documents similar to the test documents.Specifically, a classifier using k Nearest Neighbor (k NN) algorithm categories an

Trang 23

unlabelled document under a class based on categories of k training documents thatare most similar to this document The distance metrics measuring the similarity

between two documents include the Eculidean distance

Dis(P, Q) =

sX

ip2 i

pP

iq2 i

(2.3)

where P and Q are two samples, pi and qi are the attributes of two samples,

respec-tively

k NN has been shown quite effective, but the significant drawback is classification

time in the case of the huge dimensional data sets Furthermore, k NN requires theentire training samples to be ranked for similarity with the test documents, which

is much more expensive Actually, the k NN method can not be called an inductivelearner because it does not have a training phase

A decision tree (DT) text classifier is a tree in which each internal node is labelled

by a term, each branch corresponds to a term weight, and each leaf node is labelled

by a category To categorize a test document, a classifier starts at the root of thetree, and moves through this tree until a leaf node, which provides a category At

each internal node, classifier tests whether the document contains the term beinglabelled in this node or not If yes, the moving direction follows the weight of this

term in the document Most such classifiers apply binary text representations and

Trang 24

Figure 2.5: An example of a decision tree [source [27]]

binary trees Figure 2.5 is an example of a binary tree where edges are labeled byterms (underlining denotes negation) and leaves are labelled by categories (WHEAT

in this example)

One important issue of DT is overfitting when some branches may be too

spe-cific to the training samples Thus most decision tree learning methods contain amethod for growing and pruning the tree (for example, discarding the overly specific

branches) Among the standard packages for DT learning, the popular ones are ID3[17], C4.5 [7] and C5 [31]

Support vector machine (SVM) algorithm has been first introduced by Vapnik It

has been originally applied to text categorization by Joachims and Dumais [25],[14] Among all the surfaces dividing the training examples into two classes in |W|-

Trang 25

SVMs are usually grouped into linear SVM and non-linear SVM based on the

different kernel functions For instane, the different kernel functions are linear tion

for vectors (xi and xj), where, γ, τ and d are kernel parameters

In recent years, SVM has been widely used and has shown better performancethan other machine learning algorithms due to its ability to handle high dimensional

and large-scale training set [25], [14] There are a number of software packagesimplementing SVM algorithm with the different kernel functions such as SVM-Light,

LIBSVM, TinySVM, LIBLINEAR, and so on

In this section, we describe the measures of TC effectiveness

Trang 26

category); T Ni denote true negatives (the number of the samples that do not belong

to this category and are correctly not assigned to this category); and F Ni denote

false negatives (the number of the documents that belong to this category, but areincorrectly not assigned to this category) We define five measures as follows:

Measures for multi-label classification To assess performance of m categories

in a multi-label classification task, we have two averaging methods, namely, macro−

F1 and micro − F1 The formula for macro − F1 is:

Định dạng
Số trang	52
Dung lượng	741,39 KB