An Improved Term Weighting Scheme for Text Categorization IHỌC NGHỆ / Pham Xuan Nguyen Faculty of Information Technology University of Engineering and Technology Vietnam National Univ
Trang 1An Improved Term Weighting Scheme for
Text Categorization
IHỌC NGHỆ /
Pham Xuan Nguyen
Faculty of Information Technology
University of Engineering and Technology Vietnam National University, Hanoi
Supervised by
Dr Le Quang Hieu
A thesis submitted in fulfillment of the requirements
for the degree of Master of Science in Computer Science
August 2014
Trang 2ORIGINALITY STATEMENT
‘Thereby declare nal Unis subutission is my own work To the best of iy kuowledge, it coulains no nudterials previously published by auother persun, or substautial proportions
of analerial which have been accepted for the award of any utler degrees or diplomas
at University of lingineering and Technology (UL'L/Coltech} or any other educational institutions, except where due acknowledgement is made in the thesis, Any contributions made to the researches by others are explicitly acknowledged in the thesis 1 also declare that the intellectual content of this thesis is the product of my own work, except ta the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression are acknowledged."
Trang 3ADSTRACT
In text naregorization, term weighting is the task to assign weights to terms during the document presentation phace ‘Ihhus, it aifects the classification performance In addition to resulting in a high performance of text eategorization, an effective term weighting scheme should be easy to use
Tet weighting methods can be divided inlo two categories, unmncly, supervised and uusupervised 27] The tadilionnl term weighting schemey such us binary, and {fidf [38], belong, vo uusupervised ter weighting methods, Other sch
ws (lor
example, ff.x* [12]} that make use of the pricr information about the membership
of training dacuments, belong to the supervised term weighting methods
The supervised term weighting method #f.rf [27 is one of the most effective schemes to date lt showed better performance than many others [27] Llowever, tf.nf is not the best in some cases Moreover, tf rf requires many 17 values for each term
Tu this thesis, we present an improved term weighting scheme from fff, culled
Logt fr fnoa Our wew sehen uses loglf = logy (1.0 1 éf) instead of Uf Furthermore, our scheme is simpler than Ur beeuuse iL ouly uses the maximum value of rf Tor each term Our experimental results showed that our scheme is consistently hetrer
than #f.rf and others.
Trang 4To my family 9
Trang 5ACKNOWLEDGEMENTS
First, I would like to express my gratitude to my supervisor, Dr Le Quang Hieu
He guided me throughout the years and gave me several usoful adviecs about study method He was very patient with me His words influenced strongly on me I also would like to give my honest appreciation to my colleagues al Hoalu Cuiversily and University of Engineering and Technology (CET/Collech) for their great support Thank you ail!
Trang 6Structure of thiy Thesis
2 Overview of Text Catcgorization
2.3.1 Singlelabel aud Multidebel Text cu Cateyurization `
2.5.2 Tlat and Tierarchical Text Categorization Applications of Text Categorization - 3.4.1 Automatic Document, Indexing for TT systems 2.4.2 Documentation Organization
2.4.3 Word Sense Visambiguation
24.5 Hicrarehical Catcgorization of Web Popes
Machine learning approaches to Text Culegurizalion
25.1 & Nearest Neighbor 2.5.2 Decision Tree
2.5.3 Support Vector Machines
Previous ‘lerm Weighting Schemes -
321 Unsupervised 'lerm Weighting Schemes 3.2.2 Supervised Term Weighting Schemes
Trang 7
43 Corpora 5.0
43.1 43.2
Reuters News Corpus
454
5 Conehision
and Discussion
Resulls on the 20 Newsgroups corpus
Results on the Teenters News corpus Tisenssion
Further Analysis
Trang 8
An example of transforming a multilahel problem into 3 2 binary claa- sification problems
A hiararchy with two toplevel categories
‘Text Categorization using machine learning techniques
An cxample of a decision trec [source [27] oe
Linear Support Vector Machine [source “14]]
‘The micro — /, measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features 0.0
The macro F, measure of cight term weighting schemes on the 20 Newsgroups corpus with different uumbery of features 0 ee
The micra — F, measure of eight term weighting schemes on the Renters News corpus with different numbers af features
The macro — F, measure of eight term weighting schemes on the Reuters News corpus with different numbers of features - The fy measure of four methods on each category of Reuters News corpus using $VM algorithm at the full vocabulary
The f, measure of four methods on cach category of 20 Newsgroups corpus using 8VM algorithm at the full vocabulary, category from 1
Trang 9List of Tables
Traditional Term Weighting Schemes
3.2 Fxamples of two terms having different ff and ‘font + if)
4.1 Experimental Term Weighting Schemes
S10) Examples of two term weights us using 7f aod + fmes
viii
Trang 10Support Vector Machine
Trang 11In text representation phase, the content of documents are transformed into a compact lormat Specifically, cach document is presented as a vector of Genus in Uae vevlor space model Each veclor compouent contaius a value presenting how much a term contributes to the discriminative semantics of the document Term Weighting Scheme is the task to assign weights to terms in this phase
TWS is u wellstudicd ficld The traditional term weighting wicthods such as binary, if and tf.idf are borrowed from information retrieval (IR) domain These term weighting schemes da not use the previons information about the member- ship of training documents Other schemes using this information are called the supervisecl term weighting schemes, for example, tx? [12]
is one of the best meth
To date, the supervised Lora weighting scherne /ƒ7/ |
ods Tt achieves betler performance than many others in a series of thorough
Trang 12ex-1.2 Structure of this Thesis 2
periments using lwo commonly-used algorithuuy (SVM und KXN) as well us two benchmark data collections (Reuters News and 20 Newsgroups) However, the per- formance of tf-rf is not stable éf.rf shows the considerable better performance than all other schemes in the experiments on Reuters News data set, while its perfor mance is worse than rf’s performance (a term weighting scheme does not use the #f factor), and is slightly better than f-idf (a common term weighting methad) in the experiments on 20 Newsgronps corpns Murthermore, for each term, #f.rf requires N (the total number of categories) rf values im a multi-label classification problem It raises a question whether there is a typical rf value for each term
Tu this thesis, we propose an improved term weighting scheme, which applics two improvements to ivf First, we replace Uf by luglf — log2(1.0 + Uf) Moreover, we only use the maximum of 7f value (7 fray) for each term in a mnititabel classification problem ‘I'he formula for our scheme is logt fr fas-
We vouducted experiments with Une experimental vottingy described in 27], where Ufrf was proposed, We use bwo standard measures (redcre— Fy and macro — FỊ) as well as linear SVM We carefully select: eight term weighting schemes, in cluding two common methods, two schemes used in [27], four methods applying our improvements, in order to assess our work ‘I'he experimental results show that
Jog foe Cousiswently outperforms Uf.rf us well us other selumes on tw dure sev
‘Vhe remainder of this thesis is organized as follows Chapter 2 provides an overview
of text categorization Chapter 3 reviews the term weighting schemes for text cate gorization, and deserihes our improved term weighting scheme Chapter 4 describes ont experiments, inclnding the used algorithms, data sets, measures, results and discussion Chapter 5 presents the conelusion
In this study, the default studied language is English In addition, we only apply
the bag-of-words opprouch lo represent u document, and used dutu sots are flat The
Trang 131.2 Structure of this Thesis
rewulls of the study can resull in a valuable term weighting method for TC
Trang 14Chapter 2
Overview of Text Categorization
‘Vhis chapter gives an overview of ‘IC We begin by introducing ‘IC, then present
some applications and tasks of ‘1 Lhe rest this chapter is about the approaches
to TC, especially SYM, whieh is applicd in this Unesis
Automated text categorization (or text classification) is the supervised learning task
of assigning documents into the predefined categories TC differs from text clustering
where we can not, know the set af caregories in advance
TC has heen studying since the early 1960s, bur it only has been focused in recent decades due to the needs of categorizing a large number of the documents in Word
Wide Web Generally, TC relates to the machine learning (ML) and information
retrieval (IR) field
In the 1980s, the popular approaches to TC is constructing an expert system,
which is capable of taking text classification decision hased on knowledge engineering
tcchniques The famous example of this method is CONSTRUE system [22] Since the carly 1990s, the machine learning approaches to TC have become popular.
Trang 15representation so as to be recognized and categorized by classifiers
One way to text representation is nse the vector space model (VSM) hased on
words documents (a technique in the It domain) In , the content of a textual document is converted to a vector in the term space (each tenn usually associates
a word) For detail, the ducument d iy represented ag (wy wy), where m iy Ue
total number of terms The value of w;, represents how much the term f, contributes
to classify the docnment, d Figure 2.1 illustrates the way of representing documents
in VSM Five documents are represented as five vectors in the 3-dimensional space
(System, Class, Text).
Trang 162.2 ‘ext Representation 6
Tu the process of Lransforming documents according, lo VBM, the word sequence
in a doenment is not considered and each dimension in vector space associates with
a word in the vocabulary that is built after text preprocessing phase In this phase, the words assume to have no information content (such as stop words, numbers, and so on) in a document are removed, ‘hen words can be stemmed Finally, the rest words in all of documents are sorted alphaberically, and numbered conseentively
Stop words are common words that are not useful ta TC such as article (for example,
“the”, “a”), prepositions (for example, “of”, *in”), conjunctions (for example, “and”,
“or*), Stemming algorithms are used to map several morphological forms of a word
to u term (lor iustunee, “computers” is mapped to “coimputer”) Tu reduce the dimension of the fealure space, feature selection process is usually applied Tu this
level of this term
process, each term is assigned a score presenting the “importa
for LC task ‘hen only top terms with highest scores are used to represent ali
documents
Two key issues considered in the text representation phase are term types and
term weights A term (or a feature} can he a sub-word, a word, a phrase, a sentence,
and so on ‘Lhe common type of term is a word, and a document is treated as a group
nf words with different frequency ‘[his representation methed is called dag of words
approach aud it perforus well in practice, The bag-of-words approuch is sirmplicity,
barr it discards a lot of useful information about the semantic between words Vor
example, two words in a phrase verb are considered as two independent ones lo
solve this problem, many researchers used phrases (for instance, noun phrases) or ventences as terms, These phrases often include syutacti¢ und or statistical informu-
tion [29], [6] Furthermore, the term type can be # combination [10] of the different
types, for example, the word-level type and the 3.gram type [10] Term weights will
Trang 172.3 ‘lext Categorization tasks 7
yeelor representalon
‘Vext categorization can be classified into many different types according to the mumber of categorics assigned to a document, the total number of eategorica, and
the calegory structure
2.3.1 Single-label and Multi-label Text Categorization
Rased on the number of categories that a document can belong to, text eategoriza-
tion is classified into two types, namely, singlelabel and mnti-label
Single-label classification is the case thal cach document is axsigned lo only
ome category, and there are two or more categories Binary classification is a spe-
cial case of singlelabel text categorization, in whieh the number of categories is two
Multi-label Text Categorization In multi-label classification, a document can
by assigned (o more Uw oue eategory, and it involves two or more calegor Mulli-label classification diflers from mulli-class single-label classification where the number of categories is also more than one, but a document is assigned only one category
To sive the mullidubel probleu, we ean apply cither the problew rnsforme- tion methods or the algorithm adaptation methods The problem transformation methods transform multi-label prablem into a set of binary classification problems, each of them can be solved by a single-label classifier
An example of the transformation methed is OueV3AlL ‘Lis approach trans forms the multi-label classification problem of V categories into N’ binary classifi-
cation problems, each of which corresponds to a differant: category To determine
Trang 182.3 ‘lext Categorization tasks §
helong to the negative category
2.3.2 Flat and Hicrarchical Text Catcgorization
According to the category structure, text categorization can be divided into two categories The former is flat categorization where a calogory ig separate from others The latter is hierarchical categorization in which there is a hierarchical caregory structure An example of a hierarchy with two top-level categories, Cars and Sports, and three subcategories within each, namely, Cars/Lorry, Cars/‘Iruck,
Trang 192.4 Applications of ‘Text Catcgorization 9
/Lorry “Truek #Taxi ##uelbnll Skiing, /Tanit
Figure 2.3: A hiararchy with two top-level categories
Cores Taxi, Sports/Foothall, Sporte/Skiing, Sparts/Tennis is shown in Figure 2.3
In the flat classification case, a morel corresponding to a positive category is learned to distinguish the target category from all other categories However, in the hicrarchical classification, a model corresponding to a positive category is learned
to distinguish the bargeL ealogory Irom other categories within the sams wp level
In figure 2.3, the text classifiers corresponding to each top-level category, Cars and Sports, distinguish them from each other ‘I'his is the same as flat ‘IC Meanwhile, the model corresponding to cach second-level category is eared to distinguish a scvond-leyel estegory from other sevond-level culegories within uke vame lopdevel cavegory Specifically, the madel built on category Cara/Lorny distingnishes it~ self from the other twa categories under Cars category, namely, Cars/Tani and
Cars/Truck
There are a large number of applications of text categorization Tn this session, we discuss the important ones.
Trang 202.4 Applications of ‘Text Catcgorization 10
Automatic document indexing for I systems is the activity that each document
is assigned some key words or key phrases describing its content from a dictionary Ceuerally, his work is dune by trained human indexers However, if we brew Uae entries in the dictionary as categories, dacnment, indexing is an application of TC, and it may be solved by computers Several ways of using ‘I'C techniques for auto- matic document indexing have been described in [41], 35] ‘I'he dictionary usually consists of » Uncmatic bierarchical Uscynucus, for example, the NASA thesaurus lor the aerospace disvipline, or the MESH thesaurus [or the biomedical literature
Automatic indexing with a controlled dictionary and antomated metadara, gen- eration are closely related to each other In digital libraries, documents are tagged
by metuduta (for cxumple, ereation date, document lype, author, avuilubility, and so
on) Some of this ietadats is themaLio and the role of the metadata is lo deseribe the documents by means of bibliographic codes, key words or key phrases
Documentation organization might be the most general application of TC because
there is a huge number of documents that need to be classified ‘lextual information can be in ads, newspaper, cmails, patents, confcrence papers, abstracts, acwsgroup posters and so on A aystom classifying uewspaper advertisements under different categories such as Cars for Sale and Job Hunting, or a system grouping conference papers into sessions related to themes are two examples of documentation organize: tion
The task of word sense disambiguation (WSD) is to find tbe seme of un ambiguous word (for instance bank may mean a financial institntion or a land long aide af a river), given the occurrence
context of this particular word Although a number
Trang 212.4 Applications of ‘Text Catcgorization 11
of other beebiniaues have been used ïn WBD, anothcr soluion to WSD ig to apply
TC techniques when we treat the word orcurrence contexts as documents, and treat
word senses as categories 19], [15]
24.4 Text Filtering System
‘Lext filtering is an activity of categorizing a stream of incoming documents in an asynchronous way based on an information producer to an information consumer
|4 One typical instance is a news feed, in which the consumer is a newspaper and the producer is a news agency [22' In this case the filtering system shonld block the delivery of the documents that the consumers are likely not interested in {for example, all news not concerning sports in a sports newspaper) Moreover, a text
Filtering syslem might ulsu further cavegorize the documents considered relevant to
the consumer into different bhematic categories For instance, the relevant decu-
ments (news ahout, sports} should be further classified based on which sport they
invovel Junk e-mails Hiltering system is another instance It may be trained to get rich of spam mails and further ealegorize non-spain mils into different valegorics [2],
[20], Luformation Filleriug based on machine learning Lecluiques hay been discussed
in (1), [24]
24.5 Hierarchical Categorizution of Web Pages
When documents are catalogued hierarchically, it is easier for a researcher to first navigate in the hierarchy of categories and limit his search to a interested category Therefore, many real world web classification systems have boon built on complicated
hierarchic
al strncrure such as Yahoo!, MeSH, U.S.Patents, TookSmart and s0 on This hierarchical web page classification may be dealt with the hierarchical TC techniques Prier works related to the hierarchical structure in a ‘'C) context have been discussed in [13], 42] In practice, links also have been used in web pages
classification by [34], [20
Trang 222.5, Machine learning approaches to Text Categorization 12
Model oF categorization
Labeled test documents
Figure 2.4; Text Categorization using machine learning techniques
rization
As mentioned in the beginning of this chapter, machine learning approaches to
text categorization have been widely studied since 1990s Figure 2.4 illustrates the
process of TC based on machine learning algorithm The goal of a classifier is to
learn a model from the training samples so as to predict the target categories of the test documents In this section, we present some popular methods
k Nearest Neighbor (KNN) is a kind of example-based classifiers It relies on the
category labels assigned to the training documents similar to the test documents
Specifically, a classifier using & Nearest Neighbor (KNN) algorithm categories an
Trang 233,5, Machine learning approaches to 'Icxt Categorization 18
unlabelicd document under a cluzs bused on oaLogorics oÏ & training, docuntewts that
are most similar to this document The distance metrics measuring the similarity
between two documents include the Eculidean distance
——
the inner product
is much wore expensive Actually, the ENN wethod can not be called an inductive
learner because it does not have a training phase
A decision tree (DT) text classifier is a tree in which each internal node is labelled
by a verm, cach branch corresponds to u term weight, and cact leaf node ig labelled
by a category To calegorive a vest document, a classifier starty at lhe root of the
tree, and moves through this tree until a leaf node, which provides a category At each internal node, classifier tests whether the document contains the term being labelled in this node or not If yes, the moving direction follows the weight of this
torm in the document, Most such classificrs apply binary text representations and
Trang 242.5 Machine learning approaches to Text Categorization 14
binary trees Figure an example of a binary tree where edges are labeled by terms (underlining denotes negation) and leaves are labelled by categories (WHEAT
in this example)
One important issue of DT is overfitting when some branches may be too spe- cific to the training samples Thus most decision tree learning methods contain a method for growing and pruning the tree (for example, discarding the overly specifie
branch
Among the standard packages for DT learning, the popular ones are ID3 {17], ©4.5 [7] and C5 [31]
Support vector machine (SVM) algorithm has been first introduced by Vapnik It
Ì:
[14] Among all the surfaces dividing the training examples into two classes in [W|- has been originally applied to text categorization by Joachinis and Dumais {J
Trang 253.6 Performancc Mlcasuros d5
dimensional space (W| is Whe number of terms), SVM sucks Unc surlace (decision
surface) that separates the positives from the negatives by the widest possible margin
based on the structural risk minimization principle from computational learning theory ‘I'he training examples used to determine the hest decision surface are mown
as support vectors, and all examples in the training data set are used to optimize the decision surface [43] This property makes the SVM algorithm different from many
other methods
SVMs are usually grouped into linear SYM and non-linear SVM based on the
different kernel functions For instane, the different kemel! functions are finear func
for vectors (a; and wy), where, 7, 7 and ở are kernel parameters
Ín recent years, SYM has been widely used and has shown better performance than other machine leacuing alyorithins due Lo its ability lo handle high dimensional and large-scale training set [25], [14] ‘There are a number of software packages implementing SVM algorithm with the different kernel functions such as SVM-Light,
M, finySVM, LIBLINBAR, and so on
Trang 263.6 Performancc Mlcasuros 18
Measures for a calegory According lo (39 for w category lee TP; denole
true positives (the number of the documenta that belong to this category and are
correctly assigned ta this category); FP, denote false positives {the mmber of the documents that do not belong to this category, but are incorrectly assigned to this category); ‘{N; denote true negatives (the number of the samples that do not belong
to this category and are correctly not assigned to this category); and FN; denote
false negatives (the nnmber of the documents that belong to this category, but are incorrectly not assigned to this category) We define five measures as follows:
2%, is calied the harmonic mean With the equal 12/271, the more balanced 17;
and Ay , the higher Fy;
Measures fur multi-label classification To assess porlormunee of 1 entcgOrics
in a multi-lahel classification task, we have two averaging methods, namely, macro—
Fy and micro — Fy The formula for macro — Fy is: