Luận văn an improved term weighting scheme for text categorization

An Improved Term Weighting Scheme for Text Categorization IHỌC NGHỆ / Pham Xuan Nguyen Faculty of Information Technology University of Engineering and Technology Vietnam National Univ

Trang 1

An Improved Term Weighting Scheme for

Text Categorization

IHỌC NGHỆ /

Pham Xuan Nguyen

Faculty of Information Technology

University of Engineering and Technology Vietnam National University, Hanoi

Supervised by

Dr Le Quang Hieu

A thesis submitted in fulfillment of the requirements

for the degree of Master of Science in Computer Science

August 2014

Trang 2

ORIGINALITY STATEMENT

‘Thereby declare nal Unis subutission is my own work To the best of iy kuowledge, it coulains no nudterials previously published by auother persun, or substautial proportions

of analerial which have been accepted for the award of any utler degrees or diplomas

at University of lingineering and Technology (UL'L/Coltech} or any other educational institutions, except where due acknowledgement is made in the thesis, Any contributions made to the researches by others are explicitly acknowledged in the thesis 1 also declare that the intellectual content of this thesis is the product of my own work, except ta the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression are acknowledged."

Trang 3

ADSTRACT

In text naregorization, term weighting is the task to assign weights to terms during the document presentation phace ‘Ihhus, it aifects the classification performance In addition to resulting in a high performance of text eategorization, an effective term weighting scheme should be easy to use

Tet weighting methods can be divided inlo two categories, unmncly, supervised and uusupervised 27] The tadilionnl term weighting schemey such us binary, and {fidf [38], belong, vo uusupervised ter weighting methods, Other sch

ws (lor

example, ff.x* [12]} that make use of the pricr information about the membership

of training dacuments, belong to the supervised term weighting methods

The supervised term weighting method #f.rf [27 is one of the most effective schemes to date lt showed better performance than many others [27] Llowever, tf.nf is not the best in some cases Moreover, tf rf requires many 17 values for each term

Tu this thesis, we present an improved term weighting scheme from fff, culled

Logt fr fnoa Our wew sehen uses loglf = logy (1.0 1 éf) instead of Uf Furthermore, our scheme is simpler than Ur beeuuse iL ouly uses the maximum value of rf Tor each term Our experimental results showed that our scheme is consistently hetrer

than #f.rf and others.

Trang 4

To my family 9

Trang 5

ACKNOWLEDGEMENTS

First, I would like to express my gratitude to my supervisor, Dr Le Quang Hieu

He guided me throughout the years and gave me several usoful adviecs about study method He was very patient with me His words influenced strongly on me I also would like to give my honest appreciation to my colleagues al Hoalu Cuiversily and University of Engineering and Technology (CET/Collech) for their great support Thank you ail!

Trang 6

Structure of thiy Thesis

2 Overview of Text Catcgorization

2.3.1 Singlelabel aud Multidebel Text cu Cateyurization `

2.5.2 Tlat and Tierarchical Text Categorization Applications of Text Categorization - 3.4.1 Automatic Document, Indexing for TT systems 2.4.2 Documentation Organization

2.4.3 Word Sense Visambiguation

24.5 Hicrarehical Catcgorization of Web Popes

Machine learning approaches to Text Culegurizalion

25.1 & Nearest Neighbor 2.5.2 Decision Tree

2.5.3 Support Vector Machines

Previous ‘lerm Weighting Schemes -

321 Unsupervised 'lerm Weighting Schemes 3.2.2 Supervised Term Weighting Schemes

Trang 7

43 Corpora 5.0

43.1 43.2

Reuters News Corpus

454

5 Conehision

and Discussion

Resulls on the 20 Newsgroups corpus

Results on the Teenters News corpus Tisenssion

Further Analysis

Trang 8

An example of transforming a multilahel problem into 3 2 binary claa- sification problems

A hiararchy with two toplevel categories

‘Text Categorization using machine learning techniques

An cxample of a decision trec [source [27] oe

Linear Support Vector Machine [source “14]]

‘The micro — /, measure of eight term weighting schemes on the 20 Newsgroups corpus with different numbers of features 0.0

The macro F, measure of cight term weighting schemes on the 20 Newsgroups corpus with different uumbery of features 0 ee

The micra — F, measure of eight term weighting schemes on the Renters News corpus with different numbers af features

The macro — F, measure of eight term weighting schemes on the Reuters News corpus with different numbers of features - The fy measure of four methods on each category of Reuters News corpus using $VM algorithm at the full vocabulary

The f, measure of four methods on cach category of 20 Newsgroups corpus using 8VM algorithm at the full vocabulary, category from 1

Trang 9

List of Tables

Traditional Term Weighting Schemes

3.2 Fxamples of two terms having different ff and ‘font + if)

4.1 Experimental Term Weighting Schemes

S10) Examples of two term weights us using 7f aod + fmes

viii

Trang 10

Support Vector Machine

Trang 11

In text representation phase, the content of documents are transformed into a compact lormat Specifically, cach document is presented as a vector of Genus in Uae vevlor space model Each veclor compouent contaius a value presenting how much a term contributes to the discriminative semantics of the document Term Weighting Scheme is the task to assign weights to terms in this phase

TWS is u wellstudicd ficld The traditional term weighting wicthods such as binary, if and tf.idf are borrowed from information retrieval (IR) domain These term weighting schemes da not use the previons information about the membership of training documents Other schemes using this information are called the supervisecl term weighting schemes, for example, tx? [12]

is one of the best meth

To date, the supervised Lora weighting scherne /ƒ7/ |

ods Tt achieves betler performance than many others in a series of thorough

Trang 12

ex-1.2 Structure of this Thesis 2

periments using lwo commonly-used algorithuuy (SVM und KXN) as well us two benchmark data collections (Reuters News and 20 Newsgroups) However, the performance of tf-rf is not stable éf.rf shows the considerable better performance than all other schemes in the experiments on Reuters News data set, while its perfor mance is worse than rf’s performance (a term weighting scheme does not use the #f factor), and is slightly better than f-idf (a common term weighting methad) in the experiments on 20 Newsgronps corpns Murthermore, for each term, #f.rf requires N (the total number of categories) rf values im a multi-label classification problem It raises a question whether there is a typical rf value for each term

Tu this thesis, we propose an improved term weighting scheme, which applics two improvements to ivf First, we replace Uf by luglf — log2(1.0 + Uf) Moreover, we only use the maximum of 7f value (7 fray) for each term in a mnititabel classification problem ‘I'he formula for our scheme is logt fr fas-

We vouducted experiments with Une experimental vottingy described in 27], where Ufrf was proposed, We use bwo standard measures (redcre— Fy and macro — FỊ) as well as linear SVM We carefully select: eight term weighting schemes, in cluding two common methods, two schemes used in [27], four methods applying our improvements, in order to assess our work ‘I'he experimental results show that

Jog foe Cousiswently outperforms Uf.rf us well us other selumes on tw dure sev

‘Vhe remainder of this thesis is organized as follows Chapter 2 provides an overview

of text categorization Chapter 3 reviews the term weighting schemes for text cate gorization, and deserihes our improved term weighting scheme Chapter 4 describes ont experiments, inclnding the used algorithms, data sets, measures, results and discussion Chapter 5 presents the conelusion

In this study, the default studied language is English In addition, we only apply

the bag-of-words opprouch lo represent u document, and used dutu sots are flat The

Trang 13

1.2 Structure of this Thesis

rewulls of the study can resull in a valuable term weighting method for TC

Trang 14

Chapter 2

Overview of Text Categorization

‘Vhis chapter gives an overview of ‘IC We begin by introducing ‘IC, then present

some applications and tasks of ‘1 Lhe rest this chapter is about the approaches

to TC, especially SYM, whieh is applicd in this Unesis

Automated text categorization (or text classification) is the supervised learning task

of assigning documents into the predefined categories TC differs from text clustering

where we can not, know the set af caregories in advance

TC has heen studying since the early 1960s, bur it only has been focused in recent decades due to the needs of categorizing a large number of the documents in Word

Wide Web Generally, TC relates to the machine learning (ML) and information

retrieval (IR) field

In the 1980s, the popular approaches to TC is constructing an expert system,

which is capable of taking text classification decision hased on knowledge engineering

tcchniques The famous example of this method is CONSTRUE system [22] Since the carly 1990s, the machine learning approaches to TC have become popular.

Trang 15

representation so as to be recognized and categorized by classifiers

One way to text representation is nse the vector space model (VSM) hased on

words documents (a technique in the It domain) In , the content of a textual document is converted to a vector in the term space (each tenn usually associates

a word) For detail, the ducument d iy represented ag (wy wy), where m iy Ue

total number of terms The value of w;, represents how much the term f, contributes

to classify the docnment, d Figure 2.1 illustrates the way of representing documents

in VSM Five documents are represented as five vectors in the 3-dimensional space

(System, Class, Text).

Trang 16

2.2 ‘ext Representation 6

Tu the process of Lransforming documents according, lo VBM, the word sequence

in a doenment is not considered and each dimension in vector space associates with

a word in the vocabulary that is built after text preprocessing phase In this phase, the words assume to have no information content (such as stop words, numbers, and so on) in a document are removed, ‘hen words can be stemmed Finally, the rest words in all of documents are sorted alphaberically, and numbered conseentively

Stop words are common words that are not useful ta TC such as article (for example,

“the”, “a”), prepositions (for example, “of”, *in”), conjunctions (for example, “and”,

“or*), Stemming algorithms are used to map several morphological forms of a word

to u term (lor iustunee, “computers” is mapped to “coimputer”) Tu reduce the dimension of the fealure space, feature selection process is usually applied Tu this

level of this term

process, each term is assigned a score presenting the “importa

for LC task ‘hen only top terms with highest scores are used to represent ali

documents

Two key issues considered in the text representation phase are term types and

term weights A term (or a feature} can he a sub-word, a word, a phrase, a sentence,

and so on ‘Lhe common type of term is a word, and a document is treated as a group

nf words with different frequency ‘[his representation methed is called dag of words

approach aud it perforus well in practice, The bag-of-words approuch is sirmplicity,

barr it discards a lot of useful information about the semantic between words Vor

example, two words in a phrase verb are considered as two independent ones lo

solve this problem, many researchers used phrases (for instance, noun phrases) or ventences as terms, These phrases often include syutacti¢ und or statistical informu-

tion [29], [6] Furthermore, the term type can be # combination [10] of the different

types, for example, the word-level type and the 3.gram type [10] Term weights will

Trang 17

2.3 ‘lext Categorization tasks 7

yeelor representalon

‘Vext categorization can be classified into many different types according to the mumber of categorics assigned to a document, the total number of eategorica, and

the calegory structure

2.3.1 Single-label and Multi-label Text Categorization

Rased on the number of categories that a document can belong to, text eategoriza-

tion is classified into two types, namely, singlelabel and mnti-label

Single-label classification is the case thal cach document is axsigned lo only

ome category, and there are two or more categories Binary classification is a spe-

cial case of singlelabel text categorization, in whieh the number of categories is two

Multi-label Text Categorization In multi-label classification, a document can

by assigned (o more Uw oue eategory, and it involves two or more calegor Mulli-label classification diflers from mulli-class single-label classification where the number of categories is also more than one, but a document is assigned only one category

To sive the mullidubel probleu, we ean apply cither the problew rnsforme- tion methods or the algorithm adaptation methods The problem transformation methods transform multi-label prablem into a set of binary classification problems, each of them can be solved by a single-label classifier

An example of the transformation methed is OueV3AlL ‘Lis approach trans forms the multi-label classification problem of V categories into N’ binary classifi-

cation problems, each of which corresponds to a differant: category To determine

Trang 18

2.3 ‘lext Categorization tasks §

helong to the negative category

2.3.2 Flat and Hicrarchical Text Catcgorization

According to the category structure, text categorization can be divided into two categories The former is flat categorization where a calogory ig separate from others The latter is hierarchical categorization in which there is a hierarchical caregory structure An example of a hierarchy with two top-level categories, Cars and Sports, and three subcategories within each, namely, Cars/Lorry, Cars/‘Iruck,

Trang 19

2.4 Applications of ‘Text Catcgorization 9

/Lorry “Truek #Taxi ##uelbnll Skiing, /Tanit

Figure 2.3: A hiararchy with two top-level categories

Cores Taxi, Sports/Foothall, Sporte/Skiing, Sparts/Tennis is shown in Figure 2.3

In the flat classification case, a morel corresponding to a positive category is learned to distinguish the target category from all other categories However, in the hicrarchical classification, a model corresponding to a positive category is learned

to distinguish the bargeL ealogory Irom other categories within the sams wp level

In figure 2.3, the text classifiers corresponding to each top-level category, Cars and Sports, distinguish them from each other ‘I'his is the same as flat ‘IC Meanwhile, the model corresponding to cach second-level category is eared to distinguish a scvond-leyel estegory from other sevond-level culegories within uke vame lopdevel cavegory Specifically, the madel built on category Cara/Lorny distingnishes it~ self from the other twa categories under Cars category, namely, Cars/Tani and

Cars/Truck

There are a large number of applications of text categorization Tn this session, we discuss the important ones.

Trang 20

Automatic document indexing for I systems is the activity that each document

is assigned some key words or key phrases describing its content from a dictionary Ceuerally, his work is dune by trained human indexers However, if we brew Uae entries in the dictionary as categories, dacnment, indexing is an application of TC, and it may be solved by computers Several ways of using ‘I'C techniques for automatic document indexing have been described in [41], 35] ‘I'he dictionary usually consists of » Uncmatic bierarchical Uscynucus, for example, the NASA thesaurus lor the aerospace disvipline, or the MESH thesaurus [or the biomedical literature

Automatic indexing with a controlled dictionary and antomated metadara, gen- eration are closely related to each other In digital libraries, documents are tagged

by metuduta (for cxumple, ereation date, document lype, author, avuilubility, and so

on) Some of this ietadats is themaLio and the role of the metadata is lo deseribe the documents by means of bibliographic codes, key words or key phrases

Documentation organization might be the most general application of TC because

there is a huge number of documents that need to be classified ‘lextual information can be in ads, newspaper, cmails, patents, confcrence papers, abstracts, acwsgroup posters and so on A aystom classifying uewspaper advertisements under different categories such as Cars for Sale and Job Hunting, or a system grouping conference papers into sessions related to themes are two examples of documentation organize: tion

The task of word sense disambiguation (WSD) is to find tbe seme of un ambiguous word (for instance bank may mean a financial institntion or a land long aide af a river), given the occurrence

context of this particular word Although a number

Trang 21

of other beebiniaues have been used ïn WBD, anothcr soluion to WSD ig to apply

TC techniques when we treat the word orcurrence contexts as documents, and treat

word senses as categories 19], [15]

24.4 Text Filtering System

‘Lext filtering is an activity of categorizing a stream of incoming documents in an asynchronous way based on an information producer to an information consumer

|4 One typical instance is a news feed, in which the consumer is a newspaper and the producer is a news agency [22' In this case the filtering system shonld block the delivery of the documents that the consumers are likely not interested in {for example, all news not concerning sports in a sports newspaper) Moreover, a text

Filtering syslem might ulsu further cavegorize the documents considered relevant to

the consumer into different bhematic categories For instance, the relevant decu-

ments (news ahout, sports} should be further classified based on which sport they

invovel Junk e-mails Hiltering system is another instance It may be trained to get rich of spam mails and further ealegorize non-spain mils into different valegorics [2],

[20], Luformation Filleriug based on machine learning Lecluiques hay been discussed

in (1), [24]

24.5 Hierarchical Categorizution of Web Pages

When documents are catalogued hierarchically, it is easier for a researcher to first navigate in the hierarchy of categories and limit his search to a interested category Therefore, many real world web classification systems have boon built on complicated

hierarchic

al strncrure such as Yahoo!, MeSH, U.S.Patents, TookSmart and s0 on This hierarchical web page classification may be dealt with the hierarchical TC techniques Prier works related to the hierarchical structure in a ‘'C) context have been discussed in [13], 42] In practice, links also have been used in web pages

classification by [34], [20

Trang 22

2.5, Machine learning approaches to Text Categorization 12

Model oF categorization

Labeled test documents

Figure 2.4; Text Categorization using machine learning techniques

rization

As mentioned in the beginning of this chapter, machine learning approaches to

text categorization have been widely studied since 1990s Figure 2.4 illustrates the

process of TC based on machine learning algorithm The goal of a classifier is to

learn a model from the training samples so as to predict the target categories of the test documents In this section, we present some popular methods

k Nearest Neighbor (KNN) is a kind of example-based classifiers It relies on the

category labels assigned to the training documents similar to the test documents

Specifically, a classifier using & Nearest Neighbor (KNN) algorithm categories an

Trang 23

3,5, Machine learning approaches to 'Icxt Categorization 18

unlabelicd document under a cluzs bused on oaLogorics oÏ & training, docuntewts that

are most similar to this document The distance metrics measuring the similarity

between two documents include the Eculidean distance

——

the inner product

is much wore expensive Actually, the ENN wethod can not be called an inductive

learner because it does not have a training phase

A decision tree (DT) text classifier is a tree in which each internal node is labelled

by a verm, cach branch corresponds to u term weight, and cact leaf node ig labelled

by a category To calegorive a vest document, a classifier starty at lhe root of the

tree, and moves through this tree until a leaf node, which provides a category At each internal node, classifier tests whether the document contains the term being labelled in this node or not If yes, the moving direction follows the weight of this

torm in the document, Most such classificrs apply binary text representations and

Trang 24

2.5 Machine learning approaches to Text Categorization 14

binary trees Figure an example of a binary tree where edges are labeled by terms (underlining denotes negation) and leaves are labelled by categories (WHEAT

in this example)

One important issue of DT is overfitting when some branches may be too spe- cific to the training samples Thus most decision tree learning methods contain a method for growing and pruning the tree (for example, discarding the overly specifie

branch

Support vector machine (SVM) algorithm has been first introduced by Vapnik It

Ì:

[14] Among all the surfaces dividing the training examples into two classes in [W|- has been originally applied to text categorization by Joachinis and Dumais {J

Trang 25

3.6 Performancc Mlcasuros d5

dimensional space (W| is Whe number of terms), SVM sucks Unc surlace (decision

surface) that separates the positives from the negatives by the widest possible margin

based on the structural risk minimization principle from computational learning theory ‘I'he training examples used to determine the hest decision surface are mown

as support vectors, and all examples in the training data set are used to optimize the decision surface [43] This property makes the SVM algorithm different from many

other methods

SVMs are usually grouped into linear SYM and non-linear SVM based on the

different kernel functions For instane, the different kemel! functions are finear func

for vectors (a; and wy), where, 7, 7 and ở are kernel parameters

Ín recent years, SYM has been widely used and has shown better performance than other machine leacuing alyorithins due Lo its ability lo handle high dimensional and large-scale training set [25], [14] ‘There are a number of software packages implementing SVM algorithm with the different kernel functions such as SVM-Light,

M, finySVM, LIBLINBAR, and so on

Trang 26

3.6 Performancc Mlcasuros 18

Measures for a calegory According lo (39 for w category lee TP; denole

true positives (the number of the documenta that belong to this category and are

correctly assigned ta this category); FP, denote false positives {the mmber of the documents that do not belong to this category, but are incorrectly assigned to this category); ‘{N; denote true negatives (the number of the samples that do not belong

to this category and are correctly not assigned to this category); and FN; denote

false negatives (the nnmber of the documents that belong to this category, but are incorrectly not assigned to this category) We define five measures as follows:

2%, is calied the harmonic mean With the equal 12/271, the more balanced 17;

and Ay , the higher Fy;

Measures fur multi-label classification To assess porlormunee of 1 entcgOrics

in a multi-lahel classification task, we have two averaging methods, namely, macro—

Fy and micro — Fy The formula for macro — Fy is:

Tiêu đề	An Improved Term Weighting Scheme for Text Categorization
Tác giả	Pham Xuan Nguyen
Người hướng dẫn	Dr. Le Quang Hieu
Trường học	University of Engineering and Technology, Vietnam National University, Hanoi
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2014
Thành phố	Hanoi

Định dạng
Số trang	52
Dung lượng	508,5 KB