Incorporating linguistically motivated knowledge sources into document classification

4.3.2 Handling Multiple Categories Training Examples 414.4 Evaluation Measures 41 5.2 Contribution of Different Linguistically Motivated Knowledge Sources to Classification of Reuters

Trang 1

INCORPORATING LINGUISTICALLY MOTIVATED KNOWLEDGE SOURCES INTO DOCUMENT

CLASSIFICATION

GOH JIE MEIN

NATIONAL UNIVERSITY OF SINGAPORE

2004

Trang 2

INCORPORATING LINGUISTICALLY MOTIVATED KNOWLEDGE SOURCES INTO DOCUMENT

CLASSIFICATION

GOH JIE MEIN

BSc(Hons 1, NUS)

A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF INFORMATION SYSTEMS NATIONAL UNIVERSITY OF SINGAPORE

2004

Trang 3

My deep appreciation also goes out to all my friends, peers and colleagues who have helped my in one way or another:

I wish to thank Klarissa Chang, Koh Wei Chern, Cheo Puay Ling and Wong Foong Yin for their listening ears and uplifting remarks

Colleagues and friends such as Michelle Gwee, Wang Xinwei, Liu Xiao, Koh Chung Haur, Li Yan, Li Huixian, Santosa Paulus, Wan Wen, Bryan Low, Chua Teng Chwan, Tok Wee Hyong, Indriyati Atmosukarto, Colin Tan, Julian Lin, the pioneering batch of Schemers, for keeping life pleasurable in the office

I would also like to express my sincere thanks to A/P Chan Hock Chuan and A/P Stanislaw Jarzabek for evaluating my research

And to all professors, teaching staff, administrative staff, friends and students

Last but not least, I would like to thank my family especially my parents, sisters and Melvin Lye for their relentless moral support, motivation, advice, love and understanding

Trang 4

1.1 Background & Motivation 1

1.2 Aims and Objectives 2

2.2 Feature Selection Methods 9

2.3 Machine Learning Algorithms 9

2.3.6 Instance Based Learning – k-Nearest Neighbour 17

2.4 Natural Language Processing (NLP) in Document Classification 19

Trang 5

4.3.2 Handling Multiple Categories Training Examples 41

4.4 Evaluation Measures 41

5.2 Contribution of Different Linguistically Motivated Knowledge Sources to

Classification of Reuters-21578 Corpus

48

Trang 6

5.2.3 Word Sense (Part of Speech Tags) 51

5.3 Contribution of Linguistically Motivated Knowledge Sources to

Classification accuracy of WebKB Corpus

Trang 7

SUMMARY

This thesis describes an empirical study on the effects of linguistically motivated knowledge sources with different learning algorithms Using up to nine different linguistically motivated knowledge sources and six different learning algorithms, we examined the performance of classification accuracy by using different linguistically motivated knowledge sources, with different learning algorithms on two benchmark data corpora: Reuters-21578 and WebKB The use of linguistically motivated knowledge sources, nouns, outperformed the traditional bag-of-words classifiers for Reuters-21578 The best results for this corpus were obtained with the use of nouns with support vector machines On the other hand, experiments with WebKB showed that classifiers built using the novel combinations of linguistically motivated knowledge sources were as competitive as those built using the conventional bag-of-words technique

Trang 8

LIST OF TABLES

Table 1 Summary of Related Studies 23 Table 2 Distribution of Categories in Reuters-21578 34 Table 3 Top 20 Categories in ModApte Split 35 Table 4 Breakdown of Documents in WebKB Categories 36

Table 11 Results using Adjectives 54 Table 12 Results using both Linguistically Knowledge Sources and Words 54 Table 13 Contribution of Knowledge Sources on Reuters-21578 (F1 measures) 55 Table 14 Contribution of Knowledge Sources on Reuters-21578 (Precision) 58 Table 15 Contribution of Knowledge Sources on Reuters-21578 (Recall) 59

Table 21 Results using Adjectives 66

Trang 9

Table 22 Results using Nouns & Words 66 Table 23 Results using Phrases & Words 67 Table 24 Results using Adjectives & Words 68 Table 25 Consolidated Results of WebKB 69

Trang 10

Figure 8 Comparison of Different Linguistically Motivated Knowledge

Sources on Reuters-21578 (Micro F1 values)

56

Sources on Reuters-21578 (Macro F1 values)

57

Sources on Reuters-21578 (Precision)

59

Sources on Reuters-21578 (Recall)

60

Sources on WebKB (Micro F1 values)

70

Sources on WebKB (Macro F1 values)

70

Trang 11

CHAPTER 1

INTRODUCTION

1.1 Background & Motivation

With the emerging importance of knowledge management, research in areas such as document classification, information retrieval and information extraction each plays a critical role in the success of knowledge management initiatives Studies have shown that the perceived output quality is an essential factor for successful implementation and adoption of knowledge management technologies (Kankanhalli, et al., 2001) Large document archives such as electronic knowledge repositories offer a huge wealth of information whereby methods in the field of information retrieval and document classification are used to derive knowledge

Coupled with the accessibility to voluminous amount of information available in the World Wide Web, this information explosion has brought about other problems Users are often overwhelmed by the deluge of information and suffer from a decreased ability

to assimilate information Research has suggested that users feel bored or frustrated when they receive too much information (Roussinov and Chen, 1999) which can lead to a state where the individual is no longer able to effectively process the amount of information he

is exposed, giving rise to a lower decision quality in a given time set This problem is exacerbated with the proliferation of information in electronic repositories in organizations and the World Wide Web (Farhoomand and Drury, 2002)

Trang 12

Document classification has been applied to the categorization of search results and has been shown to alleviate the problem of information overload The presentation of documents in categories works better than a list of results because it enables users to disambiguate the categories quickly and then focus on the relevant category (Dumais, Cutrell & Chen, 2001) This is also useful for distinguishing documents containing words with multiple meanings, or polysemy, a characteristic predominant in English words

Experiments on supervised document classification techniques have predominantly used the bag-of-words technique whereby words of the documents are used as features Alternate formulations of a meaning can also be introduced through linguistic variation such as syntax which determines the association of words Although there have been some studies that have employed alternate features such as linguistic sources, these studies have only employed a subset of linguistic sources and learning algorithms (Lewis, 1992; Arampatzis et al., 2000; Kongovi et al., 2002) Thus, this study extends previous studies relating to document classification and aims to find ways to improve the classification accuracy of documents

1.2 Aims and Objectives

Differences in previous empirical studies could be introduced by the differences in tagging tools used, different learning algorithms, parameters tuned for each of the learning algorithms, feature selection methods employed and dataset involved (Yang,

Trang 13

1999) Thus, it is difficult to offer a sound conclusion based on previous works Since previous works documenting results based on linguistically motivated features with learning algorithms produced inconsistent and sometimes conflicting results, we propose

to conduct a systematic study on multiple learning algorithms and linguistically motivated knowledge sources as features Some of these features are novel combinations

of linguistically motivated knowledge sources were not explored in previous studies

Using a systematic and controlled study, we can resolve some of these ambiguities and offer a sound conclusion In our study, consistency in the dataset, learning algorithms, tagging tools and features selections were maintained in the study so that we can have a valid assessment of the effectiveness of linguistically motivated features

The aim of this thesis is also to provide a solid foundation for research on feature representations on text classification and study the factors affecting machine learning algorithms used in document classification One of these factors we want to look into is the effect of feature engineering by utilizing linguistically motivated knowledge sources

as features in our study

Thus, the objectives of this thesis are listed below:

1 To examine the approach of using linguistically motivated knowledge

sources based on concepts derived from natural language processing as features with popular learning algorithms, and systematically vary both

Trang 14

the learning algorithms and feature representations We based our evaluation on Reuters-21578 corpus and WebKB, benchmark corpus that has been widely used in previous research

2 To examine the feasibility of applying novel combinations of

linguistically motivated knowledge sources and explore the performance of these combinations as features on the accuracy of document classification

1.3 Thesis Plan

This thesis is composed of the following chapters:

Chapter 1 provides the background, motivation and objectives of this research

Chapter 2 provides a literature review of document classification research Here we

bring together literature from different fields, document classification, machine learning techniques and natural language processing and give a detailed coverage of the algorithms chosen for this study In addition, this chapter also overview the rudimentary knowledge required in later chapters

Chapter 3 describes the types of linguistically motivated knowledge sources and the

novel combinations used in our study

Trang 15

Chapter 4 provides a description of the experiment setup It also gives a brief on the

performance measures used to evaluate the classifiers and tools employed to conduct the study

Chapter 5 provides an analysis of the results and suggests the implications for practice

Chapter 6 concludes with the contributions, findings and the limitations of our study

Suggestions on future research that can be an extension to this work are also provided

Trang 16

CHAPTER 2

LITERATURE REVIEW

Document classification has traditionally been carried out using the bag-of-words paradigm Research on natural language processing has shown fruitful results that can be applied to document classification by introducing an avenue for improving accuracy by a different set of features used This chapter reviews related literature underpinning this research Section 2.1 gives an overview of document classification and the focus of previous research Section 2.2 overviews common features selection techniques used in previous studies Section 2.3 introduces the machine learning algorithms that will be adopted in our study Section 2.4 presents the concepts of natural language processing employed to derive linguistically motivated knowledge sources

2.1 Document Classification

Document classification, the focus of this work, refers to the task of assigning a document to categories based on its underlying content Although this task has been carried out effectively using knowledge engineering approaches in the 80s, machine learning approaches have superceded the use of knowledge engineering approaches for document classification in recent years While the knowledge engineering approaches have produced effective classification accuracy, the machine learning approach offered many advantages such as cost effectiveness, portability and competitive accuracy to that

Trang 17

of human experts while producing considerable efficiency (Sebastiani, 2002) Thus, machine learning supervised techniques are employed in this study

A supervised approach involves three main phases: feature extraction, training, testing The entire process of document classification using machine learning methods is illustrated in Figure 1

Figure 1: Document classification Process

There are two phases involved in learning based document classification: the training and testing phase In the training phase, pre-labeled documents are collected This set of pre-labeled documents is called the training corpus, training data or training set which is used interchangeably throughout this thesis Each document in the training corpus is transformed into a feature vector These feature vectors are trained with a learning

Trang 18

classifier Each classifier will then build a model based on the training set The model built based on the learning algorithm is then used in the testing phase to test a set of unlabeled documents, called the test corpus, test data or test set, that are new to the learning classifier, to be labeled

The classification problem can be formally represented as follows:

Classification fc(d) -> {true, false} where d∈D

Given a set of training documents, D and a set of categories C, the classification problem

is defined as the function, f, to map documents in the test set, T, into a boolean value

where ‘true’ indicates that the document is categorized as C and ‘false’ indicates that the document is not categorized under C, based on the characteristics of the training

documents, D where f (d)->c

2.2 Feature Selection Methods

With a large number of documents and features, the document classification process usually involves a feature selection step This is done to reduce the feature dimension Feature selection methods extract only the important features derived from the original set of features Traditional features selection methods that have been commonly employed in previous studies include document frequency (Dumais et al., 1998; Yang and Pedersen, 1997), chi-square (Schutze et al., 1997; Yang and Pedersen, 1997), information gain (Lewis and Ringuette, 1994; Yang and Pedersen, 1997), mutual information (Dumais et al., 1998; Larkey and Croft, 1996) etc

Trang 19

After the feature reduction step, many supervised techniques can then be employed in document classification Subsequent section presents a review of the techniques that were employed in this study

2.3 Machine Learning Algorithms

This section reviews the state-of-the-art learning algorithms as text classifiers, to give a background on the different methods and an analysis of the advantages and disadvantages

of each learning method that we have used in our empirical study Past research in the field of automatic document classification has focused on improving the document classification techniques through various learning algorithms (Yang and Liu, 1999) such

as support vector machines (Joachims, 1998) and various feature selection methods (Yang and Pedersen, 1997; Luigi et al., 2000) To make the study worthwhile, popular learning algorithms that have reported significant improvements on the classification accuracy of various learning algorithms in previous studies were used These included a wide variety of supervised-learning algorithms, including nạve bayes, support vector machines, k-nearest-neighbours, C4.5, RIPPER, AdaBoost with decision stumps and alternating decision trees

2.3.1 Nạve Bayes (NB)

Bayes classification is a popular technique in recent years The simplest Bayesian classifier is the widely used Nạve Bayes classifier which assumes that features are independent Despite the inaccurate assumption of feature independence, Nạve Bayes is

Trang 20

surprisingly successful in practice and has proven effective in text classification, medical diagnosis, and computer performance management, among other applications

Nạve Bayes classifier uses a probabilistic model of text to estimate the probability that a

document d is in class y, Pr(y|d) This model assumes conditional independence of

features, i.e words are assumed to occur independently of the other words in the document given its class Despite this assumption, Nạve Bayes have performed well

Bayes rule says that to achieve the highest classification accuracy, d should be assigned

to the class y ∈ {-1, +1} for which Pr(y|d) is the highest

)

|Pr(

maxarg)(d { 1, } y d

|Pr(

)'

|Pr(

)',

|Pr(

)

|Pr(

l y l

y d d

Pr(d|y,l’) is the probability of observing document d in class y given its length l’ Pr(y|l’)

is the prior probability that a document of length l’ is in class y In the following we will assume that the category of a document does not depend on its length so Pr(y|l’) = Pr(y)

An estimate of Pr(y) is as follows:

|

|)

(Pr'

} , 1 {

y y

Trang 21

|y| denotes the number of training documents in class y∈{-1,+1} and |D| is the total

number of documents

Despite the unrealistic independence assumption, the Nạve Bayes classifier is remarkably successful in practice Researchers show that the Nạve Bayes classifier is competitive with other learning algorithms such as decision tree and neural network algorithms Experimental results for Nạve Bayes classifiers can be found in several studies (Lewis, 1992; Lewis and Ringuette, 1994; Lang, 1995; Pazzani 1996; Joachims, 1998; McCallum & Nigam 1998; Sahami, 1998) These studies have shown that Bayesian classifiers can produce reasonable results with high precision and recall values Hence,

we have chosen this learning algorithm in learning to classify text documents The second reason that this Bayesian method is important to our study of machine learning is that it provides a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities

2.3.2 Support Vector Machines (SVM)

Support vector machines were developed by Vapnik et al (1995) based on structural risk minimization principle from statistical learning theory The idea of structural risk

minimization is to find a hypothesis h from a space H that guarantees the lowest probability of error E(h) for a given training sample S consisting of n examples Equation (5) gives the upper bound that connects the true error of a hypothesis h with the error E(h) of h on the training set and the complexity of h which reflects the well-known trade-

off between the complexity of the hypothesis space and the training error

Trang 22

)ln(

()()

(

n d

n d O h E h

E train

η

−+

A simple hypothesis space will most likely not contain good approximation functions and will lead to a high training and true error On the other hand a large hypothesis space will lead to a small training error but the second term in the right-hand side of equation (5) will be large This reflects the fact that for a hypothesis space with high VC-dimension the hypothesis with low training error may result in overfitting Thus it is crucial to find the right hypothesis space

The simplest representation of a support vector machine, a linear SVM, is a hyperplane that separates a set of positive examples from a set of negative examples with maximum distance from the hyperplane to the nearest of the positive and negative examples Figure

2 shows the graphical representation of a linear SVM

+

++ +

Trang 23

of the learner Unlike conventional generative models, SVM does not involve unreasonable parametric or independence assumptions The discriminative model focuses

on those properties of the text classification tasks that are sufficient for good generalization performance, avoiding much of the complexity of natural language This makes SVM suitable for achieving good classification performance despite the high dimensional feature spaces in text classification High redundancy, high discriminative power of term sets, and discriminative features in the high-frequency range are sufficient conditions for good generalization SVM is therefore chosen as one of the learning algorithms in this study

We used Platt’s (1999) sequential minimal optimization algorithm to process the linear SVM more efficiently This algorithm decomposes the large quadratic programming problem into smaller sub-problems Document classification using support vector machine can be done either through a binary or multi-class classification but we have adopted the binary approach which will be mentioned in a later chapter

2.3.3 Alternating Decision Tree (ADTree)

Although a variety of decision tree learning methods have been developed with somewhat differing capabilities and requirements, we have chosen one of the recent method called the alternating decision trees (Freund & Mason, 1999) This is because this method has been often applied to classification problems and applied to problems such as learning to classify text or documents

Trang 24

Alternating decision tree learning algorithm is a new combination of decision trees with boosting that generates classification rules that are small and often easy to interpret A general alternating tree defines a classification rule by defining a set of paths in the alternating tree As in standard decision trees, when a path reaches a decision node it continues with the child that corresponds to the outcome of the decision associated with that node When a prediction node is reached, the path continues with all of the children

of the node Path splits into a set of paths, where each path corresponds to one of the children of the prediction node

The difference between ADTree and conventional decision trees is that the classification

is based on traversing the path of the decision tree instead of the final leaf node of the tree

There are several key features of alternating decision trees Firstly, compared to C5.0 with boosting, ADTree provides classifiers that are smaller and easier to interpret In addition, ADTree give a measure of confidence, called the classification margin, that can

be used to improve the accuracy of the cost of abstaining from predicting examples that are hard to classify instead of only a class However, the disadvantage of ADTree is its susceptibility to overfitting in small data sets

Trang 25

Figure 3 An Example of an ADTree (Freund & Mason, 1999)

2.3.4 C4.5

A decision tree text classifier is a tree in which internal nodes are labeled by terms, branches departing from them are labeled by tests on the weight that the term has in the test document, and leaves are labeled by categories In this classification scheme, a text

document d is categorized by recursively testing the weights that the terms labeling the internal nodes have in vector d, until a leaf node is reached The label of this node is then assigned to d Most of these classifiers use binary document representations represented

as a binary tree There are a number of decision trees and among the most popular is C4.5 (Cohen and Hirsh, 1998) Thus we have chosen this learning method

The most popular decision-tree algorithm that has shown good results on a variety of problems is the C4.5 algorithm (Quinlan, 1993) Previous works based on this technique are reported in Lewis and Ringuette, 1994, Moulinier et al., 1996, Apte and Damerau,

1994, Cohen, 1995, Cohen, 1996 The underlying approach to C4.5 is that it learns

Trang 26

decision trees by constructing them top-down, from the root of the tree Each instance feature is then evaluated using a statistical test, like the information gain, to determine how well it alone classifies the training examples Information gain is otherwise known

as entropy in information theory Entropy of a collection S is measured as follows:

−

− + + −

−

S Entropy( ) log2 log2 (6)

Where p + is the proportion of positive instances in the collection S and p - is the proportion of negative instances The best feature is selected and employed as a root node to the tree For each possible value of this attribute, a descendant of the root node is created, and the training examples are sorted to the appropriate descendant node C4.5 forms a greedy search for a suitable decision tree in which no backtracking is allowed

2.3.5 Ripper

This learning algorithm is a prepositional rule learner, RIPPER (Repeated Incremental Pruning to Produce Error Reduction), proposed by Cohen (1995) The algorithm has a few major phases that characterize it: grow, prune, optimization RIPPER was developed based on repeated application of Furnkranz and Widmer’s (1994) IREP algorithm followed by two new global optimization procedures Like other rule-based learners, RIPPER grows rules in a greedy fashion guided by some information theoretic procedures

Trang 27

Firstly, the rules are grown from a greedy process which adds conditions to the rule until

the rule is 100% accurate The algorithm tries every possible value of each attribute and

selects the conditions with the highest information gain The rules are incrementally

pruned and finally in the optimization stage, an initial rule set and pruned rule set are

obtained One variant is generated from an empty rule while the other is generated by

greedily adding antecedents to the original rule The smallest possible description length

for each variant and the original rule is computed and the variant with the minimal

description length is selected as the final representative of rules in the rule set Rules that

would increase the description length of the rule set if they were included were deleted

The resultant rules are then added to the rule set

RIPPER has already been applied to a number of standard problems in text classification

with rather promising results (Cohen, 1995) Thus, it is chosen as one of the candidate

learning algorithms in our empirical study

2.3.6 Instance Based Learning – k-Nearest Neighbour

The basic idea behind k-Nearest Neighbors (k-NN) classifier is the assumption that examples located close to each other according to a user-defined

similarity metric are highly likely to belong to the same class This algorithm is also

derived from Bayes’ rule This technique has shown good performance on text

categorization in Yang & Liu (1997), Yang & Pederson (1999), Masand (1992).This

algorithm assumes that all instances correspond to points in the n-dimensional space The

nearest neighbors of an instance are defined in terms of the standard Euclidean distance

Trang 28

An arbitrary instance x is described as a feature vector (a1(x), a2(x), … an(x)) where ai(x) denotes the value of the ith attribute of the instance x Then the distance between two

ar x

,(

The target function can either be discrete or real-valued In our study, we will assume that

Train(training example x){

For each training example (x, f(x)) add example to list of training examples

}

Classify(query instance xq){

Let x … x denote the k instances from training examples that are nearest to x q

v q

1

Figure 4: k-NN Algorithm (Mitchell, 1997)

The key advantage of instance-based learning is that instead of estimating the target function once for the ly and differently for each new instance to be classified This method is a conceptually straightforward approach to approximating real-valued or discrete-valued target functions In general, one disadvantage of instance-based approaches is that the cost of classifying new instances

entire instance space, it can estimate it local

Trang 29

can be high due to the fact that nearly all computation takes place at classification time rather than when the training examples are first encountered A second disadvantage is they consider all attributes of the instances when attempting to retrieve similar training examples from memory If the target concept depends on only a few of the many available attributes, then the instances that are truly most “similar” may have a large distance apart However, as previous attempts to classify text with this approach has shown to be effective (Yang, 1999), we have decided to include it inside our experiment

2.4 Natural Language Processing (NLP) in Document

Classification

Similarly, in docum search, most experiments are not linguistically motivated (Cullingford, 1986) Closely related to the research on document classification is the

Most information retrieval (IR) systems are not linguistically motivated

ent classification re

research on natural language processing and cognitive science Traditionally, document classification techniques are primarily directed at detecting performance accuracy and hold little regard for linguistic phenomena Much of the current document classification systems are built upon techniques that represent text as a collection of terms such as words This has been done successfully using quantitative methods based on word or character counts However, it has been emphasized that vector space models cannot capture critical semantic aspects of document content In this case, the representation is superficially related to content since language is more than simply a collection of words

Trang 30

Thus, natural language processing is a key technology for building information retrieval systems of the future (Strzalkowski, 1999)

In order to study the effects of linguistically motivated knowledge sources with document lassification, it is imperative to learn about the grammar through natural language

er concepts identified are verb, count nouns, mass ouns, isolated and interrelated concepts We define such concepts as linguistically

c

processing so as to apply concepts in cognitive science on document classification techniques Natural language processing research attempts to enhance the ability of the computer to analyze, understand and generate languages that are used This is performed

by some type of computational or conceptual analysis to form meaningful structure or semantics from a document The inherently ambiguous nature of natural language makes things even more difficult A variety of research disciplines are involved in the successful development of NLP systems The mapping of words into meaningful representations is driven by morphological, syntactic, semantic, and contextual cues available in words (Cullingford, 1986) With the advancement of NLP techniques, we hope to incorporate linguistic cues into document classification through NLP techniques This can be done by utilizing NLP techniques to extract the different representation of the documents and then used in the classification process

As defined by Medin (2000), oth

n

motivated knowledge sources They can be used to derive more complex linguistically motivated features in the process of classification It appears that the centrality of using linguistic knowledge sources as features in the process of classification can serve as an

Trang 31

important step for a good classification scheme For example, besides individual words and the relationships between words within a sentence, a document and the context of what is already known of the world, helps to deliver the actual meaning of a text

Research has focused on using nouns in the process of categorization in modeling the rocess of categorization in the real world (Chen et al 1992; Lewis, 1992; Arampatzis et

2.5 Conclusion

appeared to be the feature used dominantly in supervised lassification studies This could be due to the results of early attempts (Lewis, 1992)

p

al, 2000; Basili, 2001) However, the significant differences of these results have led us

to examine these features with alternate representations

Bag-of-words paradigm has

c

which showed negative results With the advent of NLP techniques, there seems a propelling reason to examine the use of linguistically motivated knowledge sources Although there has been separate attempts made to study the effects of linguistically motivated knowledge sources on supervised document classification techniques, it is difficult to generalize a conclusion based on these separate attempts because of the variations introduced across studies In some cases, conflicting results were also reported Thus, there seems a need for us to fill the gap with a systematic study that covers an extensive variety of linguistically motivated knowledge sources

Trang 32

of document classification and natural language processing, we hope to shed light on the effects of linguistically motivated knowledge sources with different learning algorithms Section 3.1 discusses the shortcomings of previous research Section 3.2 explores the linguistically motivated knowledge sources employed to resolve these issues Finally section 3.3 presents the technique to derive the features

3.1 Considerations

Much research in the area of document classification has been focused mainly on developing techniques or on improving the accuracy of such techniques While the underlying algorithm is an essential factor for classification accuracy, the way in which texts are represented is also an important factor that should be examined However, attempts to produce text representations to improve effectiveness have shown inconsistent results

Trang 33

The classic work of Lewis (1992) has shown that there was low effectiveness of syntactic phrase indexing in terms of its characteristics as a text representation but recent works by Kongovi (2002), Basili (2001), has shown that there were improvements using the same representation Table 1 shows the conclusions made by some related works For example, noun phrase seems to behave differently with different learning algorithms The results differ due to the inconsistencies introduced in these studies through various datasets, taggers, learning algorithms, parameters of the learning algorithms and feature selection methods used

Noun Phrase Statistical

clustering algorithm

Reuters-22173 Lewis (1992) Noun Phrase RIPPER Reuters-21578 Scott & Matwin

(1999) Noun Phrase Clustering Reuters-21578 Kongovi (2002)

Noun Phrase SOM CANCERLIT Tolle & Chen (2000)

Nouns Rocchio Reuters-21578 Basili, Moschitti &

Pazienza (2001) Proper Nouns Rocchio Reuters-21578

Basili, Moschitti &

Pazienza (2001) Tags Rocchio Reuters-21578

Basili, Moschitti &

Pazienza (2001)

Worse Performance than Words Better Performance than Words

Table 1: Summary of Previous Studies

To address the issues as discussed in the previous section and limitations of previous work, this entails a systematic study on the effects of linguistically motivated knowledge sources with various machine learning approaches for automatic document classification

is necessary In contrast to previous work, this research conducts a comparative study and analysis of learning methods among which, are some of the most effective and popular techniques available, and report on the accuracies of linguistically motivated knowledge

Trang 34

sources and novel combinations of them using a systematic methodology to resolve any

of the issues that we have discussed in previous work

Additionally, we try to see if we can break away from the traditional bag-of-words paradigm Bag-of-words basically refers to representing document using words, the smallest meaningful unit of a document with little ambiguity Word-based representations have been the most common representation used in previous works related to document classification They are the basis for most work in text classification The obvious advantage of words is in its simplicity and straightforward processes to obtain the representation However the problem of using bag-of-words is that usually the logical structure, layout and sequence of words are ignored

A basic observation about using the bag of words representations for classification is that

a great deal of information from the original document associated with the logical structure and sequence is discarded The major limitation is the implicit assumption that the order of words in a sentence is not relevant In fact, paragraph, sentence and word orderings are disrupted, and syntactic structures are ignored However, this assumption may always hold as words alone do not always represent true atomic units of meaning For example, the word “learning algorithm” could be interpreted in another manner when broken up into two separate words, “learning” and “algorithm” Thus, we utilize linguistically motivated knowledge sources as features, to see if we can resolve these limitations associated with the bag-of-words paradigm Novel combinations of

Trang 35

linguistically motivated knowledge sources are also proposed and presented in the next section

3.1.1 Linguistically Motivated Knowledge Sources

Machine learning methods require each example in the corpus to be described by a vector

of fixed dimensionality Each component of the vector represents the value of one feature

of the example As a linguistics knowledge source may provide the contextual cues about

a document that are useful as a feature representation for distinguishing the category of the document, we are interested to study whether the choice of different feature representations using different linguistic knowledge sources as the input vectors to the learning algorithm have significant impact on document classification We consider the following linguistics knowledge sources in our research:

1 Word, this will be used as the baseline to do a comparative analysis with other linguistically motivated knowledge sources;

Trang 36

The description of the above features and an analysis of the advantages and disadvantages of feature representation are discussed

3.1.2 Phrase

Phrases have been found to be useful indexing units in previous research Kongovi, Guzman & Dasigi’s (2002) has shown that phrases were salient features when used with category profiles We consider one class of phrases i.e the syntactic phrases Syntactic phrases refer to any set of words that satisfy certain syntactic relations or constitute specified syntactic structures or certain syntactic relations

Phrase refers to the noun phrases that are identified by our parser The data set is first parsed into the appropriate format before being extracted and segmented Noun Phrases is defined as a sequence of words that terminates with a noun More specifically, noun phrases is defined as

NP = {A, N}*N , where NP stands for noun phrase, A for adjective and N for nouns

For example, the sentence, “The limping old man walks across the long bridge” the noun phrases identified are “limping old man” and “long bridge” In our work here, we do not attempt to separate the noun phrases into its component noun phrases

Trang 37

The advantage of phrase does not ignore the assumption that the ordering of words in not relevant and the logical structure, layout and sequence of words are retained thus keeping some information from the original document On the other hand, the major limitation is the greater degree of complexity when processing and extracting phrases as features

Although phrase-based representation has been used in information retrieval, conclusions derived from studies reporting the retrieval effectiveness of linguistic phrase-based representations on retrieval have been inconsistent Linguistic phrase identification has been noted as improving retrieval effectiveness by Fagan (1987) but on the other hand, Mitra et al (1997) reported little benefits in using phrase-based representations Smeaton (1999) reported that the benefit of phrase-based representation varied with users Lewis (1992) undertook a major study of the use of noun phrases for statistical classification and found that phrase representation did not produce any improvement on the Reuters-

22173 corpus

As we are using a different corpus in our work, we decided to continue with the use of phrase based representations in our experiment as it has not been studied before with some of the learning algorithms that we have chosen

3.2.3 Word Sense (Part of Speech Tagging)

Word sense refers to the incorporation of part of speech tags with the word so that the exact word sense within a document is identified The part of speech of a word provides a

Trang 38

syntactic source of the word, such as adjective, adverb, determiner, noun, verb, preposition, pronoun and conjunction As this feature incorporates both the tag and the word, it will provide the word class or lexical tag for the classifier

The intuition for using word sense is to capture additional information that will help to distinguish homographs that can be differentiated based on the syntactical role of the word Homographs refer to words with more than one meaning For example, the word

“patient” might have different meanings when utilized with different syntactical role such

as noun or verb When used as a noun, a patient refers to an individual who is not feeling well or sick but when used as an adjective, it could refer to the character of a person as being tolerant

3.2.4 Nouns

Gentner (1981) explored the differences between nouns and verbs and suggested that nouns differ from verbs in the relational density of their representations The semantic components of noun meanings are more strongly interconnected than those of verbs and other parts of speech Hence, the meanings of nouns seem less mutable than the meanings

of verbs Nouns have been used as a common candidate for distinguishing among different concepts Nouns are often called “substantive words” in the field of Computational Linguistics and “content words” in Information Science

3.2.5 Verbs

Trang 39

Verbs are associated with motions involving relations between objects (Kersten 1998) From an information seeking perspective, verbs do not appear to contribute to the classification accuracies In order to validate this hypothesis, verbs are included as one of the linguistically motivated knowledge sources that were examined in our study

3.2.6 Adjectives

Bruce and Wiebe’s (1999) work has established a positive correlation with the presence

of adjectives with subjectivity The presence of one or more adjectives is essential for predicting that a sentence is subjective Subjectivity tagging refers to distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information There are numerous applications for which subjectivity tagging is relevant, including information retrieval and information extraction This task

is essential to forums and news reporting For a complete study on the use of linguistically motivated knowledge sources, we have included adjectives as one source of linguistic knowledge in our experiment

3.2.7 Combination of Sources

Each linguistic knowledge source generates a feature vector from the context of the document However we also examine the combination of two linguistic knowledge sources which is a novel technique When sources are combined, the features generated

Trang 40

from each knowledge source are concatenated with each source contributing to half of the total number of features and the dataset with all these features are generated Here we combine the words and the linguistically motivated knowledge sources, nouns, noun phrase and adjectives as novel combinations to see if there are any improvements Based

on the fact that the original features are retained, while some syntactical structure is captured using this model, there appears to be an advantage in using a combination of techniques

3.3 Obtaining Linguistically Motivated Classifiers

Our technique has four steps (Figure 5) The input to the technique is a document, D Below is an outline of the generic process proposed and employed to use the linguistically motivated knowledge sources as features:

1 Firstly, the document is broken up into sentences

2 Morphological descriptions or tags are assigned to each term This NLP component does linguistic processing on the contents and attaches a tag to every term

3 Processed terms are parsed

4 Linguistically motivated knowledge sources are then extracted based on the tagging requirements as discussed earlier

5 Features are combined at the binder phase if combinations of features are required

Định dạng
Số trang	93
Dung lượng	1,47 MB