4.3.2 Handling Multiple Categories Training Examples 414.4 Evaluation Measures 41 5.2 Contribution of Different Linguistically Motivated Knowledge Sources to Classification of Reuters
Trang 1INCORPORATING LINGUISTICALLY MOTIVATED KNOWLEDGE SOURCES INTO DOCUMENT
CLASSIFICATION
GOH JIE MEIN
NATIONAL UNIVERSITY OF SINGAPORE
2004
Trang 2INCORPORATING LINGUISTICALLY MOTIVATED KNOWLEDGE SOURCES INTO DOCUMENT
CLASSIFICATION
GOH JIE MEIN
BSc(Hons 1, NUS)
A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF INFORMATION SYSTEMS NATIONAL UNIVERSITY OF SINGAPORE
2004
Trang 3My deep appreciation also goes out to all my friends, peers and colleagues who have helped my in one way or another:
I wish to thank Klarissa Chang, Koh Wei Chern, Cheo Puay Ling and Wong Foong Yin for their listening ears and uplifting remarks
Colleagues and friends such as Michelle Gwee, Wang Xinwei, Liu Xiao, Koh Chung Haur, Li Yan, Li Huixian, Santosa Paulus, Wan Wen, Bryan Low, Chua Teng Chwan, Tok Wee Hyong, Indriyati Atmosukarto, Colin Tan, Julian Lin, the pioneering batch of Schemers, for keeping life pleasurable in the office
I would also like to express my sincere thanks to A/P Chan Hock Chuan and A/P Stanislaw Jarzabek for evaluating my research
And to all professors, teaching staff, administrative staff, friends and students
Last but not least, I would like to thank my family especially my parents, sisters and Melvin Lye for their relentless moral support, motivation, advice, love and understanding
Trang 41.1 Background & Motivation 1
1.2 Aims and Objectives 2
2.2 Feature Selection Methods 9
2.3 Machine Learning Algorithms 9
2.3.6 Instance Based Learning – k-Nearest Neighbour 17
2.4 Natural Language Processing (NLP) in Document Classification 19
Trang 54.3.2 Handling Multiple Categories Training Examples 41
4.4 Evaluation Measures 41
5.2 Contribution of Different Linguistically Motivated Knowledge Sources to
Classification of Reuters-21578 Corpus
48
Trang 65.2.3 Word Sense (Part of Speech Tags) 51
5.3 Contribution of Linguistically Motivated Knowledge Sources to
Classification accuracy of WebKB Corpus
Trang 7SUMMARY
This thesis describes an empirical study on the effects of linguistically motivated knowledge sources with different learning algorithms Using up to nine different linguistically motivated knowledge sources and six different learning algorithms, we examined the performance of classification accuracy by using different linguistically motivated knowledge sources, with different learning algorithms on two benchmark data corpora: Reuters-21578 and WebKB The use of linguistically motivated knowledge sources, nouns, outperformed the traditional bag-of-words classifiers for Reuters-21578 The best results for this corpus were obtained with the use of nouns with support vector machines On the other hand, experiments with WebKB showed that classifiers built using the novel combinations of linguistically motivated knowledge sources were as competitive as those built using the conventional bag-of-words technique
Trang 8LIST OF TABLES
Table 1 Summary of Related Studies 23 Table 2 Distribution of Categories in Reuters-21578 34 Table 3 Top 20 Categories in ModApte Split 35 Table 4 Breakdown of Documents in WebKB Categories 36
Table 11 Results using Adjectives 54 Table 12 Results using both Linguistically Knowledge Sources and Words 54 Table 13 Contribution of Knowledge Sources on Reuters-21578 (F1 measures) 55 Table 14 Contribution of Knowledge Sources on Reuters-21578 (Precision) 58 Table 15 Contribution of Knowledge Sources on Reuters-21578 (Recall) 59
Table 21 Results using Adjectives 66
Trang 9Table 22 Results using Nouns & Words 66 Table 23 Results using Phrases & Words 67 Table 24 Results using Adjectives & Words 68 Table 25 Consolidated Results of WebKB 69
Trang 10Figure 8 Comparison of Different Linguistically Motivated Knowledge
Sources on Reuters-21578 (Micro F1 values)
56
Figure 9 Comparison of Different Linguistically Motivated Knowledge
Sources on Reuters-21578 (Macro F1 values)
57
Figure 10 Comparison of Different Linguistically Motivated Knowledge
Sources on Reuters-21578 (Precision)
59
Figure 11 Comparison of Different Linguistically Motivated Knowledge
Sources on Reuters-21578 (Recall)
60
Figure 12 Comparison of Different Linguistically Motivated Knowledge
Sources on WebKB (Micro F1 values)
70
Figure 13 Comparison of Different Linguistically Motivated Knowledge
Sources on WebKB (Macro F1 values)
70
Trang 11CHAPTER 1
INTRODUCTION
1.1 Background & Motivation
With the emerging importance of knowledge management, research in areas such as document classification, information retrieval and information extraction each plays a critical role in the success of knowledge management initiatives Studies have shown that the perceived output quality is an essential factor for successful implementation and adoption of knowledge management technologies (Kankanhalli, et al., 2001) Large document archives such as electronic knowledge repositories offer a huge wealth of information whereby methods in the field of information retrieval and document classification are used to derive knowledge
Coupled with the accessibility to voluminous amount of information available in the World Wide Web, this information explosion has brought about other problems Users are often overwhelmed by the deluge of information and suffer from a decreased ability
to assimilate information Research has suggested that users feel bored or frustrated when they receive too much information (Roussinov and Chen, 1999) which can lead to a state where the individual is no longer able to effectively process the amount of information he
is exposed, giving rise to a lower decision quality in a given time set This problem is exacerbated with the proliferation of information in electronic repositories in organizations and the World Wide Web (Farhoomand and Drury, 2002)
Trang 12Document classification has been applied to the categorization of search results and has been shown to alleviate the problem of information overload The presentation of documents in categories works better than a list of results because it enables users to disambiguate the categories quickly and then focus on the relevant category (Dumais, Cutrell & Chen, 2001) This is also useful for distinguishing documents containing words with multiple meanings, or polysemy, a characteristic predominant in English words
Experiments on supervised document classification techniques have predominantly used the bag-of-words technique whereby words of the documents are used as features Alternate formulations of a meaning can also be introduced through linguistic variation such as syntax which determines the association of words Although there have been some studies that have employed alternate features such as linguistic sources, these studies have only employed a subset of linguistic sources and learning algorithms (Lewis, 1992; Arampatzis et al., 2000; Kongovi et al., 2002) Thus, this study extends previous studies relating to document classification and aims to find ways to improve the classification accuracy of documents
1.2 Aims and Objectives
Differences in previous empirical studies could be introduced by the differences in tagging tools used, different learning algorithms, parameters tuned for each of the learning algorithms, feature selection methods employed and dataset involved (Yang,
Trang 131999) Thus, it is difficult to offer a sound conclusion based on previous works Since previous works documenting results based on linguistically motivated features with learning algorithms produced inconsistent and sometimes conflicting results, we propose
to conduct a systematic study on multiple learning algorithms and linguistically motivated knowledge sources as features Some of these features are novel combinations
of linguistically motivated knowledge sources were not explored in previous studies
Using a systematic and controlled study, we can resolve some of these ambiguities and offer a sound conclusion In our study, consistency in the dataset, learning algorithms, tagging tools and features selections were maintained in the study so that we can have a valid assessment of the effectiveness of linguistically motivated features
The aim of this thesis is also to provide a solid foundation for research on feature representations on text classification and study the factors affecting machine learning algorithms used in document classification One of these factors we want to look into is the effect of feature engineering by utilizing linguistically motivated knowledge sources
as features in our study
Thus, the objectives of this thesis are listed below:
1 To examine the approach of using linguistically motivated knowledge
sources based on concepts derived from natural language processing as features with popular learning algorithms, and systematically vary both
Trang 14the learning algorithms and feature representations We based our evaluation on Reuters-21578 corpus and WebKB, benchmark corpus that has been widely used in previous research
2 To examine the feasibility of applying novel combinations of
linguistically motivated knowledge sources and explore the performance of these combinations as features on the accuracy of document classification
1.3 Thesis Plan
This thesis is composed of the following chapters:
Chapter 1 provides the background, motivation and objectives of this research
Chapter 2 provides a literature review of document classification research Here we
bring together literature from different fields, document classification, machine learning techniques and natural language processing and give a detailed coverage of the algorithms chosen for this study In addition, this chapter also overview the rudimentary knowledge required in later chapters
Chapter 3 describes the types of linguistically motivated knowledge sources and the
novel combinations used in our study
Trang 15Chapter 4 provides a description of the experiment setup It also gives a brief on the
performance measures used to evaluate the classifiers and tools employed to conduct the study
Chapter 5 provides an analysis of the results and suggests the implications for practice
Chapter 6 concludes with the contributions, findings and the limitations of our study
Suggestions on future research that can be an extension to this work are also provided
Trang 16CHAPTER 2
LITERATURE REVIEW
Document classification has traditionally been carried out using the bag-of-words paradigm Research on natural language processing has shown fruitful results that can be applied to document classification by introducing an avenue for improving accuracy by a different set of features used This chapter reviews related literature underpinning this research Section 2.1 gives an overview of document classification and the focus of previous research Section 2.2 overviews common features selection techniques used in previous studies Section 2.3 introduces the machine learning algorithms that will be adopted in our study Section 2.4 presents the concepts of natural language processing employed to derive linguistically motivated knowledge sources
2.1 Document Classification
Document classification, the focus of this work, refers to the task of assigning a document to categories based on its underlying content Although this task has been carried out effectively using knowledge engineering approaches in the 80s, machine learning approaches have superceded the use of knowledge engineering approaches for document classification in recent years While the knowledge engineering approaches have produced effective classification accuracy, the machine learning approach offered many advantages such as cost effectiveness, portability and competitive accuracy to that
Trang 17of human experts while producing considerable efficiency (Sebastiani, 2002) Thus, machine learning supervised techniques are employed in this study
A supervised approach involves three main phases: feature extraction, training, testing The entire process of document classification using machine learning methods is illustrated in Figure 1
Figure 1: Document classification Process
There are two phases involved in learning based document classification: the training and testing phase In the training phase, pre-labeled documents are collected This set of pre-labeled documents is called the training corpus, training data or training set which is used interchangeably throughout this thesis Each document in the training corpus is transformed into a feature vector These feature vectors are trained with a learning
Trang 18classifier Each classifier will then build a model based on the training set The model built based on the learning algorithm is then used in the testing phase to test a set of unlabeled documents, called the test corpus, test data or test set, that are new to the learning classifier, to be labeled
The classification problem can be formally represented as follows:
Classification fc(d) -> {true, false} where d∈D
Given a set of training documents, D and a set of categories C, the classification problem
is defined as the function, f, to map documents in the test set, T, into a boolean value
where ‘true’ indicates that the document is categorized as C and ‘false’ indicates that the document is not categorized under C, based on the characteristics of the training
documents, D where f (d)->c
2.2 Feature Selection Methods
With a large number of documents and features, the document classification process usually involves a feature selection step This is done to reduce the feature dimension Feature selection methods extract only the important features derived from the original set of features Traditional features selection methods that have been commonly employed in previous studies include document frequency (Dumais et al., 1998; Yang and Pedersen, 1997), chi-square (Schutze et al., 1997; Yang and Pedersen, 1997), information gain (Lewis and Ringuette, 1994; Yang and Pedersen, 1997), mutual information (Dumais et al., 1998; Larkey and Croft, 1996) etc
Trang 19After the feature reduction step, many supervised techniques can then be employed in document classification Subsequent section presents a review of the techniques that were employed in this study
2.3 Machine Learning Algorithms
This section reviews the state-of-the-art learning algorithms as text classifiers, to give a background on the different methods and an analysis of the advantages and disadvantages
of each learning method that we have used in our empirical study Past research in the field of automatic document classification has focused on improving the document classification techniques through various learning algorithms (Yang and Liu, 1999) such
as support vector machines (Joachims, 1998) and various feature selection methods (Yang and Pedersen, 1997; Luigi et al., 2000) To make the study worthwhile, popular learning algorithms that have reported significant improvements on the classification accuracy of various learning algorithms in previous studies were used These included a wide variety of supervised-learning algorithms, including nạve bayes, support vector machines, k-nearest-neighbours, C4.5, RIPPER, AdaBoost with decision stumps and alternating decision trees
2.3.1 Nạve Bayes (NB)
Bayes classification is a popular technique in recent years The simplest Bayesian classifier is the widely used Nạve Bayes classifier which assumes that features are independent Despite the inaccurate assumption of feature independence, Nạve Bayes is
Trang 20surprisingly successful in practice and has proven effective in text classification, medical diagnosis, and computer performance management, among other applications
Nạve Bayes classifier uses a probabilistic model of text to estimate the probability that a
document d is in class y, Pr(y|d) This model assumes conditional independence of
features, i.e words are assumed to occur independently of the other words in the document given its class Despite this assumption, Nạve Bayes have performed well
Bayes rule says that to achieve the highest classification accuracy, d should be assigned
to the class y ∈ {-1, +1} for which Pr(y|d) is the highest
)
|Pr(
maxarg)(d { 1, } y d
|Pr(
)'
|Pr(
)',
|Pr(
)
|Pr(
l y l
y d d
Pr(d|y,l’) is the probability of observing document d in class y given its length l’ Pr(y|l’)
is the prior probability that a document of length l’ is in class y In the following we will assume that the category of a document does not depend on its length so Pr(y|l’) = Pr(y)
An estimate of Pr(y) is as follows:
|
|
|)
(Pr'
} , 1 {
y y
y y
Trang 21|y| denotes the number of training documents in class y∈{-1,+1} and |D| is the total
number of documents
Despite the unrealistic independence assumption, the Nạve Bayes classifier is remarkably successful in practice Researchers show that the Nạve Bayes classifier is competitive with other learning algorithms such as decision tree and neural network algorithms Experimental results for Nạve Bayes classifiers can be found in several studies (Lewis, 1992; Lewis and Ringuette, 1994; Lang, 1995; Pazzani 1996; Joachims, 1998; McCallum & Nigam 1998; Sahami, 1998) These studies have shown that Bayesian classifiers can produce reasonable results with high precision and recall values Hence,
we have chosen this learning algorithm in learning to classify text documents The second reason that this Bayesian method is important to our study of machine learning is that it provides a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities
2.3.2 Support Vector Machines (SVM)
Support vector machines were developed by Vapnik et al (1995) based on structural risk minimization principle from statistical learning theory The idea of structural risk
minimization is to find a hypothesis h from a space H that guarantees the lowest probability of error E(h) for a given training sample S consisting of n examples Equation (5) gives the upper bound that connects the true error of a hypothesis h with the error E(h) of h on the training set and the complexity of h which reflects the well-known trade-
off between the complexity of the hypothesis space and the training error
Trang 22)ln(
)ln(
()()
(
n d
n d O h E h
E train
η
−+
A simple hypothesis space will most likely not contain good approximation functions and will lead to a high training and true error On the other hand a large hypothesis space will lead to a small training error but the second term in the right-hand side of equation (5) will be large This reflects the fact that for a hypothesis space with high VC-dimension the hypothesis with low training error may result in overfitting Thus it is crucial to find the right hypothesis space
The simplest representation of a support vector machine, a linear SVM, is a hyperplane that separates a set of positive examples from a set of negative examples with maximum distance from the hyperplane to the nearest of the positive and negative examples Figure
2 shows the graphical representation of a linear SVM
+
++ +
Trang 23of the learner Unlike conventional generative models, SVM does not involve unreasonable parametric or independence assumptions The discriminative model focuses
on those properties of the text classification tasks that are sufficient for good generalization performance, avoiding much of the complexity of natural language This makes SVM suitable for achieving good classification performance despite the high dimensional feature spaces in text classification High redundancy, high discriminative power of term sets, and discriminative features in the high-frequency range are sufficient conditions for good generalization SVM is therefore chosen as one of the learning algorithms in this study
We used Platt’s (1999) sequential minimal optimization algorithm to process the linear SVM more efficiently This algorithm decomposes the large quadratic programming problem into smaller sub-problems Document classification using support vector machine can be done either through a binary or multi-class classification but we have adopted the binary approach which will be mentioned in a later chapter
2.3.3 Alternating Decision Tree (ADTree)
Although a variety of decision tree learning methods have been developed with somewhat differing capabilities and requirements, we have chosen one of the recent method called the alternating decision trees (Freund & Mason, 1999) This is because this method has been often applied to classification problems and applied to problems such as learning to classify text or documents
Trang 24Alternating decision tree learning algorithm is a new combination of decision trees with boosting that generates classification rules that are small and often easy to interpret A general alternating tree defines a classification rule by defining a set of paths in the alternating tree As in standard decision trees, when a path reaches a decision node it continues with the child that corresponds to the outcome of the decision associated with that node When a prediction node is reached, the path continues with all of the children
of the node Path splits into a set of paths, where each path corresponds to one of the children of the prediction node
The difference between ADTree and conventional decision trees is that the classification
is based on traversing the path of the decision tree instead of the final leaf node of the tree
There are several key features of alternating decision trees Firstly, compared to C5.0 with boosting, ADTree provides classifiers that are smaller and easier to interpret In addition, ADTree give a measure of confidence, called the classification margin, that can
be used to improve the accuracy of the cost of abstaining from predicting examples that are hard to classify instead of only a class However, the disadvantage of ADTree is its susceptibility to overfitting in small data sets
Trang 25Figure 3 An Example of an ADTree (Freund & Mason, 1999)
2.3.4 C4.5
A decision tree text classifier is a tree in which internal nodes are labeled by terms, branches departing from them are labeled by tests on the weight that the term has in the test document, and leaves are labeled by categories In this classification scheme, a text
document d is categorized by recursively testing the weights that the terms labeling the internal nodes have in vector d, until a leaf node is reached The label of this node is then assigned to d Most of these classifiers use binary document representations represented
as a binary tree There are a number of decision trees and among the most popular is C4.5 (Cohen and Hirsh, 1998) Thus we have chosen this learning method
The most popular decision-tree algorithm that has shown good results on a variety of problems is the C4.5 algorithm (Quinlan, 1993) Previous works based on this technique are reported in Lewis and Ringuette, 1994, Moulinier et al., 1996, Apte and Damerau,
1994, Cohen, 1995, Cohen, 1996 The underlying approach to C4.5 is that it learns
Trang 26decision trees by constructing them top-down, from the root of the tree Each instance feature is then evaluated using a statistical test, like the information gain, to determine how well it alone classifies the training examples Information gain is otherwise known
as entropy in information theory Entropy of a collection S is measured as follows:
−
− + + −
−
S Entropy( ) log2 log2 (6)
Where p + is the proportion of positive instances in the collection S and p - is the proportion of negative instances The best feature is selected and employed as a root node to the tree For each possible value of this attribute, a descendant of the root node is created, and the training examples are sorted to the appropriate descendant node C4.5 forms a greedy search for a suitable decision tree in which no backtracking is allowed
2.3.5 Ripper
This learning algorithm is a prepositional rule learner, RIPPER (Repeated Incremental Pruning to Produce Error Reduction), proposed by Cohen (1995) The algorithm has a few major phases that characterize it: grow, prune, optimization RIPPER was developed based on repeated application of Furnkranz and Widmer’s (1994) IREP algorithm followed by two new global optimization procedures Like other rule-based learners, RIPPER grows rules in a greedy fashion guided by some information theoretic procedures
Trang 27Firstly, the rules are grown from a greedy process which adds conditions to the rule until
the rule is 100% accurate The algorithm tries every possible value of each attribute and
selects the conditions with the highest information gain The rules are incrementally
pruned and finally in the optimization stage, an initial rule set and pruned rule set are
obtained One variant is generated from an empty rule while the other is generated by
greedily adding antecedents to the original rule The smallest possible description length
for each variant and the original rule is computed and the variant with the minimal
description length is selected as the final representative of rules in the rule set Rules that
would increase the description length of the rule set if they were included were deleted
The resultant rules are then added to the rule set
RIPPER has already been applied to a number of standard problems in text classification
with rather promising results (Cohen, 1995) Thus, it is chosen as one of the candidate
learning algorithms in our empirical study
2.3.6 Instance Based Learning – k-Nearest Neighbour
The basic idea behind k-Nearest Neighbors (k-NN) classifier is the assumption that examples located close to each other according to a user-defined
similarity metric are highly likely to belong to the same class This algorithm is also
derived from Bayes’ rule This technique has shown good performance on text
categorization in Yang & Liu (1997), Yang & Pederson (1999), Masand (1992).This
algorithm assumes that all instances correspond to points in the n-dimensional space The
nearest neighbors of an instance are defined in terms of the standard Euclidean distance
Trang 28An arbitrary instance x is described as a feature vector (a1(x), a2(x), … an(x)) where ai(x) denotes the value of the ith attribute of the instance x Then the distance between two
ar x
,(
The target function can either be discrete or real-valued In our study, we will assume that
Train(training example x){
For each training example (x, f(x)) add example to list of training examples
}
Classify(query instance xq){
Let x … x denote the k instances from training examples that are nearest to x q
v q
1
Figure 4: k-NN Algorithm (Mitchell, 1997)
The key advantage of instance-based learning is that instead of estimating the target function once for the ly and differently for each new instance to be classified This method is a conceptually straightforward approach to approximating real-valued or discrete-valued target functions In general, one disadvantage of instance-based approaches is that the cost of classifying new instances
entire instance space, it can estimate it local
Trang 29can be high due to the fact that nearly all computation takes place at classification time rather than when the training examples are first encountered A second disadvantage is they consider all attributes of the instances when attempting to retrieve similar training examples from memory If the target concept depends on only a few of the many available attributes, then the instances that are truly most “similar” may have a large distance apart However, as previous attempts to classify text with this approach has shown to be effective (Yang, 1999), we have decided to include it inside our experiment
2.4 Natural Language Processing (NLP) in Document
Classification
Similarly, in docum search, most experiments are not linguistically motivated (Cullingford, 1986) Closely related to the research on document classification is the
Most information retrieval (IR) systems are not linguistically motivated
ent classification re
research on natural language processing and cognitive science Traditionally, document classification techniques are primarily directed at detecting performance accuracy and hold little regard for linguistic phenomena Much of the current document classification systems are built upon techniques that represent text as a collection of terms such as words This has been done successfully using quantitative methods based on word or character counts However, it has been emphasized that vector space models cannot capture critical semantic aspects of document content In this case, the representation is superficially related to content since language is more than simply a collection of words
Trang 30Thus, natural language processing is a key technology for building information retrieval systems of the future (Strzalkowski, 1999)
In order to study the effects of linguistically motivated knowledge sources with document lassification, it is imperative to learn about the grammar through natural language
er concepts identified are verb, count nouns, mass ouns, isolated and interrelated concepts We define such concepts as linguistically
c
processing so as to apply concepts in cognitive science on document classification techniques Natural language processing research attempts to enhance the ability of the computer to analyze, understand and generate languages that are used This is performed
by some type of computational or conceptual analysis to form meaningful structure or semantics from a document The inherently ambiguous nature of natural language makes things even more difficult A variety of research disciplines are involved in the successful development of NLP systems The mapping of words into meaningful representations is driven by morphological, syntactic, semantic, and contextual cues available in words (Cullingford, 1986) With the advancement of NLP techniques, we hope to incorporate linguistic cues into document classification through NLP techniques This can be done by utilizing NLP techniques to extract the different representation of the documents and then used in the classification process
As defined by Medin (2000), oth
n
motivated knowledge sources They can be used to derive more complex linguistically motivated features in the process of classification It appears that the centrality of using linguistic knowledge sources as features in the process of classification can serve as an
Trang 31important step for a good classification scheme For example, besides individual words and the relationships between words within a sentence, a document and the context of what is already known of the world, helps to deliver the actual meaning of a text
Research has focused on using nouns in the process of categorization in modeling the rocess of categorization in the real world (Chen et al 1992; Lewis, 1992; Arampatzis et
2.5 Conclusion
appeared to be the feature used dominantly in supervised lassification studies This could be due to the results of early attempts (Lewis, 1992)
p
al, 2000; Basili, 2001) However, the significant differences of these results have led us
to examine these features with alternate representations
Bag-of-words paradigm has
c
which showed negative results With the advent of NLP techniques, there seems a propelling reason to examine the use of linguistically motivated knowledge sources Although there has been separate attempts made to study the effects of linguistically motivated knowledge sources on supervised document classification techniques, it is difficult to generalize a conclusion based on these separate attempts because of the variations introduced across studies In some cases, conflicting results were also reported Thus, there seems a need for us to fill the gap with a systematic study that covers an extensive variety of linguistically motivated knowledge sources
Trang 32of document classification and natural language processing, we hope to shed light on the effects of linguistically motivated knowledge sources with different learning algorithms Section 3.1 discusses the shortcomings of previous research Section 3.2 explores the linguistically motivated knowledge sources employed to resolve these issues Finally section 3.3 presents the technique to derive the features
3.1 Considerations
Much research in the area of document classification has been focused mainly on developing techniques or on improving the accuracy of such techniques While the underlying algorithm is an essential factor for classification accuracy, the way in which texts are represented is also an important factor that should be examined However, attempts to produce text representations to improve effectiveness have shown inconsistent results
Trang 33The classic work of Lewis (1992) has shown that there was low effectiveness of syntactic phrase indexing in terms of its characteristics as a text representation but recent works by Kongovi (2002), Basili (2001), has shown that there were improvements using the same representation Table 1 shows the conclusions made by some related works For example, noun phrase seems to behave differently with different learning algorithms The results differ due to the inconsistencies introduced in these studies through various datasets, taggers, learning algorithms, parameters of the learning algorithms and feature selection methods used
Noun Phrase Statistical
clustering algorithm
Reuters-22173 Lewis (1992) Noun Phrase RIPPER Reuters-21578 Scott & Matwin
(1999) Noun Phrase Clustering Reuters-21578 Kongovi (2002)
Noun Phrase SOM CANCERLIT Tolle & Chen (2000)
Nouns Rocchio Reuters-21578 Basili, Moschitti &
Pazienza (2001) Proper Nouns Rocchio Reuters-21578
Basili, Moschitti &
Pazienza (2001) Tags Rocchio Reuters-21578
Basili, Moschitti &
Pazienza (2001)
Worse Performance than Words Better Performance than Words
Table 1: Summary of Previous Studies
To address the issues as discussed in the previous section and limitations of previous work, this entails a systematic study on the effects of linguistically motivated knowledge sources with various machine learning approaches for automatic document classification
is necessary In contrast to previous work, this research conducts a comparative study and analysis of learning methods among which, are some of the most effective and popular techniques available, and report on the accuracies of linguistically motivated knowledge
Trang 34sources and novel combinations of them using a systematic methodology to resolve any
of the issues that we have discussed in previous work
Additionally, we try to see if we can break away from the traditional bag-of-words paradigm Bag-of-words basically refers to representing document using words, the smallest meaningful unit of a document with little ambiguity Word-based representations have been the most common representation used in previous works related to document classification They are the basis for most work in text classification The obvious advantage of words is in its simplicity and straightforward processes to obtain the representation However the problem of using bag-of-words is that usually the logical structure, layout and sequence of words are ignored
A basic observation about using the bag of words representations for classification is that
a great deal of information from the original document associated with the logical structure and sequence is discarded The major limitation is the implicit assumption that the order of words in a sentence is not relevant In fact, paragraph, sentence and word orderings are disrupted, and syntactic structures are ignored However, this assumption may always hold as words alone do not always represent true atomic units of meaning For example, the word “learning algorithm” could be interpreted in another manner when broken up into two separate words, “learning” and “algorithm” Thus, we utilize linguistically motivated knowledge sources as features, to see if we can resolve these limitations associated with the bag-of-words paradigm Novel combinations of
Trang 35linguistically motivated knowledge sources are also proposed and presented in the next section
3.1.1 Linguistically Motivated Knowledge Sources
Machine learning methods require each example in the corpus to be described by a vector
of fixed dimensionality Each component of the vector represents the value of one feature
of the example As a linguistics knowledge source may provide the contextual cues about
a document that are useful as a feature representation for distinguishing the category of the document, we are interested to study whether the choice of different feature representations using different linguistic knowledge sources as the input vectors to the learning algorithm have significant impact on document classification We consider the following linguistics knowledge sources in our research:
1 Word, this will be used as the baseline to do a comparative analysis with other linguistically motivated knowledge sources;
Trang 36The description of the above features and an analysis of the advantages and disadvantages of feature representation are discussed
3.1.2 Phrase
Phrases have been found to be useful indexing units in previous research Kongovi, Guzman & Dasigi’s (2002) has shown that phrases were salient features when used with category profiles We consider one class of phrases i.e the syntactic phrases Syntactic phrases refer to any set of words that satisfy certain syntactic relations or constitute specified syntactic structures or certain syntactic relations
Phrase refers to the noun phrases that are identified by our parser The data set is first parsed into the appropriate format before being extracted and segmented Noun Phrases is defined as a sequence of words that terminates with a noun More specifically, noun phrases is defined as
NP = {A, N}*N , where NP stands for noun phrase, A for adjective and N for nouns
For example, the sentence, “The limping old man walks across the long bridge” the noun phrases identified are “limping old man” and “long bridge” In our work here, we do not attempt to separate the noun phrases into its component noun phrases
Trang 37The advantage of phrase does not ignore the assumption that the ordering of words in not relevant and the logical structure, layout and sequence of words are retained thus keeping some information from the original document On the other hand, the major limitation is the greater degree of complexity when processing and extracting phrases as features
Although phrase-based representation has been used in information retrieval, conclusions derived from studies reporting the retrieval effectiveness of linguistic phrase-based representations on retrieval have been inconsistent Linguistic phrase identification has been noted as improving retrieval effectiveness by Fagan (1987) but on the other hand, Mitra et al (1997) reported little benefits in using phrase-based representations Smeaton (1999) reported that the benefit of phrase-based representation varied with users Lewis (1992) undertook a major study of the use of noun phrases for statistical classification and found that phrase representation did not produce any improvement on the Reuters-
22173 corpus
As we are using a different corpus in our work, we decided to continue with the use of phrase based representations in our experiment as it has not been studied before with some of the learning algorithms that we have chosen
3.2.3 Word Sense (Part of Speech Tagging)
Word sense refers to the incorporation of part of speech tags with the word so that the exact word sense within a document is identified The part of speech of a word provides a
Trang 38syntactic source of the word, such as adjective, adverb, determiner, noun, verb, preposition, pronoun and conjunction As this feature incorporates both the tag and the word, it will provide the word class or lexical tag for the classifier
The intuition for using word sense is to capture additional information that will help to distinguish homographs that can be differentiated based on the syntactical role of the word Homographs refer to words with more than one meaning For example, the word
“patient” might have different meanings when utilized with different syntactical role such
as noun or verb When used as a noun, a patient refers to an individual who is not feeling well or sick but when used as an adjective, it could refer to the character of a person as being tolerant
3.2.4 Nouns
Gentner (1981) explored the differences between nouns and verbs and suggested that nouns differ from verbs in the relational density of their representations The semantic components of noun meanings are more strongly interconnected than those of verbs and other parts of speech Hence, the meanings of nouns seem less mutable than the meanings
of verbs Nouns have been used as a common candidate for distinguishing among different concepts Nouns are often called “substantive words” in the field of Computational Linguistics and “content words” in Information Science
3.2.5 Verbs
Trang 39Verbs are associated with motions involving relations between objects (Kersten 1998) From an information seeking perspective, verbs do not appear to contribute to the classification accuracies In order to validate this hypothesis, verbs are included as one of the linguistically motivated knowledge sources that were examined in our study
3.2.6 Adjectives
Bruce and Wiebe’s (1999) work has established a positive correlation with the presence
of adjectives with subjectivity The presence of one or more adjectives is essential for predicting that a sentence is subjective Subjectivity tagging refers to distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information There are numerous applications for which subjectivity tagging is relevant, including information retrieval and information extraction This task
is essential to forums and news reporting For a complete study on the use of linguistically motivated knowledge sources, we have included adjectives as one source of linguistic knowledge in our experiment
3.2.7 Combination of Sources
Each linguistic knowledge source generates a feature vector from the context of the document However we also examine the combination of two linguistic knowledge sources which is a novel technique When sources are combined, the features generated
Trang 40from each knowledge source are concatenated with each source contributing to half of the total number of features and the dataset with all these features are generated Here we combine the words and the linguistically motivated knowledge sources, nouns, noun phrase and adjectives as novel combinations to see if there are any improvements Based
on the fact that the original features are retained, while some syntactical structure is captured using this model, there appears to be an advantage in using a combination of techniques
3.3 Obtaining Linguistically Motivated Classifiers
Our technique has four steps (Figure 5) The input to the technique is a document, D Below is an outline of the generic process proposed and employed to use the linguistically motivated knowledge sources as features:
1 Firstly, the document is broken up into sentences
2 Morphological descriptions or tags are assigned to each term This NLP component does linguistic processing on the contents and attaches a tag to every term
3 Processed terms are parsed
4 Linguistically motivated knowledge sources are then extracted based on the tagging requirements as discussed earlier
5 Features are combined at the binder phase if combinations of features are required