In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification.. Unlik
Trang 1COLLEGE OF TECHNOLOGY
NGUYEN CAM TU
HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS
MASTER THESIS
HANOI - 2008
Trang 2COLLEGE OF TECHNOLOGY
NGUYEN CAM TU
HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS
Major: Information Technology Specificity: Information Systems Code: 60 48 05
MASTER THESIS
SUPERVISOR: Prof Dr Ha Quang Thuy
HANOI - 2008
Trang 3Acknowledgements
My deepest thank must first go to my research advisor, Prof Dr Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years
Many thanks go to Dr Phan Xuan Hieu who has given me many advices and comments This work can not be possible without his support Also, I would like to thank him for being my friend, my older brother who has brought me a lot of lessons in both scientific research and daily life
My thanks also go to all members of seminar group “data mining” Especially, I would
like to thank Bsc Nguyen Thu Trang for helping me a lot in collecting data and doing experiments
I highly acknowledge the invaluable support and advice in both technical and daily life of
my teachers, my colleagues in Department of Information Systems, Faculty of Technology, Vietnam National University, Hanoi
I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity
Resolution and Tracking crossover Web Documents”, Vietnam National University,
Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and
Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and
Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10
“Developing content filter systems to support management and implementation public security – ensure policy”
Finally, from bottom of my heart, I would specially like to say thanks to all members in
my family, all my friends They are really an endless encouragement in my life
Nguyen Cam Tu
Trang 4Assurance
I certify that the achievements in this thesis belong to my personal, and are not copied from any other’s results Throughout the dissertation, all the mentions are either my proposal, or summarized from many sources All the references have clear origins, and properly quoted I am responsible for this statement
Hanoi, November 15, 2007
Nguyen Cam Tu
Trang 5Table of Content
Introduction 1
Chapter 1 The Problem of Modeling Text Corpora and Hidden Topic Analysis 3
1.1 Introduction 3
1.2 The Early Methods 5
1.2.1 Latent Semantic Analysis 5
1.2.2 Probabilistic Latent Semantic Analysis 8
1.3 Latent Dirichlet Allocation 11
1.3.1 Generative Model in LDA 12
1.3.2 Likelihood 13
1.3.3 Parameter Estimation and Inference via Gibbs Sampling 14
1.3.4 Applications 17
1.4 Summary 17
Chapter 2 Frameworks of Learning with Hidden Topics 19
2.1 Learning with External Resources: Related Works 19
2.2 General Learning Frameworks 20
2.2.1 Frameworks for Learning with Hidden Topics 20
2.2.2 Large-Scale Web Collections as Universal Dataset 22
2.3 Advantages of the Frameworks 23
2.4 Summary 23
Chapter 3 Topics Analysis of Large-Scale Web Dataset 24
3.1 Some Characteristics of Vietnamese 24
3.1.1 Sound 24
3.1.2 Syllable Structure 26
3.1.3 Vietnamese Word 26
3.2 Preprocessing and Transformation 27
3.2.1 Sentence Segmentation 27
Trang 63.2.2 Sentence Tokenization 28
3.2.3 Word Segmentation 28
3.2.4 Filters 28
3.2.5 Remove Non Topic-Oriented Words 28
3.3 Topic Analysis for VnExpress Dataset 29
3.4 Topic Analysis for Vietnamese Wikipedia Dataset 30
3.5 Discussion 31
3.6 Summary 32
Chapter 4 Deployments of General Frameworks 33
4.1 Classification with Hidden Topics 33
4.1.1 Classification Method 33
4.1.2 Experiments 36
4.2 Clustering with Hidden Topics 40
4.2.1 Clustering Method 40
4.2.2 Experiments 45
4.3 Summary 49
Conclusion 50
Achievements throughout the thesis 50
Future Works 50
References 52
Vietnamese References 52
English References 52
Appendix: Some Clustering Results 56
Trang 7List of Figures
Figure 1.1 Graphical model representation of the aspect model in the asymmetric (a) and
symmetric (b) parameterization ( [55]) 9
Figure 1.2 Sketch of the probability sub-simplex spanned by the aspect model ( [55]) 10
Figure 1.3 Graphical model representation of LDA - The boxes is “plates” representing replicates The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20] 12
Figure 1.4 Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively 13
Figure 1.5 Quantities in the model of latent Dirichlet allocation 13
Figure 1.6 Gibbs sampling algorithm for Latent Dirichlet Allocation 16
Figure 2.1 Classification with Hidden Topics 20
Figure 2.2 Clustering with Hidden Topics 21
Figure 3.1 Pipeline of Data Preprocessing and Transformation 27
Figure 4.1 Classification with VnExpress topics 33
Figure 4.2 Combination of one snippet with its topics: an example 35
Figure 4.3 Learning with different topic models of VnExpress dataset; and the baseline (without topics) 37
Figure 4.4 Test-out-of train with increasing numbers of training examples Here, the number of topics is set at 60topics 37
Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics 39
Figure 4.6 Clustering with Hidden Topics 40
Figure 4.7 Dendrogram in Agglomerative Hierarchical Clustering 42
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query 47
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query 47
Trang 8List of Tables
Table 3.1 Vowels in Vietnamese 24
Table 3.2 Tones in Vietnamese 25
Table 3.3 Consonants of hanoi variety 26
Table 3.4 Structure of Vietnamese syllables 26
Table 3.5 Functional words in Vietnamese 29
Table 3.6 Statistics of topics assigned by humans in VnExpress Dataset 29
Table 3.7 Statistics of VnExpress dataset 30
Table 3.8 Most likely words for sample topics Here, we conduct topic analysis with 100 topics 30
Table 3.9 Statistic of Vietnamese Wikipedia Dataset 31
Table 3.10 Most likely words for sample topics Here, we conduct topic analysis with 200 topics 31
Table 4.1 Google search results as training and testing dataset The search phrases for training and test data are designed to be exclusive 34
Table 4.2 Experimental results of baseline (learning without topics) 38
Table 4.3 Experimental results of learning with 60 topics of VnExpress dataset 38
Table 4.4 Some collocations with highest values of chi-square statistic 44
Table 4.5 Queries submitted to Google 45
Table 4.6 Parameters for clustering web search results 46
Trang 9Notations & Abbreviations
Word or phrase Abbreviation
Information Retrieval IR Latent Semantic Analysis LSA Probability Latent Semantic Analysis PLSA Latent Dirichlet Allocation LDA Dynamic Topic Models DTM Correlated Topic Models CTM Singular Value Decomposition SVD
Trang 10Introduction
The World Wide Web has influenced many aspects of our lives, changing the way we communicate, conduct business, shop, entertain, and so on However, a large portion of the Web data is not organized in systematic and well structured forms, a situation which causes great challenges to those seeking for information on the Web Consequently, a lot
of tasks enabling users to search, navigate and organize web pages in a more effective way have been posed in the last decade, such as searching, page rank, web clustering, text classification, etc To this end, there have been a lot of successful stories like Google, Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few
Inspired by this trend, the aim of this thesis is to develop efficient systems which are able to overcome the difficulties of dealing with sparse data The main motivation is that while being overwhelmed by a huge amount of online data, we sometimes lack data
to search or learn effectively Let take web search clustering as an example In order to meet the real-time condition, that is the response time must be short enough, most of online clustering systems only work with small pieces of text returned from search engines Unfortunately those pieces are not long and rich enough to build a good clustering system A similar situation occurs in the case of searching images only based
on captions Because image captions are only very short and sparse chunks of text, most
of the current image retrieval systems still fail to achieve high accuracy As a result, much effort has been made recently to take advantage of external resources like learning with knowledge-base support, semi-supervised learning, etc in order to improve the accuracy These approaches, however, have some difficulties: (1) constructing a knowledge base is very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in one application cannot be reused in another one even in the same domain
In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification Unlike semi-supervised learning, we approach this issue from the point of view of text/web data analysis that is based on recently successful topic analysis models, such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation The underlying idea of the frameworks is that for a domain we collect a very large external data collection called “universal dataset”, and then build the learner on both the original data (like snippets or image captions) and a rich set of hidden topics discovered from the universal data collection The general frameworks are flexible
Trang 11and general enough to apply for a wide range of domains and languages Once we analyze
a universal dataset, the resulting hidden topics can be used for several learning tasks in the same domain This is also particularly useful for sparse data mining Sparse data like snippets returned from a search engine can be expanded and enriched with hidden topics Thus, a better performance can be achieved Moreover, because the method can learn with smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less computational resources than semi-supervised learning
Roadmap: The organization of this thesis is follow
Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis,
Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-oriented models, such as hierarchical models, author-role models, entity models, and so on They can also be considered key components in our proposals in subsequent chapters
Chapter 2 introduces two general frameworks for learning with hidden topics: one for
classification and one for clustering These frameworks are flexible and general enough to apply in many domains of applications The key common phrase between the two frameworks is topic analysis for large-scale collections of web documents The quality of the hidden topic described in this chapter will much influence the performance of subsequent stages
Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese
documents/Web pages We first review some characteristics of Vietnamese which are considered significant for data preprocessing and transformation in the subsequent processes Next, we discuss more details about each step of preprocessing and transforming data Important notes, including specific characteristics of Vietnamese are highlighted Also, we demonstrate the results from topic analysis using LDA for the clean, preprocessed dataset
Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2
tasks: search result classification, and search result clustering The two implementations
are based on the topic model analyzed from a universal dataset like shown in chapter 3 The Conclusion sums up the achievements throughout the previous four chapters Some
future research topics are also mentioned in this section
Trang 12Chapter 1 The Problem of Modeling Text Corpora and Hidden Topic Analysis
1.1 Introduction
The goal of modeling text corpora and other collections of discrete data is to find short description of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basis tasks such as classification, clustering, summarization, and similarity and relevance judgments
Significant achievements have been made on this problem by researchers in the context of information retrieval (IR) Vector space model [48] (Salton and McGill, 1983) – a methodology successfully deployed in modern search technologies - is a typical approach proposed by IR researchers for modeling text corpora In this model, documents are represented as vectors in a multidimensional Euclidean space Each axis in this space
corresponds to a term (or word) The i-th coordinate of a vector represents some functions
of times of the i-th term occurs in the document represented by the vector The end result
is a term-by-document matrix X whose columns contain the coordinates for each of the
documents in the corpus Thus, this model reduces documents of arbitrary length to length lists of numbers
fixed-While the vector space model has some appealing features – notably in its basis identification of sets of words that are discriminative for documents in the collection – the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intra- document statistical structure To overcome these shortcomings, IR researchers have proposed some other modeling methods such as generalized vector space model, topic-based vector space model, etc., among which latent semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably LSA uses a
singular value decomposition of the term-by-document X matrix to identify a linear
subspace in the space of term weight features that captures most of variance in the collection This approach can achieve considerable reduction in large collections Furthermore, Deerwester et al argue that this method can reveal some aspects of basic linguistic notions such as synonymy or polysemy
In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text corpora to study the ability of recovering aspects of the generative model from data in LSA approach However, once we have a generative model in hand, it is not clear why we
Trang 13should follow the LSI approach – we can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods
The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard The pLSI models each word in a document as a sample from a mixture model, where each mixture components are multinomial random variables that can be viewed as representation of “topics” Consequently, each word is generated from a single topic, and different words in a document may be generated from different topics Each document is represented as a probability distribution over a fixed set of topics This distribution can be considered as a “reduced description” associated with the document
While Hofmann’s work is a useful step toward probabilistic text modeling, it suffers from severe overfitting problems The number of parameters grows linearly with the number of documents Additionally, although pLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents Latent Dirichlet Allocation (LDA) [5][20] proposed by Blei et al (2003) is one solution to these problems Like all of the above methods, LDA bases on the “bag of word” assumption – that the order of words in a document can be neglected In addition, although less often stated formally, these methods also assume that documents are exchangeable; the specific ordering of the documents in a corpus can also be omitted According to de Finetti (1990), any collection of exchangeable random variables can be represented as a mixture distribution – in general an infinite mixture Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents This is the key idea of LDA model that we will consider carefully in the section 1.3
In recent time, Blei et al have developed the two extensions to LDA They are Dynamic Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8] DTM is suitable for time series data analysis thanks to the non-exchangeability nature of modeling documents On the other hand, CTM is capable of revealing topic correlation, for example, a document about genetics is more likely to also be about disease than X-ray astronomy Though the CTM gives a better fit of the data in comparison to LDA, it is so complicated by the fact that it loses the conjugate relationship between prior distribution and likelihood
In the following sections, we will discuss more about the issues behind these modeling methods with particular attention to LDA – a well-known model that has shown its efficiency and success in many applications
Trang 141.2 The Early Methods
1.2.1 Latent Semantic Analysis
The main challenge of machine learning systems is to determine the distinction between the lexical level of “what actually has been said or written” and the semantic level of
“what is intended” or “what was referred to” in the text or utterance This problem lies in twofold: (i) polysemy, i.e., a word has multiple meaning and multiple types of usage in different context, and (ii), synonymy and semantically related words, i.e, different words mat have similar sense They at least in certain context specify the same concept or the same topic in a weaker sense
Latent semantic analysis (LSA - Deerwester et al, 1990) [13][24][26] is the well-known technique which partially addresses this problem The key idea is to map from the document vectors in word space to a lower dimensional representation in the so-called
concept space or latent semantic space Mathematically, LSA relies on singular value
decomposition (SVD), a well-known factorization method in linear algebra
a Latent Semantic Analysis by SVD
In the first step, we present the text corpus as term-by-document matrix where elements
(i, j) describes the occurrences of term i in document j Let X be such a matrix, X will look
n
x x
x x
, 1
,
, 1 1
, 1 T
Trang 15Element (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product
Similarly, the matrix
T
In the next step, we conduct the standard SVD for the X matrix and get , where
U and V are orthogonal matrices and the diagonal matrix Σcontains the
singular values of X The matrix products giving us the term and document correlations
are then become and respectively
T V U
I V V U
T T
The values σ1, ,σl are called the singular values, and and the left and right singular vectors Note that only part of U, which contributes to , is the i-th row Let this row vector be called Likewise, the only part of that contributes to is the j’th column, These are not the eigenvectors, but depend on all the eigenvectors
l u
u1, , v1, ,v l
i t i
j dˆ
The LSA approximation of X is computed by selecting k largest singular values, and their corresponding singular vectors from U and V This results in the rank k approximation to
X with the smallest error The appealing thing in this approximation is that not only does
it have the minimal error, but it translates the terms and document vectors into a concept
space The vector then has k entries, each gives the occurrence of term i in one of the k
concepts Similarly, the vector gives the relation between document j and each concept
We write this approximation as Based on this approximation, we can now
do the following:
i tˆ
j dˆ
T k k k
- See how related documents j and q are in the concept space by comparing the
vectors dˆ j and dˆ q(usually by cosine similarity) This gives us a clustering of the documents
Trang 16- Comparing terms i and p by comparing the vectors tˆ iandtˆ n, giving us a clustering
of the terms in the concept space
- Given a query, view this as a mini document, and compare it to your documents in
the concept space
To do the latter, we must first translate your query into the concept space with the same transformation used on the documents, i.e and This means that if we have a query vector, we must do the translation before comparing it
to the document vectors in the concept space
j k k
k k
dˆ =Σ− 1
q U
k k
1
ˆ = Σ −
b Applications
The new concept space typically can be used to:
- Compare the documents in the latent semantic space This is useful to some typical
learning tasks such as data clustering or document classification
- Find similar documents across languages, after analyzing a base set of translated
documents
- Find relations between terms (synonymy and polysemy) Synonymy and polysemy
are fundamental problems in natural language processing:
o Synonymy is the phenomenon where different words describe the same idea Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query
o Polysemy is the phenomenon where the same word has multiple meanings
So a search may retrieve irrelevant documents containing the desired words
in the wrong meaning For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents
- Given a query of terms, we could translate it into the concept space, and find
matching documents (information retrieval)
c Limitations
LSA has two drawbacks:
- The resulting dimensions might be difficult to interpret For instance, in
{(car), (truck), (flower)} > {(1.3452 * car + 0.2828 * truck), (flower)}
the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle" However, it is very likely that cases close to
Trang 17}
{(car), (bottle), (flower)} > {(1.3452 * car + 0.2828 * bottle), (flower)}
will occur This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language
- The probabilistic model of LSA does not match observed data: LSA assumes that
words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA
1.2.2 Probabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for analysis of two-mode and co-occurrence data which has applications in information retrieval and filtering, natural language processing, machine learning from text and in related areas Compared to standard LSA, PLSA is based on a mixture decomposition derived from a latent class model This results in a more principled approach which has a solid foundation in statistics
a The Aspect Model
Suppose that we have given a collection of text documents with terms from a vocabulary The starting point for PLSA is a statistical model
namely aspect model The aspect model is a latent variable model for co-occurrence data
in which an unobserved variable
z∈ = 1, , is introduced to capture the hidden
topics implied in the documents Here, N, M and K are the number of documents, words,
and topics respectively Hence, we model the joint probability over by the mixture
d z P z w P d
w P d w P d P w d
P( , ) ( ) ( | ), ( | ) ( | ) ( | ) (1.1)
Like virtually all statistical latent variable models the aspect model relies on a conditional
independence assumption, i.e d and w are independent conditioned on the state of the
associated latent variable (the graphical model representing this is demonstrated in Figure 1.1(a))
Trang 18Figure 1.1 Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization ( [53])
It is necessary to note that the aspect model can be equivalently parameterized by (cf Figure 1.1 (b))
∑
∈
=
Z z
z w P z d P z P w
d
P( , ) ( ) ( | ) ( | ) (1.2)
This is perfectly symmetric with respect to both documents and words
b Model Fitting with the Expectation Maximization Algorithm
The aspect model is estimated by the traditional procedure for maximum likelihood estimation, i.e Expectation Maximization EM iterates two coupled steps: (i) an expectation (E) step in which posterior probabilities are computed for the latent variables; and (ii) a maximization (M) step where parameters are updated Standard calculations give us the E-step formulae
z w P z d P z P
z w P z d P z P w
d z P
)
|()
|()(
)
|()
|()()
w d z P w d n z
w d z P w d n z
P( ) ( , ) ( | , ) (1.6)
c Probabilistic Latent Semantic Space
Trang 19Let us consider topic-conditional multinomial distribution over vocabulary as points on the
M dimensional simplex of all possible multinomial Via convex hull, the
K points define a L ≤ K − 1 dimensional sub-simplex The modeling assumption expressedby (1.1) is that conditional distributions for all documents are approximated by a multinomial representable as a convex combination of in which the mixture component uniquely define a point on the spanned sub-simplex which can identified with a concept space A simple illustration of this idea is shown in
)
| (w d P
)
| (w z P
)
| (z d P
Figure 1.2
Figure 1.2 Sketch of the probability sub-simplex spanned by the aspect model ( [53])
In order to clarify the relation to LSA, it is useful to reformulate the aspect model as parameterized by (1.2) in matrix notation By defining ,
ˆ = Σˆ=diag(P( )z k )k matrices, we can write the joint probability model
Pas a matrix product Comparing this with SVD, we can draw the following observations: (i) outer products between rows of Uand reflect conditional independence in PLSA, (ii) the mixture proportions in PLSA substitute the singular values Nevertheless, the main difference between PLSA and LSA lies on the objective function used to specify the optimal approximation While LSA uses or Frobenius norm which corresponds to an implicit additive Gaussian noise assumption on counts, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model As is well known, this corresponds to
a minimization of the cross entropy or Kullback - Leibler divergence between empirical distribution and the model, which is very different from the view of any types of squared deviation On the modeling side, this offers crucial advantages, for example, the mixture approximation
T V U
Trang 20distribution IN contrast, LSA does not define a properly normalized probability distribution and the approximation of term-by-document matrix may contain negative entries In addition, there is no obvious interpretation of the directions in the LSA latent space, while the directions in the PLSA space are interpretable as multinomial word distributions The probabilistic approach can also take advantage of the well-established statistical theory for model selection and complexity control, e.g., to determine the optimal number of latent space dimensions Choosing the number of dimensions in LSA
on the other hand is typically based on ad hoc heuristics
d Limitations
In the aspect model, notice that is a dummy index into the list of documents in the
training set Consequently, d is a multinomial random variable with as many possible
values as there are training documents and the model learns the topic mixtures
only for those documents on which it is trained For this reason, pLSI is not a defined generative model of documents; there is no natural way to assign probability to a previously unseen document
well-d
)
| (z d p
A further difficulty with pLSA, which also originate from the use of a distribution indexed by training documents, is that the numbers of parameters grows linearly with the
number of training documents The parameters for a K-topic pLSI model are K multinomial distributions of size V and M mixtures over the K hidden topics This gives
KV + KM parameters and therefore linear growth in M The linear growth in parameters
suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem In practice, a tempering heuristic is used to smooth the parameters of the model for acceptable predictive performance It has been shown, however, that overfitting can occur even when tempering is used (Popescul et al., 2001, [41])
Latent Dirichlet Allocation (LDA - which is described in section 1.3 overcomes both of
these problems by treating the topic mixture weights as a K-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the
training set
1.3 Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) [7][20] is a generative probabilistic model for collections of discrete data such as text corpora It was developed by David Blei, Andrew
Ng, and Michael Jordan in 2003 By nature, LDA is a three-level hierarchical Bayesian model in which each item of a collection is modeled as a finite mixture over an underlying set of topics Each topic, in turn, modeled as an infinite mixture over an
Trang 21underlying set of topic probabilities In the context of text modeling, the topic probabilities provide an explicit representation of a document In the following sections,
we will discuss more about generative model, parameter estimation as well as inference in LDA
1.3.1 Generative Model in LDA
Given a corpus of M documents denoted by D={d1,d2, ,d M}, in which each document
number m in the corpus consists of N m words drawn from a vocabulary of terms
, the goal of LDA is to find the latent structure of “topics” or “concepts” which captured the meaning of text that is imagined to be obscured by “word choice” noise Though the terminology of “hidden topics” or “latent concepts” has been encountered in LSA and pLSA, LDA provides us a complete generative model that has shown better results than the earlier approaches
i w
{t1, ,t V}
Consider the graphical model representation of LDA as shown in Figure 1.3, the generative process can be interpreted as follows: LDA generates a stream of observable wordsw m,n, partitioned into documentsdrm
For each of these documents, a topic proportion ϑrmis drawn, and from this, topic-specific words are emitted That is, for each word, a topic indicator is sampled according to the document – specific mixture proportion, and then the corresponding topic-specific term distribution
n m
z ,
n m
z ,
ϕr used to draw a word The topics ϕrkare sampled once for the entire corpus The complete (annotated) generative model is presented in Figure 1.4 Figure 1.5 gives a list of all involved quantities
Figure 1.3 Graphical model representation of LDA - The boxes is “plates” representing replicates The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20]
Trang 22Figure 1.4 Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively
Figure 1.5 Quantities in the model of latent Dirichlet allocation
1.3.2 Likelihood
According to the model, the probability that a word instantiates a particular term t
given the LDA parameters is:
n m
k
m n
m k n
m m
n
w p
1
, ,
which corresponds to one iteration on the word plate of the graphical model From the topology of the graphical model, we can further specify the complete-data likelihood of a
Trang 23document, i.e., the joint distribution of all known and hidden variables given the hyper parameters:
4421r4
44444
44444
6
rr4444
4444
1
rr
rr
rr
plate topic
document) (1 plate document
plate word
, 1
, | ) ( | ) ( , ) ( | )(
),
|,,,(
ϕβ
α
=
p p
z p w
p z
d
N n
n m m
m
m
(1.8)
Specifying this distribution is often simple and useful as a basic for other derivations So
we can obtain the likelihood of a documentdrm
, i.e., of the joint event of all word occurrences, as one of its marginal distributions by integrating out the distributions ϑrmand
Φand summing overz m,n:
n m
n m N
m m
n m z
n m m
,
|()
|()
|()
n m
p
1 , | , )(
)
|(
Finally, the likelihood of the complete corpus { }M
m m d
W = r =1 is determined by the product of the likelihoods of the independent documents:
,
| ( αr βr r αr βr (1.11)
1.3.3 Parameter Estimation and Inference via Gibbs Sampling
Exact estimation for LDA is generally intractable The common solution to this is to use approximate inference algorithms such as mean-field variational expectation maximization, expectation propagation, and Gibbs sampling [20]
Trang 24dx x p
x p x
x
∫
r r
dz x z p
x z p x
z z p
rr
rrr
r
,
,,
1
~ 1
n m
z ,
n m
z
z m
z
z i m v
V v
v z
t
t i z i
i
n
n n
n w
z z p
α
αβ
βr
Trang 25The other hidden variables of LDA can be calculated as follows:
( ) ( )
v V
v
v k
t
t k t k
n
n
β
βϕ
z K
z
z m
k
k m k
for all documents m∈[ ]1,M do
for all words n∈[1,N m] in document do m
sample topic index ~Mult(1/K) z m,n
increment document-topic count: ( )s + 1
m n
increment document-topic sum: 1n m +
increment topic-term count: ( )t + 1
s n
increment topic-term sum: 1n z +
end for
end for
]
- Gibbs sampling over burn-in period and sampling period
while not finished do
for all documents m∈[ ]1,M do
for all words n∈[1,N m in document do m
- for the current assignment of to a term t for word z w m,n:
decrement counts and sums: ( )z − 1
- use the new assignment of to the term t for word z w m,nto:
increment counts and sums: ( )z + 1
- check convergence and read out parameters
if converged and L sampling iterations since last read out then
- the different parameters read outs are averaged
read out parameter set Φ acc to Eq 1.16
read out parameter set Θ acc to Eq 1.17
end if
end while
Figure 1.6 Gibbs sampling algorithm for Latent Dirichlet Allocation
Trang 26c Inference
Given an estimated LDA model, we can now do topic inference for unknown documents
by a similar sampling procedure A new document m~ is a vector of words , our goal is
to estimate the posterior distribution of topics
wr
~
zrgiven the word vector of the query wr and the LDA modelL(Θ , Φ) (: p zr |wr ,L)= p(~zr ,w~r ,wr ,zr ) In order to find the required counts for a complete new document, the similar reasoning is made to get the Gibbs sampling update:
;,
z
z m
k i k m v
v k V
v
v k
t
t i k
t k i
i i
n
n n
n
n n w
z w z k z p
α
αβ
βr
r
where the new variable counts the observations of term t and topic k in the unseen
document This equation gives a colorful example of the workings of Gibbs posterior sampling: High estimated word-topic associations
( )t k nr
( )t k
n will dominate the multinomial masses compared to the contributions of ( )t
k n~ and ( )k
Applying Eq 1.17 gives the topic distribution for the unknown document:
( ) ( )
z K
z
z m
k
k m k
1.4 Summary
This chapter has shown some typical topic analysis methods such as LSA, PLSA, and LDA These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-
Trang 27oriented models These models can also be seen as key components in our proposals in subsequent chapters
Among the topic analysis methods, we pay much attention to LDA, a generative probabilistic model for collections of discrete data such as text corpora It was developed
by David Blei, Andrew Ng, and Michael Jordan in 2003 and has proven its success in many applications Given the data, the goal is to reverse the generative process to estimate model parameters However, exact inference or estimation for even the not-so-complex model like LDA is intractable Consequently, there are a lot of attempts to make use of approximate approaches to this task among which Gibbs Sampling is one of the most suitable methods Gibbs Sampling, which is also mentioned in this chapter, is a special case of Markov-chain Monte Carlo (MCMC) and often yields relatively simple algorithms for approximate inference in high-dimensional models like LDA
Trang 28Chapter 2 Frameworks of Learning with Hidden Topics
2.1 Learning with External Resources: Related Works
In recent time, there were a lot of attempts making use of external resources to enhance learning performance Depending on types of external resources, these methods can be roughly classified into 2 categories: those make use of unlabeled data, and those exploit structured or semi-structured data
The first category is commonly referred under the name of semi-supervised learning The key argument is that unlabeled examples are significantly easier to collect than labeled ones One example of this is web-page classification Suppose that we want a program to electronically visit some web site and download all the web pages of interest to us, such
as all the Computer Science faculty pages, or all the course home pages at some university To train such a system to automatically classify web pages, one would typically rely on hand labeled web pages Unfortunately, these labeled examples are fairly expensive to obtain because they require human effort In contrast, the web has hundreds
of millions of unlabeled web pages that can be inexpensively gathered using a web crawler Therefore, we would like the learning algorithms to be able to take as much advantage of the unlabeled data as possible
Semi-supervised learning has been received a lot of attentions in the last decade Yarowsky (1995) uses self-training for word sense disambiguation, e.g deciding whether the word “plant” means a living organism or a factory in a given context Rosenberg et al (2005) apply it to object detection systems from images, and show the semi-supervised technique compares favorably with a state-of-the-art detector In 2000, Nigam and Ghani [30] perform extensive empirical experiments to compare co-training with generative mixture models and Expectation Maximization (EM) Jones (2005) used co-training, co-
EM and other related methods for information extraction from text Besides, there were a lot of works that applied Transductive Support Vector Machines (TSVMs) to use unlabeled data for determining optimal decision boundary
The second category covers a lot of works exploiting resources like Wikipedia to support learning process Gabrilovich et al (2007) [16] has demonstrated the value of using Wikipedia as an additional source of features for text classification and determining the semantic relatedness between texts Banerjee et al (2007)[3] also extract titles of Wikipedia articles and use them as features for clustering short texts Unfortunately, this approach is not very flexible in the sense that it depends much on the external resource or the application
Trang 29This chapter describes frameworks for learning with the support of topic model estimated from a large universal dataset This topic model can be considered background knowledge for the domain of application It also helps the learning process to capture hidden topics (of the domain), the relationships between topics and words as well as words and words, thus partially overcome the limitations of different word choices in text
2.2 General Learning Frameworks
This section presents general frameworks for learning with the support of hidden topics The main motivation is how to gain benefits from huge sources of online data in order to enhance quality of the Text/Web clustering and classification Unlike previous studies of learning with external resources, we approach this issue from the point of view of text/Web data analysis that is based on recently successful latent topic analysis models like LSA, pLSA, and LDA The underlying idea of the frameworks is that for each learning task, we collect a very large external data collection called “universal dataset”, and then build a learner on both the learning data and a rich set of hidden topics discovered from that data collection
2.2.1 Frameworks for Learning with Hidden Topics
Corresponding to two typical learning problems, i.e classification and clustering, we describe two frameworks with some differences in the architectures
a Framework for Classification
Figure 2.1 Classification with Hidden Topics
Nowadays, the continuous development of Internet has created a huge amount of documents which are difficult to manage, organize and navigate As a result, the task of automatic classification, which is to categorize textual documents into two or more predefined classes, has been received a lot of attentions
Trang 30Several machine-learning methods have been applied to text classification including decision trees, neural networks, support vector machines, etc In the typical applications
of machine-learning methods, the training data is passed to a learning phrase The result
of the learning step is an appropriate classifier capable of categorizing new documents However, in the cases such as the training data is not as much as expected or the data to
be classified is too rare [52], learning with only training data can not provide us a satisfactory classifier Inspired by this fact, we propose a framework that enables us to enrich both training and new coming data with hidden topics from available large dataset
so as to enhance the performance of text classification
Classification with hidden topics is described in Figure 2.1 We first collect a very large external data collection called “universal dataset” Next, a topic analysis technique such
as pLSA, LDA, etc is applied to the dataset The result of this step is an estimated topic model which consists of hidden topics and the probability distributions of words over these topics Upon this model, we can do topic inference for training dataset and new data For each document, the output of topic inference is a probability distribution of hidden topics – the topics analyzed in the estimation phrase – given the document The topic distributions of training dataset are then combined with training dataset itself for learning classifier In the similar way, new documents, which need to be classified, are combined with their topic distributions to create the so called “new data with hidden topics” before passing to the learned classifier
b Framework for Clustering
Figure 2.2 Clustering with Hidden Topics
Text clustering is to automatically generate groups (clusters) of documents based on the
similarity or distance among documents Unlike Classification, the clusters are not known
Trang 31previously User can optionally give the requirement about the number of clusters The documents will later be organized in clusters, each of which contains “close” documents
Clustering algorithms can be hierarchical or partitional Hierarchical algorithms find
successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once Hierarchical algorithms can be agglomerative (“bottom-up”) or divisive (“top-down”) Agglomerative algorithms begin with each element as a separate cluster and merge them into larger ones Divisive algorithm begin with the while set and divide it into smaller ones
Distance measure, which determines how similarity of two documents is calculated, is a key to the success of any text clustering algorithms Some documents may be close to one another according to one distance and further away according to another Common distance functions are the Euclidean distance, the Manhattan distance (also called taxicab norm or 1-norm) and the maximum norm, just to name here but a few
Web clustering, which is a type of text clustering specific for web pages, can be offline or online Offline clustering is to cluster the whole storage of available web documents and
does not have the constraint of response time In online clustering, the algorithms need to meet the “real-time condition”, i.e the system need to perform clustering as fast as possible For example, the algorithm should take the document snippets instead of the whole documents as input since the downloading of original documents is time-consuming The question here is how to enhance the quality of clustering for such document snippets in online web clustering Inspired by the fact those snippets are only small pieces of text (and thus poor in content) we propose the framework to enrich them with hidden topics for clustering (Figure 2.2) This framework and topic analysis is similar to one for classification The difference here is only due to the essential differences between classification and clustering
2.2.2 Large-Scale Web Collections as Universal Dataset
Despite of the obvious differences between two learning frameworks, there is a key phrase sharing between them – the phrase of analyzing topics for previously collected dataset Here are some important considerations for this phrase:
- The degree of coverage of the dataset: the universal dataset should be large enough
to cover topics underlined in the domain of application
- Preprocessing: this step is very important to get good analysis results Although
there is no general instruction for all languages, the common advice is to remove
as much as possible noise words such as functional words, stop words or too frequent/ rare words
Trang 32- Methods for topic analysis: Some analyzing methods which can be applied have
been mentioned in the Chapter 1 The tradeoff between the quality of topic analysis and time complexity should be taken into account For example, topic analysis for snippets in online clustering should be as short as possible to meet the “real-time” condition
2.3 Advantages of the Frameworks
- The general frameworks are flexible and general enough to apply in any domain/language Once we have trained a universal dataset, its hidden topics could
be useful for several learning tasks in the same domain
- This is particularly useful for sparse data mining Spare data like snippets returned
from a search engine could be enriched with hidden topics Thus, enhanced performance can be achieved
- Due to learning with smaller data, the presented methods require less
computational resources than semi-supervised learning
- Thank to the nice generative model for analyzing topics for new documents (in the
case of LDA), we have a natural way to map documents from term space into topic space This is really an advantage over heuristic-based mapping in the previous approaches [16][3][10]
2.4 Summary
This chapter described two general frameworks and their advantages for learning with hidden topics: one for classification and one for clustering The main advantages of our frameworks are that they are flexible and general to apply in any domain/language and be able to deal with sparse data The key common phrase between the two frameworks is topic analysis for large-scale web collection called “universal dataset” The quality of the topic model estimation for this data will influence much the performance of learning in the later phrases
Trang 33Chapter 3 Topics Analysis of Large-Scale Web Dataset
As mentioned earlier, topic analysis for a universal dataset is a key to the success of our proposed methods Thus, toward Vietnamese text mining, this chapter contributes to considerations for the problem of topics analysis for large-scale web datasets in Vietnamese
3.1 Some Characteristics of Vietnamese
Vietnamese is the national and official language of Vietnam [48] It is the mother tongue
of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three million overseas Vietnamese It is also spoken as a second language by some ethnic minorities of Vietnam Many words in Vietnamese are borrowed from Chinese Originally, it is written in Chinese-like writing system The current writing system of Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and certain letters
3.1.1 Sound
a Vowels
Like other Southeast Asian languages, Vietnamese has a comparatively large number of vowels Below is a vowel chart of vowels in Vietnamese:
Table 3.1 Vowels in Vietnamese
The correspondence between the orthography and pronunciation is rather complicated
For example, the vowel i is often written as y; both may represent [i], in which case the
difference is in the quality of the preceding vowel For instance, “tai” (ear) is [tāi] while tay (hand/arm) is [tāj]
In addition to single vowels (or monophthongs), Vietnamese has diphthongs (âm đôi)
Three diphthongs consist of a vowel plus a These are: “ia”, “ua”, “ưa” (When followed
by a consonant, they become “iê”, “uô”, and “ươ”, respectively) The other diphthongs