These approaches, however, have some difficulties: 1 constructing, a knowledge base is very time-consuming & labor-intensive, and 2 the results of semi-supervised learning in one appli
Trang 2VIETNAMESE WEB DOCUMENTS
Major: Information Technology Spcificily: Information Systems Code: 60 48 05
Trang 31.2.2 Probabilistic Latent Semantic Analysis cece ieee seventeen B
1.3.1 Generative Model im LIĐA THÍ H182 1.11 mg
2.3, Advantages of the Irameworks Hee
2.4 SUNNN8EY, icon terriiirriee "—.-
3.1 Some Characteristics of Vietnamese THÍ H182 1.11 mg
Trang 43.2.2 Sentence 1okenization
3.2.3 Word Segmoriation
5 Remove Non Topic-Ortented Words
3.3, Topic Analysis for Vnlixpress Dataset
3.4, Topic Analysis for Vietnamese Wikipedia Dataset
3.5 Discussion
3.6 Summary
Chapter 1 Deployments of General Frameworks
41 Classification with Hidden Topies
Trang 5replicates, The outer plate represenls documents, while the immer plale represents he
repeated choice of topics and words wilhin a document [20] - 12
Figure 1.4 Generative model for Latent Dirichlet allocation; ilere, Dir, Poiss and Mult
stand for Dirichlet, Poisson, Multinomial distributions respectiveh 13
Figure 1.5 Quantities in the model of latent Dirichlet allocation - - 13
Eigtre 1.6 Gibbs sampling algorithm for Latent Dirichlet Allocation „16
Figure 4.1 Classification with VnExpress topics -c:
Figure 4.2 Combination of one snippet with its topics: an example 35
Figure 4.3 Leaming with different topic models of Vnlixpress dataset; and the baseline
37
(svithout topics)
Figure 4.4 Test-out-of train with increasing numbers of training examples Here, the
Figure 45 ¥1-Measure for classes and average (over all classes) im learning with 60
topies
Figure 4.7 Dendrogram in Agglomerative Hierarchical Clustering,
Trang 6vi List of Tables
Table 3.1 Vowels ín Vietnamese —¬
'Table 3.3 Consonants of hanoi variety Hee svat
Table 3.6 Statistics of topics assignrcd by humans in VnExpross DøtaseL
Table 3.7 Statistics of VnExpress đataset 30
Table 3.8 Most likely words for sample topies Here, we conduet topic analysis with 100
Table 4.4 Some collocatioms with highest values of chi-square statistic 44
Table 4.6 Parameters for clustering web search resulls - - 46
Trang 7Notations & Abbreviations
Probability Latent Semantic Analysis PLSA
Trang 8Introduction
The World Wide Web has influenced many aspects of our lives, changing the way we
commmunicale, conducl business, shop, enterlam, and so on However, a large portion of
the Web dala is nol organived im systematic and well struchared fonns, a situation which causes greal challenges la those seckirg [or imfisrmation on the Web Consequently, a lot
of tasks, which onable users to scarch, navigate and organize web pages in a more effective way, have been posed in the last decade, such as scarching, page rank, web
clusicring, texl classification, cle To this ond, there have boon a lol of successful stories
like Google, Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few
Tnspired by this trend, the aim of this thesis is 1o develop efficient systems which can overcome the difficulties of dealing with sparse data The main motivation is that while being overwhelmed by a buge amounl of online data, we somclimes lack data ta search or leam efficiently Lel lake web search clustering as an example Tn order Lo meet the real-time condition, that is the response time must be short enough, most of online clusicring systems only work with small picces of text returned from search cngines Unfortunately those pieces are not long and rich enough to build a good clustering system A similar situation occurs in the case of searching images only based on captions Because image captions are only very short and sparse chunks of text, most of the cwrent image retrieval systems still fail to achieve high accuracy As a result, much effort has been made recently to take advantage of external resources such as leaming with imowledge-base support, semi-supervised Jeaming, etc in order to improve the accuracy These approaches, however, have some difficulties: (1) constructing, a knowledge base is
very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in
one application cannot be reused in another ane even in the same domain,
In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification Unlike semi-supervised learning, we approach this issue from the point of view of text/web data analysis that is based on recently successful topic analysis models, such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation The underlying idea of the frameworks is that for a domain we collect a very large external data collection called “universal dataset”, and then build the leamer on both the original data (like snippets or image captions) and a rich set of hidden
topics discovered from the universal data collection The general frameworks are flexible
Trang 93 and general enough to apply for a wide range of domains and languages Once we analyze
a universal dataset, the resulting hidden topies can be used for several learning tasks in the same domain This is also particularly useful for sparse data mining Sparse data like snippets retumed from a search engine can be expanded and enriched with hidden topics Thus, a better performance can be achieved Moreover, because the method can leam with smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less
computational resources than semi-supervised leamming,
Roadmap: ‘the organization of this thesis is follow
Chapter ? reviews some typical topic aralysis methods such as Latent, Semantic Analysis,
Probabilistic Latent Semantic Analysis, and T.atent Dirichlet Allocalion These models
can be considered the basic building blocks of general framework of probabilistic
modeling of text and be used to develop more sophisticated and application-oriented
models, such as hierarchical models, author-role models, entity models, and so on ‘They
can also be considered key components in our proposals in subsequent chapters
Chapter 2 introduces two general frameworks for learnmg with hidden topics: one for
iũ
apply in many domains of applications The key common phrase between the two
classification and one for clustering These (rameworks are [exible and general enough to
frameworks is topic analysis for large-scale collections of web documents The quality of the hidden topic described in this chapter will much influence the performance of
subsequent stages
Chapter 3 summarizes several major issues for analyzing data collections of Vietnamese
documents/Web pages We first review some characteristics of Vietnamese which are considered significant for data preprocessing and transformation in the subsequent processes Next, we discuss more details about each step of preprocessing and transforming data Important notes, including specific characteristics of Vietnamese are highlighted Also, we demonstrate the results from topic analysis using LDA for the clean, preprocessed dataset
Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2 tasks: search result classification, and search result clustering The two implementations are based on the topic model analyzed from a universal dataset like shown in chapter 3 The Conclusion sums up the acluevements throughout the previous four chapters Some
fulure research topics are also mentioned i this section
Trang 103
Chapter 1 The Problem of Modeling Text Corpora and Hidden Topic Analysis
1.1 Introduction
The goal of modeling text corpora and other collections of discrete data is to find short
description of the members of a collection that enable efficient processing of large
collections while preserving the essential statistical relationships that are useful for basis tasks such as classification, clustering, summarization, and similarity and relevance
Judgments
Significant achievements have beer made on this problem by rescarchers m the context of information retrieval (IR) Vector space model [48] (Salton and McGill 1983) — a methodology successfully deployed in modern search technologies - is a typical approach proposed by IR researchers for modeling text corpora In this model, documents are represented as vectors in a multidimensional Juclidean space Lach axis in this space corresponds to a term (or word) ‘Ihe i-th coordinate of a vector represents some functions
of times of the i-th term occurs in the document represented by the vector The end result 1s a term-by-document matrix X whose columns contain the coordinates for cach of the
documents m the corpus Thus, this model reduces documents of arbitrary length to fixed- length lists of numbers
While the vector space model has some appealing features notably in its basis
identification of sets of words that are discruminative for documents in the collection the
approach also provides a relatively small amount of reduction in description length and
teveals litle m the way of inter- or infra- document statixtical structure To overcame
these shortcomings, TR researchers have proposed some other modeling methods such as generalized vector space model, topic-based veolor space model, elc., among which latent
semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably LSA uses a
singular value decomposition of the term-by-document X matrix to identify a linear
subspace in the space of term weight features that captures most of variance in the collection This approach can achieve considerable reduction in large collections Furthermore, Deerwester et al argue that this method can reveal some aspects of basic linguistic notions such as synonymy or polysemy
In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text
corpora to study the ability of recovering aspects of the generative model from data in LSA approach However, once we have a generative model in hand, it is not clear why we
Trang 114 should follow the LSI approach we can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods
The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard The pLSI models each word in a document as a sample from a mixture model, where each mixture components are multinomial random variables that can be viewed as representation of “topics” Consequently, each word is generated from a single topic, and different words in a document may be generated from different topics Hach document is represented as a probability distribution over a fixed set of topics ‘This distribution can be considered as a “reduced description” associated with the document
While Ilofmann's work is a useful step toward probabilistic text modeling, it suffers from severe overfitting problems The number of parameters grows linearly with the number of documents Additionally, although pLSA is a generative madel of the documents in the
collection it is estimated on, it is not a generative model of new documents Latent
Dirichlet Allocation (LDA) [511201 proposed by Blei et al (2003) is one solution to these problems Like all of the above methods, LDA bases on the “bag of word” assumption — Hut the order of words in a document, can be negleeted In addition, although less oflen slated formally, these methods also assume that documents are exchangeable: the specilie ordering of the documents in a corpus can also be omitted According to de Finetti (1990), any collection of exchangeable random variables can be represented as a mixture distribution — in general an infinite mixture Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents This is the key idea of LDA model that we will consider carefully in the section 1.3
Tn event time, Blei of al have developed the wo extensions lo LDA They are Dynamic Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8] DTM is suitable for time series data analysis thanks to the non-exchangeability nature of modeling documents On the other hand, CTM is capable of revealing topic correlation, for example, a document about genetics is more likely to also be about disease than X-ray astronomy ‘Though the CIM gives a better fit of the data in comparison to LDA, it is so complivated by the fact that it loses the conjugate relationship between prior distribution and likelihood
In the following sections, we will discuss mare about the issues behind these modeling methods with particular attention to LDA — a well-known model that has shown its efficiency and success in many applications
Trang 124.2 The Early Methods
1.2.1 Latent Semantic Analysis
The main challenge of machine leaming, systems is to determine the distinction between the lexical level of “what actually has been said or written” and the semantic level of
“what is intended” or “what was referred to” in the text or utterance ‘his problem lies in twofold: G) polysemy, ie, a word has multiple meaning and multiple (ypes of usage in different context, and (ii), synonymy and semantically related words, i.x, different words mat have similar sense, They at least in certain context specify the same concept or the
same topic ina weaker sense
Latent semantic analysis (LSA - Deerwester et al, 1990) [1311241126] is the well-known technique which partially addresses this problem Ihe key idea is to map from the document vectors in word space to a lower dimensional representation in the so-called concept space or latent semantic space Mathemativally, TSA relics on singular valuc
decomposition (SVT), a well-known factorization method in near algebra
a, Latent Semantic Analysis by SVD
In the first step, we present the text corpus as term-by-document matrix where elements (i, j) desoribes the occurrences of term i in document j LetX be such a matrix, X will look like this:
Now, the dot product (71, between Lwo term veetors gives us the correlation belween the
terms over the documents ‘I'he matrix product AX? contains all these dot products.
Trang 13Llement (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product
CC Ứb) Similady, the maubc 47x contains the dot products between all the
document vectors, giving their correlation over the terms: djd, — đ; đ,
In the next stop, we conduct the standard SVD for the ¥ matrix and get ¥ =UEF" , where
U and V are orthogonal matrices L707 —¥'V — and the diagonal matrix » contains the
singular values of X The matrix products giving us the term and document correlations
are then become X¥? =UED7Y' and X7X =PEL"U respectively
Since EE? and £7 are diagonal we sce that (7 must contain the eigenvectors
of XX", while 1” must be the eigenvectors of X71 Both products have the same non-zero
eigenvalues, given by the non-zero entries of Y:", or equally, the non-zero entries of" ¥
Now the decomposition looks like this:
Let this row vector be called/, Likewise, the only part of 7 that contributes to d, is the
j'thcolumn, d, These are not the eigenvectors, but depend on all the eigenvectors
The LSA approximation of ¥ is computed by selecting k largest singular values, and cher corresponding singular vectors from U and V This results im (he rank & approximation tw
X with the smallest error The appealing thing in this approximation is that not only does
it have the minimal error, but it translates the terms and document vectors into a concept space The vector /, then has k entries, each gives the occurrence of term / in one of the &
concepts Similarly, the vector d, gives the relation between document / and each concept
We write this approximation as.¥, —U,2,¥/ Based on this approximation, we can now
do the following:
- Sce how related documents j and q arc in the concept space by comparing the
vectors d, and d, (usually by cosine similarity), This gives us a clustering of Lie documents.
Trang 147
- Comparing terns i and p by comparing the vectors # andi, giving us a clustering
of the terms in the concept space
- Given a query, view this a3 a mini document, and compare it to your documents in the concept space
To do the lalicr, we must fish translate your query into the concept space with (he same
transformation used on the documents, ie d,-U,E,d, andé, -£,'Ufd, This means
that if we have a query veclor, we musl do the trarelaion ¿— S2 g before comparing it
to the document vectors in the concept space
b Applications
The new concep! space typically can be used to:
- Compare the documents in the latent semantic space This is useful to some typical leaning tasks such as data clustering or document classification
- Find similar documents across languages, after analyzing a base set of translated documents
- Find relations between terms (synonymy and polysemy) Synonymy and polysemy are fundamental problems mi natural language processmg,
o Synonymy is the phenomenon where different words describe the same idea ‘hus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query
Polysemy is the phenomenon where the same word has multiple meanings
So a search may retrieve nrelevanl documents contaming the desired wards
in the wrong meaning, For example, a botanist and a computer scientist locking for the word "tree" probably desire different sets of documents
- Given a query of terms, we could translate it into the concept space, and find matching documents (information retrieval)
c Limitations
LSA has two drawbacks:
- ‘The resulting dimensions might be difficult to interpret For instance, in
{(car), (truck), (ower)} —> {(1.3452 * car | 0.2828 * truck), (flower)}
the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle"
However, it is very likely that cases close to
Trang 15{(car), (bottle), (flower)} > {(1.3452 * car + 0.2828 * bottle), (flower)}
will occur Tins leads to results which can be jusified on the mathematical level,
but have no interpretable meaning in natural language
- The probabilistic model of LSA does not match observed data: LSA assumes that
words and documents form a joint Gaussian model (ergodic hypothesis), while a
Poisson distribution has been observed Thus, a newer altemative is probabilistic
Jalen semantic analysis, based on a multinomial model, which is reporled to give
better results than standard LSA
1.2.2 Probabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for analysis of two-mode and co-occurrence data whieh has applications in information retrieval and filtering, natural language processing, machine learning from text and in related areas Compared to standard LSA, PLSA is based on a mixture decomposition derived (rom a Tuten! class model This results in a more principled approach whicl: has a
sohd foundation in statistics
a The Aspect Model
Suppose that we have given a collection of text documents 2 — {d,, ,¢,,} with terms
from a vocabulary — {w., w,,} The starting point for PLSA is a statistical model
namely aspect model The aspect model is a Jatent variable model for co-occurrence data
in which an unobserved variable =< Z={,,, ,2,} is introduced to capture the hidden topics implied in the documents Here, N, Mand K are the number of documents, words, and topics respectively IIence, we model the joint probability over Pal by the mixture
as follows
P(,w)—P(4)P(w|d),P(@w|4—S)P@|22PGId) — 0)
Like virtually all statistical latent variable models the aspect model relies on a conditional independence assumption, i.e d@ and w are mdependent conditioned on the state of the associated latent variable (the graphical model representing this is demonstrated in Figure
1.1fa))
Trang 16This is perfectly symmetric with respect to both documents and words
b Model Fitting with the Expectation Maximization Algorithm
‘The aspect model is estimated by the traditional procedure for maximum likelihood estimation, ie Lixpectation Maximization LM iterates two coupled steps: (@) an expectation (E) step in which posterior probabilities are computed for the latent variables, and (ii) a maximization (M) step where parameters are updated Standard calculations
give us the Festep formutae
P)P(4.z)P(wl|z) S)PC)P(@]z)P0r|z
Trang 17Let us consider topic-conditional multinomial distribution p{.|2) over vocabulary as
points on the 4-1 dimensional simplex of all possible multinomial Via convex hull, the
K points define a L<K-I1 dimensional sub-simplex The modeling assumption
expressedby (1.1) is that conditional distriutions P(w|d)for all documents are
approximated by a multinomial representable as a convex combination of P(w z)in which the mixture component P(=|d) miquely define a point ơn the spanned sub-simplex
which can identified with a concept space A simple illustration of this idea is shown in
Figure 1.2
‘Figure 1.2 Sketch of the prabability sab simplex spanned by the aspect model ( [53})
In order to clarify the relation to LSA, it is useful to reformulate the aspect model as
Ẻ-f,Iz; )„„ and Š-— điag(f(S, }), matrices, we can write the joint probability model Pas a matrix product? =USP" Comparing this with SVD, we can draw the following observations: (i) outer products between rows of Uand J reflect conditional independence in PLSA, (ii) the mixture proportions in PISA subslilute the singular
values Nevertheless, the main differenve between PLSA and T.SA hes on the objective
function used to specify the optimal approximation While LSA uses 1, or Frobenius nom which corresponds to an implicit additive Gaussian noise assumption on counts, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model As is well known, this corresponds to
a minimization of the cross cntropy or Kullback - Leibler divergence between empirical distribution and the model, which is very different from the view of any types of squared deviation On the modeling side, this offers crucial advantages, for example, the mixture
approximation Pof the term-by-document matrix is a well-defined probability
Trang 18ll
distribution IN contrast, LSA does not define a properly normalized probability distribution and the approximation of term-by-document matrix may contain negative
entries In addition, there is ne obvious interpretation of the directions in the LSA latent
space, while the directions in the PLSA space are interpretable as multinomial word
distributions The probabilistic approach can also lake advantage of the well-established
siatistical theory for model seleciion and complexity control, ¢g to determine the optimal number of latent space dimensions Choosing the number of dimensions in LSA
on the other hand is typically based on ad hoc heuristics
d Limitations
Tn the aspect model, uotice thaldis a dummy index imto the hst of documents im the
training set Consequenlly, d is a rauttinomial random variable with as many possible
values as there are training documents and the model leams the topic mixtures p(-|@)
only for those documents on which it is trained For this reason, pLSI is not a well-
defined generative model of documents: there is no natural way to assign probability to a previously unseen document
A further difficully with pLSA, which also originate from the use of a distribution
indexed by (raming documents, is thal the numbers of parameters grows hinearly wilh the number of training documents The parameters for a K-topic pLSI model are K
multinomial distributions of size V and M mixtures over the K hidden topics This gives
KV + KM parameters and therefore linear growth in Af The Hnear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem In practice, a tempering heuristic is used to smooth the parameters of the model for acceptable predictive performance It has been shown, however, that overfitting can occur even when tempering is used (Popescul et al., 2008, |4] |)
Latent Dirichlet Allocation (LDA - which is described in section 1.3 overcomes both of these problems by treating the topic mixture weights as a K-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the
training set
1.3 Latent Dirichiet Allocation
Latent Dirichlet Allocation (LIA) [7][20] is a generative probabilistic model for collections of diserete data such as text corpora, It was developed by David Blei, Andrew
Ng, and Michacl Jordan im 2003 By nature, 1.DA is a three-level Incrarchical Bayesian model in which cach iem of a collection is modeled as a finite mixture over an
underlying sel of topics Rach topic, in turn, modeled as an infinite mixtuwe over am
Trang 19underlying set of topic probabilities In the context of text modeling, the topic probabilities provide an explicit representation of a document In the following sections,
wo will discuss more about generative model, parameter estimation as well as inference in LDA
1.3.1 Generative Model in LDA
Given a corpus of M documents denoted by D — {d,.d,, d,} in which each document number m in the corpus consists of Nj, words w,drawn from a vocabulary of terms
Yast}, the goal of LDA is to find the latent structure of “topics” or “concepts” which
captured the meaning of text that is imagined to be obscured by “word choice” noise
‘Though the terminology of “hidden topics” or “latent concepts” has been encountered in LSA and pLSA, LDA provides us a complete generative model that has shown better results than the earlier approaches
Consider the graphical model representation of LDA as shown in Figure 1.3, the generative process can be interpreted as follows: LDA generates a stream of observable
wordsw,,,, partitioned into documentsd,, For each of these documents, a topic
proportion §, is drawn, and from this, topic-specific words are emitted That is, for each word, a topic indicator ;,,,,s sampled according to the document — specific mixture proportion, and then the corresponding topie-specific term distribution g.,_ used to draw a word he topics #,are sampled once for the entire corpus The complete (amotated) generative model is presented in Figure 1.4 Figure 1.5 gives a list of all involved
quantities
Figure 1.3 Graphical model representation of LDA - The buxes is “plales” representing replicates The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20]
Trang 2013
LI “copie plate?
for all toptes k x [I,K] do sample mixture components Gy ~ Lari end for
WU “document plate for all documants m € ‘1, Ml] do
sample mixtare proportion ig ~ Dirty
sample docurnent length Vy, + Poisalg) Sword plane”:
for all words
[1,¥m] in document m do
sample topic index 2m,n ~ Muit(in)
eomple term for word win Mult(Sen np)
end for end for
M number of documents to generate (const scalar},
#omimber of topies f mixture components (const scalar)
V number of terms ¢ in vocabulary (const scalar)
o hyperparameter on the mixing proportions (i-vector or scalar if symmetric}
3 hyperparameter on the mixture compenents {V-vector or acalar if symmetric)
3m parameter notation for p(z|d=m), the topic mixture proportion for document m
(ma proportion for esah document, @ = {Fu }8L (Mx Ko matrix}
fe poramcter notation for p{t|2=k), the mixture component of topic k Onc component
for each topic, @— {3}, (K x V matrix)
Nm document length (document-specific), here modelled with a Poisson distribution
[2] with constant parameter € 3mm Adxkure gulizulor Yul chooves Uae Wopie for dhe vet word in ducusucut ve
Wm,a teem indicator for the mth word in dccumens im
According to the model, the probability that a wordw,,, instantiates a particular term 7
given the LDA parameters is:
PO ne — Ge DF) Pe — te IPCun — 1H) a 1.7)
which corresponds to one iteration on the word plate of the graphical model From the topology of the graphical model, we can further specify the complete-data likelihood of a
Trang 21document, ie., the joint distribution of all known and hidden variables given the hyper
Spccifying this distribution is often simple and uscful as a basic for other derivations So
we can obtain the likelihood of a documentd,, ie of the joint event of all word
occurrences, as one of its marginal distributions by integrating out the distributions & and
Gand summing overs,
wd, 12,A)- [[p(5, |2) p(® Ø1] rors |B, Plan ld ods, (19)
= [fo a)o@1 A] [rer, 15,4008, 2.10)
Finally, the likelihood of the complete corpus W’ — {2, \, is determined by the product of the likelihoods of the independent documents:
1.3.3 Parameter Estimation and Inference via Gibbs Sampling
Lact estimation for LDA is generally intractable ‘Ihe common solution to this is to use approximate inference algorithms such as meanficld vanational cxpectation maximization, expectation propagation, and Gibbs sampling [20]
Trang 2215
sampled altemately one at a time, conditioned on the values of all other dimensions,
which we denote x_, The algorithm works as follows:
1 Choose dimension i (random or by permutalion)
Heirich [20] has been shown a sequence of calculations to lead to the formulation of the full conditionals for LDA as follows:
(1.15)
Trang 23
‘The other hidden variables of LDA can be calculated as follows:
zero all count variables, „(2
far all đocumcnts øc |I,ÀZ] do
Ter all words ø e [L,V'„ | in doeumeni z do
sample topic index =,, ~-Mult(L/A)
increment document-topic count: al! +1
increment document-topic sum: x, : 1
inercment lopie-ierm count: nf? +1
increment topic-term sum: #, +1
end for
end for
- Gibbs sampling over hurn-in period and sampling period
while not finished do
for all documents m c [I,ÄZ ]do
for all words ø CÍI,N„ | in document m do
~ for the curent assignmuerrf of z to a tertn / for Word v„ „
decremenl connls and sung: äE2)— |; —1;z”?—]1;m, —1
- multinomial sampling ace To Eq, 1.15 (decrements from previous step)
sample lopic index © z,,#)
= use the new assignment af zo the term ¢ for word w,, lo
increment counts and sums: a + Lf = Lim, +1
end for end for
- check convergence and read out parameters
if converged and J sampling iterations since last read oul then
- the different paraincters read outs are averaged
read out parameter set Dace to Eq 1.16 read oul parameter sel @ ace to Ey 1.17 end if
Trang 24¢ Inference
Given an estimated LDA model, we can now do topic inference for unknown documents
by a similar sampling procedure A new document, i is a vector of words, our goal is
to estimate the posterior distribution of topics 7 given the ward vector of the query # and
the LDA modelZ(@.): pi
complete new document, the similar reasoning is made to get the Gibbs sampling update:
where the new variable 7 counts the observations of term ¢ and topic & in the unseen
document This equation gives a colorful example of the workings of Gibbs posterior
sampling: High estimated word-lopic associations xf! will dominale the multinomial
masses compared to the contributions of jiandalt 1, which are chosen randomly
Consequently, on repeatedly sampling from the distribution and updating of n™, the masses of topic-word associations are propagated into document-topic associations Note
the smoothing influence of the Dirichlet hyper parameters
Applying Eq, 1.17 gwves the topic distribution for the unknown document
1.3.4 Applications
LDA has been successfully applied to text modeling and feature reduction in text classification [5] Recent work has also used LDA as a building block in more sophisticated topic models such as author-document models [42], abstract-reference models [15], syntax-semantic models [18] and image-caption models [6] Additionally, the same kinds of modeling tools have been used in a variety of non-text scttings, such as image processing |46], and the modeling of user profiles |17J
1.4 Chapter Summary
This chapter has shown some typical lopie analysis methods such as TSA, PLSA, and LDA These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-
Trang 25oriented models, ‘These models can also be seen as key components in our proposals in subsequent chapters
Among the topic analysis methods, we pay much attention to LDA, a generative probabilistic model for collections of discrete data such as text corpora It was developed
by David Blei, Andrew Ng, and Michael Jordan in 2003 and has proven its success in many applications Given the data, the goal is to reverse the generative process to estimate model parameters However, exact inference or estimation for even the not-so-complex model like LDA is intractable Consequently, there are a lot of attempts to make use of apploxiale approaches lo this task umoug which Gibbs Sampling is one of the most suitable methods Gibbs Sampling, which is also imentioned in this chapter, is a spovial case of Markoy-chain Monte Carlo (MCMC) and often yields relatively simple algorithms for approximate inference in high-dimensional models like DA
Trang 2619
Chapter 2 Frameworks of Learning with Hidden Topics
2.1 Learning with External Resources: Related Works
In recent time, there were a lot of attempts making use of extemal rosources to enhances Icarning performance Depending on types of oxtemal resouress, these methods can be roughly ©
as all the Computer Science faculty pages, or all the course home pages at some university To train such a system to automatically classify web pages, one would
typically rely on hand labeled web pages Unfortunately, these labeled examples are fairly
expensive to obtain because they require human effort In contrast, the web has hundreds
of millions of unlabeled web pages that can be inexpensively gathered using a web
crawler Therefore, we would like the learning algorithms to be able to take as much
advantage of the unlabeled data as possible
Semi-supervised learning has been received a lot of attentions in the last decade Yarowsky (1995) uses self-training for word sense disambiguation, e.g deciding whether the word “plant” means a living organism or a factory in a given context Rosenberg et all (2005) apply it to object detection systems from images, and show the semi-supervised technique compares favorably with a state-of-the-art detector In 2000, Nigam and Ghani [30] perform extensive empirical experiments to compare co-training with generative mixture models and Hxpectation Maximization (HM) Jones (2005) used co-training, co-
EM and other related methods for information extraction from text, Besides, there were a lot of works that applied Tramsductive Support Vector Machines (TSVMs) lo use
unlabeled dala for determining optimal decision boundary
‘The second category covers a lot of works exploiting resources like Wikipedia to support learning process Gabrilovich et al (2007) [16] has demonstrated the value of using
Wikipedia as an additional source of features for text classification and determining the semantic relatedness between texts Banerjee ct al (2007)|3] also caxtract titles of
Wikipedia articles and use them as features for clustering short texts Unfortunately, this approach is not very flexible in the sense that it depends much on the external resource or the application.
Trang 27This chapter describes frameworks for leaning with the support of topic model estimated
from a large universal dataset This topic model can be considered background knowledge
for the domain of application It also helps the learning process to capture hidden topics (of the domain), the relationships between topics and words as well as words and words,
thus partially overcome the limitations of different word choices in text
2.2 General Learning Frameworks
This section presents general frameworks for learning with the support of hidden topics The main motivation is how to gain benefits from huge sources of online data in order to
enhance quality of the Text/Web clustering and classification Unlike previous studies of leaming with extemal resources, we approach this issue from the point of view of text/Web data analysis that is based on recently successful latent topic analysis models
like LSA, pLSA, and LDA The underlying idea of the frameworks is that for each
learning task, we collect a very large external data collection called “universal dataset”, and then build a learner on both the learning data and a rich set of hidden topics
discovered from that data collection
2.2.1 Frameworks for Learning with Hidden Topics
Corresponding to two typical learning problems, ie classification and clustering, we describe two frameworks with some differences in the architectures
a Framework for Classification
training dataset with dataset hidden topics
Figure 2.1 Classification with Hidden Topics
Nowadays, the continuous development of Internet has created a huge amount of documents which are difficult to manage, organize and navigate As a result, the task of automatic classification, which is to categorize textual documents into two or more predefined classes, has been received a lot of attentions
Trang 2821
Several machine-learning methods have been applied to text classification including
decision trees, neural networks, support vector machines, etc In the typical applications
of machine-learning methods, the training data is passed to a learning phrase The result
of the learning step is an appropriate classifier capable of categorizing new documents
However, in the cases such as the training data is not as much as expected or the data to
be classified is too rare [52], learning with only training data can not provide us a satisfactory classifier Inspired by this fact, we propose a framework that enables us to
enrich both training and new coming data with hidden topics from available large dataset
so as to enhance the performance of text classification
Classifying with hidden topics is described in Figure 2.1, We first collect a very large
external data collection called “universal dataset” Next, a topic analysis technique such
as pLSA, LDA, etc is applied to the dataset The result of this step is an estimated topic
model which consists of hidden topics and the probability distributions of words over
these topics Upon this model, we can do topic inference for training dataset and new data For each document, the output of topic inference is a probability distribution of
hidden topics — the topics analyzed in the estimation phrase — given the document The
topic distributions of training dataset are then combined with training dataset itself for
learning classifier In the similar way, new documents, which need to be classified, are
combined with their topic distributions to create the so called “new data with hidden topics” before passing to the learned classifier
b Framework for Clustering
Hidden Topics
Figure 2.2 Clustering with Hidden Topics
Text clustering is to automatically generate groups (clusters) of documents based on the similarity or distance among documents Unlike Classification, the clusters are not known
Trang 29previously User can optionally give the requirement about the number of clusters ‘he documents will later be organized in clusters, cach of which contains “close” documents Clustering algorithms can be hierarchical or partitional Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once Hierarchical algorithms can be agglomerative (‘“bottom- up”) ot divisive (“top-down”), Agglomerative algorithms begin with each element as a separate cluster and merge them into larger ones Divisive algorithm begin with the while set and divide it into smaller ones
Dislance measure, which determines haw similarity of two documents is calculated, is a
key to the success of any text clustering algorithms Some documents may be close to one another according to one distance and further away according to another Common
distance functions are the Euclidean distance, the Manhattan distance (also called taxicab
nomn or I-norm) and the maximum norm, just to ame here but a few
Web clustering which is a type of text clustering specific for web pages, can be affline or online Offline clustering is to cluster the whole storage of available web documents and does not have the constraint of response time In online clustering, the algorithms need to meet the “real-time condition”, ie the system need to perform clustering as fast as possible, For example, the algorithm should take the document snippets instead of the whole documents as input since the downloading of original documents is time- consuming ‘the question here is how to enhance the quality of clustering for such document snippets in online web clustering Inspired by the fact those snippets are only small pieces of text (and thus pour in conten!) we propose the Ítamework to ernich them
with hidden: lopics for clustering (Figure 2.2) This [namework and topic analy
similar to one for classif
won The difference here is only due to the
differences between classification and clustering
2.2.2 Large-Scale Web Collections as Universal Dataset
Despite of the obvious differences between two learning frameworks, there is a key phrase sharing between them the phrase of analyzing topics for previously collected datasct Here are some important considerations for this phrase:
- The degree of coverage of the dataset: the universal dataset should be large enough
to cover topics underlined in the domain of application
- Preprocessing: this step is very important to ect good analysis results Although there is no general instruction for all languages, the common adviec 1s to remove
as much as possible noise words such as functional words, stop words ar too
frequent/ rare words
Trang 302
Methods for topic analysis: Some analyzing methods which can be applied have been mentioned in the Chapter 1 The tradeoff between the quality of topic analysis and time complexity should be taken into aevount, For example, topic analysis for snippets in online clustering should be as short as possible to meet the “real-time”
condition
2.3 Advantages of the Frameworks
‘Yhe general frameworks are flexible and general enough to apply in any domaivlanguage, Once we have trained a universal dataset, its hidden topics could
be useful for several leaming tasks in the same domain
This is particularly useful for sparse dala mininys Spare dala like snippets retuned from a search engine could be enriched with hidden topics Thus, enhanced performance can be achieved
Due to lcaming with smaller data, the prescnted methods require less computational resourves than semi-supervised learning
Thank to the nice generative model for analyzing topics for new documents (in the case of LDA), we have a natural way to map documents from term space into topic space ‘his is really an advantage over hewristic-based mapping in the previous approaches [161[3111 01
2.4 Summary
In this chapter, we have described two general frameworks and their advantages for
learning wilh Inddon iopies: one for classification amd one [or clustering The key common phrase between the two frameworks is lopic analyzing for large-seale web
collection called “universal dataset” The quality of the topic model estimation for this data will influence much the performance of learning in the later phrases
Trang 31Chapter 3 Topics Analysis of Large-Scale Web Dataset
As mentioned earlier, topic analysis for a universal dataset is a key to the success of our proposed methods Thus, toward Vietnamese text mining, this chapter contributes to considerations for the problem of topics analysis for large-scale web datasets in Vietnamese
3.1 Some Characteristics of Vietnamese
Vietnamese is the national and official language of Vietnam [48] It is the mother tongue
of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three
million overseas Vietnamese It is also spoken as a second language by some ethnic
minorities of Vietnam Many words in Vietnamese are borrowed from Chinese
Originally, it is written in Chinese-like writing system The current writing system of
Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and
Table 3.1 Vowels in Vietnamese
Front Central Back High iT) |) ea uted
Upper Mid éfe] ofs] 4 [0]
Lower Mid ef] 4[2] 6]
The correspondence between the orthography and pronunciation is rather complicated For example, the vowel i is often written as y; both may represent [i], in which case the difference is in the quality of the preceding vowel For instance, “fai” (ear) is [tai] while tay (hand/arm) is [taj]
In addition to single vowels (or monophthongs), Vietnamese has diphthongs (am di) Three diphthongs consist of a vowel plus a These are: “ia”, “ua”, “wa” (When followed
by a consonant, they become “ié”, “ué”, and “wo”, respectively) The other diphthongs
Trang 325
25
consist of a vowel plus semivowel There are two of these semivowels: y (written i or y)
A majority of diphthongs in Vietnamese are formed this way
Furthermore, these semivowels may also follow the first three diphthongs (“ia”, “ua”,
“wa”) resulting in tripthongs
b Tones
Vietnamese vowels are all pronounced with an inherent tone Tones differ in pitch,
length, contour melody, intensity, and glottal (with or without accompanying constricted
vocal cords)
Tone is indicated by diacritics written above or below the vowel (most of the tone
diacritics appear above the vowel; however, the “nding” tone dot diacritic goes below the
vowel) The six tones in Vietnamese are:
‘Table 3.2, Tones in Vietnamese
ngang lse" |highievel | domi, [ma “ghost
huyền ‘hanging’ low falling * (grave accent) | ma hút"
sắc ‘share’ hightising '(scuteaccen má chøek, mother (southern)!
hai ‘asking’ —|eippingising | (hook) |mể Vomb,gewe (ngã tumbling’ brsskingreing “đds) — ømŠ horse (SiIno-Vietnamsss),cods' [nang ‘heavy’ constricted |, (dol below) | mg ‘tice seedling’
c Consonants
‘The consonants of the Hanoi variety are listed in the Vietnamese orthography, except for
the bilabial approximant which is written as “w” (in the writing system it is written the
same as the vowels “o” and “u”
Some consonant sounds are written with only one letter (like “p”), other consonant sounds are written with a two-letter digraph (like “ph”), and others are written with more than
„or “q”)
one letter or digraph (the velar stop is written variously as “c’