Luận văn hidden topic discovery toward classification and clustering in vietnamese web document

These approaches, however, have some difficulties: 1 constructing, a knowledge base is very time-consuming & labor-intensive, and 2 the results of semi-supervised learning in one appli

Trang 2

VIETNAMESE WEB DOCUMENTS

Major: Information Technology Spcificily: Information Systems Code: 60 48 05

Trang 3

1.2.2 Probabilistic Latent Semantic Analysis cece ieee seventeen B

1.3.1 Generative Model im LIĐA THÍ H182 1.11 mg

2.3, Advantages of the Irameworks Hee

2.4 SUNNN8EY, icon terriiirriee "—.-

3.1 Some Characteristics of Vietnamese THÍ H182 1.11 mg

Trang 4

3.2.2 Sentence 1okenization

3.2.3 Word Segmoriation

5 Remove Non Topic-Ortented Words

3.3, Topic Analysis for Vnlixpress Dataset

3.4, Topic Analysis for Vietnamese Wikipedia Dataset

3.5 Discussion

3.6 Summary

Chapter 1 Deployments of General Frameworks

41 Classification with Hidden Topies

Trang 5

replicates, The outer plate represenls documents, while the immer plale represents he

repeated choice of topics and words wilhin a document [20] - 12

Figure 1.4 Generative model for Latent Dirichlet allocation; ilere, Dir, Poiss and Mult

stand for Dirichlet, Poisson, Multinomial distributions respectiveh 13

Figure 1.5 Quantities in the model of latent Dirichlet allocation - - 13

Eigtre 1.6 Gibbs sampling algorithm for Latent Dirichlet Allocation „16

Figure 4.1 Classification with VnExpress topics -c:

Figure 4.2 Combination of one snippet with its topics: an example 35

Figure 4.3 Leaming with different topic models of Vnlixpress dataset; and the baseline

37

(svithout topics)

Figure 4.4 Test-out-of train with increasing numbers of training examples Here, the

Figure 45 ¥1-Measure for classes and average (over all classes) im learning with 60

topies

Figure 4.7 Dendrogram in Agglomerative Hierarchical Clustering,

Trang 6

vi List of Tables

Table 3.1 Vowels ín Vietnamese —¬

'Table 3.3 Consonants of hanoi variety Hee svat

Table 3.6 Statistics of topics assignrcd by humans in VnExpross DøtaseL

Table 3.7 Statistics of VnExpress đataset 30

Table 3.8 Most likely words for sample topies Here, we conduet topic analysis with 100

Table 4.4 Some collocatioms with highest values of chi-square statistic 44

Table 4.6 Parameters for clustering web search resulls - - 46

Trang 7

Notations & Abbreviations

Probability Latent Semantic Analysis PLSA

Trang 8

Introduction

The World Wide Web has influenced many aspects of our lives, changing the way we

commmunicale, conducl business, shop, enterlam, and so on However, a large portion of

the Web dala is nol organived im systematic and well struchared fonns, a situation which causes greal challenges la those seckirg [or imfisrmation on the Web Consequently, a lot

of tasks, which onable users to scarch, navigate and organize web pages in a more effective way, have been posed in the last decade, such as scarching, page rank, web

clusicring, texl classification, cle To this ond, there have boon a lol of successful stories

like Google, Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few

Tnspired by this trend, the aim of this thesis is 1o develop efficient systems which can overcome the difficulties of dealing with sparse data The main motivation is that while being overwhelmed by a buge amounl of online data, we somclimes lack data ta search or leam efficiently Lel lake web search clustering as an example Tn order Lo meet the real-time condition, that is the response time must be short enough, most of online clusicring systems only work with small picces of text returned from search cngines Unfortunately those pieces are not long and rich enough to build a good clustering system A similar situation occurs in the case of searching images only based on captions Because image captions are only very short and sparse chunks of text, most of the cwrent image retrieval systems still fail to achieve high accuracy As a result, much effort has been made recently to take advantage of external resources such as leaming with imowledge-base support, semi-supervised Jeaming, etc in order to improve the accuracy These approaches, however, have some difficulties: (1) constructing, a knowledge base is

very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in

one application cannot be reused in another ane even in the same domain,

In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification Unlike semi-supervised learning, we approach this issue from the point of view of text/web data analysis that is based on recently successful topic analysis models, such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation The underlying idea of the frameworks is that for a domain we collect a very large external data collection called “universal dataset”, and then build the leamer on both the original data (like snippets or image captions) and a rich set of hidden

topics discovered from the universal data collection The general frameworks are flexible

Trang 9

3 and general enough to apply for a wide range of domains and languages Once we analyze

a universal dataset, the resulting hidden topies can be used for several learning tasks in the same domain This is also particularly useful for sparse data mining Sparse data like snippets retumed from a search engine can be expanded and enriched with hidden topics Thus, a better performance can be achieved Moreover, because the method can leam with smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less

computational resources than semi-supervised leamming,

Roadmap: ‘the organization of this thesis is follow

Chapter ? reviews some typical topic aralysis methods such as Latent, Semantic Analysis,

Probabilistic Latent Semantic Analysis, and T.atent Dirichlet Allocalion These models

can be considered the basic building blocks of general framework of probabilistic

modeling of text and be used to develop more sophisticated and application-oriented

models, such as hierarchical models, author-role models, entity models, and so on ‘They

can also be considered key components in our proposals in subsequent chapters

Chapter 2 introduces two general frameworks for learnmg with hidden topics: one for

iũ

apply in many domains of applications The key common phrase between the two

classification and one for clustering These (rameworks are [exible and general enough to

frameworks is topic analysis for large-scale collections of web documents The quality of the hidden topic described in this chapter will much influence the performance of

subsequent stages

Chapter 3 summarizes several major issues for analyzing data collections of Vietnamese

documents/Web pages We first review some characteristics of Vietnamese which are considered significant for data preprocessing and transformation in the subsequent processes Next, we discuss more details about each step of preprocessing and transforming data Important notes, including specific characteristics of Vietnamese are highlighted Also, we demonstrate the results from topic analysis using LDA for the clean, preprocessed dataset

Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2 tasks: search result classification, and search result clustering The two implementations are based on the topic model analyzed from a universal dataset like shown in chapter 3 The Conclusion sums up the acluevements throughout the previous four chapters Some

fulure research topics are also mentioned i this section

Trang 10

3

Chapter 1 The Problem of Modeling Text Corpora and Hidden Topic Analysis

1.1 Introduction

The goal of modeling text corpora and other collections of discrete data is to find short

description of the members of a collection that enable efficient processing of large

collections while preserving the essential statistical relationships that are useful for basis tasks such as classification, clustering, summarization, and similarity and relevance

Judgments

Significant achievements have beer made on this problem by rescarchers m the context of information retrieval (IR) Vector space model [48] (Salton and McGill 1983) — a methodology successfully deployed in modern search technologies - is a typical approach proposed by IR researchers for modeling text corpora In this model, documents are represented as vectors in a multidimensional Juclidean space Lach axis in this space corresponds to a term (or word) ‘Ihe i-th coordinate of a vector represents some functions

of times of the i-th term occurs in the document represented by the vector The end result 1s a term-by-document matrix X whose columns contain the coordinates for cach of the

documents m the corpus Thus, this model reduces documents of arbitrary length to fixed- length lists of numbers

While the vector space model has some appealing features notably in its basis

identification of sets of words that are discruminative for documents in the collection the

approach also provides a relatively small amount of reduction in description length and

teveals litle m the way of inter- or infra- document statixtical structure To overcame

these shortcomings, TR researchers have proposed some other modeling methods such as generalized vector space model, topic-based veolor space model, elc., among which latent

semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably LSA uses a

singular value decomposition of the term-by-document X matrix to identify a linear

subspace in the space of term weight features that captures most of variance in the collection This approach can achieve considerable reduction in large collections Furthermore, Deerwester et al argue that this method can reveal some aspects of basic linguistic notions such as synonymy or polysemy

In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text

corpora to study the ability of recovering aspects of the generative model from data in LSA approach However, once we have a generative model in hand, it is not clear why we

Trang 11

4 should follow the LSI approach we can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods

The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard The pLSI models each word in a document as a sample from a mixture model, where each mixture components are multinomial random variables that can be viewed as representation of “topics” Consequently, each word is generated from a single topic, and different words in a document may be generated from different topics Hach document is represented as a probability distribution over a fixed set of topics ‘This distribution can be considered as a “reduced description” associated with the document

While Ilofmann's work is a useful step toward probabilistic text modeling, it suffers from severe overfitting problems The number of parameters grows linearly with the number of documents Additionally, although pLSA is a generative madel of the documents in the

collection it is estimated on, it is not a generative model of new documents Latent

Dirichlet Allocation (LDA) [511201 proposed by Blei et al (2003) is one solution to these problems Like all of the above methods, LDA bases on the “bag of word” assumption — Hut the order of words in a document, can be negleeted In addition, although less oflen slated formally, these methods also assume that documents are exchangeable: the specilie ordering of the documents in a corpus can also be omitted According to de Finetti (1990), any collection of exchangeable random variables can be represented as a mixture distribution — in general an infinite mixture Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents This is the key idea of LDA model that we will consider carefully in the section 1.3

Tn event time, Blei of al have developed the wo extensions lo LDA They are Dynamic Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8] DTM is suitable for time series data analysis thanks to the non-exchangeability nature of modeling documents On the other hand, CTM is capable of revealing topic correlation, for example, a document about genetics is more likely to also be about disease than X-ray astronomy ‘Though the CIM gives a better fit of the data in comparison to LDA, it is so complivated by the fact that it loses the conjugate relationship between prior distribution and likelihood

In the following sections, we will discuss mare about the issues behind these modeling methods with particular attention to LDA — a well-known model that has shown its efficiency and success in many applications

Trang 12

4.2 The Early Methods

1.2.1 Latent Semantic Analysis

The main challenge of machine leaming, systems is to determine the distinction between the lexical level of “what actually has been said or written” and the semantic level of

“what is intended” or “what was referred to” in the text or utterance ‘his problem lies in twofold: G) polysemy, ie, a word has multiple meaning and multiple (ypes of usage in different context, and (ii), synonymy and semantically related words, i.x, different words mat have similar sense, They at least in certain context specify the same concept or the

same topic ina weaker sense

Latent semantic analysis (LSA - Deerwester et al, 1990) [1311241126] is the well-known technique which partially addresses this problem Ihe key idea is to map from the document vectors in word space to a lower dimensional representation in the so-called concept space or latent semantic space Mathemativally, TSA relics on singular valuc

decomposition (SVT), a well-known factorization method in near algebra

a, Latent Semantic Analysis by SVD

In the first step, we present the text corpus as term-by-document matrix where elements (i, j) desoribes the occurrences of term i in document j LetX be such a matrix, X will look like this:

Now, the dot product (71, between Lwo term veetors gives us the correlation belween the

terms over the documents ‘I'he matrix product AX? contains all these dot products.

Trang 13

Llement (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product

CC Ứb) Similady, the maubc 47x contains the dot products between all the

document vectors, giving their correlation over the terms: djd, — đ; đ,

In the next stop, we conduct the standard SVD for the ¥ matrix and get ¥ =UEF" , where

U and V are orthogonal matrices L707 —¥'V — and the diagonal matrix » contains the

singular values of X The matrix products giving us the term and document correlations

are then become X¥? =UED7Y' and X7X =PEL"U respectively

Since EE? and £7 are diagonal we sce that (7 must contain the eigenvectors

of XX", while 1” must be the eigenvectors of X71 Both products have the same non-zero

eigenvalues, given by the non-zero entries of Y:", or equally, the non-zero entries of" ¥

Now the decomposition looks like this:

Let this row vector be called/, Likewise, the only part of 7 that contributes to d, is the

j'thcolumn, d, These are not the eigenvectors, but depend on all the eigenvectors

The LSA approximation of ¥ is computed by selecting k largest singular values, and cher corresponding singular vectors from U and V This results im (he rank & approximation tw

X with the smallest error The appealing thing in this approximation is that not only does

it have the minimal error, but it translates the terms and document vectors into a concept space The vector /, then has k entries, each gives the occurrence of term / in one of the &

concepts Similarly, the vector d, gives the relation between document / and each concept

We write this approximation as.¥, —U,2,¥/ Based on this approximation, we can now

do the following:

- Sce how related documents j and q arc in the concept space by comparing the

vectors d, and d, (usually by cosine similarity), This gives us a clustering of Lie documents.

Trang 14

7

- Comparing terns i and p by comparing the vectors # andi, giving us a clustering

of the terms in the concept space

- Given a query, view this a3 a mini document, and compare it to your documents in the concept space

To do the lalicr, we must fish translate your query into the concept space with (he same

transformation used on the documents, ie d,-U,E,d, andé, -£,'Ufd, This means

that if we have a query veclor, we musl do the trarelaion ¿— S2 g before comparing it

to the document vectors in the concept space

b Applications

The new concep! space typically can be used to:

- Compare the documents in the latent semantic space This is useful to some typical leaning tasks such as data clustering or document classification

- Find similar documents across languages, after analyzing a base set of translated documents

- Find relations between terms (synonymy and polysemy) Synonymy and polysemy are fundamental problems mi natural language processmg,

o Synonymy is the phenomenon where different words describe the same idea ‘hus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query

Polysemy is the phenomenon where the same word has multiple meanings

So a search may retrieve nrelevanl documents contaming the desired wards

in the wrong meaning, For example, a botanist and a computer scientist locking for the word "tree" probably desire different sets of documents

- Given a query of terms, we could translate it into the concept space, and find matching documents (information retrieval)

c Limitations

LSA has two drawbacks:

- ‘The resulting dimensions might be difficult to interpret For instance, in

{(car), (truck), (ower)} —> {(1.3452 * car | 0.2828 * truck), (flower)}

the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle"

However, it is very likely that cases close to

Trang 15

{(car), (bottle), (flower)} > {(1.3452 * car + 0.2828 * bottle), (flower)}

will occur Tins leads to results which can be jusified on the mathematical level,

but have no interpretable meaning in natural language

- The probabilistic model of LSA does not match observed data: LSA assumes that

words and documents form a joint Gaussian model (ergodic hypothesis), while a

Poisson distribution has been observed Thus, a newer altemative is probabilistic

Jalen semantic analysis, based on a multinomial model, which is reporled to give

better results than standard LSA

1.2.2 Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for analysis of two-mode and co-occurrence data whieh has applications in information retrieval and filtering, natural language processing, machine learning from text and in related areas Compared to standard LSA, PLSA is based on a mixture decomposition derived (rom a Tuten! class model This results in a more principled approach whicl: has a

sohd foundation in statistics

a The Aspect Model

Suppose that we have given a collection of text documents 2 — {d,, ,¢,,} with terms

from a vocabulary — {w., w,,} The starting point for PLSA is a statistical model

namely aspect model The aspect model is a Jatent variable model for co-occurrence data

in which an unobserved variable =< Z={,,, ,2,} is introduced to capture the hidden topics implied in the documents Here, N, Mand K are the number of documents, words, and topics respectively IIence, we model the joint probability over Pal by the mixture

as follows

P(,w)—P(4)P(w|d),P(@w|4—S)P@|22PGId) — 0)

Like virtually all statistical latent variable models the aspect model relies on a conditional independence assumption, i.e d@ and w are mdependent conditioned on the state of the associated latent variable (the graphical model representing this is demonstrated in Figure

1.1fa))

Trang 16

This is perfectly symmetric with respect to both documents and words

b Model Fitting with the Expectation Maximization Algorithm

‘The aspect model is estimated by the traditional procedure for maximum likelihood estimation, ie Lixpectation Maximization LM iterates two coupled steps: (@) an expectation (E) step in which posterior probabilities are computed for the latent variables, and (ii) a maximization (M) step where parameters are updated Standard calculations

give us the Festep formutae

P)P(4.z)P(wl|z) S)PC)P(@]z)P0r|z

Trang 17

Let us consider topic-conditional multinomial distribution p{.|2) over vocabulary as

points on the 4-1 dimensional simplex of all possible multinomial Via convex hull, the

K points define a L<K-I1 dimensional sub-simplex The modeling assumption

expressedby (1.1) is that conditional distriutions P(w|d)for all documents are

approximated by a multinomial representable as a convex combination of P(w z)in which the mixture component P(=|d) miquely define a point ơn the spanned sub-simplex

which can identified with a concept space A simple illustration of this idea is shown in

Figure 1.2

‘Figure 1.2 Sketch of the prabability sab simplex spanned by the aspect model ( [53})

In order to clarify the relation to LSA, it is useful to reformulate the aspect model as

Ẻ-f,Iz; )„„ and Š-— điag(f(S, }), matrices, we can write the joint probability model Pas a matrix product? =USP" Comparing this with SVD, we can draw the following observations: (i) outer products between rows of Uand J reflect conditional independence in PLSA, (ii) the mixture proportions in PISA subslilute the singular

values Nevertheless, the main differenve between PLSA and T.SA hes on the objective

function used to specify the optimal approximation While LSA uses 1, or Frobenius nom which corresponds to an implicit additive Gaussian noise assumption on counts, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model As is well known, this corresponds to

a minimization of the cross cntropy or Kullback - Leibler divergence between empirical distribution and the model, which is very different from the view of any types of squared deviation On the modeling side, this offers crucial advantages, for example, the mixture

approximation Pof the term-by-document matrix is a well-defined probability

Trang 18

ll

distribution IN contrast, LSA does not define a properly normalized probability distribution and the approximation of term-by-document matrix may contain negative

entries In addition, there is ne obvious interpretation of the directions in the LSA latent

space, while the directions in the PLSA space are interpretable as multinomial word

distributions The probabilistic approach can also lake advantage of the well-established

siatistical theory for model seleciion and complexity control, ¢g to determine the optimal number of latent space dimensions Choosing the number of dimensions in LSA

on the other hand is typically based on ad hoc heuristics

d Limitations

Tn the aspect model, uotice thaldis a dummy index imto the hst of documents im the

training set Consequenlly, d is a rauttinomial random variable with as many possible

values as there are training documents and the model leams the topic mixtures p(-|@)

only for those documents on which it is trained For this reason, pLSI is not a well-

defined generative model of documents: there is no natural way to assign probability to a previously unseen document

A further difficully with pLSA, which also originate from the use of a distribution

indexed by (raming documents, is thal the numbers of parameters grows hinearly wilh the number of training documents The parameters for a K-topic pLSI model are K

multinomial distributions of size V and M mixtures over the K hidden topics This gives

KV + KM parameters and therefore linear growth in Af The Hnear growth in parameters suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem In practice, a tempering heuristic is used to smooth the parameters of the model for acceptable predictive performance It has been shown, however, that overfitting can occur even when tempering is used (Popescul et al., 2008, |4] |)

Latent Dirichlet Allocation (LDA - which is described in section 1.3 overcomes both of these problems by treating the topic mixture weights as a K-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the

training set

1.3 Latent Dirichiet Allocation

Latent Dirichlet Allocation (LIA) [7][20] is a generative probabilistic model for collections of diserete data such as text corpora, It was developed by David Blei, Andrew

Ng, and Michacl Jordan im 2003 By nature, 1.DA is a three-level Incrarchical Bayesian model in which cach iem of a collection is modeled as a finite mixture over an

underlying sel of topics Rach topic, in turn, modeled as an infinite mixtuwe over am

Trang 19

underlying set of topic probabilities In the context of text modeling, the topic probabilities provide an explicit representation of a document In the following sections,

wo will discuss more about generative model, parameter estimation as well as inference in LDA

1.3.1 Generative Model in LDA

Given a corpus of M documents denoted by D — {d,.d,, d,} in which each document number m in the corpus consists of Nj, words w,drawn from a vocabulary of terms

Yast}, the goal of LDA is to find the latent structure of “topics” or “concepts” which

captured the meaning of text that is imagined to be obscured by “word choice” noise

‘Though the terminology of “hidden topics” or “latent concepts” has been encountered in LSA and pLSA, LDA provides us a complete generative model that has shown better results than the earlier approaches

Consider the graphical model representation of LDA as shown in Figure 1.3, the generative process can be interpreted as follows: LDA generates a stream of observable

wordsw,,,, partitioned into documentsd,, For each of these documents, a topic

proportion §, is drawn, and from this, topic-specific words are emitted That is, for each word, a topic indicator ;,,,,s sampled according to the document — specific mixture proportion, and then the corresponding topie-specific term distribution g.,_ used to draw a word he topics #,are sampled once for the entire corpus The complete (amotated) generative model is presented in Figure 1.4 Figure 1.5 gives a list of all involved

quantities

Figure 1.3 Graphical model representation of LDA - The buxes is “plales” representing replicates The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20]

Trang 20

13

LI “copie plate?

for all toptes k x [I,K] do sample mixture components Gy ~ Lari end for

WU “document plate for all documants m € ‘1, Ml] do

sample mixtare proportion ig ~ Dirty

sample docurnent length Vy, + Poisalg) Sword plane”:

for all words

[1,¥m] in document m do

sample topic index 2m,n ~ Muit(in)

eomple term for word win Mult(Sen np)

end for end for

M number of documents to generate (const scalar},

#omimber of topies f mixture components (const scalar)

V number of terms ¢ in vocabulary (const scalar)

o hyperparameter on the mixing proportions (i-vector or scalar if symmetric}

3 hyperparameter on the mixture compenents {V-vector or acalar if symmetric)

3m parameter notation for p(z|d=m), the topic mixture proportion for document m

(ma proportion for esah document, @ = {Fu }8L (Mx Ko matrix}

fe poramcter notation for p{t|2=k), the mixture component of topic k Onc component

for each topic, @— {3}, (K x V matrix)

Nm document length (document-specific), here modelled with a Poisson distribution

[2] with constant parameter € 3mm Adxkure gulizulor Yul chooves Uae Wopie for dhe vet word in ducusucut ve

Wm,a teem indicator for the mth word in dccumens im

According to the model, the probability that a wordw,,, instantiates a particular term 7

given the LDA parameters is:

PO ne — Ge DF) Pe — te IPCun — 1H) a 1.7)

which corresponds to one iteration on the word plate of the graphical model From the topology of the graphical model, we can further specify the complete-data likelihood of a

Trang 21

document, ie., the joint distribution of all known and hidden variables given the hyper

Spccifying this distribution is often simple and uscful as a basic for other derivations So

we can obtain the likelihood of a documentd,, ie of the joint event of all word

occurrences, as one of its marginal distributions by integrating out the distributions & and

Gand summing overs,

wd, 12,A)- [[p(5, |2) p(® Ø1] rors |B, Plan ld ods, (19)

= [fo a)o@1 A] [rer, 15,4008, 2.10)

Finally, the likelihood of the complete corpus W’ — {2, \, is determined by the product of the likelihoods of the independent documents:

1.3.3 Parameter Estimation and Inference via Gibbs Sampling

Lact estimation for LDA is generally intractable ‘Ihe common solution to this is to use approximate inference algorithms such as meanficld vanational cxpectation maximization, expectation propagation, and Gibbs sampling [20]

Trang 22

15

sampled altemately one at a time, conditioned on the values of all other dimensions,

which we denote x_, The algorithm works as follows:

1 Choose dimension i (random or by permutalion)

Heirich [20] has been shown a sequence of calculations to lead to the formulation of the full conditionals for LDA as follows:

(1.15)

Trang 23

‘The other hidden variables of LDA can be calculated as follows:

zero all count variables, „(2

far all đocumcnts øc |I,ÀZ] do

Ter all words ø e [L,V'„ | in doeumeni z do

sample topic index =,, ~-Mult(L/A)

increment document-topic count: al! +1

increment document-topic sum: x, : 1

inercment lopie-ierm count: nf? +1

increment topic-term sum: #, +1

end for

- Gibbs sampling over hurn-in period and sampling period

while not finished do

for all documents m c [I,ÄZ ]do

for all words ø CÍI,N„ | in document m do

~ for the curent assignmuerrf of z to a tertn / for Word v„ „

decremenl connls and sung: äE2)— |; —1;z”?—]1;m, —1

- multinomial sampling ace To Eq, 1.15 (decrements from previous step)

sample lopic index © z,,#)

= use the new assignment af zo the term ¢ for word w,, lo

increment counts and sums: a + Lf = Lim, +1

end for end for

- check convergence and read out parameters

if converged and J sampling iterations since last read oul then

- the different paraincters read outs are averaged

read out parameter set Dace to Eq 1.16 read oul parameter sel @ ace to Ey 1.17 end if

Trang 24

¢ Inference

Given an estimated LDA model, we can now do topic inference for unknown documents

by a similar sampling procedure A new document, i is a vector of words, our goal is

to estimate the posterior distribution of topics 7 given the ward vector of the query # and

the LDA modelZ(@.): pi

complete new document, the similar reasoning is made to get the Gibbs sampling update:

where the new variable 7 counts the observations of term ¢ and topic & in the unseen

document This equation gives a colorful example of the workings of Gibbs posterior

sampling: High estimated word-lopic associations xf! will dominale the multinomial

masses compared to the contributions of jiandalt 1, which are chosen randomly

Consequently, on repeatedly sampling from the distribution and updating of n™, the masses of topic-word associations are propagated into document-topic associations Note

the smoothing influence of the Dirichlet hyper parameters

Applying Eq, 1.17 gwves the topic distribution for the unknown document

1.3.4 Applications

LDA has been successfully applied to text modeling and feature reduction in text classification [5] Recent work has also used LDA as a building block in more sophisticated topic models such as author-document models [42], abstract-reference models [15], syntax-semantic models [18] and image-caption models [6] Additionally, the same kinds of modeling tools have been used in a variety of non-text scttings, such as image processing |46], and the modeling of user profiles |17J

1.4 Chapter Summary

This chapter has shown some typical lopie analysis methods such as TSA, PLSA, and LDA These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-

Trang 25

oriented models, ‘These models can also be seen as key components in our proposals in subsequent chapters

Among the topic analysis methods, we pay much attention to LDA, a generative probabilistic model for collections of discrete data such as text corpora It was developed

by David Blei, Andrew Ng, and Michael Jordan in 2003 and has proven its success in many applications Given the data, the goal is to reverse the generative process to estimate model parameters However, exact inference or estimation for even the not-so-complex model like LDA is intractable Consequently, there are a lot of attempts to make use of apploxiale approaches lo this task umoug which Gibbs Sampling is one of the most suitable methods Gibbs Sampling, which is also imentioned in this chapter, is a spovial case of Markoy-chain Monte Carlo (MCMC) and often yields relatively simple algorithms for approximate inference in high-dimensional models like DA

Trang 26

19

Chapter 2 Frameworks of Learning with Hidden Topics

2.1 Learning with External Resources: Related Works

In recent time, there were a lot of attempts making use of extemal rosources to enhances Icarning performance Depending on types of oxtemal resouress, these methods can be roughly ©

as all the Computer Science faculty pages, or all the course home pages at some university To train such a system to automatically classify web pages, one would

typically rely on hand labeled web pages Unfortunately, these labeled examples are fairly

expensive to obtain because they require human effort In contrast, the web has hundreds

of millions of unlabeled web pages that can be inexpensively gathered using a web

crawler Therefore, we would like the learning algorithms to be able to take as much

advantage of the unlabeled data as possible

Semi-supervised learning has been received a lot of attentions in the last decade Yarowsky (1995) uses self-training for word sense disambiguation, e.g deciding whether the word “plant” means a living organism or a factory in a given context Rosenberg et all (2005) apply it to object detection systems from images, and show the semi-supervised technique compares favorably with a state-of-the-art detector In 2000, Nigam and Ghani [30] perform extensive empirical experiments to compare co-training with generative mixture models and Hxpectation Maximization (HM) Jones (2005) used co-training, co-

EM and other related methods for information extraction from text, Besides, there were a lot of works that applied Tramsductive Support Vector Machines (TSVMs) lo use

unlabeled dala for determining optimal decision boundary

‘The second category covers a lot of works exploiting resources like Wikipedia to support learning process Gabrilovich et al (2007) [16] has demonstrated the value of using

Wikipedia as an additional source of features for text classification and determining the semantic relatedness between texts Banerjee ct al (2007)|3] also caxtract titles of

Wikipedia articles and use them as features for clustering short texts Unfortunately, this approach is not very flexible in the sense that it depends much on the external resource or the application.

Trang 27

This chapter describes frameworks for leaning with the support of topic model estimated

from a large universal dataset This topic model can be considered background knowledge

for the domain of application It also helps the learning process to capture hidden topics (of the domain), the relationships between topics and words as well as words and words,

thus partially overcome the limitations of different word choices in text

2.2 General Learning Frameworks

This section presents general frameworks for learning with the support of hidden topics The main motivation is how to gain benefits from huge sources of online data in order to

enhance quality of the Text/Web clustering and classification Unlike previous studies of leaming with extemal resources, we approach this issue from the point of view of text/Web data analysis that is based on recently successful latent topic analysis models

like LSA, pLSA, and LDA The underlying idea of the frameworks is that for each

learning task, we collect a very large external data collection called “universal dataset”, and then build a learner on both the learning data and a rich set of hidden topics

discovered from that data collection

2.2.1 Frameworks for Learning with Hidden Topics

Corresponding to two typical learning problems, ie classification and clustering, we describe two frameworks with some differences in the architectures

a Framework for Classification

training dataset with dataset hidden topics

Figure 2.1 Classification with Hidden Topics

Nowadays, the continuous development of Internet has created a huge amount of documents which are difficult to manage, organize and navigate As a result, the task of automatic classification, which is to categorize textual documents into two or more predefined classes, has been received a lot of attentions

Trang 28

21

Several machine-learning methods have been applied to text classification including

decision trees, neural networks, support vector machines, etc In the typical applications

of machine-learning methods, the training data is passed to a learning phrase The result

of the learning step is an appropriate classifier capable of categorizing new documents

However, in the cases such as the training data is not as much as expected or the data to

be classified is too rare [52], learning with only training data can not provide us a satisfactory classifier Inspired by this fact, we propose a framework that enables us to

enrich both training and new coming data with hidden topics from available large dataset

so as to enhance the performance of text classification

Classifying with hidden topics is described in Figure 2.1, We first collect a very large

external data collection called “universal dataset” Next, a topic analysis technique such

as pLSA, LDA, etc is applied to the dataset The result of this step is an estimated topic

model which consists of hidden topics and the probability distributions of words over

these topics Upon this model, we can do topic inference for training dataset and new data For each document, the output of topic inference is a probability distribution of

hidden topics — the topics analyzed in the estimation phrase — given the document The

topic distributions of training dataset are then combined with training dataset itself for

learning classifier In the similar way, new documents, which need to be classified, are

combined with their topic distributions to create the so called “new data with hidden topics” before passing to the learned classifier

b Framework for Clustering

Hidden Topics

Figure 2.2 Clustering with Hidden Topics

Text clustering is to automatically generate groups (clusters) of documents based on the similarity or distance among documents Unlike Classification, the clusters are not known

Trang 29

previously User can optionally give the requirement about the number of clusters ‘he documents will later be organized in clusters, cach of which contains “close” documents Clustering algorithms can be hierarchical or partitional Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once Hierarchical algorithms can be agglomerative (‘“bottom- up”) ot divisive (“top-down”), Agglomerative algorithms begin with each element as a separate cluster and merge them into larger ones Divisive algorithm begin with the while set and divide it into smaller ones

Dislance measure, which determines haw similarity of two documents is calculated, is a

key to the success of any text clustering algorithms Some documents may be close to one another according to one distance and further away according to another Common

distance functions are the Euclidean distance, the Manhattan distance (also called taxicab

nomn or I-norm) and the maximum norm, just to ame here but a few

Web clustering which is a type of text clustering specific for web pages, can be affline or online Offline clustering is to cluster the whole storage of available web documents and does not have the constraint of response time In online clustering, the algorithms need to meet the “real-time condition”, ie the system need to perform clustering as fast as possible, For example, the algorithm should take the document snippets instead of the whole documents as input since the downloading of original documents is time- consuming ‘the question here is how to enhance the quality of clustering for such document snippets in online web clustering Inspired by the fact those snippets are only small pieces of text (and thus pour in conten!) we propose the Ítamework to ernich them

with hidden: lopics for clustering (Figure 2.2) This [namework and topic analy

similar to one for classif

won The difference here is only due to the

differences between classification and clustering

2.2.2 Large-Scale Web Collections as Universal Dataset

Despite of the obvious differences between two learning frameworks, there is a key phrase sharing between them the phrase of analyzing topics for previously collected datasct Here are some important considerations for this phrase:

- The degree of coverage of the dataset: the universal dataset should be large enough

to cover topics underlined in the domain of application

- Preprocessing: this step is very important to ect good analysis results Although there is no general instruction for all languages, the common adviec 1s to remove

as much as possible noise words such as functional words, stop words ar too

frequent/ rare words

Trang 30

2

Methods for topic analysis: Some analyzing methods which can be applied have been mentioned in the Chapter 1 The tradeoff between the quality of topic analysis and time complexity should be taken into aevount, For example, topic analysis for snippets in online clustering should be as short as possible to meet the “real-time”

condition

2.3 Advantages of the Frameworks

‘Yhe general frameworks are flexible and general enough to apply in any domaivlanguage, Once we have trained a universal dataset, its hidden topics could

be useful for several leaming tasks in the same domain

This is particularly useful for sparse dala mininys Spare dala like snippets retuned from a search engine could be enriched with hidden topics Thus, enhanced performance can be achieved

Due to lcaming with smaller data, the prescnted methods require less computational resourves than semi-supervised learning

Thank to the nice generative model for analyzing topics for new documents (in the case of LDA), we have a natural way to map documents from term space into topic space ‘his is really an advantage over hewristic-based mapping in the previous approaches [161[3111 01

2.4 Summary

In this chapter, we have described two general frameworks and their advantages for

learning wilh Inddon iopies: one for classification amd one [or clustering The key common phrase between the two frameworks is lopic analyzing for large-seale web

collection called “universal dataset” The quality of the topic model estimation for this data will influence much the performance of learning in the later phrases

Trang 31

Chapter 3 Topics Analysis of Large-Scale Web Dataset

As mentioned earlier, topic analysis for a universal dataset is a key to the success of our proposed methods Thus, toward Vietnamese text mining, this chapter contributes to considerations for the problem of topics analysis for large-scale web datasets in Vietnamese

3.1 Some Characteristics of Vietnamese

Vietnamese is the national and official language of Vietnam [48] It is the mother tongue

of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three

million overseas Vietnamese It is also spoken as a second language by some ethnic

minorities of Vietnam Many words in Vietnamese are borrowed from Chinese

Originally, it is written in Chinese-like writing system The current writing system of

Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and

Table 3.1 Vowels in Vietnamese

Front Central Back High iT) |) ea uted

Upper Mid éfe] ofs] 4 [0]

Lower Mid ef] 4[2] 6]

The correspondence between the orthography and pronunciation is rather complicated For example, the vowel i is often written as y; both may represent [i], in which case the difference is in the quality of the preceding vowel For instance, “fai” (ear) is [tai] while tay (hand/arm) is [taj]

In addition to single vowels (or monophthongs), Vietnamese has diphthongs (am di) Three diphthongs consist of a vowel plus a These are: “ia”, “ua”, “wa” (When followed

by a consonant, they become “ié”, “ué”, and “wo”, respectively) The other diphthongs

Trang 32

5

25

consist of a vowel plus semivowel There are two of these semivowels: y (written i or y)

A majority of diphthongs in Vietnamese are formed this way

Furthermore, these semivowels may also follow the first three diphthongs (“ia”, “ua”,

“wa”) resulting in tripthongs

b Tones

Vietnamese vowels are all pronounced with an inherent tone Tones differ in pitch,

length, contour melody, intensity, and glottal (with or without accompanying constricted

vocal cords)

Tone is indicated by diacritics written above or below the vowel (most of the tone

diacritics appear above the vowel; however, the “nding” tone dot diacritic goes below the

vowel) The six tones in Vietnamese are:

‘Table 3.2, Tones in Vietnamese

ngang lse" |highievel | domi, [ma “ghost

huyền ‘hanging’ low falling * (grave accent) | ma hút"

sắc ‘share’ hightising '(scuteaccen má chøek, mother (southern)!

c Consonants

‘The consonants of the Hanoi variety are listed in the Vietnamese orthography, except for

the bilabial approximant which is written as “w” (in the writing system it is written the

same as the vowels “o” and “u”

Some consonant sounds are written with only one letter (like “p”), other consonant sounds are written with a two-letter digraph (like “ph”), and others are written with more than

„or “q”)

one letter or digraph (the velar stop is written variously as “c’

Tiêu đề	Hidden Topic Discovery Toward Classification and Clustering in Vietnamese Web Documents
Tác giả	Nguyen Cam Tu
Người hướng dẫn	Prof. Dr. Ha Quang Thuy
Trường học	Viet Kam National University, Hanoi College of Technology
Chuyên ngành	Information Technology
Thể loại	Thesis
Năm xuất bản	2008
Thành phố	Hanoi

Định dạng
Số trang	65
Dung lượng	1,45 MB