HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS

In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification.. Unlik

Trang 1

COLLEGE OF TECHNOLOGY

NGUYEN CAM TU

HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS

MASTER THESIS

HANOI - 2008

Trang 2

COLLEGE OF TECHNOLOGY

NGUYEN CAM TU

HIDDEN TOPIC DISCOVERY TOWARD CLASSIFICATION AND CLUSTERING IN VIETNAMESE WEB DOCUMENTS

Major: Information Technology Specificity: Information Systems Code: 60 48 05

MASTER THESIS

SUPERVISOR: Prof Dr Ha Quang Thuy

HANOI - 2008

Trang 3

Acknowledgements

My deepest thank must first go to my research advisor, Prof Dr Ha Quang Thuy, who offers me an endless inspiration in scientific research, leading me to this research area I particularly appreciate his unconditional support and advice in both academic environment and daily life during the last four years

Many thanks go to Dr Phan Xuan Hieu who has given me many advices and comments This work can not be possible without his support Also, I would like to thank him for being my friend, my older brother who has brought me a lot of lessons in both scientific research and daily life

My thanks also go to all members of seminar group “data mining” Especially, I would

like to thank Bsc Nguyen Thu Trang for helping me a lot in collecting data and doing experiments

I highly acknowledge the invaluable support and advice in both technical and daily life of

my teachers, my colleagues in Department of Information Systems, Faculty of Technology, Vietnam National University, Hanoi

I also want to thank the supports from the Project QC.06.07 “Vietnamese Named Entity

Resolution and Tracking crossover Web Documents”, Vietnam National University,

Hanoi; the Project 203906 “`Information Extraction Models for finding Entities and

Semantic Relations in Vietnamese Web Pages'' of the Ministry of Science and

Technology, Vietnam; and the National Project 02/2006/HĐ - ĐTCT-KC.01/06-10

“Developing content filter systems to support management and implementation public security – ensure policy”

Finally, from bottom of my heart, I would specially like to say thanks to all members in

my family, all my friends They are really an endless encouragement in my life

Nguyen Cam Tu

Trang 4

Assurance

I certify that the achievements in this thesis belong to my personal, and are not copied from any other’s results Throughout the dissertation, all the mentions are either my proposal, or summarized from many sources All the references have clear origins, and properly quoted I am responsible for this statement

Hanoi, November 15, 2007

Nguyen Cam Tu

Trang 5

Table of Content

Introduction 1

Chapter 1 The Problem of Modeling Text Corpora and Hidden Topic Analysis 3

1.1 Introduction 3

1.2 The Early Methods 5

1.2.1 Latent Semantic Analysis 5

1.2.2 Probabilistic Latent Semantic Analysis 8

1.3 Latent Dirichlet Allocation 11

1.3.1 Generative Model in LDA 12

1.3.2 Likelihood 13

1.3.3 Parameter Estimation and Inference via Gibbs Sampling 14

1.3.4 Applications 17

1.4 Summary 17

Chapter 2 Frameworks of Learning with Hidden Topics 19

2.1 Learning with External Resources: Related Works 19

2.2 General Learning Frameworks 20

2.2.1 Frameworks for Learning with Hidden Topics 20

2.2.2 Large-Scale Web Collections as Universal Dataset 22

2.3 Advantages of the Frameworks 23

2.4 Summary 23

Chapter 3 Topics Analysis of Large-Scale Web Dataset 24

3.1 Some Characteristics of Vietnamese 24

3.1.1 Sound 24

3.1.2 Syllable Structure 26

3.1.3 Vietnamese Word 26

3.2 Preprocessing and Transformation 27

3.2.1 Sentence Segmentation 27

Trang 6

3.2.2 Sentence Tokenization 28

3.2.3 Word Segmentation 28

3.2.4 Filters 28

3.2.5 Remove Non Topic-Oriented Words 28

3.3 Topic Analysis for VnExpress Dataset 29

3.4 Topic Analysis for Vietnamese Wikipedia Dataset 30

3.5 Discussion 31

3.6 Summary 32

Chapter 4 Deployments of General Frameworks 33

4.1 Classification with Hidden Topics 33

4.1.1 Classification Method 33

4.1.2 Experiments 36

4.2 Clustering with Hidden Topics 40

4.2.1 Clustering Method 40

4.2.2 Experiments 45

4.3 Summary 49

Conclusion 50

Achievements throughout the thesis 50

Future Works 50

References 52

Vietnamese References 52

English References 52

Appendix: Some Clustering Results 56

Trang 7

List of Figures

Figure 1.1 Graphical model representation of the aspect model in the asymmetric (a) and

symmetric (b) parameterization ( [55]) 9

Figure 1.2 Sketch of the probability sub-simplex spanned by the aspect model ( [55]) 10

Figure 1.3 Graphical model representation of LDA - The boxes is “plates” representing replicates The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20] 12

Figure 1.4 Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively 13

Figure 1.5 Quantities in the model of latent Dirichlet allocation 13

Figure 1.6 Gibbs sampling algorithm for Latent Dirichlet Allocation 16

Figure 2.1 Classification with Hidden Topics 20

Figure 2.2 Clustering with Hidden Topics 21

Figure 3.1 Pipeline of Data Preprocessing and Transformation 27

Figure 4.1 Classification with VnExpress topics 33

Figure 4.2 Combination of one snippet with its topics: an example 35

Figure 4.3 Learning with different topic models of VnExpress dataset; and the baseline (without topics) 37

Figure 4.4 Test-out-of train with increasing numbers of training examples Here, the number of topics is set at 60topics 37

Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics 39

Figure 4.6 Clustering with Hidden Topics 40

Figure 4.7 Dendrogram in Agglomerative Hierarchical Clustering 42

Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query 47

Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query 47

Trang 8

List of Tables

Table 3.1 Vowels in Vietnamese 24

Table 3.2 Tones in Vietnamese 25

Table 3.3 Consonants of hanoi variety 26

Table 3.4 Structure of Vietnamese syllables 26

Table 3.5 Functional words in Vietnamese 29

Table 3.6 Statistics of topics assigned by humans in VnExpress Dataset 29

Table 3.7 Statistics of VnExpress dataset 30

Table 3.8 Most likely words for sample topics Here, we conduct topic analysis with 100 topics 30

Table 3.9 Statistic of Vietnamese Wikipedia Dataset 31

Table 3.10 Most likely words for sample topics Here, we conduct topic analysis with 200 topics 31

Table 4.1 Google search results as training and testing dataset The search phrases for training and test data are designed to be exclusive 34

Table 4.2 Experimental results of baseline (learning without topics) 38

Table 4.3 Experimental results of learning with 60 topics of VnExpress dataset 38

Table 4.4 Some collocations with highest values of chi-square statistic 44

Table 4.5 Queries submitted to Google 45

Table 4.6 Parameters for clustering web search results 46

Trang 9

Notations & Abbreviations

Word or phrase Abbreviation

Information Retrieval IR Latent Semantic Analysis LSA Probability Latent Semantic Analysis PLSA Latent Dirichlet Allocation LDA Dynamic Topic Models DTM Correlated Topic Models CTM Singular Value Decomposition SVD

Trang 10

Introduction

The World Wide Web has influenced many aspects of our lives, changing the way we communicate, conduct business, shop, entertain, and so on However, a large portion of the Web data is not organized in systematic and well structured forms, a situation which causes great challenges to those seeking for information on the Web Consequently, a lot

of tasks enabling users to search, navigate and organize web pages in a more effective way have been posed in the last decade, such as searching, page rank, web clustering, text classification, etc To this end, there have been a lot of successful stories like Google, Yahoo, Open Directory Project (Dmoz), Clusty, just to name but a few

Inspired by this trend, the aim of this thesis is to develop efficient systems which are able to overcome the difficulties of dealing with sparse data The main motivation is that while being overwhelmed by a huge amount of online data, we sometimes lack data

to search or learn effectively Let take web search clustering as an example In order to meet the real-time condition, that is the response time must be short enough, most of online clustering systems only work with small pieces of text returned from search engines Unfortunately those pieces are not long and rich enough to build a good clustering system A similar situation occurs in the case of searching images only based

on captions Because image captions are only very short and sparse chunks of text, most

of the current image retrieval systems still fail to achieve high accuracy As a result, much effort has been made recently to take advantage of external resources like learning with knowledge-base support, semi-supervised learning, etc in order to improve the accuracy These approaches, however, have some difficulties: (1) constructing a knowledge base is very time-consuming & labor-intensive, and (2) the results of semi-supervised learning in one application cannot be reused in another one even in the same domain

In the thesis, we introduce two general frameworks for learning with hidden topics discovered from large-scale data collections: one for clustering and another for classification Unlike semi-supervised learning, we approach this issue from the point of view of text/web data analysis that is based on recently successful topic analysis models, such as Latent Semantic Analysis, Probabilistic-Latent Semantic Analysis, or Latent Dirichlet Allocation The underlying idea of the frameworks is that for a domain we collect a very large external data collection called “universal dataset”, and then build the learner on both the original data (like snippets or image captions) and a rich set of hidden topics discovered from the universal data collection The general frameworks are flexible

Trang 11

and general enough to apply for a wide range of domains and languages Once we analyze

a universal dataset, the resulting hidden topics can be used for several learning tasks in the same domain This is also particularly useful for sparse data mining Sparse data like snippets returned from a search engine can be expanded and enriched with hidden topics Thus, a better performance can be achieved Moreover, because the method can learn with smaller data (the meaningful hidden topics rather than all unlabeled data), it requires less computational resources than semi-supervised learning

Roadmap: The organization of this thesis is follow

Chapter 1 reviews some typical topic analysis methods such as Latent Semantic Analysis,

Probabilistic Latent Semantic Analysis, and Latent Dirichlet Allocation These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-oriented models, such as hierarchical models, author-role models, entity models, and so on They can also be considered key components in our proposals in subsequent chapters

Chapter 2 introduces two general frameworks for learning with hidden topics: one for

classification and one for clustering These frameworks are flexible and general enough to apply in many domains of applications The key common phrase between the two frameworks is topic analysis for large-scale collections of web documents The quality of the hidden topic described in this chapter will much influence the performance of subsequent stages

Chapter 3 summarizes some major issues for analyzing data collections of Vietnamese

documents/Web pages We first review some characteristics of Vietnamese which are considered significant for data preprocessing and transformation in the subsequent processes Next, we discuss more details about each step of preprocessing and transforming data Important notes, including specific characteristics of Vietnamese are highlighted Also, we demonstrate the results from topic analysis using LDA for the clean, preprocessed dataset

Chapter 4 describes the deployments of general frameworks proposed in Chapter 2 for 2

tasks: search result classification, and search result clustering The two implementations

are based on the topic model analyzed from a universal dataset like shown in chapter 3 The Conclusion sums up the achievements throughout the previous four chapters Some

future research topics are also mentioned in this section

Trang 12

Chapter 1 The Problem of Modeling Text Corpora and Hidden Topic Analysis

1.1 Introduction

The goal of modeling text corpora and other collections of discrete data is to find short description of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basis tasks such as classification, clustering, summarization, and similarity and relevance judgments

Significant achievements have been made on this problem by researchers in the context of information retrieval (IR) Vector space model [48] (Salton and McGill, 1983) – a methodology successfully deployed in modern search technologies - is a typical approach proposed by IR researchers for modeling text corpora In this model, documents are represented as vectors in a multidimensional Euclidean space Each axis in this space

corresponds to a term (or word) The i-th coordinate of a vector represents some functions

of times of the i-th term occurs in the document represented by the vector The end result

is a term-by-document matrix X whose columns contain the coordinates for each of the

documents in the corpus Thus, this model reduces documents of arbitrary length to length lists of numbers

fixed-While the vector space model has some appealing features – notably in its basis identification of sets of words that are discriminative for documents in the collection – the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intra- document statistical structure To overcome these shortcomings, IR researchers have proposed some other modeling methods such as generalized vector space model, topic-based vector space model, etc., among which latent semantic analysis (LSA - Deerwester et al, 1990)[13][26] is the most notably LSA uses a

singular value decomposition of the term-by-document X matrix to identify a linear

subspace in the space of term weight features that captures most of variance in the collection This approach can achieve considerable reduction in large collections Furthermore, Deerwester et al argue that this method can reveal some aspects of basic linguistic notions such as synonymy or polysemy

In 1998, Papadimitriou et al [40] developed a generative probabilistic model of text corpora to study the ability of recovering aspects of the generative model from data in LSA approach However, once we have a generative model in hand, it is not clear why we

Trang 13

should follow the LSI approach – we can attempt to proceed more directly, fitting the model to data using maximum likelihood or Bayesian methods

The probabilistic LSI (PLSI - Hoffman, 1999) [21] [22] is a significant step in this regard The pLSI models each word in a document as a sample from a mixture model, where each mixture components are multinomial random variables that can be viewed as representation of “topics” Consequently, each word is generated from a single topic, and different words in a document may be generated from different topics Each document is represented as a probability distribution over a fixed set of topics This distribution can be considered as a “reduced description” associated with the document

While Hofmann’s work is a useful step toward probabilistic text modeling, it suffers from severe overfitting problems The number of parameters grows linearly with the number of documents Additionally, although pLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents Latent Dirichlet Allocation (LDA) [5][20] proposed by Blei et al (2003) is one solution to these problems Like all of the above methods, LDA bases on the “bag of word” assumption – that the order of words in a document can be neglected In addition, although less often stated formally, these methods also assume that documents are exchangeable; the specific ordering of the documents in a corpus can also be omitted According to de Finetti (1990), any collection of exchangeable random variables can be represented as a mixture distribution – in general an infinite mixture Thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents This is the key idea of LDA model that we will consider carefully in the section 1.3

In recent time, Blei et al have developed the two extensions to LDA They are Dynamic Topic Models (DTM - 2006)[7] and Correlated Topic Models (CTM - 2007) [8] DTM is suitable for time series data analysis thanks to the non-exchangeability nature of modeling documents On the other hand, CTM is capable of revealing topic correlation, for example, a document about genetics is more likely to also be about disease than X-ray astronomy Though the CTM gives a better fit of the data in comparison to LDA, it is so complicated by the fact that it loses the conjugate relationship between prior distribution and likelihood

In the following sections, we will discuss more about the issues behind these modeling methods with particular attention to LDA – a well-known model that has shown its efficiency and success in many applications

Trang 14

1.2 The Early Methods

1.2.1 Latent Semantic Analysis

The main challenge of machine learning systems is to determine the distinction between the lexical level of “what actually has been said or written” and the semantic level of

“what is intended” or “what was referred to” in the text or utterance This problem lies in twofold: (i) polysemy, i.e., a word has multiple meaning and multiple types of usage in different context, and (ii), synonymy and semantically related words, i.e, different words mat have similar sense They at least in certain context specify the same concept or the same topic in a weaker sense

Latent semantic analysis (LSA - Deerwester et al, 1990) [13][24][26] is the well-known technique which partially addresses this problem The key idea is to map from the document vectors in word space to a lower dimensional representation in the so-called

concept space or latent semantic space Mathematically, LSA relies on singular value

decomposition (SVD), a well-known factorization method in linear algebra

a Latent Semantic Analysis by SVD

In the first step, we present the text corpus as term-by-document matrix where elements

(i, j) describes the occurrences of term i in document j Let X be such a matrix, X will look

n

x x

, 1

,

, 1 1

, 1 T

Trang 15

Element (i, p) (which equal to element (p,i) due to the symmetry) contains the dot product

Similarly, the matrix

T

In the next step, we conduct the standard SVD for the X matrix and get , where

U and V are orthogonal matrices and the diagonal matrix Σcontains the

singular values of X The matrix products giving us the term and document correlations

are then become and respectively

T V U

I V V U

T T

The values σ1, ,σl are called the singular values, and and the left and right singular vectors Note that only part of U, which contributes to , is the i-th row Let this row vector be called Likewise, the only part of that contributes to is the j’th column, These are not the eigenvectors, but depend on all the eigenvectors

l u

u1, , v1, ,v l

i t i

j dˆ

The LSA approximation of X is computed by selecting k largest singular values, and their corresponding singular vectors from U and V This results in the rank k approximation to

X with the smallest error The appealing thing in this approximation is that not only does

it have the minimal error, but it translates the terms and document vectors into a concept

space The vector then has k entries, each gives the occurrence of term i in one of the k

concepts Similarly, the vector gives the relation between document j and each concept

We write this approximation as Based on this approximation, we can now

do the following:

i tˆ

j dˆ

T k k k

- See how related documents j and q are in the concept space by comparing the

vectors dˆ j and dˆ q(usually by cosine similarity) This gives us a clustering of the documents

Trang 16

- Comparing terms i and p by comparing the vectors tˆ iandtˆ n, giving us a clustering

of the terms in the concept space

- Given a query, view this as a mini document, and compare it to your documents in

the concept space

To do the latter, we must first translate your query into the concept space with the same transformation used on the documents, i.e and This means that if we have a query vector, we must do the translation before comparing it

to the document vectors in the concept space

j k k

k k

dˆ =Σ− 1

q U

k k

1

ˆ = Σ −

b Applications

The new concept space typically can be used to:

- Compare the documents in the latent semantic space This is useful to some typical

learning tasks such as data clustering or document classification

- Find similar documents across languages, after analyzing a base set of translated

documents

- Find relations between terms (synonymy and polysemy) Synonymy and polysemy

are fundamental problems in natural language processing:

o Synonymy is the phenomenon where different words describe the same idea Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query

o Polysemy is the phenomenon where the same word has multiple meanings

So a search may retrieve irrelevant documents containing the desired words

in the wrong meaning For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents

- Given a query of terms, we could translate it into the concept space, and find

matching documents (information retrieval)

c Limitations

LSA has two drawbacks:

- The resulting dimensions might be difficult to interpret For instance, in

{(car), (truck), (flower)} > {(1.3452 * car + 0.2828 * truck), (flower)}

the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle" However, it is very likely that cases close to

Trang 17

}

{(car), (bottle), (flower)} > {(1.3452 * car + 0.2828 * bottle), (flower)}

will occur This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language

- The probabilistic model of LSA does not match observed data: LSA assumes that

words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA

1.2.2 Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis [21][22] (PLSA) is a statistical technique for analysis of two-mode and co-occurrence data which has applications in information retrieval and filtering, natural language processing, machine learning from text and in related areas Compared to standard LSA, PLSA is based on a mixture decomposition derived from a latent class model This results in a more principled approach which has a solid foundation in statistics

a The Aspect Model

Suppose that we have given a collection of text documents with terms from a vocabulary The starting point for PLSA is a statistical model

namely aspect model The aspect model is a latent variable model for co-occurrence data

in which an unobserved variable

z∈ = 1, , is introduced to capture the hidden

topics implied in the documents Here, N, M and K are the number of documents, words,

and topics respectively Hence, we model the joint probability over by the mixture

d z P z w P d

w P d w P d P w d

P( , ) ( ) ( | ), ( | ) ( | ) ( | ) (1.1)

Like virtually all statistical latent variable models the aspect model relies on a conditional

independence assumption, i.e d and w are independent conditioned on the state of the

associated latent variable (the graphical model representing this is demonstrated in Figure 1.1(a))

Trang 18

Figure 1.1 Graphical model representation of the aspect model in the asymmetric (a) and symmetric (b) parameterization ( [53])

It is necessary to note that the aspect model can be equivalently parameterized by (cf Figure 1.1 (b))

∑

∈

=

Z z

z w P z d P z P w

d

P( , ) ( ) ( | ) ( | ) (1.2)

This is perfectly symmetric with respect to both documents and words

b Model Fitting with the Expectation Maximization Algorithm

The aspect model is estimated by the traditional procedure for maximum likelihood estimation, i.e Expectation Maximization EM iterates two coupled steps: (i) an expectation (E) step in which posterior probabilities are computed for the latent variables; and (ii) a maximization (M) step where parameters are updated Standard calculations give us the E-step formulae

z w P z d P z P

z w P z d P z P w

d z P

)

|()

|()(

)

|()

|()()

w d z P w d n z

P( ) ( , ) ( | , ) (1.6)

c Probabilistic Latent Semantic Space

Trang 19

Let us consider topic-conditional multinomial distribution over vocabulary as points on the

M dimensional simplex of all possible multinomial Via convex hull, the

K points define a L ≤ K − 1 dimensional sub-simplex The modeling assumption expressedby (1.1) is that conditional distributions for all documents are approximated by a multinomial representable as a convex combination of in which the mixture component uniquely define a point on the spanned sub-simplex which can identified with a concept space A simple illustration of this idea is shown in

)

| (w d P

)

| (w z P

)

| (z d P

Figure 1.2

Figure 1.2 Sketch of the probability sub-simplex spanned by the aspect model ( [53])

In order to clarify the relation to LSA, it is useful to reformulate the aspect model as parameterized by (1.2) in matrix notation By defining ,

ˆ = Σˆ=diag(P( )z k )k matrices, we can write the joint probability model

Pas a matrix product Comparing this with SVD, we can draw the following observations: (i) outer products between rows of Uand reflect conditional independence in PLSA, (ii) the mixture proportions in PLSA substitute the singular values Nevertheless, the main difference between PLSA and LSA lies on the objective function used to specify the optimal approximation While LSA uses or Frobenius norm which corresponds to an implicit additive Gaussian noise assumption on counts, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model As is well known, this corresponds to

a minimization of the cross entropy or Kullback - Leibler divergence between empirical distribution and the model, which is very different from the view of any types of squared deviation On the modeling side, this offers crucial advantages, for example, the mixture approximation

T V U

Trang 20

distribution IN contrast, LSA does not define a properly normalized probability distribution and the approximation of term-by-document matrix may contain negative entries In addition, there is no obvious interpretation of the directions in the LSA latent space, while the directions in the PLSA space are interpretable as multinomial word distributions The probabilistic approach can also take advantage of the well-established statistical theory for model selection and complexity control, e.g., to determine the optimal number of latent space dimensions Choosing the number of dimensions in LSA

on the other hand is typically based on ad hoc heuristics

d Limitations

In the aspect model, notice that is a dummy index into the list of documents in the

training set Consequently, d is a multinomial random variable with as many possible

values as there are training documents and the model learns the topic mixtures

only for those documents on which it is trained For this reason, pLSI is not a defined generative model of documents; there is no natural way to assign probability to a previously unseen document

well-d

)

| (z d p

A further difficulty with pLSA, which also originate from the use of a distribution indexed by training documents, is that the numbers of parameters grows linearly with the

number of training documents The parameters for a K-topic pLSI model are K multinomial distributions of size V and M mixtures over the K hidden topics This gives

KV + KM parameters and therefore linear growth in M The linear growth in parameters

suggests that the model is prone to overfitting and, empirically, overfitting is indeed a serious problem In practice, a tempering heuristic is used to smooth the parameters of the model for acceptable predictive performance It has been shown, however, that overfitting can occur even when tempering is used (Popescul et al., 2001, [41])

Latent Dirichlet Allocation (LDA - which is described in section 1.3 overcomes both of

these problems by treating the topic mixture weights as a K-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the

training set

1.3 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) [7][20] is a generative probabilistic model for collections of discrete data such as text corpora It was developed by David Blei, Andrew

Ng, and Michael Jordan in 2003 By nature, LDA is a three-level hierarchical Bayesian model in which each item of a collection is modeled as a finite mixture over an underlying set of topics Each topic, in turn, modeled as an infinite mixture over an

Trang 21

underlying set of topic probabilities In the context of text modeling, the topic probabilities provide an explicit representation of a document In the following sections,

we will discuss more about generative model, parameter estimation as well as inference in LDA

1.3.1 Generative Model in LDA

Given a corpus of M documents denoted by D={d1,d2, ,d M}, in which each document

number m in the corpus consists of N m words drawn from a vocabulary of terms

, the goal of LDA is to find the latent structure of “topics” or “concepts” which captured the meaning of text that is imagined to be obscured by “word choice” noise Though the terminology of “hidden topics” or “latent concepts” has been encountered in LSA and pLSA, LDA provides us a complete generative model that has shown better results than the earlier approaches

i w

{t1, ,t V}

Consider the graphical model representation of LDA as shown in Figure 1.3, the generative process can be interpreted as follows: LDA generates a stream of observable wordsw m,n, partitioned into documentsdrm

For each of these documents, a topic proportion ϑrmis drawn, and from this, topic-specific words are emitted That is, for each word, a topic indicator is sampled according to the document – specific mixture proportion, and then the corresponding topic-specific term distribution

n m

z ,

n m

z ,

ϕr used to draw a word The topics ϕrkare sampled once for the entire corpus The complete (annotated) generative model is presented in Figure 1.4 Figure 1.5 gives a list of all involved quantities

Figure 1.3 Graphical model representation of LDA - The boxes is “plates” representing replicates The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document [20]

Trang 22

Figure 1.4 Generative model for Latent Dirichlet allocation; Here, Dir, Poiss and Mult stand for Dirichlet, Poisson, Multinomial distributions respectively

Figure 1.5 Quantities in the model of latent Dirichlet allocation

1.3.2 Likelihood

According to the model, the probability that a word instantiates a particular term t

given the LDA parameters is:

n m

k

m n

m k n

m m

n

w p

1

, ,

which corresponds to one iteration on the word plate of the graphical model From the topology of the graphical model, we can further specify the complete-data likelihood of a

Trang 23

document, i.e., the joint distribution of all known and hidden variables given the hyper parameters:

4421r4

44444

6

rr4444

4444

1

rr

plate topic

document) (1 plate document

plate word

, 1

, | ) ( | ) ( , ) ( | )(

),

|,,,(

ϕβ

α

=

p p

z p w

p z

d

N n

n m m

m

(1.8)

Specifying this distribution is often simple and useful as a basic for other derivations So

we can obtain the likelihood of a documentdrm

, i.e., of the joint event of all word occurrences, as one of its marginal distributions by integrating out the distributions ϑrmand

Φand summing overz m,n:

n m

n m N

m m

n m z

n m m

,

|()

n m

p

1 , | , )(

)

|(

Finally, the likelihood of the complete corpus { }M

m m d

W = r =1 is determined by the product of the likelihoods of the independent documents:

,

| ( αr βr r αr βr (1.11)

1.3.3 Parameter Estimation and Inference via Gibbs Sampling

Exact estimation for LDA is generally intractable The common solution to this is to use approximate inference algorithms such as mean-field variational expectation maximization, expectation propagation, and Gibbs sampling [20]

Trang 24

dx x p

x p x

x

∫

r r

dz x z p

x z p x

z z p

rr

rrr

r

,

,,

1

~ 1

n m

z ,

n m

z

z m

z

z i m v

V v

v z

t

t i z i

i

n

n n

n w

z z p

α

αβ

βr

Trang 25

The other hidden variables of LDA can be calculated as follows:

( ) ( )

v V

v

v k

t

t k t k

n

β

βϕ

z K

z

z m

k

k m k

for all documents m∈[ ]1,M do

for all words n∈[1,N m] in document do m

sample topic index ~Mult(1/K) z m,n

increment document-topic count: ( )s + 1

m n

increment document-topic sum: 1n m +

increment topic-term count: ( )t + 1

s n

increment topic-term sum: 1n z +

end for

]

- Gibbs sampling over burn-in period and sampling period

while not finished do

for all documents m∈[ ]1,M do

for all words n∈[1,N m in document do m

- for the current assignment of to a term t for word z w m,n:

decrement counts and sums: ( )z − 1

- use the new assignment of to the term t for word z w m,nto:

increment counts and sums: ( )z + 1

- check convergence and read out parameters

if converged and L sampling iterations since last read out then

- the different parameters read outs are averaged

read out parameter set Φ acc to Eq 1.16

read out parameter set Θ acc to Eq 1.17

end if

end while

Figure 1.6 Gibbs sampling algorithm for Latent Dirichlet Allocation

Trang 26

c Inference

Given an estimated LDA model, we can now do topic inference for unknown documents

by a similar sampling procedure A new document m~ is a vector of words , our goal is

to estimate the posterior distribution of topics

wr

~

zrgiven the word vector of the query wr and the LDA modelL(Θ , Φ) (: p zr |wr ,L)= p(~zr ,w~r ,wr ,zr ) In order to find the required counts for a complete new document, the similar reasoning is made to get the Gibbs sampling update:

;,

z

z m

k i k m v

v k V

v

v k

t

t i k

t k i

i i

n

n n

n

n n w

z w z k z p

α

αβ

βr

r

where the new variable counts the observations of term t and topic k in the unseen

document This equation gives a colorful example of the workings of Gibbs posterior sampling: High estimated word-topic associations

( )t k nr

( )t k

n will dominate the multinomial masses compared to the contributions of ( )t

k n~ and ( )k

Applying Eq 1.17 gives the topic distribution for the unknown document:

( ) ( )

z K

z

z m

k

k m k

1.4 Summary

This chapter has shown some typical topic analysis methods such as LSA, PLSA, and LDA These models can be considered the basic building blocks of general framework of probabilistic modeling of text and be used to develop more sophisticated and application-

Trang 27

oriented models These models can also be seen as key components in our proposals in subsequent chapters

Among the topic analysis methods, we pay much attention to LDA, a generative probabilistic model for collections of discrete data such as text corpora It was developed

by David Blei, Andrew Ng, and Michael Jordan in 2003 and has proven its success in many applications Given the data, the goal is to reverse the generative process to estimate model parameters However, exact inference or estimation for even the not-so-complex model like LDA is intractable Consequently, there are a lot of attempts to make use of approximate approaches to this task among which Gibbs Sampling is one of the most suitable methods Gibbs Sampling, which is also mentioned in this chapter, is a special case of Markov-chain Monte Carlo (MCMC) and often yields relatively simple algorithms for approximate inference in high-dimensional models like LDA

Trang 28

Chapter 2 Frameworks of Learning with Hidden Topics

2.1 Learning with External Resources: Related Works

In recent time, there were a lot of attempts making use of external resources to enhance learning performance Depending on types of external resources, these methods can be roughly classified into 2 categories: those make use of unlabeled data, and those exploit structured or semi-structured data

The first category is commonly referred under the name of semi-supervised learning The key argument is that unlabeled examples are significantly easier to collect than labeled ones One example of this is web-page classification Suppose that we want a program to electronically visit some web site and download all the web pages of interest to us, such

as all the Computer Science faculty pages, or all the course home pages at some university To train such a system to automatically classify web pages, one would typically rely on hand labeled web pages Unfortunately, these labeled examples are fairly expensive to obtain because they require human effort In contrast, the web has hundreds

of millions of unlabeled web pages that can be inexpensively gathered using a web crawler Therefore, we would like the learning algorithms to be able to take as much advantage of the unlabeled data as possible

Semi-supervised learning has been received a lot of attentions in the last decade Yarowsky (1995) uses self-training for word sense disambiguation, e.g deciding whether the word “plant” means a living organism or a factory in a given context Rosenberg et al (2005) apply it to object detection systems from images, and show the semi-supervised technique compares favorably with a state-of-the-art detector In 2000, Nigam and Ghani [30] perform extensive empirical experiments to compare co-training with generative mixture models and Expectation Maximization (EM) Jones (2005) used co-training, co-

EM and other related methods for information extraction from text Besides, there were a lot of works that applied Transductive Support Vector Machines (TSVMs) to use unlabeled data for determining optimal decision boundary

The second category covers a lot of works exploiting resources like Wikipedia to support learning process Gabrilovich et al (2007) [16] has demonstrated the value of using Wikipedia as an additional source of features for text classification and determining the semantic relatedness between texts Banerjee et al (2007)[3] also extract titles of Wikipedia articles and use them as features for clustering short texts Unfortunately, this approach is not very flexible in the sense that it depends much on the external resource or the application

Trang 29

This chapter describes frameworks for learning with the support of topic model estimated from a large universal dataset This topic model can be considered background knowledge for the domain of application It also helps the learning process to capture hidden topics (of the domain), the relationships between topics and words as well as words and words, thus partially overcome the limitations of different word choices in text

2.2 General Learning Frameworks

This section presents general frameworks for learning with the support of hidden topics The main motivation is how to gain benefits from huge sources of online data in order to enhance quality of the Text/Web clustering and classification Unlike previous studies of learning with external resources, we approach this issue from the point of view of text/Web data analysis that is based on recently successful latent topic analysis models like LSA, pLSA, and LDA The underlying idea of the frameworks is that for each learning task, we collect a very large external data collection called “universal dataset”, and then build a learner on both the learning data and a rich set of hidden topics discovered from that data collection

2.2.1 Frameworks for Learning with Hidden Topics

Corresponding to two typical learning problems, i.e classification and clustering, we describe two frameworks with some differences in the architectures

a Framework for Classification

Figure 2.1 Classification with Hidden Topics

Nowadays, the continuous development of Internet has created a huge amount of documents which are difficult to manage, organize and navigate As a result, the task of automatic classification, which is to categorize textual documents into two or more predefined classes, has been received a lot of attentions

Trang 30

Several machine-learning methods have been applied to text classification including decision trees, neural networks, support vector machines, etc In the typical applications

of machine-learning methods, the training data is passed to a learning phrase The result

of the learning step is an appropriate classifier capable of categorizing new documents However, in the cases such as the training data is not as much as expected or the data to

be classified is too rare [52], learning with only training data can not provide us a satisfactory classifier Inspired by this fact, we propose a framework that enables us to enrich both training and new coming data with hidden topics from available large dataset

so as to enhance the performance of text classification

Classification with hidden topics is described in Figure 2.1 We first collect a very large external data collection called “universal dataset” Next, a topic analysis technique such

as pLSA, LDA, etc is applied to the dataset The result of this step is an estimated topic model which consists of hidden topics and the probability distributions of words over these topics Upon this model, we can do topic inference for training dataset and new data For each document, the output of topic inference is a probability distribution of hidden topics – the topics analyzed in the estimation phrase – given the document The topic distributions of training dataset are then combined with training dataset itself for learning classifier In the similar way, new documents, which need to be classified, are combined with their topic distributions to create the so called “new data with hidden topics” before passing to the learned classifier

b Framework for Clustering

Figure 2.2 Clustering with Hidden Topics

Text clustering is to automatically generate groups (clusters) of documents based on the

similarity or distance among documents Unlike Classification, the clusters are not known

Trang 31

previously User can optionally give the requirement about the number of clusters The documents will later be organized in clusters, each of which contains “close” documents

Clustering algorithms can be hierarchical or partitional Hierarchical algorithms find

successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once Hierarchical algorithms can be agglomerative (“bottom-up”) or divisive (“top-down”) Agglomerative algorithms begin with each element as a separate cluster and merge them into larger ones Divisive algorithm begin with the while set and divide it into smaller ones

Distance measure, which determines how similarity of two documents is calculated, is a key to the success of any text clustering algorithms Some documents may be close to one another according to one distance and further away according to another Common distance functions are the Euclidean distance, the Manhattan distance (also called taxicab norm or 1-norm) and the maximum norm, just to name here but a few

Web clustering, which is a type of text clustering specific for web pages, can be offline or online Offline clustering is to cluster the whole storage of available web documents and

does not have the constraint of response time In online clustering, the algorithms need to meet the “real-time condition”, i.e the system need to perform clustering as fast as possible For example, the algorithm should take the document snippets instead of the whole documents as input since the downloading of original documents is time-consuming The question here is how to enhance the quality of clustering for such document snippets in online web clustering Inspired by the fact those snippets are only small pieces of text (and thus poor in content) we propose the framework to enrich them with hidden topics for clustering (Figure 2.2) This framework and topic analysis is similar to one for classification The difference here is only due to the essential differences between classification and clustering

2.2.2 Large-Scale Web Collections as Universal Dataset

Despite of the obvious differences between two learning frameworks, there is a key phrase sharing between them – the phrase of analyzing topics for previously collected dataset Here are some important considerations for this phrase:

- The degree of coverage of the dataset: the universal dataset should be large enough

to cover topics underlined in the domain of application

- Preprocessing: this step is very important to get good analysis results Although

there is no general instruction for all languages, the common advice is to remove

as much as possible noise words such as functional words, stop words or too frequent/ rare words

Trang 32

- Methods for topic analysis: Some analyzing methods which can be applied have

been mentioned in the Chapter 1 The tradeoff between the quality of topic analysis and time complexity should be taken into account For example, topic analysis for snippets in online clustering should be as short as possible to meet the “real-time” condition

2.3 Advantages of the Frameworks

- The general frameworks are flexible and general enough to apply in any domain/language Once we have trained a universal dataset, its hidden topics could

be useful for several learning tasks in the same domain

- This is particularly useful for sparse data mining Spare data like snippets returned

from a search engine could be enriched with hidden topics Thus, enhanced performance can be achieved

- Due to learning with smaller data, the presented methods require less

computational resources than semi-supervised learning

- Thank to the nice generative model for analyzing topics for new documents (in the

case of LDA), we have a natural way to map documents from term space into topic space This is really an advantage over heuristic-based mapping in the previous approaches [16][3][10]

2.4 Summary

This chapter described two general frameworks and their advantages for learning with hidden topics: one for classification and one for clustering The main advantages of our frameworks are that they are flexible and general to apply in any domain/language and be able to deal with sparse data The key common phrase between the two frameworks is topic analysis for large-scale web collection called “universal dataset” The quality of the topic model estimation for this data will influence much the performance of learning in the later phrases

Trang 33

Chapter 3 Topics Analysis of Large-Scale Web Dataset

As mentioned earlier, topic analysis for a universal dataset is a key to the success of our proposed methods Thus, toward Vietnamese text mining, this chapter contributes to considerations for the problem of topics analysis for large-scale web datasets in Vietnamese

3.1 Some Characteristics of Vietnamese

Vietnamese is the national and official language of Vietnam [48] It is the mother tongue

of the Vietnamese people who constitute 86% of Vietnam’s population, and of about three million overseas Vietnamese It is also spoken as a second language by some ethnic minorities of Vietnam Many words in Vietnamese are borrowed from Chinese Originally, it is written in Chinese-like writing system The current writing system of Vietnamese is a modification of Latin alphabet, with additional diacritics for tones and certain letters

3.1.1 Sound

a Vowels

Like other Southeast Asian languages, Vietnamese has a comparatively large number of vowels Below is a vowel chart of vowels in Vietnamese:

Table 3.1 Vowels in Vietnamese

The correspondence between the orthography and pronunciation is rather complicated

For example, the vowel i is often written as y; both may represent [i], in which case the

difference is in the quality of the preceding vowel For instance, “tai” (ear) is [tāi] while tay (hand/arm) is [tāj]

In addition to single vowels (or monophthongs), Vietnamese has diphthongs (âm đôi)

Three diphthongs consist of a vowel plus a These are: “ia”, “ua”, “ưa” (When followed

by a consonant, they become “iê”, “uô”, and “ươ”, respectively) The other diphthongs

Định dạng
Số trang	67
Dung lượng	1,22 MB