Topic modeling and its applications

Another one is based on probabilistic topic models which represent the latent structure of words using topics [58], [27], and [11].. One preferred direction for discovering latent struct

Trang 1

HANOI UNIVERSITY OF TECHNOLOGY

-

THÂN QUANG KHOÁT

TOPIC MODELING AND ITS APPLICATIONS

MAJOR: INFORMATION TECHNOLOGY

THESIS FOR THE DEGREE OF MASTER OF SCIENCE

SUPERVISOR: Prof HỒ TÚ BẢO

HANOI, 2009

Trang 2

-THESIS FOR THE DEGREE OF

MASTER OF SCIENCE

THAN QUANG KHOAT

HANOI, 2009

Trang 3

-THÂN QUANG KHOÁT

THESIS FOR THE DEGREE OF MASTER OF SCIENCE

SUPERVISOR: Prof HỒ TÚ BẢO

HANOI, 2009

Trang 4

I promise that the content of this thesis was written solely by me Any of the content was written based on the reliable references such as published papers in distinguished international conferences and journals, and books published by widely-known publishers Many parts and discussions of the thesis are new, not previously published by any other authors.

Trang 5

First and foremost, I would like to present my gratitude to my supervisor, Professor Ho Tu Bao, for introducing me to this attractive research area, for his willingness to promptly support me to complete the thesis, and for many invaluable advices from the starting point of my thesis

I would like to sincerely thank Nguyen Phuong Thai and Nguyen Cam Tu for sharing some data sets and for pointing me to some sources on the network where I can find the implementations of some topic models

Thanks are also to Phung Trung Nghia for spending his valuable days on helping me to load the data for my experiments

Finally, I would like to thank David Blei and Thomas Griffiths for their insightful discussions on Topic Modeling and for providing the C implementation

of one of their topic models

Trang 6

TABLE OF CONTENTS

List of Phrases 4

List of Tables 5

List of Figures 6

Chapter 1 INTRODUCTION 7

Chapter 2 MODERN PROGRESS IN TOPIC MODELING 11

2.1 Linear algebra based models 12

2.2 Statistical topic models 13

2.3 Discussion and notes 18

Chapter 3 LINEAR ALGEBRA BASED TOPIC MODELS 21

3.1 An overview 21

3.2 Latent Semantic Analysis 22

3.3 QR factorization 33

3.4 Discussion 35

Chapter 4 PROBABILISTIC TOPIC MODELS 37

4.1 An overview 37

4.2 Probabilistic Latent Semantic Analysis 39

4.3 Latent Dirichlet Allocation 44

4.4 Hierarchical Latent Dirichlet Allocation 53

4.5 Bigram Topic Model 60

Chapter 5 SOME APPLICATIONS OF TOPIC MODELS 64

5.1 Classification 64

5.2 Analyzing research trends over times 65

5.3 Semantic representation 66

5.4 Information retrieval 67

5.5 More applications 68

5.6 Experimenting with some topic models 68

CONCLUSION 74

REFERENCES 75

Trang 7

HDP-RE Hierarchical Dirichlet Processes with random effectshLDA Hierarchical Latent Dirichlet Allocation

NetSTM Network Regularized Statistical Topic Model

PLSV Probabilistic Latent Semantic Visualization

Spatial LDA Spatial Latent Dirichlet Allocation

Trang 8

LIST OF TABLES

Table 2.1 Some selected Probabilistic topic models 15

Table 5.1 DiscLDA for Classification .65

Table 5.2 Comparison of query likelihood retrieval (QL), cluster-based retrieval (CBDM) and retrieval with the LDA-based document models (LBDM) 68

Table 5.3 The most probable topics from NIPS and VnExpress collections 70

Table 5.4 Finding the topics of a document 71

Table 5.5 Finding topics of a report .71

Table 5.6 Selected topics found by HMM-LDA 72

Table 5.7 Classes of function words found by HMM-LDA .73

Trang 9

LIST OF FIGURES

Figure 1.1 Some approaches to representing knowledge 8

Figure 2.1 A general view on Topic Modeling 12

Figure 2.2 Probabilistic topic models in view of the bag-of-words assumption 16

Figure 2.3 Viewing generative models in terms of Topics .17

Figure 2.4 A parametric view on generative models .18

Figure 3.1 A corpus consisting of 8 documents 23

Figure 3.2 An illustration of finding topics by LSA using cosine .29

Figure 3.3 A geometric illustration of representing items in 2-dimensional space 30 Figure 3.4 Finding relevant documents using QR-based method 34

Figure 4.1 Graphical model representation of pLSA 40

Figure 4.2 A geometric interpretation of pLSA 41

Figure 4.3 Graphical model representation of LDA .46

Figure 4.4 A geometric interpretation of LDA .46

Figure 4.5 A variational inference algorithm for LDA 48

Figure 4.6 A geometric illustration of document generation process .55

Figure 4.7 An example of hierarchy of topics [8] 58

Figure 4.8 A graphical model representation of BTM 61

Figure 5.1 LDA for Classification .64

Figure 5.2 The dynamics of the three hottest and three coldest topics .65

Figure 5.3 Evolution of topics through decades .66

Trang 10

Chapter 1 INTRODUCTION

Information Retrieval (IR) has been being a very active area and has a long

history The development of IR often associates with increasingly huge corpora such as collections of Web pages, collections of scientific papers over years Therefore, it poses many hard questions that have received much attention from

researchers One of the most famous questions that seem to be never ended is how

to automatically index the documents of a given corpus or database Another

substantial question is how to find the most relevant documents in the semantic

manner from the Internet or a given corpus to a given user’s query.

Finding and ranking are usually important tasks in IR Many tools for supporting these tasks are available now, for example, Google and Yahoo However most of these available tools are only able to search for documents via words matching instead of semantic matching Semantics is well-known to be complicated, so finding and ranking documents in the presence of semantics are extremely hard Despite of this fact, these tasks however potentially have many important applications, which in my opinion are future web service technologies, for instance, semantic searching, semantic advertising, academic recommending, and intelligent controlling

Semantics is a hot topic not only in the IR community but also in the Artificial

Intelligence (AI) community In particular, in the field of knowledge representation

it is crucial to know how to effectively represent natural knowledge gathered from the environment around so that reusing it or integrating new knowledge are easy and efficient To obtain a good knowledge database, semantics cannot be absent since any word has its own meanings and has semantic relations to some other words As we know, a word may have multiple senses and play different roles in

Trang 11

different contexts So taking these facts into account in representing knowledge is extremely complicated and seems to be never-ending debate.

One can easily point out a real application of knowledge representation Let us see an example often arisen from Robotics Imagine we want to make an intelligent robot to be able to classify rubbish into different kinds To make such a robot, we must be able to efficiently represent the information that describe many types of rubbish so that the robot can immediately interpret to which types a given piece of garbage belongs, and can classify which things are reusable Furthermore, the robot should be able to recognize which is rubbish among many things placed closely together Thus the amount of information for describing real things is very large to make sure the robot has enough knowledge If the information were organized unsuitably, the robot could do its works prohibitively slowly and could not learn new knowledge from the environment around This example illustrates thatknowledge representation is very important in artificial intelligence

Many approaches to representing knowledge have been proposed so far One

direction for this task is based on high-dimensional semantic spaces, where each

word is a vector in those spaces; see [20], [39], [58], and [27] Another one is based

on probabilistic topic models which represent the latent structure of words using topics [58], [27], and [11] Also we can use semantic networks to represent

knowledge by placing words in nodes and using edges to connect pairs of related words For more discussions about these and other approaches, we refer to the surveys in [27] and [58] Figure 1.1 illustrates some of the mentioned approaches

Figure 1.1 Some approaches to representing knowledge

(a) Semantic network, (b) Semantic space, (c) Topic models

Trang 12

Automatically discovering the needed information and interpreting a given conversation or document are also challenging tasks in AI In fact, these tasks play crucial roles in the problem of finding and ranking the gathered information as mentioned earlier These tasks are so important that a large number of researches have been launched for either finding efficient methods or applying existing methods to specific real applications To support this argument, we can easily check

by using Google tool1 that the work of Deerwester et al in [20] receives more than

4200 citations, the work of Blei et al in [11] receives more than 1200 citations, and the work of Landauer and Dumais in [39] receives more than 1800 citations

One preferred direction for discovering latent structures hidden in a document

or collections of documents and interpreting them clearly is based on the

approaches in Topic Modeling The main contributions of topic modeling to IR are

many methods for extracting the gist of a given document, conversation, or

collection of documents Many Topic Models have been shown their wide ranges of applications, some of which are Latent Semantic Analysis (LSA) [20], Probabilistic

Latent Semantic Analysis (pLSA) [31], [30], Latent Dirichlet Allocation (LDA)

[11], Hierarchical Latent Dirichlet Allocation (hLDA) [8], CorrLDA2 [53] Due to the ability to uncover latent structures (e.g topics), topic models have been

successfully applied to automatically index the documents of a given corpus [20], [11], [72], to find topical communities from collections of scientific papers [45], to support spam filtering task [7], to reveal the development of Science over years [10], to discover hot and cold topics in the research community [25], to identify function and content words from text corpora [26], to discover different groups with their corresponding roles only by using text corpora [70], [44], to explain statistically the inference process in human memory [58], [63], [50] For other attractive applications, we refer to [5], [12], [14], [15], [18], [19], [21], [22], [23], [35], [39], [41], [43], [46], [50], [52], [65], [69], [73], [75], [49], and [29]

From many amazing and potential applications of Topic Modeling, this thesis

1 Http://www.google.com.vn

Trang 13

is devoted to surveying the modern development of the field Since the number of researches relating to topic models constantly and quickly increases, we should not hope the thesis to uncover all, but instead do focus on the most appealing characteristics and the main directions from which new topic models were or will be developed The thesis also attempts to reveal advantages and disadvantages of each considered model Possible extensions of some models will be discussed in details after presenting them Finally, the thesis reports some important applications in AI and some experiments of the author on a collection of papers from NIPS conferences2 up to volume 12 and a collection of reports of VnExpress3 – an electronic Vietnamese newspaper.

ORGANIZATION OF THE THESIS: Chapter 2 presents an extensive survey on the recent progress of Topic Modeling We shall see a general picture and many partial views of the field up to now In the two subsequent chapters, we go into details of some topic models which are the most typical for each view on topic modeling Extensive discussions about the (dis)advantages and possible extensions shall be pointed out after presenting a model Some interesting applications of topic modeling shall be discussed in Chapter 5 In addition, Chapter 5 also contains some reports about the author’s experiments on some corpora

2 Advances in Neural Information Processing Systems (NIPS): http://books.nips.cc/

3 Http://vnexpress.net

Trang 14

Chapter 2

MODERN PROGRESS IN TOPIC MODELING

The wide range of potential applications of Topic Modeling has motivated many researches including the one of this thesis While many other subfields of IR have been intensively studied and thus have their solid foundations, such as

Relational Database, Topic Modeling has been receiving remarkable considerations

from researchers only within the last two decades, especially in the last decade Loosely speaking, it can be regarded as the point that Topic Modeling was born when Deerwester et al [20] first proposed an efficient and reliable method for automatically indexing the documents of a certain collection About ten years since then Topic Modeling gained a breakthrough development when the pLSA model was proposed by Hofmann [30], [31] Within two years later, Blei et al [11] further opened the door for an increasing evolution of Topic Modeling by equipping a solid foundation from Statistics

In Topic Modeling, the central task is to extract the gist of a given document or

corpus As we can see, a document may have some different topics; multiple documents may together describe a certain event or person In addition, the meaning

of a certain word must be taken into account in discovering the gist A word may have many senses, and how to select the correct meaning in a given context is not a

simple task This phenomenon is called polysemy [39] Besides, many words can

have the same meaning that lead to another problem when dealing with semantics,

called synonymy Consequently, solving the central task in Topic Modeling cannot

be simple, but even extremely hard

Many topic models have been sequentially proposed so far, some of which are simple to clearly understand and are not the others Some models are based on linear algebra, but most are based on statistics Although each model has its own

Trang 15

assumption on the given data, we can generally summarize the development of Topic Modeling as in Figure 2.1.

Figure 2.1 A general view on Topic Modeling.

The researchers may have different views on Topic Modeling, and Figure 2.1 demonstrates the author’s view on the field While there are few topic models based

on linear algebra, almost all models are based on statistics for the hope of having a

solid foundation and are thus known as Generative Models or Probabilistic Topic

Models An overview of each direction is presented in the next sections.

2.1 Linear algebra based models

Latent Semantic Analysis (LSA) [20] is the first proposed method for the task

of extracting topics from a corpus The method was constructed based on the idea that each word can be represented as a point in a certain high-dimensional space,

called semantic space To find the gist of a given document, the method assumes

that the order of the words in the document can be ignored This is known as the

bag-of-words assumption.

When the bag-of-words assumption is satisfied, the LSA model tries to project

a word into a semantic space of low dimensionality by using a technique from linear

algebra known as Singular Value Decomposition (SVD) To discover the gist of a

given document, it first projects the document into the semantic space to be a point,and then collects some points close to that point by using a certain similarity measure, e.g., cosine or inner product The words associated with the collected points compose the topic of the document The mathematical details of this method will be presented in Chapter 3

Trang 16

Since a document can have many words, and a word may appears in many documents of the given corpus, the dimension of the semantic space chosen in LSA should not be large for the aim of practical implementations In fact, the dimension

of the semantic space is usually much smaller than the number of words (or

documents) in the corpus Therefore, LSA can be seen as a dimensionality reduction

technique Difference from other techniques such as [42], [57], and [64], LSA is

known as a linear technique since words and documents are projected into a semantic space via a linear transformation

Despite the steps in LSA are simple, the method have been shown to be very successful in many applications Landauer and Dumais [39] argued that the ability

of LSA in similarity judgment is as good as that of foreign students They also demonstrated that LSA is a good tool for devising a new induction method or method of knowledge representation León et al [43] used LSA to grade very short summaries of texts, and showed that LSA is able to make accurate evaluations of summaries even when they are no longer than 50 words LSA was also used to derive predication methods [35], to assess coherence between words [23], to construct a reliable method for indexing documents [20] For more discussions, see the surveys in [5], [40], and [46]

Apart from LSA based on SVD, there are some more methods for extracting the topics from a corpus The method proposed by Michael et al [46] uses QR factorization technique to find the representations of words and documents in semantic spaces Some other variants of LSA can be found in [46] and [5]

2.2 Statistical topic models

Another direction to attack the central problem in Topic Modeling is based on Statistics, i.e., using statistical tools to devise topic models and to explain their performances As well-known [54], the methods based on linear algebra lack a solid theory for explaining their successful experiments Hence the need for finding a topic model that possesses both a good performance and a solid foundation has

Trang 17

motivated many researches.

The first successful study in this direction is from Hofmann [31] He made a breakthrough development for Topic Modeling by introducing a topic model called

Probabilistic Latent Semantic Analysis (pLSA) or the Aspect Model The intriguing

property of pLSA is that it is a generative model; specifically, the model sees the words in a corpus to be generated from some probability distribution This property helps pLSA outperform classical methods such as LSA in many real applications.Following Hofmann’s result, a large number of topic models have been proposed and some selected models are listed in Table 2.1 Most of the new models are fully generative, probabilistic models at the level of documents Some models consider the number of topics hidden in a given document or corpus to be fixed, but some others do not Some models take the bag-of-words assumption into account, not the others do Each model has its own assumption and interesting properties However, the state-of-the-art in Topic Modeling can be summarized in some views depicted in Figure 2.2, 2.3, and 2.4

2.2.1 Bag-of-words versus non-Bag-of-words assumption

In many real applications, such as document indexing and clustering, the order

of the words in a document or the order of the documents in a corpus plays little role Then we can ignore it in analyzing documents or corpora Many topic models

work in this manner including LSA, pLSA, LDA, Correlated Topic Model (CTM) [10], Dirichlet Enhanced LSA (DELSA) [74], and Discriminative LDA (DiscLSA)

[38] In such models, the input documents are represented by a vocabulary of unique words, a frequency matrix showing the number of occurrences of words in documents, and the number of documents considered

Nonetheless, for most applications in Natural Language Processing, the order

of words is very important For example, to predict the correct meaning of a word in

a context, the previous words in the same sentence should be taken into account In

a scientific paper, some prior sentences could influence the implication of the next

Trang 18

sentence since they can altogether comprise a logical paragraph From these observations, we should keep an eye on grammars when designing a topic model for

a concrete natural language processing application Along this line are many topic

models including Syntactic Topic Model (STM), Bigram Topic Model (BTM), and

Hidden Topic Markov Model (HTMM).

Table 2.1 Some selected Probabilistic topic models.

(The first column lists the abbreviations of models, the second writes the fullnames, and the last

shows where a model can be found.)

Figure 2.2 shows the two directions for the development of probabilistic topic

Trang 19

models in Topic Modeling in the perspective of the bag-of-words assumption We note that, to the best of our knowledge, the number of models using this assumption

is much larger than the one of the others, and that Figure 2.2 only lists some representatives

Figure 2.2 Probabilistic topic models in view of the bag-of-words assumption.

2.2.2 Static versus Dynamic topics

As we can observe, the evolution of Science has a long history and research topics evolve constantly over times in the academic community New topics often arise in the community to capture or explain new events, demands or problems from the real world It is thus very useful and interesting for many applications to models exactly the evolution of topics over times hidden in a collection of documents

Many topic models have been introduced for this task, for instance, Discrete

Dynamic Topic Model (dDTM) [13], Continuous Dynamic Topic Model (cDTM)

[67], and Hierarchical Latent Dirichlet Allocation (hLDA) [8] The most subtle

characteristic of these models is that they consider the number of topics as unknown and allow learning new topics online when processing the data These models are often equipped with some complicated random processes, e.g., Dirichlet process, nested Chinese restaurant process

Viewing the generative models in the line of whether or not the number of topics is known a priori, the evolution of the generative models can be depicted in Figure 2.3

Trang 20

Figure 2.3 Viewing generative models in terms of Topics.

(The models grouped in “Static Topics” assume the number of topics is known a priori, the others

in “Dynamic Topics” group learn new topics online from data.)

2.2.3 Parametric versus non-Parametric models

Parametric technique is a preferred choice for many researchers when solving statistical problems in Machine Learning A typical parametric model (or method) often assumes a corpus to be comprised of a fixed set of parameters those will be estimated by a certain inference algorithm Some other models however assume the number of parameters can grow as the corpus grows, and new parameters can be found online

In Topic Modeling, a large number of models assume that the set of

parameters is known previously, such as pLSA, LDA, Supervised Latent Dirichlet

Allocation (sLDA) [14], Spatial Latent Dirichlet Allocation (Spatial LDA) [69], Author-Topic Model (AT) [56] These models assume that a document is a sample

from a probability distribution which is in turn a mixture of topics, and a topic is a distribution over words This means the number of topics and the distributions generating topics are previously known, except some hyperparameters, and the remaining task to model the data is to estimate the coefficients in the mixture

Some topic models consider the number of parameters and the parameters themselves to be unknown They are known as non-parametric models These models often allow the parameters to be learned from a corpus, allow new parameters to be found as the corpus grows Nevertheless, this ability always causes

Trang 21

the models very complicated compared with the parametric counterparts.

Figure 2.4 illustrates a parametric view on the generative models The detailed descriptions of some models appear in Chapter 4

Figure 2.4 A parametric view on generative models.

2.3 Discussion and notes

We have seen various views on the state-of-the-art of Topic Modeling However, these views are somehow inflexible in the sense that the models are classified in separate types and that the correlations among those types have not been shown clearly A topic model may belong to different types, for instance, LSA not only is based on linear algebra but also takes the bag-of-words assumption and

is parameterized by the number of topics Thus our views seem to be incomplete.One can figure out some other views on Topic Modeling based on some other characteristics of the topic models Indeed, some classifications of topic models can

be based on the following observations

 Correlations between topics: in the research community, a scientific paper can

discuss an existing topic or new topic developed from some old ones This means a collection of scientific papers may contain some topics which correlate with some others Many sorts of data can have this property such as collections of images from a camera over times, collections of political news

As a result, modeling documents with an eye on the correlations among latent

Trang 22

topics is important Some models concerning this fact include CTM, HTMM.

 Short-range versus Long-range dependency: some words of a sentence may

dominate or affect the meaning of the others For example, “away” changes the meaning of “run” from “He is running away” to “He is running” This sort of

correlation between words is known as short-range dependency (or local

dependency) Some models taking this fact into account include STM, BTM,

HMM-LDA, CBTM, HTMM Some other models assuming a document as a

bag-of-words can be classified as long-range dependency.

 Bounded versus Unbounded memory: imagine a scenario in which we must

deal with very large collections of documents such as series of images from a camera over years, collections of personal web pages In this situation, we cannot load all documents into the memory to process at once, since the available memory is limited and may be much smaller than the size of the collection Thus the ability to work with blocks of the data is favorable Some topic models have this ability including MBTM, PF-LDA and IG-LDA

 Parallel (distributed) versus Sequential learning of topics: various topic

models were designed without concerning the potential ability of parallel (or distributed) computation This fact makes the models less preferred in the tasks

of processing data stored in various workstations Recently some authors have proposed interesting models for learning topics which is suitable for parallel computation or distributed environment See more in [3], [16], and [51]

 Supervised versus Unsupervised learning of topics: almost all topic models

proposed so far learn topics from data in an unsupervised manner In other words, these models automatically extract topics from documents without any prior knowledge Even though the existing models have been shown to be useful in many contexts, it is reasonable to expect a topic model that can facilitate some prior knowledge efficiently Such a model is worth studying since it may be used to make a service more intelligent in processing texts

Trang 23

To the best we are aware of, few topic models work as a supervised mechanism See for example [14], [65], [17], and [76].

 Topic hierarchy: certain topic model that impresses me the most can uncover

the hierarchy of topics in a corpus This kind of models may be very useful for the task of document classification or clustering Imagine the documents are distributed on the nodes of a certain hierarchy of topics Then a topic model like hLDA can discover this hierarchy and thus can classify documents into various classes with various levels The problem is however that all topics in the hierarchy must be executed simultaneously in order to find all Thus it is worth studying a model that can find all nodes up to a desired level, not the complete hierarchy For some existing topic models of this kind, see [8] and [62]

Trang 24

 A word (or term) is the basic unit of discrete data defined to be an item

from a vocabulary indexed by {1, , }V

 A document is a sequence of N words, denoted by d ( , , , )w w1 2 w N , where w is the i th word in the sequence i

 A corpus is a collection of M documents, denoted by  { , , , }d d1 2 d M The j th word in the document d is denoted by i d ,ij

 A term-by-document matrix A of the corpus  is a V M matrix, where the ( , )i j entry is the number of times word i occurs in the jth document

 The j th column of A represents the j th document in count numbers,

called document vector The ith row shows the appearances of word i in the documents of the corpus The value of the ( , )i j entry is the number of

word i occurs in the j th document

 Bag-of-words assumption: the order of words in a document and the order

of the documents of a corpus are ignored That is, the order of words (documents) does not affect the representation of the document containing them

Trang 25

Taking the bag-of-words assumption into account, we would like to represent documents not by terms but by the latent (hidden) concepts referred to by the terms The hidden structure may not be a fixed mapping, but heavily depends on the corpus and the correlations among terms.

LSA is one of the first methods that efficiently represent documents and terms

in a semantic space, in the sense that the rate of losing information is small It uses the Singular Value Decomposition (SVD) technique to project documents and terms into a low-dimensional space Then the latent concepts hidden in a document can be observed by collecting some terms close to the document in the space Here, closeness can be measured by the cosine of the angle between two vectors or by the inner product of them

The similar idea is used in some other methods We shall see this in the subsequent sections

3.2 Latent Semantic Analysis

Assume we are given a corpus  represented by a V M  matrix A, A

vocabulary of unique words {1,2, , }V The j th document in  is d , the i th j

word in d is j d Then LSA finds new representations of documents and words as ,ji

follows

3.2.1 New representations of documents and words:

 Find the unique representation A T S D    by a singular value 0 0 0t decomposition technique, where T and 0 D are orthonormal matrixes, 0 S0

is the diagonal matrix composed of the singular values of A, D is the 0t

Trang 26

that the product t

 The new representations of words and documents are as follows: the

columns of A represent the documents, the rows of k A represent the k words.

Figure 3.1 A corpus consisting of 8 documents.

(A is the term-by-document matrix of the corpus.)

We see that finding the new representations of the items is quite simple Nonetheless, some important remarks should be aware of Firstly, when removing the i th singular value (removing the row and column of S ), the i th columns of 0

both T and 0 D must be removed Secondly, while the new representations of 0

d1: On Tight Approximate Inference of the Logistic-Normal Topic Admixture Model d2: Asynchronous Distributed Learning of Topic Models

d3: Pattern Recognition and Machine Learning

d4: Hierarchical topic models and the nested Chinese restaurant process

d5: Variational inference for Dirichlet process mixtures

d6: Online Inference of Topics with Latent Dirichlet Allocation

d7: A Bayesian hierarchical model for learning natural scene categories

d8: Unsupervised Learning by Probabilistic Latent Semantic Analysis

Trang 27

words and documents are vectors, they are often of different dimensionality Thus a direct comparison between a word and a document cannot be implemented We shall see how to remedy this issue after considering the following example.

Example 3.1: consider a corpus containing 8 documents, each of which is the title

of a certain paper mentioned by this thesis Figure 3.1 contains details of each document For this corpus, we only keep the words which appear in at least two documents, except “topics” for the aim of seeing the different treatment compared

with “topic”, and remove the others Some common words, called stop words,4 such

as “the” are also removed The final vocabulary consists of 10 words The document matrix A is of size 10 8 , also appearing in Figure 3.1

term-by-Using SVD, we can find the representation A T S D   , where0 0 0t

Trang 28

Choose k  , and take 4 rows and 4 columns of 4 S associated with the 4 0

largest singular values of A to form the matrix

t

A   T S D 

4 0.0445 -0.1646 0.4450 0.0586 -0.1505 0.1846 0.3054 -0.2031 0.3288 0.9002 0.0970 0.6763 0.0387 0.9274 0.7695 0.1388 -0.3143 0.1885 1.0091 1.0749 1.0643 0.0632 0.1106 -0.0511 0.0165 -0.0261 0.9013 -0.0274 -0.0487 0.7789 0.1128 1.0889 0.1619 -0.1320 0.0608 0.1136 0.1852 0.3521 -0.1480 0.8987 0.5737 0.2856 -0.2197 -0.1599 0.6404 0.7003 0.0451 1.1189 0.3171 -0.0924 0.4360 -0.0765 0.0396 -0.0331 0.0493 -0.0989 0.3845 0.6622 -0.0657 0.2380

Trang 29

3.2.2 Measure the similarity

As we have seen, the rows and columns of Ak are representations of terms and documents However, those representations are often of different dimensionality Thus, to measure the similarity between items (term-term, term-document, document-document), we need some other observations

a) Comparing two terms:

Each row of Ak represents a term in the vocabulary So comparing two terms

is equivalent to comparing two rows However, comparing two rows of Ak may be inefficient since each row belongs to the M -dimensional space We shall see another way for this task

A careful observation would reveal that A Ak  k t is the square symmetric matrix containing all term-term inner products That is, the cell ( , )i j of A Ak  k t is the inner product of the rows i and j of Ak Remember that D0 is orthonormal,5

so D is also orthonormal As a result,

a vector in the k-dimensional space spanned by the basis S Hence, comparing two terms in these representations is much more efficient

In short, to measure the similarity of two terms, we shall deal with the two corresponding rows of T S

Example 3.2: Consider the corpus in Example 3.1 We would like to measure the

closeness of two terms “inference” and “learning”

5 A matrix is said to be orthonormal if all columns of the matrix are mutually orthogonal, and of unit length.

Trang 30

In the 4-dimensional space, the representation of “inference” is T S2 , where Ti

is the ith row of T Similarly, the representation of “learning” is T S5 So the similarity of these two terms is measured via the similarity of the following two vectors

T S



b) Comparing two documents:

If we use the columns of Ak for comparing two documents, it may be inefficient since those vectors are contained in high dimensional space To reduce the computation, we need to find other representation for the documents

Note that A Ak t  k contains all document-document inner products Furthermore, t ( t t) ( t) ( ) ( )t

k k

A A   T S D   T S D  D S D S   Thus the inner product of two columns of Ak is amazingly the one of the corresponding two rows of D S This implies one may regard the rows of D S as the representations of documents With these representations, measuring the similarity

of two documents is very effective since all vectors are of k dimensions

Example 3.3: Return to Example 3.1 We would like to measure the similarity of

two documents d1 and d2 The similarity of these documents can be measured by inspecting the following vectors

1

2

(-1.1167, - 0.24, 0.2712, - 1.1002)(-1.2217, 0.8318, - 0.1165, 0.4568) ,

t t

D S



where Dj is the jth row of D

c) Comparing a term and a document:

This comparison can be done by inspecting the row and column of Ak Nevertheless, it is inefficient

Trang 31

Note that t ( 12) ( 12 t)

k

A T S D      T S  S D  , where S12 is the diagonal matrix such that S S12  12 S This formulation tells us that each cell of Ak is the product between a row of T S 12 and a row of D S 12 We thus can regard the rows of T S 12 and D S 12 as representations of terms and documents, respectively Comparing a term with a document can be done by using the corresponding rows of T S 12 and D S 12

Example 3.4: We want to measure the similarity of term “inference” with

document d1 This task can be done by inspecting the following vectors

t t

From the previous section, we have already discussed how to compare a given term and a given document To extract the topic of the document i, we do the following steps

- Choose a as the threshold, e.g., a 0.8.

- For each term j in the vocabulary, compute the cosine of the angle

Trang 32

between two vectors T Sj  12 and D S i  12

Figure 3.2 An illustration of finding topics by LSA using cosine.

Example 3.5: Assume we want to find the topic of the document d5 of the corpus in

Example 3.1 We first choose a 0.6, and then compute all cos qj,5

(j 1, ,10) Figure 3.2 contains the results of these computations By simple comparison, we can see the desired terms including: Dirichlet, inference, process, topics These terms comprise the topic of the document d5 We see that even though

“topics” does not appear in d5, the term still plays an important role in the

0.7271

0.7418

After computing these numbers, select the numbers which are no longer less than a

We find that 4 numbers marked bold satisfy our condition Thus, look up the vocabulary

to find the terms associated with these numbers (see Figure 3.1), and obtain the followings:

Dirichlet, inference, process, topics

These terms comprise the topic of the document d5

Trang 33

document The happening of this fact can be explained as follows: both terms

“Dirichlet” and “inference” appear in d5 and d6, and “topics” appears in d6 Thus LSA suggests that these three terms may reasonably correlate

d1

d2

d3 d4

hierarchical

latent

learning

model models

process topic

topics

Figure 3.3 A geometric illustration of representing items in 2-dimensional space

(Covered by the two dashed lines may be similar with the document d5 in the cosine measure.)

Figure 3.3 depicts the distribution, in 2-dimensional space, of the terms and documents of the corpus we are considering Note that the four terms which compose the topic of d5 are all contained in the small area covered by the two dashed lines Note further that “latent” is also contained in this area, but does not appear in the topic of d5 This seemingly contradiction can be explained as follows The points in Figure 3.3 are the projections of the term vectors and document vectors in high dimensional space onto the 2-dimensional space In our example, the terms and documents are represented by 4-dimensional vectors So, the information may be lost when projecting these vectors onto smaller dimensional space And this situation happens with our vectors This fact suggests that “latent” lies much far from d5 in the 4-dimensional space

Trang 34

3.2.4 Finding relevant documents to a query

A central task in IR is how to find the most relevant documents to a user query A document may not contain any word of the query even though it is closely related with the query For example, the document d2 in Example 3.1 is much relevant to the term “hierarchical”, but does not contain it Therefore this task cannot be so simple

Assume q is the given query of a user To find some relevant documents to this query, it is desired to have a suitable representation of q so that it can be compared with the document vectors For some real instances, q may contain some terms that do not appear in the vocabulary of our corpus If this situation happens,

we shall remove all new terms of the query Let dq be the document vector of the resulting query We shall see how to compare dq with the existing vectors of the documents

Keep in mind that we have approximated the term-by-document matrix A by

k

A The columns of Ak are the representations of the documents These facts suggest that we can find all relevant documents by comparing each column of Ak

with dq Nonetheless, these simple comparisons are inefficient since all vectors are

of dimensionalityV We thus need to find other alternatives

Let qj be the angle between dq and the document dj Then,

Trang 35

since bj and T d t  q are of dimension k Furthermore, bj is amazingly the jth row

of D S , and can be pre-computed once for all queries

In short, to find relevant documents to a query q, we essentially do the following steps:

- Remove all terms of q that do not appear previously in the vocabulary.

- Form the document vector dq of the resulting query.

- For each document j of the corpus, compute cos qj by (3.1).

- Return any document j such that cos qj is not less than a chosen threshold.

Example 3.6: Consider again the corpus of Example 3.1 Assuming the corpus was

previously processed to find the representations of terms and documents for k 4,

we would like to find some relevant documents to the query “topic model”.

Applying the formulae in (3.1) to all documents and the query vector (0,0,0,0,1,0,0,0,1,0)t

q

Trang 36

3.3 QR factorization

In Section 3.2 we have discussed how LSA deals with the term-by-document matrix A of a given corpus LSA expects to remove some uncertainties in the corpus by removing some redundant information, specifically, it approximates A

with another matrix Ak by the use of singular values However, the resulting matrixes Ak and A still have the same size In addition, the comparisons between items, such as document-document and document-query, are essentially unnatural.LSA uses SVD as the main tool to deal with the corpus Nevertheless, as pointed out by Michael et al [46], some other techniques in linear algebra behave like SVD In this section, we shall discuss how the QR factorization of a matrix can

be employed in IR

The underlying idea of the QR based method is very simple Since a corpus is represented by a term-by-document matrix, and some matrix columns may be linearly dependent on some others, we can obtain a reduced but efficient representation for the corpus by removing some dependent columns This means the important task is to find some key columns of A

Assume we are given the V M term-by-document matrix A of a corpus Then, as well-known, it can be decomposed into two matrixes Q and R such that

A Q R   (3.2)where Q is a V V orthonormal matrix, R is a V M upper triangular matrix.The representation in (3.2) shows that the columns of A are linear combinations of the columns of Q Thus we can approximate A by doing so for Q For the aim of losing as little information as possible, we shall approximate Q by removing some of its columns associated with some sparse rows of R

Let Qk be the V k matrix derived from Q by removing the columns associated with the (V k ) sparsest rows of R Then the representation of the

Trang 37

corpus is approximated by

k k k

where Rk is the matrix obtained from R by removing the (V k ) sparsest rows

To find all relevant documents to a given query in this new representation, we can do the same steps as in the LSA method provided that the quantity cos qj has a little different form

where rj is the jth column of Rk

Figure 3.4 Finding relevant documents using QR-based method.

First, we decompose A into two other matrixes Q and R using QR factorization Then keep the first 3 columns of Q associated with the 3 densest rows of R to form the new

0 0.6124 -0.7906 -0.5774 1.0206 0.7906 -0.5774 -0.2041 -0.1581 -0.5774 -0.2041 -0.1581 -0.5774 0.4082 -0.9487

Trang 38

Example 3.7: Consider the corpus in Example 3.1 We want to find the most

relevant documents to the query “topic model” The result appears in Figure 3.4

3.4 Discussion

We have demonstrated that some linear algebra techniques can be used to find efficient representation for a given corpus In particular, they have been shown to be useful in finding the topic of a document, in finding the most relevant documents to

a query In summary, some of their obvious advantages are as follows

- The computation is simple since SVD and QR factorization are elementary techniques in linear algebra

- These methods are able to provide explicit representations of terms and documents in the semantic space

- They are able to preserve the correlations among terms by placing correlated terms closely together From this ability, they are good candidates for supporting us to solve some hard problems in Natural Language Processing such as polysemy and synonymy

- These methods are applicable to any discrete data, not only text corpora.Despite of many appealing properties, linear algebra based methods have their own serious drawbacks, some of which are:

- The dimension of the semantic space is chosen empirically For LSA, it is suggested that the good choice of the dimension should be from 100 to 300 [39], [5] If we choose it to be too small or too large, the result of

uncovering latent structures may be poor Thus, an open question is how to

automatically choose the best dimension for a concrete application.

- LSA and its counterparts lack a sound foundation for explaining the interesting successes in many applications A reasonable explanation for those successes is that LSA, for example, approximates the term-by-

Trang 39

document matrix A by the best matrix Ak among all matrix of the samesize See more in [54].

- It is unclear and unnatural that how they deal with the corpus when there is

a new document added to or removed from the corpus This situation occasionally happens with real corpora, and needs to be further studied Some proposals have been introduced in [5] and [46], but are complex

- Another drawback that seems to be serious but have been little studied is that how LSA, for example, deals with a corpus having only one document (or very few documents) This situation may be arisen in some applications

in which the observable data are very few As one may recognize, LSA finds the topic of a document by inspecting every word in the vocabulary; and the more data, the better performance LSA can achieve These facts imply that LSA may get trouble with a corpus having very few documents

To the best of my knowledge, there has not been any attention to this problem so far Hence, in my opinion, this situation is worth being studied further

Trang 40

Loosely speaking, any probabilistic topic model always tries to approximate (or model) the stochastic process that generates the data Thus, a typical probabilistic topic model is often comprised of the followings:

 Assumptions: each model has some assumptions on the given data For

example, some models such as pLSA and LDA assume that each word in the vocabulary is a sample from a certain probability distribution; some models such as LDA, CTM, hLDA and HTMM assume the documents of a corpus were generated from a certain probability distribution or stochastic process Assumptions seem to be the key ingredient for developing a probabilistic topic model

 Generative process: holding some assumptions in hand, the topic model

further assumes how a document or corpus or collection of documents was generated For instance, the LDA model assumes each document is a sample from a mixture of topics, where each topic is a sample drawn from a probability distribution over words Having a clear understanding of the

Định dạng
Số trang	85
Dung lượng	3,01 MB