1. Trang chủ
  2. » Luận Văn - Báo Cáo

semantic similarity in vietnamese

64 615 16
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Finding the Semantic Similarity in Vietnamese
Tác giả Nguyen Tien Dat
Người hướng dẫn Phd. Phạm Bảo Sơn
Trường học Vietnam National University, Ha Noi
Chuyên ngành Computer Science
Thể loại graduation thesis
Năm xuất bản 2010
Thành phố Ha Noi
Định dạng
Số trang 64
Dung lượng 615,95 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Keywords: Semantic vector, Word space model, Random projection, Apache Lucene... Specifically, there are many well-known approaches for representing thecontext vector of words such as: L

Trang 1

 Nguyen Tien Dat

FINDING THE SEMANTIC SIMILARITY IN

 Nguyen Tien Dat

FINDING THE SEMANTIC SIMILARITY IN

VIETNAMESE

GRADUATION THESIS

Major Field: Computer Science

Supervisor: Phd Phạm Bảo Sơn

Trang 2

Our thesis shows the quality of semantic vector representation with randomprojection and Hyperspace Analogue to Language model under about the researching

on Vietnamese The main goal is how to find semantic similarity or to study synonyms

in Vietnamese We are also interested in the stability of our approach that usesRandom Indexing and HAL to represent semantic of words or documents We build asystem to find the synonyms in Vietnamese called Semantic Similarity FindingSystem In particular, we also evaluate synonyms resulted from our system

Keywords: Semantic vector, Word space model, Random projection, Apache Lucene

Trang 3

Acknowledgments

First of all, I wish to express my respect and my deepest thanks to my advisorPham Bao Son, University of Engineering and Technology, Viet Nam NationalUniversity, Ha Noi for his enthusiastic guidance, warm encouragement and usefulresearch experiences

I would like to gratefully thank all the teachers of University of Engineeringand Technology, VNU for their invaluable knowledge which they provide me duringthe past four academic years

I would also like to send my special thanks to my friends in K51CA class, HMILab

Last, but not least, my family is really the biggest motivation for me Myparents and my brother always encourage me when I have stress and difficulty Iwould like to send them great love and gratefulness

Ha Noi, May 19, 2010 Nguyen Tien Dat

Trang 4

Contents

Trang 7

Chapter 1

Introduction

Finding semantic similarity is an interesting project in Natural LanguageProcessing (NLP) Determining semantic similarity of a pair of words is an importantproblem in many NLP applications such as: web-mining [18] (search andrecommendation systems), targeted advertisement and domains that need semanticcontent matching, word sense disambiguation, text categorization [28][30] There isnot much research done on semantic similarity for Vietnamese, while semanticsimilarity plays a crucial role for human categorization [11] and reasoning; andcomputational similarity measures have also been applied to many fields such as:semantics-based information retrieval [4][29], information filter [9] or ontologyengineering [19]

Nowadays, word space model is often used in current research in semanticsimilarity Specifically, there are many well-known approaches for representing thecontext vector of words such as: Latent Semantic Indexing (LSI) [17], HyperspaceAnalogue to Language (HAL) [21] and Random Indexing (RI) [26] These approacheshave been introduced and they have proven useful in implementing word space model[27]

In our thesis, we carry on the word space model and implementation forcomputing the semantic similarity We have studied every method and investigatedtheir advantages and disadvantages to select the suitable technique to apply forVietnamese text data Then, we built a complete system for finding synonyms inVietnamese It is called Semantic Similarity Finding System Our system is a

Trang 8

word Our experimental results on the task of finding synonym are promising Ourthesis is organized as following First, in Chapter 2, we introduce the backgroundknowledge about word space model and also review some of the solutions that havebeen proposed in the word space implementation In the next Chapter 3, we thendescribe our Semantic Similarity Finding System for finding synonyms in Vietnamese.Chapter 4 describes the experiment we carry out to evaluate the quality of ourapproach Finally, Chapter 5 is conclusion and our future work.

Trang 10

2.1.1 Synonym and Hyponymy

The synonymy is the equality or at least similarity of the importance ofdifferent linguistic Two words are synonymous if they have the same meaning [15].Words that are synonyms are said to be synonymous, and the sate of being a synonym

is called synonymy For the example, in the English, words “car” and “automobile”

are synonyms In the figurative sense, two words are often said to be synonyms if theyhave the same extended meaning or connotation

Synonyms can be any part of speech (e.g noun, verb, adjective or pronoun) as

the two words of a pair are the same part of speech More examples of Vietnamesesynonyms:

độc giả - bạn đọc (noun)

chung quanh – xung quanh (pronoun)

bồi thường – đền bù (verb)

Trang 11

In the linguistics dictionary, the synonym is defined in three concepts:

1. A word having the same or nearly the same meaning as another word or other words in a language.

2. A word or an expression that serves as a figurative or symbolic substitute for another.

3. Biology: A scientific name of an organism or of a taxonomic group that has been superseded by another name at the same rank

In the linguistics: Hyponym is a word whose meaning is included in that of other word

2.1.2 Antonym and Opposites

In the lexical semantics, opposites are the words that are in a relationship of

binary incompatibles in opposite as: female-male, long-short and to love – to hate The

notion of incompatibility refers to the fact that one word in an opposite pair entails that

it is not the other pair member The concept of incompatibility here refers to the factthat a word in a pair of facing demands that it not be a member of another pair For

example, “something that is long” entails that “it is not short” There are two members

in a set of opposites, thus it is referred to a binary relationship The relationshipbetween opposites is determined as opposition

Opposites are simultaneously similar and different in meanings [12] Usually,they differ in only one dimension of meaning, but are similar in most other aspects,which are similar in grammar and semantics of the unusual location Some words are

Trang 12

non-opposable For example, animal or plant species have no binary opposites orantonyms Opposites may be viewed as a special type of incompatibility:

An example, incompatibility is also found in the opposite pairs “fast - slow” It’s fast - It’s not slow.

Some features of opposites are given by Cruse (2004): binary, inheritress, andpatency In this section, we introduced Antonyms are gradable opposites They located

at the opposite end of a continuous spectrum of meaning

………

Words can have some different antonyms, depending on the meaning orcontexts of words We study antonyms to make clearly a fundamental part of alanguage, in contrast to synonyms

2.2 Word-space model

Word-space model is an algebraic model to represent text documents or anyobjects (phrase, paraphrase, term …) It uses a mathematical model as vector toidentify or index terms in the text documents Model is useful in information retrieval,

Trang 13

about Vector space Model for information retrieval [29] This term is due to HinrichSchutze (1993):

“Vector similarity is the only information present in Word Space:

semantically related words are close, unrelated words are distant (page.896) “

2.2.1 Definition

Word-space models contain the related method for representing concepts in ahigh dimensional vector space In this thesis, we suggest a name: Semantic vectormodel through our work The models include some well-known approach such as:Latent Semantic Indexing [17], Hyperspace Analogue to language [21]

Document and queries are performed as vectors

we study one famous way that has been developed That is tf-idf weighting (see thepart of section below):

The core principle is that semantic similarity can be represented as a proximaten-dimensional vector; n can be 1 or the large number We consider the 1-dimensionaland 2-dimensional word space in the Figure:

Trang 14

Figure 2.1: Word geometric repersentaion

In above geometric representation, it shows the simple words of someVietnamese As an example, both semantic spaces, “ô_tô” is the closer meaning to

“xe_hơi” than “xe_đạp” and “xe_máy”

The definition of term depends on each application Typically terms are singlewords, or longer phrases If words are chosen to be terms, the dimensionality of thevector is the number of words in the vocabulary

2.2.2 Semantic similarity

As we have seen in the definition, the word-space model is a model of semanticsimilarity On the other hand, the geometric metaphor of meaning is Meanings are

proximity between the locations The term-document

vector represents the context of term in low

granularity Besides, creating term vector according to

the some words surrounding to compute semantic

vector [21] It is a kind of semantic vector model To

compare the semantic similarity in semantic vector

Trang 15

In practice, it is easier to calculate the cosine of the angel between the vectors

instead of angle:

A cosine value of zero means that the query and document vector does not

exist and match The higher Cosine distance is; the closer similarity of semantic of two

terms is

2.2.3 Document-term matrix

A document-term matrix and term-document matrix are the mathematical

matrices that show the frequency of terms occurings in a set of collected documents In

a document-term matrix, rows correspond to documents in the collection and columns

correspond to terms In a term-document matrix, rows correspond to words or terms

and columns correspond to documents To determine value of these matrices, one

scheme is tf-idf

A simple example for document-term matrix:

D1 = “tôi thích chơi ghita.”

D2 = “tôi ghét ghét chơi ghita.”

Then the documents-term matrix is:

Tôi thích ghét chơi ghita

Matrix shows how many times terms appear in the documents And in detail,

complexity, we describe the tf-idf in the next part

Table 2.1: A example of documents-term matrix

Trang 16

2.2.4 Example: tf-idf weights

In the classic semantic vector model [31], the term weights in the documentsvectors are products of local and global parameters It is called term frequency-inversedocument frequency model

where

And tft,d is term frequency of term t in document d

is inverse document frequency |D| is the number of

documents is the number of documents in which the term t occurs

Distance between document dj and query q can calculated as:

Trang 17

2.2.5 Applications

Over 20 years, the semantic model has been developed strongly, it's useful toperform many important tasks of natural language processing Such applications, inwhich semantic vector models play a great role, are:

Information retrieval [7]: It is basic foundation to create applications that are fully,

automatically and widely applicable on different languages or cross-languages Thesystem has flexible input and output options Typically, user queries any combination

of words or documents while system return about documents or words Therefore, it isvery easy to build web interface for users Regarding cross-language informationRetrieval, semantic vector models is more convenient than other systems to query inone language that matches relevant documents or articles in the same or otherlanguages because it is fully automatic corpus analysis while Machine translationrequires vast lexical resources Some machine translations are very expensive todevelop and lack coverage to all lexicon of a language

Information filters [9]: It is very interested Information Retrieval needs relative

stable database and depend on user queries while Information filter (IF) findsrelatively stable information needs The data stream in IF is rapidly changing IF alsouse more techniques such as: information routing, text categorization or classification

Word sense discrimination and disambiguation [28]: The main idea is clustering the

weighted sum of vector for words found in a paragraph of text; it is called the contextvector of word It calculates the co-occurrence matrix too (see in section 2.2), theappearance of an ambiguous word can then be mapped to one of these word-senses

Document segmentation [3] Computing the context vector of region text help

category this text belongs to a kind of documents Given a document, system can showthat it is some kinds of sport, policy or law topic

Lexical and ontology acquisition [19]: According to the knowledge of a few given

words called seed words and their relationship to getting many other similar words thatdistance of semantic vector is nearby

Trang 18

2.3 Word space model algorithms

In this section, we will discuss a common space model algorithm There are twokinds of approach to implement Word Space model: Probabilistic approaches andContext vector; but we pay attention to context vector approach, the common way isused to compute the semantic similarity We introduce Word-concurrence Matricesthat represent for context vector of word Then, we study some similarity measure tocalculate the distance between two context vectors

2.3.1 Context vector

Context plays an important role in NLP The quality of contextual information

is heavily dependent on the size of the training corpus with less data available,extracting contextual information for any given phenomenon becomes less reliable[24] But not only training corpus, extracting semantic depends on algorithms we use.There are many methods to extract context from data set, but results is often verydifferent

Formally, a context relation is a tuple (w, r, w) where w is a headwordoccurring in some relation type r, with another word w in one or more the sentences.Each occurrence extracted from raw text is an instance of a context, that is, a contextrelation/instance is the type/token distinction We refer to the tuple (r, w) as anattribute of w

The context instances are extracted from the raw text, counted and stored inattribute vectors Calculating attributes vectors can give some factor to comparecontext of word, then deduce semantic similarity

2.3.2 Word-concurrence Matrices

The approach developed by Schutze and Qiu & Frei has become standardpractice for word-space algorithms [16] The context of a word is defined as the rows

Trang 19

or columns of co-occurrence Matrix, data is collected in a matrix of co-occurrencecounts.

Definition

Formally, the co-occurrence matrix can be a words-by-words matrix which is asquare W x W matrix, where W corresponds to the number of unique words in the freetext corpus is parsed (in Vietnamese, we need word segmentation process before nextjob) A cell mi,j is the number of times word wi co-occurs or appear with word wj

within a specific context – a natural unit such as the sliding window of m words Notethat we process the upper and lower word before performing this

tôi (lãng mạn 1), (kiểm tra 1), (nói 1)

sinh viên (bồi thường 1), (lãng mạn 1), (kiểm tra), (nói 1)

bồi thường (sinh viên 1), (nhân viên 1)

lãng mạn (tôi 1),(sinh viên 1)

nhân viên (bồi thường 1), (kiểm tra 1),

kiểm tra (tôi 1),(sinh viên 1), (nhân viên 1)

nói (tôi 1),(sinh viên 1)

Other co-occurrence matrix is defined words-by-documents matrix W x D,where D is the number of document in the corpus A cell fi,j of this matrix shows the

frequency of appearance of word W i in the document j A simple example for

words-by-words co-occurrence matrix as following:

Table 2.2: Word co-occurrence table

Table 2.3: Co-occurrences Matrix

Trang 20

Both documents and windows will be used to compute semantic vectors, but it

is easy to estimate the quality of windows (high granularity) is greater than lowgranularity that is documents context

Instantiations of the vector-space model

It is developed by Gerald Salton and colleagues in 1960s on the SMARTinformation-retrieval system [13] Nowadays, many information retrieval systemsperform the both type of weight One is in the traditional vector space model; it isintroduced by Robertson & Sparck Jones 1997 The other is known as TFIDF family

of weighting schemes This is true for semantic vector algorithms that use a documents co-occurrence matrix

words-by-Words-by-words co-occurrence matrix count the number of times the word i co-occurs which other word j The co-occurrence is often counted in a context window

or context size spanning some number of words When we count in both directions(one to left & one to right, two words to left & two words to right …) of target word,the matrix is called symmetric words-by-words co-occurrence matrix, and in which therows equals the columns In this thesis, we use the context size direct to the both side

of target word in our experiments We estimate different results on a few context size.However, if we counted in the only one direction (left or right side), it is calleddirectional words-by-words co-occurrence matrix We can refer to the former as a left-directional words-by-words matrix, and to the latter as a right-directional words-by-words matrix The above table describes the right right-directional words-by-wordsmatrix for the example data, the row and the column vectors of word are different Therow vector contains co-occurrence counts with words that have occurred to the right ofthe words, while the column vector contains co-occurrence counts with words thathave occurred to their left side

Trang 21

2.3.3 Similarity Measure

This section describes how to define or evaluate word similarity The similaritymeasure is often based on the similarity between two Measuring semantic similarityinvolves devising a function for measuring the similarity between context vectors Thecontext extractor returns a set of context relations with their instance frequencies It'scan be represented in nested form (w, (r,w'))

To gain high-level comparison, we use measuring functions such as: Geometric Distances, Information Retrieval, Set Generalizations, Information Theory and Distributional Measures [24] But we only do research on Geometric Distances

because it is very popular and easy to understand We define the similarity betweenwords w1 and w2 by some Geometric Distances:

Euclidean distance:

Manhattan or Liechtenstein distance:

Others:

Trang 22

2.4 Implementation of word space model

In this section, we will discuss some problems with word space algorithms:high dimensional matrix and data sparseness Then, look at how differentimplementations of the word space model, we also discuss some their advantages ordisadvantages

2.4.1 Problems

High dimensional matrix

This is a great problem in the building the co-occurrences matrix process Whenwriting an algorithm for the word-space model, the choice of vector space similarity isnot a unique design for us to choose Another important issue is how to handle highdimensional data for the context vector

At the same time, if you do not have enough data, we will not have a platform

to build models of word distribution At the same time, the co-occurrence matrix willbecome prohibitively large for any reasonably sized data affecting the scalability andefficiency of the algorithm This led them to a delicate dilemma: We need much data

to build co-occurrence matrix, but they will be limited by high dimensionalcomputation capacity matrix

Data sparseness

Other problem in creating the vector in the word-space model is a simple case,there are many cells in the co-occurrence matrix will equal to zero This is called datasparseness in which, the co-occurrence events (of two words or word in document) arepossible in the matrix will actually occur, regard of data size The vast majority ofwords only occur in a very limited number of contexts with other words This theory iswell known, associate to the general Zipf’s law [34] To solve the data sparseness

Trang 23

Dimension reductions

The solution of high dimensions problem is always called dimensionalityreduction A matrix, build for representing term or document, has high dimensions.There are many methods to restructure high-dimensional data in a low-dimensionalspace, so that both the dimensionality and sparseness of the data are decreased Thus,

it is very easy to compute or compare among between context vectors

In this thesis, we introduce only one approach to do this; it’s called SingularValue Decomposition [10] SVD method is used especially in numericalmathematics There be, for example many linear systems in solving computationallyreasonable accuracy

Some modern image compressions are based on an algorithm, image (matrix ofcolor values) in an SVD This is a possible application of reduction model

In particle physics one uses the singular value decomposition to diagnose massmatrices of Dirac particles The singular values give the masses of the particles in their

mass eigenstate From the transformation matrices U and V one constructs as the CKM matrix, which express the mass eigenstates of particles from a mixture of flavor

eigenstates that can exist

The singular value decomposition need of complexity, where n

is the number of documents + number of terms and k is the number of dimensions

In additional, we introduce other technique to slave the high dimensionality.That is Latent Dirichlet Allocation (LDA) It is proceed by Blei et all 2003 LDAprovides a probabilistic generative model that the number for documents as beingprobabilistic mixtures of the underlying topics [2] Then, model use an EM algorithm

to evaluate the k topic parameter and that calculate the model for each documentsorder by k, where N is the number of words in the document

Trang 24

2.4.2 Latent semantic Indexing

Latent Semantic Indexing is a useful method of informationretrieval [7] Techniques such as LSI are particularly relevant to the search on largedata sets such as the document on the Internet The goal of LSI is to majorcomponents of documents to be found These principal components (concepts) can be

thought of as general concepts For example in English, Building is such a concept that term such as hose, tower to be included Thus, this method is suitable, for example, for

very many documents, articles (such as on the Internet), those which reference where

it comes to cars, even if the word auto in them explicitly not occur In

addition, LSI can help, finding articles, where it is about cars, to be distinguished fromthose in which only the word car is mentioned for example at sites where a car isintroduced as a profit

The semantics, which is determined by LSI, is calculated and displayed in amatrix, which is called in this case semantic space A matrix is a table to be enteredinto the multi-dimensional, LSI semantic relationships Newly added content must beincluded, which constantly requires new calculations In the process of LSI, thedimensions of a matrix can be reduced, since semantically related contents are groupedand categorized Because of the reduced matrix, calculations are simplified Thequestion is how far one should reduce the dimensions

Background Mathematical

LSI implements the Termfrequenz Matrix [20] by the singularvalue approximated It is a method to reduce dimension of matrix to the semantic units

of a document carried out, further simplifies the calculation

LSI is just an additional procedure based on the vector space retrieval touchesdown From there the famous TD-matrix is processed by the LSI in addition to zoomout to This is useful in particular for larger document collections, since the TD-matrices are generally very large This matrix is the TD on the decomposing ofsingular value decomposition This helps decrease complexity in computing and thus

Trang 25

At the end of the algorithm is a new, smaller TD-matrix in which the terms ofthe original TD-matrix are generalized into concepts.

Algorithm

The main step of latent semantic indexing

• The term-document matrix is calculated and, where appropriate, weighted

The term-document matrix A is then decomposed into three components

(singular):

The two orthogonal matrices U and V included here eigenvectors

of A T A A A or T, S A is a diagonal matrix with the roots of the eigenvalues

of A, T, also called singular values

About the eigenvalues in the resulting matrix S can now control the dimension

reduction This is done by successively omitting each of the smallest eigenvalue

up to an indefinite limit k

Search query q (for Query) to edit an order, it will be displayed in the semantic space Q this is a special case of a document the size considered With

the following formula is the (possibly weighted) Queryvektor q shown:

S k is the first k diagonal elements of S.

Each document is as shown in q the semantic space After that, q, for example,

the cosine similarity or the scalar product of the document will be compared

Trang 26

Advantage and Disadvantage

The semantic space (that is, reduced to the meanings of TD-matrix) reflects theunderlying structure of the documents in terms of semantics The approximate position

in the vector space of the vector space retrieval is kept The projections on theeigenvalues are then the membership of a concept (steps 4 and 5 of the algorithm) TheLatent Semantic Indexing elegantly solves the synonym problem, but only in part,the polysemy, which means that the same word can have different meanings Thealgorithm is computationally very intensive The complexity of singular valuedecomposition is a , Where n is the number of documents + number of

terms and k is the number of dimensions This problem can be bypassed by the

economic calculation from the start of a TD-reduced matrix, the Lanczos method isused The singular value must also be constantly repeated, when new terms ordocuments Another problem is the problem dimension: in how many dimensions to

the term-document matrix to be reduced, so how big is k.

Applications

An increasingly important future technology of search engines is the Latent

Semantic Indexing, which is already being used in part It has been previously listed

on search only web pages in search results, which contained the exact keyword orkeyword combination is Latent Semantic Indexing extended by this service For searchengines it is possible by this technological development to understand the meaning andimportance of the text content of Web sites and thus for a given subject relativesclassified as relevant, even though the search does not occur in its form On therequested keyword in the search results then related topic sites are also shown At thispoint, the semantics is needed Semantics is a field from the language research and isconcerned with meaning and significance of, say, words and their relationships Thisresearch is important for the Latent Semantic Indexing, to learn search engine, textcontent from various websites which fit together thematically and thus should be takeninto account with a search input in, even though the keyword is found only latent So

Trang 27

the keyword occurs "in appearance, but is not available", which is the definition of

"latent"

2.4.3 Hyperspace Analogue to Language

The Hyperspace Analogue to Language (HAL) [21] model create context vectoronly as the words that immediately surround a target word HAL model builds thewords-by-words co-occurrence matrix which, in contrast to LSI, was developedspecially for semantic vector research

HAL uses a word-by-word co-occurrence matrix, which is populated bycounting word co-occurrences HAL computes an NxN matrix, where N is the number

of words in its lexicon, using a directional context window 10 words wide The occurrence is used to measure the distance between words Building the directional co-occurrence matrix enable calculate the directional context of words

co-The size of HAL model matrix provides a very high dimensional contextvector, sometimes, which is two times higher than the size of lexicons in corpus In thecontrast to LSI using SVD to reduce the size, HAL normalize their dimensionality bycomputing the variances of the row and column vectors for each word, and decreasingthe elements with lowest variance, leaving only the 100 to 200 most variant vectorelements; so the size matrix reduction step is not an original of HAL HAL take somesteps following:

with target word).

Trang 28

To improving context vector and the results in the semantic similarity, a new

probabilistically techniques based on principle of HAL is call: Probabilistic

hyperspace analogue to language [33].

Generate the new matrix of documents in new dimension by adding the corresponding term vector each time a document contains a term.

The above steps are summarized in equation:

MtxN = Random k x d M dxN

Where M(random) is document matrix

MdxN is term matrix is created by random document matrix

M spanding N terms in d documents Then, it reduces to t dimensions throughabove formula

Trang 29

Figure 2.3: The processes of Random indexing

Applications in information retrieval

Whether and which textual Web content to be indexed as a Keyword

"semantically close" or "semantic distance" depends on semantic words, phrases, wordcombinations, synonyms, antonyms and the like from those found at the variousWebPages Of course, remain words that appear regularly in each text, for example,words like "and," "the, an, and, unnoticed After comparing the websites, they arerated semantically close if as many Content Words are in place and thereforesemantically distant, if there are hardly Content Words The websites are semanticallyclose earlier than listed in the search results less relevant websites, even if the keyword

does not appear This will be on search engines to use RI, searching a search engine to

search more relevant Web pages displayed to, without him, on Match query performs avariety of extra The RI will develop and is a beginning, becoming more and moreinformation in order to manage Web by using the content categories

Trang 31

Chapter 3

Semantic Similarity Finding System

We built a complete system to find synonyms in Vietnamese Our systemoperates the word-space model based on the approach: Random Indexing (RI) RI canalso be used to produce both LSI and HAL - types of word space model implementing

RI creating the context vector of terms according the documents which they occur in

In other hand, RI produces HAL when it is used to make the co-occurrence withinnarrow window size matrix of all terms in the free text corpus Hence, the contextvectors are built by the words that immediately surround the target word

3.1 System Description

The semantic Similarity Finding System contains three components:

1. Word segment:

This component is used in the pre-process Lexicons in Vietnamese can be one

or two words We used package WS4VN [8] to segment all Vietnameselexicons in documents

2. Lucene Indexing component:

Apache Lucene is a high-performance, full-featured text search engine librarywritten entirely in Java It is a technology suitable for nearly any applicationthat requires full-text search, especially cross-platform The source code andthe useful of Lucene are available on website http://lucene.apache.org

Trang 32

3. Semantic vector package:

The semantic Vector package is an open source that efficiently builds semanticvector or context vector of word and document from a raw text corpus of freetext corpus It performs RI approach In the next, we will present an overview about Semantic Vector Package, and introduced the functions of this package to use our system for finding synonyms in Vietnamese.

3.2 System Processes Flow

Figure 3.1: The processes of Semantic Similarity Finding System

Ngày đăng: 13/07/2014, 17:15

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[1] D. Appelt 1999, An Introduction to information extraction, Artificial Intelligence Communications, 12, 1999 Sách, tạp chí
Tiêu đề: An Introduction to information extraction
[2] David M. Blei, Andrew Y. Ng, Michael I. Jordan 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993-1022 Sách, tạp chí
Tiêu đề: Latent DirichletAllocation
[3] Thorsten Brants, Francine Chen, and Ioannis Tsochantaridis. Topic-based document segmentation with probabilistic latent semantic analysis. In Conference on Information and Knowledge Management (CIKM), pages 211–218, 2002 Sách, tạp chí
Tiêu đề: Topic-baseddocument segmentation with probabilistic latent semantic analysis
[4] MW.Berry, S.T Dumiais & G.W.O'Brien 1994. Using linear algebra for intelligent information retrieval. Computer Science Department Sách, tạp chí
Tiêu đề: Using linear algebra forintelligent information retrieval
[5] Cowie and W.Lehnert. 1996, Information Extraction, In Communications of the ACM, 39, 1996 Sách, tạp chí
Tiêu đề: Information Extraction
[6] H. Cunningham. 1999, Information extraction: a User Guide (revised version), Research Menorandum CS-99-07, Department of Computer Science, University of Sheffied, May, 1999 Sách, tạp chí
Tiêu đề: Information extraction: a User Guide (revised version)
[7] Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990).Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(16):391–407.(p. 157, p. 159) Sách, tạp chí
Tiêu đề: Indexing by latent semantic analysis
Tác giả: Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R
Năm: 1990
[8] Dang Duc Pham, Giang Binh Tran, Son Pham Bao 2009. A hybrid approach to Vietnamese Word Segmentation using Part of Speech tags. International Conference on Knowledge and Systems Engineering Sách, tạp chí
Tiêu đề: A hybrid approach toVietnamese Word Segmentation using Part of Speech tags
[9] Mohammad Emtiyaz Khan. Matrix Inversion Lemma and Information Filter.Honeywell Technology Solutions Lab, Bangalore, India Sách, tạp chí
Tiêu đề: Matrix Inversion Lemma and Information Filter
[10] Dr. Edel Garcia 2006. Singular Value Decomposition (SVD) A Fast Track Tutorial. First Published on September 11, 2006; Last Update: September 12, 2006 Sách, tạp chí
Tiêu đề: A Fast TrackTutorial
[11] Katherine Heller, Adam Sanborn, Nick Chater. Hierarchical Learning of Dimensional Biases in Human Categorization. Department of Engineering University of CambridgeCambridge CB2 1PZ Sách, tạp chí
Tiêu đề: Hierarchical Learning ofDimensional Biases in Human Categorization
[16] Khoo, C., & Na, J.C. (2006). Semantic Relations in Information Science.Annual Review of Information Science and Technology, 40, 157-228 Sách, tạp chí
Tiêu đề: Semantic Relations in Information Science
Tác giả: Khoo, C., & Na, J.C
Năm: 2006
[17] Thomas K Landauer 1998. An Introduction to Latent Semantic Analysis.Discourse Processes, 25, 259-284 Sách, tạp chí
Tiêu đề: An Introduction to Latent Semantic Analysis."Discourse Processes
[18] Raymond Kosala, Hendrik Blockeel 2001.Web Mining Research: A Survey.Department of Computer Science Katholieke Universiteit LeuvenCelestijnenlaan 200A, B-3001 Heverlee, Belgium Sách, tạp chí
Tiêu đề: Web Mining Research: A Survey
[19] Sergei Nirenburg, Victor Raskin and Svetlana Sheremetyeva Lexical Acquisition. Computing Research Laboratory New Mexico State University Sách, tạp chí
Tiêu đề: LexicalAcquisition
[20] Claes Neuefeind Fabian Steeg 2009. Information-Retrieval: Vektorraum-Model.Text-Engineering I - Information-Retrieval - Wintersemester 2009/2010 - Informationsverarbeitung - Universit at zu K oln Sách, tạp chí
Tiêu đề: Information-Retrieval: Vektorraum-Model
[21] Ulrik Petersen 2009. Emdros HAL example (Hyperspace Analogue toLanguage) Sách, tạp chí
Tiêu đề: Emdros HAL example
[22] Hyperspace Analogue to language [Lund and Burgess, 1996] -- Lund, Kevin and Curt Burgess. (1996) Producing high-dimensional semantic spaces from lexical co-ccurrence, Behavior Research Methods, Instruments and Computers, Volume 28, number 2, pp. 203–208 Sách, tạp chí
Tiêu đề: Producing high-dimensional semantic spaces fromlexical co-ccurrence
[23] Robertson, S., & Sp arck Jones, K. (1997). Simple, proven approaches to text re trieval (Technical report No. 356). Computer Laboratory, University of Cambridge Sách, tạp chí
Tiêu đề: Simple, proven approaches to textre trieval
Tác giả: Robertson, S., & Sp arck Jones, K
Năm: 1997
[24] James Richard Curran 2004. From Distributional to Semantic Similarity.Doctor of Philosophy Institute for Communicating and Collaborative Systems Sách, tạp chí
Tiêu đề: From Distributional to Semantic Similarity

TỪ KHÓA LIÊN QUAN

w