Keywords: Semantic vector, Word space model, Random projection, Apache Lucene... Specifically, there are many well-known approaches for representing thecontext vector of words such as: L
Trang 1 Nguyen Tien Dat
FINDING THE SEMANTIC SIMILARITY IN
Nguyen Tien Dat
FINDING THE SEMANTIC SIMILARITY IN
VIETNAMESE
GRADUATION THESIS
Major Field: Computer Science
Supervisor: Phd Phạm Bảo Sơn
Trang 2Our thesis shows the quality of semantic vector representation with randomprojection and Hyperspace Analogue to Language model under about the researching
on Vietnamese The main goal is how to find semantic similarity or to study synonyms
in Vietnamese We are also interested in the stability of our approach that usesRandom Indexing and HAL to represent semantic of words or documents We build asystem to find the synonyms in Vietnamese called Semantic Similarity FindingSystem In particular, we also evaluate synonyms resulted from our system
Keywords: Semantic vector, Word space model, Random projection, Apache Lucene
Trang 3Acknowledgments
First of all, I wish to express my respect and my deepest thanks to my advisorPham Bao Son, University of Engineering and Technology, Viet Nam NationalUniversity, Ha Noi for his enthusiastic guidance, warm encouragement and usefulresearch experiences
I would like to gratefully thank all the teachers of University of Engineeringand Technology, VNU for their invaluable knowledge which they provide me duringthe past four academic years
I would also like to send my special thanks to my friends in K51CA class, HMILab
Last, but not least, my family is really the biggest motivation for me Myparents and my brother always encourage me when I have stress and difficulty Iwould like to send them great love and gratefulness
Ha Noi, May 19, 2010 Nguyen Tien Dat
Trang 4Contents
Trang 7Chapter 1
Introduction
Finding semantic similarity is an interesting project in Natural LanguageProcessing (NLP) Determining semantic similarity of a pair of words is an importantproblem in many NLP applications such as: web-mining [18] (search andrecommendation systems), targeted advertisement and domains that need semanticcontent matching, word sense disambiguation, text categorization [28][30] There isnot much research done on semantic similarity for Vietnamese, while semanticsimilarity plays a crucial role for human categorization [11] and reasoning; andcomputational similarity measures have also been applied to many fields such as:semantics-based information retrieval [4][29], information filter [9] or ontologyengineering [19]
Nowadays, word space model is often used in current research in semanticsimilarity Specifically, there are many well-known approaches for representing thecontext vector of words such as: Latent Semantic Indexing (LSI) [17], HyperspaceAnalogue to Language (HAL) [21] and Random Indexing (RI) [26] These approacheshave been introduced and they have proven useful in implementing word space model[27]
In our thesis, we carry on the word space model and implementation forcomputing the semantic similarity We have studied every method and investigatedtheir advantages and disadvantages to select the suitable technique to apply forVietnamese text data Then, we built a complete system for finding synonyms inVietnamese It is called Semantic Similarity Finding System Our system is a
Trang 8word Our experimental results on the task of finding synonym are promising Ourthesis is organized as following First, in Chapter 2, we introduce the backgroundknowledge about word space model and also review some of the solutions that havebeen proposed in the word space implementation In the next Chapter 3, we thendescribe our Semantic Similarity Finding System for finding synonyms in Vietnamese.Chapter 4 describes the experiment we carry out to evaluate the quality of ourapproach Finally, Chapter 5 is conclusion and our future work.
Trang 102.1.1 Synonym and Hyponymy
The synonymy is the equality or at least similarity of the importance ofdifferent linguistic Two words are synonymous if they have the same meaning [15].Words that are synonyms are said to be synonymous, and the sate of being a synonym
is called synonymy For the example, in the English, words “car” and “automobile”
are synonyms In the figurative sense, two words are often said to be synonyms if theyhave the same extended meaning or connotation
Synonyms can be any part of speech (e.g noun, verb, adjective or pronoun) as
the two words of a pair are the same part of speech More examples of Vietnamesesynonyms:
độc giả - bạn đọc (noun)
chung quanh – xung quanh (pronoun)
bồi thường – đền bù (verb)
Trang 11In the linguistics dictionary, the synonym is defined in three concepts:
1. A word having the same or nearly the same meaning as another word or other words in a language.
2. A word or an expression that serves as a figurative or symbolic substitute for another.
3. Biology: A scientific name of an organism or of a taxonomic group that has been superseded by another name at the same rank
In the linguistics: Hyponym is a word whose meaning is included in that of other word
2.1.2 Antonym and Opposites
In the lexical semantics, opposites are the words that are in a relationship of
binary incompatibles in opposite as: female-male, long-short and to love – to hate The
notion of incompatibility refers to the fact that one word in an opposite pair entails that
it is not the other pair member The concept of incompatibility here refers to the factthat a word in a pair of facing demands that it not be a member of another pair For
example, “something that is long” entails that “it is not short” There are two members
in a set of opposites, thus it is referred to a binary relationship The relationshipbetween opposites is determined as opposition
Opposites are simultaneously similar and different in meanings [12] Usually,they differ in only one dimension of meaning, but are similar in most other aspects,which are similar in grammar and semantics of the unusual location Some words are
Trang 12non-opposable For example, animal or plant species have no binary opposites orantonyms Opposites may be viewed as a special type of incompatibility:
An example, incompatibility is also found in the opposite pairs “fast - slow” It’s fast - It’s not slow.
Some features of opposites are given by Cruse (2004): binary, inheritress, andpatency In this section, we introduced Antonyms are gradable opposites They located
at the opposite end of a continuous spectrum of meaning
………
Words can have some different antonyms, depending on the meaning orcontexts of words We study antonyms to make clearly a fundamental part of alanguage, in contrast to synonyms
2.2 Word-space model
Word-space model is an algebraic model to represent text documents or anyobjects (phrase, paraphrase, term …) It uses a mathematical model as vector toidentify or index terms in the text documents Model is useful in information retrieval,
Trang 13about Vector space Model for information retrieval [29] This term is due to HinrichSchutze (1993):
“Vector similarity is the only information present in Word Space:
semantically related words are close, unrelated words are distant (page.896) “
2.2.1 Definition
Word-space models contain the related method for representing concepts in ahigh dimensional vector space In this thesis, we suggest a name: Semantic vectormodel through our work The models include some well-known approach such as:Latent Semantic Indexing [17], Hyperspace Analogue to language [21]
Document and queries are performed as vectors
we study one famous way that has been developed That is tf-idf weighting (see thepart of section below):
The core principle is that semantic similarity can be represented as a proximaten-dimensional vector; n can be 1 or the large number We consider the 1-dimensionaland 2-dimensional word space in the Figure:
Trang 14Figure 2.1: Word geometric repersentaion
In above geometric representation, it shows the simple words of someVietnamese As an example, both semantic spaces, “ô_tô” is the closer meaning to
“xe_hơi” than “xe_đạp” and “xe_máy”
The definition of term depends on each application Typically terms are singlewords, or longer phrases If words are chosen to be terms, the dimensionality of thevector is the number of words in the vocabulary
2.2.2 Semantic similarity
As we have seen in the definition, the word-space model is a model of semanticsimilarity On the other hand, the geometric metaphor of meaning is Meanings are
proximity between the locations The term-document
vector represents the context of term in low
granularity Besides, creating term vector according to
the some words surrounding to compute semantic
vector [21] It is a kind of semantic vector model To
compare the semantic similarity in semantic vector
Trang 15In practice, it is easier to calculate the cosine of the angel between the vectors
instead of angle:
A cosine value of zero means that the query and document vector does not
exist and match The higher Cosine distance is; the closer similarity of semantic of two
terms is
2.2.3 Document-term matrix
A document-term matrix and term-document matrix are the mathematical
matrices that show the frequency of terms occurings in a set of collected documents In
a document-term matrix, rows correspond to documents in the collection and columns
correspond to terms In a term-document matrix, rows correspond to words or terms
and columns correspond to documents To determine value of these matrices, one
scheme is tf-idf
A simple example for document-term matrix:
D1 = “tôi thích chơi ghita.”
D2 = “tôi ghét ghét chơi ghita.”
Then the documents-term matrix is:
Tôi thích ghét chơi ghita
Matrix shows how many times terms appear in the documents And in detail,
complexity, we describe the tf-idf in the next part
Table 2.1: A example of documents-term matrix
Trang 162.2.4 Example: tf-idf weights
In the classic semantic vector model [31], the term weights in the documentsvectors are products of local and global parameters It is called term frequency-inversedocument frequency model
where
And tft,d is term frequency of term t in document d
is inverse document frequency |D| is the number of
documents is the number of documents in which the term t occurs
Distance between document dj and query q can calculated as:
Trang 17
2.2.5 Applications
Over 20 years, the semantic model has been developed strongly, it's useful toperform many important tasks of natural language processing Such applications, inwhich semantic vector models play a great role, are:
Information retrieval [7]: It is basic foundation to create applications that are fully,
automatically and widely applicable on different languages or cross-languages Thesystem has flexible input and output options Typically, user queries any combination
of words or documents while system return about documents or words Therefore, it isvery easy to build web interface for users Regarding cross-language informationRetrieval, semantic vector models is more convenient than other systems to query inone language that matches relevant documents or articles in the same or otherlanguages because it is fully automatic corpus analysis while Machine translationrequires vast lexical resources Some machine translations are very expensive todevelop and lack coverage to all lexicon of a language
Information filters [9]: It is very interested Information Retrieval needs relative
stable database and depend on user queries while Information filter (IF) findsrelatively stable information needs The data stream in IF is rapidly changing IF alsouse more techniques such as: information routing, text categorization or classification
Word sense discrimination and disambiguation [28]: The main idea is clustering the
weighted sum of vector for words found in a paragraph of text; it is called the contextvector of word It calculates the co-occurrence matrix too (see in section 2.2), theappearance of an ambiguous word can then be mapped to one of these word-senses
Document segmentation [3] Computing the context vector of region text help
category this text belongs to a kind of documents Given a document, system can showthat it is some kinds of sport, policy or law topic
Lexical and ontology acquisition [19]: According to the knowledge of a few given
words called seed words and their relationship to getting many other similar words thatdistance of semantic vector is nearby
Trang 182.3 Word space model algorithms
In this section, we will discuss a common space model algorithm There are twokinds of approach to implement Word Space model: Probabilistic approaches andContext vector; but we pay attention to context vector approach, the common way isused to compute the semantic similarity We introduce Word-concurrence Matricesthat represent for context vector of word Then, we study some similarity measure tocalculate the distance between two context vectors
2.3.1 Context vector
Context plays an important role in NLP The quality of contextual information
is heavily dependent on the size of the training corpus with less data available,extracting contextual information for any given phenomenon becomes less reliable[24] But not only training corpus, extracting semantic depends on algorithms we use.There are many methods to extract context from data set, but results is often verydifferent
Formally, a context relation is a tuple (w, r, w) where w is a headwordoccurring in some relation type r, with another word w in one or more the sentences.Each occurrence extracted from raw text is an instance of a context, that is, a contextrelation/instance is the type/token distinction We refer to the tuple (r, w) as anattribute of w
The context instances are extracted from the raw text, counted and stored inattribute vectors Calculating attributes vectors can give some factor to comparecontext of word, then deduce semantic similarity
2.3.2 Word-concurrence Matrices
The approach developed by Schutze and Qiu & Frei has become standardpractice for word-space algorithms [16] The context of a word is defined as the rows
Trang 19or columns of co-occurrence Matrix, data is collected in a matrix of co-occurrencecounts.
Definition
Formally, the co-occurrence matrix can be a words-by-words matrix which is asquare W x W matrix, where W corresponds to the number of unique words in the freetext corpus is parsed (in Vietnamese, we need word segmentation process before nextjob) A cell mi,j is the number of times word wi co-occurs or appear with word wj
within a specific context – a natural unit such as the sliding window of m words Notethat we process the upper and lower word before performing this
tôi (lãng mạn 1), (kiểm tra 1), (nói 1)
sinh viên (bồi thường 1), (lãng mạn 1), (kiểm tra), (nói 1)
bồi thường (sinh viên 1), (nhân viên 1)
lãng mạn (tôi 1),(sinh viên 1)
nhân viên (bồi thường 1), (kiểm tra 1),
kiểm tra (tôi 1),(sinh viên 1), (nhân viên 1)
nói (tôi 1),(sinh viên 1)
Other co-occurrence matrix is defined words-by-documents matrix W x D,where D is the number of document in the corpus A cell fi,j of this matrix shows the
frequency of appearance of word W i in the document j A simple example for
words-by-words co-occurrence matrix as following:
Table 2.2: Word co-occurrence table
Table 2.3: Co-occurrences Matrix
Trang 20Both documents and windows will be used to compute semantic vectors, but it
is easy to estimate the quality of windows (high granularity) is greater than lowgranularity that is documents context
Instantiations of the vector-space model
It is developed by Gerald Salton and colleagues in 1960s on the SMARTinformation-retrieval system [13] Nowadays, many information retrieval systemsperform the both type of weight One is in the traditional vector space model; it isintroduced by Robertson & Sparck Jones 1997 The other is known as TFIDF family
of weighting schemes This is true for semantic vector algorithms that use a documents co-occurrence matrix
words-by-Words-by-words co-occurrence matrix count the number of times the word i co-occurs which other word j The co-occurrence is often counted in a context window
or context size spanning some number of words When we count in both directions(one to left & one to right, two words to left & two words to right …) of target word,the matrix is called symmetric words-by-words co-occurrence matrix, and in which therows equals the columns In this thesis, we use the context size direct to the both side
of target word in our experiments We estimate different results on a few context size.However, if we counted in the only one direction (left or right side), it is calleddirectional words-by-words co-occurrence matrix We can refer to the former as a left-directional words-by-words matrix, and to the latter as a right-directional words-by-words matrix The above table describes the right right-directional words-by-wordsmatrix for the example data, the row and the column vectors of word are different Therow vector contains co-occurrence counts with words that have occurred to the right ofthe words, while the column vector contains co-occurrence counts with words thathave occurred to their left side
Trang 212.3.3 Similarity Measure
This section describes how to define or evaluate word similarity The similaritymeasure is often based on the similarity between two Measuring semantic similarityinvolves devising a function for measuring the similarity between context vectors Thecontext extractor returns a set of context relations with their instance frequencies It'scan be represented in nested form (w, (r,w'))
To gain high-level comparison, we use measuring functions such as: Geometric Distances, Information Retrieval, Set Generalizations, Information Theory and Distributional Measures [24] But we only do research on Geometric Distances
because it is very popular and easy to understand We define the similarity betweenwords w1 and w2 by some Geometric Distances:
Euclidean distance:
Manhattan or Liechtenstein distance:
Others:
Trang 222.4 Implementation of word space model
In this section, we will discuss some problems with word space algorithms:high dimensional matrix and data sparseness Then, look at how differentimplementations of the word space model, we also discuss some their advantages ordisadvantages
2.4.1 Problems
High dimensional matrix
This is a great problem in the building the co-occurrences matrix process Whenwriting an algorithm for the word-space model, the choice of vector space similarity isnot a unique design for us to choose Another important issue is how to handle highdimensional data for the context vector
At the same time, if you do not have enough data, we will not have a platform
to build models of word distribution At the same time, the co-occurrence matrix willbecome prohibitively large for any reasonably sized data affecting the scalability andefficiency of the algorithm This led them to a delicate dilemma: We need much data
to build co-occurrence matrix, but they will be limited by high dimensionalcomputation capacity matrix
Data sparseness
Other problem in creating the vector in the word-space model is a simple case,there are many cells in the co-occurrence matrix will equal to zero This is called datasparseness in which, the co-occurrence events (of two words or word in document) arepossible in the matrix will actually occur, regard of data size The vast majority ofwords only occur in a very limited number of contexts with other words This theory iswell known, associate to the general Zipf’s law [34] To solve the data sparseness
Trang 23Dimension reductions
The solution of high dimensions problem is always called dimensionalityreduction A matrix, build for representing term or document, has high dimensions.There are many methods to restructure high-dimensional data in a low-dimensionalspace, so that both the dimensionality and sparseness of the data are decreased Thus,
it is very easy to compute or compare among between context vectors
In this thesis, we introduce only one approach to do this; it’s called SingularValue Decomposition [10] SVD method is used especially in numericalmathematics There be, for example many linear systems in solving computationallyreasonable accuracy
Some modern image compressions are based on an algorithm, image (matrix ofcolor values) in an SVD This is a possible application of reduction model
In particle physics one uses the singular value decomposition to diagnose massmatrices of Dirac particles The singular values give the masses of the particles in their
mass eigenstate From the transformation matrices U and V one constructs as the CKM matrix, which express the mass eigenstates of particles from a mixture of flavor
eigenstates that can exist
The singular value decomposition need of complexity, where n
is the number of documents + number of terms and k is the number of dimensions
In additional, we introduce other technique to slave the high dimensionality.That is Latent Dirichlet Allocation (LDA) It is proceed by Blei et all 2003 LDAprovides a probabilistic generative model that the number for documents as beingprobabilistic mixtures of the underlying topics [2] Then, model use an EM algorithm
to evaluate the k topic parameter and that calculate the model for each documentsorder by k, where N is the number of words in the document
Trang 242.4.2 Latent semantic Indexing
Latent Semantic Indexing is a useful method of informationretrieval [7] Techniques such as LSI are particularly relevant to the search on largedata sets such as the document on the Internet The goal of LSI is to majorcomponents of documents to be found These principal components (concepts) can be
thought of as general concepts For example in English, Building is such a concept that term such as hose, tower to be included Thus, this method is suitable, for example, for
very many documents, articles (such as on the Internet), those which reference where
it comes to cars, even if the word auto in them explicitly not occur In
addition, LSI can help, finding articles, where it is about cars, to be distinguished fromthose in which only the word car is mentioned for example at sites where a car isintroduced as a profit
The semantics, which is determined by LSI, is calculated and displayed in amatrix, which is called in this case semantic space A matrix is a table to be enteredinto the multi-dimensional, LSI semantic relationships Newly added content must beincluded, which constantly requires new calculations In the process of LSI, thedimensions of a matrix can be reduced, since semantically related contents are groupedand categorized Because of the reduced matrix, calculations are simplified Thequestion is how far one should reduce the dimensions
Background Mathematical
LSI implements the Termfrequenz Matrix [20] by the singularvalue approximated It is a method to reduce dimension of matrix to the semantic units
of a document carried out, further simplifies the calculation
LSI is just an additional procedure based on the vector space retrieval touchesdown From there the famous TD-matrix is processed by the LSI in addition to zoomout to This is useful in particular for larger document collections, since the TD-matrices are generally very large This matrix is the TD on the decomposing ofsingular value decomposition This helps decrease complexity in computing and thus
Trang 25At the end of the algorithm is a new, smaller TD-matrix in which the terms ofthe original TD-matrix are generalized into concepts.
Algorithm
The main step of latent semantic indexing
• The term-document matrix is calculated and, where appropriate, weighted
• The term-document matrix A is then decomposed into three components
(singular):
The two orthogonal matrices U and V included here eigenvectors
of A T A A A or T, S A is a diagonal matrix with the roots of the eigenvalues
of A, T, also called singular values
• About the eigenvalues in the resulting matrix S can now control the dimension
reduction This is done by successively omitting each of the smallest eigenvalue
up to an indefinite limit k
• Search query q (for Query) to edit an order, it will be displayed in the semantic space Q this is a special case of a document the size considered With
the following formula is the (possibly weighted) Queryvektor q shown:
S k is the first k diagonal elements of S.
• Each document is as shown in q the semantic space After that, q, for example,
the cosine similarity or the scalar product of the document will be compared
Trang 26Advantage and Disadvantage
The semantic space (that is, reduced to the meanings of TD-matrix) reflects theunderlying structure of the documents in terms of semantics The approximate position
in the vector space of the vector space retrieval is kept The projections on theeigenvalues are then the membership of a concept (steps 4 and 5 of the algorithm) TheLatent Semantic Indexing elegantly solves the synonym problem, but only in part,the polysemy, which means that the same word can have different meanings Thealgorithm is computationally very intensive The complexity of singular valuedecomposition is a , Where n is the number of documents + number of
terms and k is the number of dimensions This problem can be bypassed by the
economic calculation from the start of a TD-reduced matrix, the Lanczos method isused The singular value must also be constantly repeated, when new terms ordocuments Another problem is the problem dimension: in how many dimensions to
the term-document matrix to be reduced, so how big is k.
Applications
An increasingly important future technology of search engines is the Latent
Semantic Indexing, which is already being used in part It has been previously listed
on search only web pages in search results, which contained the exact keyword orkeyword combination is Latent Semantic Indexing extended by this service For searchengines it is possible by this technological development to understand the meaning andimportance of the text content of Web sites and thus for a given subject relativesclassified as relevant, even though the search does not occur in its form On therequested keyword in the search results then related topic sites are also shown At thispoint, the semantics is needed Semantics is a field from the language research and isconcerned with meaning and significance of, say, words and their relationships Thisresearch is important for the Latent Semantic Indexing, to learn search engine, textcontent from various websites which fit together thematically and thus should be takeninto account with a search input in, even though the keyword is found only latent So
Trang 27the keyword occurs "in appearance, but is not available", which is the definition of
"latent"
2.4.3 Hyperspace Analogue to Language
The Hyperspace Analogue to Language (HAL) [21] model create context vectoronly as the words that immediately surround a target word HAL model builds thewords-by-words co-occurrence matrix which, in contrast to LSI, was developedspecially for semantic vector research
HAL uses a word-by-word co-occurrence matrix, which is populated bycounting word co-occurrences HAL computes an NxN matrix, where N is the number
of words in its lexicon, using a directional context window 10 words wide The occurrence is used to measure the distance between words Building the directional co-occurrence matrix enable calculate the directional context of words
co-The size of HAL model matrix provides a very high dimensional contextvector, sometimes, which is two times higher than the size of lexicons in corpus In thecontrast to LSI using SVD to reduce the size, HAL normalize their dimensionality bycomputing the variances of the row and column vectors for each word, and decreasingthe elements with lowest variance, leaving only the 100 to 200 most variant vectorelements; so the size matrix reduction step is not an original of HAL HAL take somesteps following:
with target word).
Trang 28To improving context vector and the results in the semantic similarity, a new
probabilistically techniques based on principle of HAL is call: Probabilistic
hyperspace analogue to language [33].
• Generate the new matrix of documents in new dimension by adding the corresponding term vector each time a document contains a term.
The above steps are summarized in equation:
MtxN = Random k x d M dxN
Where M(random) is document matrix
MdxN is term matrix is created by random document matrix
M spanding N terms in d documents Then, it reduces to t dimensions throughabove formula
Trang 29Figure 2.3: The processes of Random indexing
Applications in information retrieval
Whether and which textual Web content to be indexed as a Keyword
"semantically close" or "semantic distance" depends on semantic words, phrases, wordcombinations, synonyms, antonyms and the like from those found at the variousWebPages Of course, remain words that appear regularly in each text, for example,words like "and," "the, an, and, unnoticed After comparing the websites, they arerated semantically close if as many Content Words are in place and thereforesemantically distant, if there are hardly Content Words The websites are semanticallyclose earlier than listed in the search results less relevant websites, even if the keyword
does not appear This will be on search engines to use RI, searching a search engine to
search more relevant Web pages displayed to, without him, on Match query performs avariety of extra The RI will develop and is a beginning, becoming more and moreinformation in order to manage Web by using the content categories
Trang 31Chapter 3
Semantic Similarity Finding System
We built a complete system to find synonyms in Vietnamese Our systemoperates the word-space model based on the approach: Random Indexing (RI) RI canalso be used to produce both LSI and HAL - types of word space model implementing
RI creating the context vector of terms according the documents which they occur in
In other hand, RI produces HAL when it is used to make the co-occurrence withinnarrow window size matrix of all terms in the free text corpus Hence, the contextvectors are built by the words that immediately surround the target word
3.1 System Description
The semantic Similarity Finding System contains three components:
1. Word segment:
This component is used in the pre-process Lexicons in Vietnamese can be one
or two words We used package WS4VN [8] to segment all Vietnameselexicons in documents
2. Lucene Indexing component:
Apache Lucene is a high-performance, full-featured text search engine librarywritten entirely in Java It is a technology suitable for nearly any applicationthat requires full-text search, especially cross-platform The source code andthe useful of Lucene are available on website http://lucene.apache.org
Trang 323. Semantic vector package:
The semantic Vector package is an open source that efficiently builds semanticvector or context vector of word and document from a raw text corpus of freetext corpus It performs RI approach In the next, we will present an overview about Semantic Vector Package, and introduced the functions of this package to use our system for finding synonyms in Vietnamese.
3.2 System Processes Flow
Figure 3.1: The processes of Semantic Similarity Finding System