The indexing algorithm in searching engine for Vietnamese text

In this paper, we apply the tokenizer for Vietnamese text to build the lexicon and hence, each record of the lexicon may contain several single words. On the basis of this method, we decrease the size of the lexicon and improve the precision of search while maintaining the complexity of the process.

Trang 1

THE INDEXING ALGORITHM IN SEARCHING ENGINE

FOR VIETNAMESE TEXT

Nguyen Dang Tien

Abstract: This work investigates the indexing algorithm which is mandatory for

searching in large-scale text data sets The problem consists of dividing the dataset into words and building the metadata for each word, to boost the speed of search For Vietnamese text, the ambiguity of the tokenization processes in traditional indexing algorithms leads to a large size of lexicons and also, the low precision of results In this paper, we apply the tokenizer for Vietnamese text to build the lexicon and hence, each record of the lexicon may contain several single words On the basis of this method, we decrease the size of the lexicon and improve the precision

of search while maintaining the complexity of the process Consequently, the time consumption for each query from user is shortening Simulation shows that our algorithm provides better performances compared to the traditional strategy in terms of lexicon's size and the precision

Keywords: Tokenization, Search engine, Indexor, Lexicon, Page rank

1 INTRODUCTION

We investigate the indexing algorithm, where text documents are divided into subsets, each subset is identified by a word and it stores the index in a cache-based search engine The purpose of storing an index is to provide precise results to a search query while optimize the speed of the process In particular, we focus on the problem of building an information retrieval module for large-scale Vietnamese text data and apply it to divide the corpus into less subset in comparison to the traditional method A challenge in the indexing algorithm is to program the computer to identify what forms an individual or distinct word referred to as a token

Information retrieval is not a new problem in the literature Many approaches were proposed in order to implement an effective algorithm to distract information from large-scale corpus The first class of algorithms is string matching strategy which tries to find a place where one or several strings (also called patterns) are found within a larger string or text [1, 8] General speaking, letbe an alphabet (finite set), both the pattern and searched text are vectors of elements of is a usual human alphabet (for example, the letters A through Z in the Latin alphabet) The simplest strategy to find the searched text is to use brute force searching, where each character in the query and the pattern are compared in order However, the time complexity of the brute force strategy is not feasible for search engines where the corpus is in large size Another approach for string matching algorithm

is the Knuth–Morris–Pratt (KMP) algorithm [2] The algorithm searches for occurrences

of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters However, the performance of KMP algorithm is limited in natural language text On the basis of KMP algorithm, Robert S Boyer and J Strother Moore [8] proposed the Boyer– Moore string search algorithm, which uses information gathered during the preprocess step

to skip sections of the text, resulting in a lower constant factor than many other string search algorithms

The string matching algorithms, however, are only appropriate for the text in small size

In order to make a search engine perform effectively, one need to build an intelligent

Trang 2

information retrieval module in addition to a crawler and a page ranker The purpose of this module is to find a particular phrase in a large-scale corpus For information retrieval in large-scale data set, the indexing technique is widely used [3, 9] Similar to the index of a book, the index of a search engine includes information about words and their position The technique provided a significant improvement compared to the string matching methods However, the traditional indexing algorithm is designed for inflection language, where the isolation of words is whitespace For the isolation languages, there is a need to apply a better tokenizer for the search engine to "understand" the text Unlike literate humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences To a computer, a document is only a sequence of bytes Computers do not 'know' that a space character separates words in a document During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters The parser can also identify entities such as email addresses, phone numbers, and URLs When identifying each token, several characteristics may be store, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number

Motivated by the observation above, our idea is to apply a Vietnamese tokenization to the indexing algorithm to improve the quality of the lexicon Moreover, we propose two data structure to store the lexicons in memory The contributions of our work are threefold:

- Apply a Vietnamese tokenization process to break Vietnamese corpus to words

- Implement an indexing technique to build the lexicon and store it in an appropriate data structure

- Build a computer simulation to demonstrate how well our method works

2 THE STATE-OF-THE-ART INDEXING ALGORITHM

For large-scale data set (hundreds megabytes of text data), using string matching algorithms leads to unfeasible performances In these cases, indexing technique is usually applied before performing the searching operation In the following, we describe the state-of-the-art indexing algorithm in detail

2.1 The model

In the traditional indexing algorithm, two steps are performed before searching results for each query:

- The first step is to index the corpus In other words, an index file (lexicon) is created

by an indexor

- In the second step, for each query, we need to parse to retrieve words in the query

As describe above, the searching operation is not performed in the corpus by traditional string-matching algorithms In contrast, the lexicon is used to find information

of each word and the final results are composed by combining information of all words in the query

2.2 The structure of a lexicon

Lexicon is a file that contains a list of items Each item corresponds to a word in the corpus and has the format as in the following:

{w ord }{ ,{ }}, , ti pi i j  0,1, 2

Trang 3

where

word is a word in the text

t i are the document identifier that the word appears;

p j are the positions of the word in the document t i

To search a single word, we only perform the searching algorithm over the lexicon instead of finding the word in the whole corpus The information of each word, includes the document numbers that contain the word and the position of the word in each document, can be extracted completely only from the lexicon

Note that a word can appear in a document multiple times, not to mention that it can appear in different documents of the corpus Hence, for each value of ti , we have a large set of p j

2.3 Searching algorithm

For each query, we need to extract the words contained and perform the search algorithm in the lexicon The information to extract includes: the document numbers and the position of the word in the document The following example show how the traditional algorithms work

Example: There are two documents in the corpus, each contains one sentence

- Paris (nicknamed the City of light) is the capital city of France, and the largest city

in that country

- The Greater Tokyo Area is the most populous metropolitan area in the world

After performing the traditional indexing algorithm, the information in the lexicon is constructed, as shown in the Table 1

Table 1 The information in the lexicon

Word Document ID Positions

2

3 8 14

1 6 11

2

7

5 10

Trang 4

On the basis of the Table 1, we can accurately search the query “the capital city”

without using the document For English text, the indexing algorithm can boost the performance of searching in comparison with traditional string matching methods However, in Vietnamese text, some inaccurate results can be outputted For instance, if we use the query "đại học", the results may include documents in which the words "dai" and

"hoc" are not located in a same part An accurate result, however, should include the whole word "dai hoc"

In order to solve the issue above, we can perform following steps: (1) search for the word "dai" and "hoc" separately and independently (2) Then, on the basis of the results, the system only outputs the documents in which the whole word "dai hoc" appears (in other words, the documents where the word "hoc" stands next to the word "dai") In this paper, we propose another approach to solve the issue, which includes four steps:

- Using a Vietnamese tokenizer to split documents into Vietnamese words Each word represents a record in the lexicon Thanks to the tokenizer, this step makes the lexicon

"understands" Vietnamese compound words

- On the basis of parsed text, build a lexicon

- For each query, apply the same tokenizer to split it into words

- Search the query in the lexicon

3 PROBLEM ESTABLISHMENT

In this part, we briefly describe a Vietnamese tokenizer which was introduced in the literature and apply it to the indexing algorithm

3.1 Vietnamese tokenizer

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens The list of tokens becomes input for further processing such as parsing or text mining Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis

Typically, tokenization occurs at the word level However, it is sometimes difficult to define what is meant by a "word" Often a tokenizer relies on simple heuristics, for example:

- Punctuation and whitespace may or may not be included in the resulting list of tokens

- All contiguous strings of alphabetic characters are part of one token; likewise with numbers

- Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters

The tokenization is an important process in natural language processing, in particular, for East Asia's languages where languages are isolation: Chinese, Japanese, Korean, etc For this kind of language, the isolation of words is not the white space as in English On the other hands, there is a connection between single words, i.e a word may include multiple single words As a consequence, a good tokenizer has to decrease the ambiguity

of words

Trang 5

For Vietnamese text, a popular algorithm implemented in a tokenizer is “minimum weight” [4, 10], in which the tokenization is transform to a graph problem as follow:

1 Create two virtual nodes: Start node and end node

2 Compare sequentially the segments with an arbitrary length to a lingual dictionary

3 A segment which is contained in the dictionary corresponds to a new node in the graph

4 The weight between two nodes (two continuous segments in the sentence) is calculated by the formula:

f i j

N

5 Find the shortest path from start node to end node

In the formula, f i j( , ) is usually calculated by the uni-gram value (the probability that

a word appear) and bi-gram (the probability that two words appear together) In addition,

we can add other aspects to get better values of f : word type, linking word, etc Previously, these values (accept uni-gram and bi-gram, which are given by statically analyze the corpus) are evaluated manually However, by the advantages of machine learning models (Markov model, CRFs, etc.), they can be calculate automatically In the

step 5, the shortest path is found by the Viterbi algorithm, with the complexity O(n) where

n is the length of the input sentence Implementing this strategy can reach the precision of

97\% for Vietnamese text Refer to the paper of Dien et al [4] and Tran et al [10] for the more detail description of the tokenizer

3.2 Search engine model

The model of our proposed system is shown in the Fig 1 The indexing algorithm performs two steps:

- Implement the tokenizer to split Vietnamese text into words

- Build a lexicon in the hard disk The information need to be stored in the lexicon includes the string that contain words, the document IDs that words are contained and their position

Figure 1 A chromosome example

For each query, the system performs two steps:

- Use the Vietnamese tokenizer to split the query into words

Trang 6

- Retrieve the information of words in the lexicon

- Combine the information retrieved, output the documents that contain the query How we store the list of words in the lexicon is important in the performance of the algorithm Since searching and inserting are the two most popular processes in the lexicon,

we divided the lexicon into two types:

- The lexicon for large Vietnamese scale data set: In this case, the size of the lexicon is

significant in comparison to the size internal memory We need to implement an appropriate data structure to store the list of words in the external memory which supports the searching and inserting process Note that, the average time to access a byte in a hard disk is 19 ms while in the internal memory, the average time is 0.000113 ms In this work,

we suggest to use the B-tree to store and search in the lexicon

- The small lexicons: For the small lexicon, we can store it in the internal memory to

take advantages of its speed We use the red-black tree [6] structure to store the lexicon

For both data structures above, the complexity of searching and insertion is O(logn) where n is the number of words in the corpus

4 SIMULATION AND RESULT 4.1 Simulation description

To evaluate the performance of our proposed indexing strategy, we have built a computer simulation by C++ and Python, which is described in the Fig 2

Figure 2 Simulation description

The crawler module (which collects Vietnamese text from the internet) is written in Python The output is a large-scale Vietnamese corpus

The indexor is written in C++ We store lexicons in the external memory using B-Tree [5] It allows us to expand the size of corpus without erasing the old lexicon Moreover, in the tokenizer, we implement the “minimum weight” algorithm The performance of

systems are evaluated in terms of the precision and recall where:

To show the advantages of our proposed approach, we compare the result of our algorithm with traditional algorithm for Vietnamese Language, which was presented in [9]

in term of accuracy

4.2 Result and discussion

Trang 7

Figure 3 Precision for single and compound terms

As can be seen from Fig 3, our method provides a better performance in comparison with traditional strategy in terms of precision In particular, for the case of compound terms and 200 megabytes of corpus, the precision of our method and the traditional algorithm is 0.652 and 0.587, respectively This demonstrates the effect of the Vietnamese tokenizer applying to the indexing algorithm

Figure 4 Recall for single and compound terms

We remark that, the improvement of our method is more significant in the case of compound terms searching compared to the cased of single terms searching The reason is that, by Vietnamese tokenizer, some single words are absorbed by compound terms in the lexicon Hence, searching process on these words results in worse results

We also remark that our approach also outperformed the traditional method in terms of recall As shown in the Fig 4, for the case of single terms, the recall of our algorithm is 0.641 compared to 0.530 of the traditional indexor Similarly, the recalls in case of compound terms in system with tokenizer and system without tokenizer are 0.598 and 0.304, respectively This demonstrates the advantage of the tokenization in our method

5 CONCLUSION

In this paper, we apply a Vietnamese tokenizer to the traditional indexing algorithm

By the tokenization process, the indexor "understands" more about the language than the traditional strategy, hence, it outputs a better lexicon Our proposed search engine using a crawler module, an indexor with tokenization process provides a better performance in comparison with the traditional strategy We consider the application of our approach to isolated languages as a future work

Trang 8

REFERENCES

[1] Charras, C and Lecroq, T., “Handbook of exact string matching algorithms”, Citeseer, 2004

[2] Boyer, R S and Moore, J S., “A fast string searching algorithm”,

Communications of the ACM, Vol 20, no 10, pp 762–772, 1977

[3] Brin, S and Page, L , “Reprint of: The anatomy of a large-scale hypertextual web

search engine”, Computer networks, Vol 56, no 18, pp 3825–3833, 2012

[4] Dien, D., Kiem, H and Van Toan N, “Vietnamese word segmentation.” in NLPRS,

Vol 1, 2001, pp 749–756

[5] Ferragina, P and Grossi, R., “The string b-tree: a new data structure for string search in external memory and its applications”, Journal of the ACM (JACM),

Vol 46, no 2, pp 236–280, 1999

[6] Hanke, S., “The performance of concurrent red-black tree algorithms” Springer,

1999

[7] Johnson, C., “Method and system for visual internet search engine,” Oct 10 2001,

US Patent App 09/975,755

[8] Knuth, D E., Morris J H., Jr, and Pratt V R., “Fast pattern matching in strings”,

SIAM journal on computing, Vol 6, no 2, pp 323–350, 1977

[9] Orlando, S., Perego, R., and Silvestr,i F , “Design of a parallel and distributed web search engine”, arXiv preprint cs/0407053, 2004

[10] Tran, O T., Le, C A., and Ha T Q., “Improving vietnamese word segmentation and pos tagging using mem with various kinds of resources”, Natural language

processing, Vol 17, no 3, pp 41–60, 2010

TÓM TẮT

THUẬT TOÁN CHỈ MỤC TRONG CÔNG CỤ TÌM KIẾM

VĂN BẢN TIẾNG VIỆT Trong bài báo này chúng tôi đề xuất một thuật toán chỉ mục dùng trong tìm kiếm ngôn ngữ Tiếng Việt với những bộ dữ liệu lớn Vấn đề cần giải quyết bao gồm hai phần: chia bộ dữ liệu thành những từ riêng lẻ và xây dựng metadata cho từng từ một Với ngôn ngữ Tiếng Việt, sự mập mờ của quá trình thẻ hóa (tokenization) trong những thuật toán chỉ mục truyền thống dẫn đến kết quả tìm được không chính xác Trong bài báo này, chúng tôi áp dụng phương pháp thẻ hóa mới cho Tiếng Việt trong đó mỗi thuật ngữ có thể bao gồm nhiều từ Qua đó, tăng sự chính xác của quá trình tìm kiếm trong khi giảm thời gian tìm kiếm Quá trình mô phỏng chỉ ra sự ưu việt của thuật toán đề xuất

Từ khóa: Thẻ hóa, Công cụ tìm kiếm, Chỉ mục, Thuật ngữ, Xếp hạng trang

Author affiliations:

People's Police University of Technology and Logistics, Bac Ninh, Vietnam

Email: dangtient36@gmail.com

Định dạng
Số trang	8
Dung lượng	236,8 KB