Phương pháp đánh chỉ số cho tài liệu XML tin sinh học dựa trên r tree tt tiếng anh

MINISTRY OF EDUCATION AND TRAINING VIETNAM ACADEMY OF SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY --- DINH DUC LUONG BIOINFORMATICS XML DOCUMENTS INDEX METHO

Trang 1

MINISTRY OF EDUCATION

AND TRAINING

VIETNAM ACADEMY OF SCIENCE AND TECHNOLOGY

GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY

-

DINH DUC LUONG

BIOINFORMATICS XML DOCUMENTS INDEX METHOD BASED ON R-TREE METHOD

Major: Mathematical Foundations for Computer Science

Code: 9 46 01 10

SUMMARY OF MATHEMATICS DOCTORAL THESIS

Ha noi, 2019

Trang 2

List of works of author

1 Dinh Duc Luong, Hoang Do Thanh Tung, “A Survey on Indexing for Gene

Database”, International Clustering Workshop: Teaching, Research, Business, December 27-29, 2014, pp 50-54

2 Hoang Do Thanh Tung, Dinh Duc Luong, “A proposed Indexing Method for

Treefarm database”, International Conference on Information and Convergence Technology for Smart Society, Vol.2 No.1, Jan, 19-21,2016 in

Ho Chi Minh, Vietnam, pp 79-81

3 Vuong Quang Phuong, Le Thi Thuy Giang, Dinh Duc Luong, Ngo Van Binh, Hoang Do Thanh Tung, “Technology solution of managing pig breed”, Proceedings of the XXI National Conference: Some selected issues of Information Technology and Communications, Thanh Hoa, 27-28/7/2018, pp 110-116

4 Hoang Do Thanh Tung, Dinh Duc Luong, “An Improved Indexing Method for

Xpath Queries”, Indian Journal of Science and Technology, Vol 9(31), DOI:10.17485/ijst/2016/v9i31/92731, August 2016, pp 1-7 (SCOPUS)

5 Dinh Duc Luong, Vuong Quang Phuong, Hoang Do Thanh Tung, “A new

Indexing technique XR + tree for Bioinformatic XML data compression”, International Journal of Engineering and Advanced Technology (IJEAT), ISSN: 2249-8958 (Online), Volume-8, Issue-5, June 2019, pp 1-7 (SCOPUS)

Trang 3

INTRODUCTION

XML documents are structured text data, or semi-structured data, which has been popular for decades because data storage is flexible and easy to share and use through the internet In the past, XML documents were not usually very large, but in recent years began to appear large bioinformatics XML documents that can reach Giga, Tera Byte because of the rapid development of biotechnology in this era That data can be found from reputable data sources such as SRA (decoded sequences), NCBI Genome (sequenced species), ensembl.org (aggregating a lot of data into BioMart)

Bioinformatics XML documents are two-part data, biological data (DNA, Protein, subspecies, etc.) and description data of biological data Data structures are defined according

to tags and these data structures are often flexible and may be different because they are customized by biological individuals and organizations

Because of such a large size, the basic documents must be stored and exploited on the hard disk, or in a distributed storage system, before being able to access a small portion to put

on main memory (RAM) whenever further analysis is needed Hard disk access mechanism is sequential and much time-consuming than accessing on RAM Therefore, the query methods that need to access the hard disk always find ways to minimize the number of times to access the hard disk and maximize the use of main memory, such as Cache, Buffer

The practical queries based on the algorithm of specific queries are designed to achieve the desired results in a short time and to match the query For example:

1 Query XPath for an XML document (exact search): extract all data with tags of the same origin / sibling of one White Mouse type or extract all data is a descendant of the African pig

2 Homologous query for DNA data fragments (approximate search): look for all the homologous genes with a Gen sample of a new species

The traditional solution for such queries is to select and install indexing methods that suit

to some certain types of data and specific queries These have already been such methods, but these methods are limited to such large-sized text data

With text data, the size of the index data is often very large, even much larger than the original data, thus causing problems: (1) storing index data is a difficult problem (2) Data compression and data exploitation at the same time are less efficient Moreover, if the index is text data, the query speed problem is still a difficult problem to solve

Therefore, recent studies on indexing an XML document tend to:

- Separate XML document into 2 parts of data and apply different indexing methods to suit data types and specific query types Detailed:

1 The method of indexing structured data (tag data) and supporting specific queries such

as XPath

2 Methods of indexing biological data (such as DNA fragments) and supporting specific queries such as searching for homologous DNA sequences

Trang 4

- Converting original text data into digital format is aimed at:

1 Reduce the original data size

2 Apply appropriate indexing methods

3 Improve the speed of queries

The problems to be solved are broad, including informatics and biology, so the thesis's research focuses on solving the problem of indexing method to support specific queries about speed by reducing the number of queries access to the hard disk and still achieve the expected results

The results of the thesis have solved the method of indexing structured data (data of tags) and supporting XPath queries In addition, with the problem of Biological data indexing method (such as DNA fragments) and supporting specific queries such as searching for homologous DNA sequences, the thesis has investigated the method and had orientation for further research Objectives and results of the thesis are as follows

The objective of the thesis

- Research indexing method based on R-tree method to increase the efficiency of XPath queries on XML data, through intermediate data converted into numerical coordinates of tags The target XML data is from a bioinformatics XML document

- Use the method of converting XML structured text data into numeric data that can be represented on 2-dimensional space (can be extended to many dimensions) The objective

is to reduce the size of original data and apply the proposed indexing method

The achieved result of the thesis as follow:

- Experiments have shown that the method of converting bioInformatics XML data into spatial data is effective in reducing the size with a fairly good rate in general However, the compression ratio does not have uniformly good results between experiments with the XML bioinformatics documents, DNA, Protein, and subspecies …

Proposing the method of BioX-tree indexing and the extending method of BioX + tree The proposed methods (improved R-tree method) have proved more effective than the R-tree method when applied to index data converted from XML data through experiments In particular, sibling queries, or queries that leverage sibling queries in the algorithm have good results Theory and experiment have proven that: queries have reduced the steps of redundant tree browsing on the index tree (stored on the hard disk), thereby reducing the number of times

to access the hard disk to retrieve data on main memory, and still get the desired results

Trang 5

1.1.2 Data sources

- Database NCBI

- Database EMBL / EBI

- Database DDBJ

1.1.3 Bioinformatics and bio databases issues

It can be seen that the biological database contains a huge amount of information such

as DNA sequences, proteins, functions, subspecies, etc., and added continuously to increase their size quickly, especially with the development of current biotechnologies Biological databases can be stored on computers; however, such problems of searching or querying data

on large databases are often difficult to perform due to factors related to space and time query

At present, the problem of indexing to speed up the processing of bioinformatics data is very much interested by many researchers, and has great significance in reality

1.2 Methods for indexing biological and bioinformatics data

1.2.1 Index and external memory model

Complete databases are often incompatible with the main memory (RAM) of a computer system Therefore, complete databases are usually stored on hard drives Access to this drive will be 100,000 times slower than accessing to the main memory, which is the bottleneck of database management systems

Measuring the effectiveness of an algorithm is calculating the amount of I / O to perform an action Indexes in a database are a special lookup structure that database search tool can use to quickly increase data retrieval time and performance by reducing the number

of blocks used to storing the database if possible

1.2.2 Indexing methods for biological data

There are two main groups:

- Group 1: Methods that perform to compare sequences by comparing the segments

in the sequence and optimizing the similarity

- Group 2: The methods that use special transformation to build the index

Trang 6

There are many types of bioinformatics documents stored in many different formats In this thesis, the author will focus on large-sized XML data, this is one of the output standards for users to download from the above mentioned data sources

XGrind [78], Xpress [52], XQzip [15], XQueC [7], Arroyuelo et al [8], Qian et al [62], Dietz [21] Li and Moon in XISS [61] methods have been studied by the author and will be presented and analyzed advantages and disadvantages more carefully in the following sections

1.3 Method of indexing XML documents

1.3.1 XML and XPath documents

- XML document: XML (eXtensible Markup Language) [77] is a hierarchical data model derived from SGML, it allows modeling a document as a tree structure

- Xpath: Structure of an XML document can be visualized as a tree with many different branches and small branches An axis indicates which node is relevant to the context node, should be included in the search The XPath specification [11] lists a family of 13 axes in Table 1.1:

Table 1.1: Xpath axes

Self Context node itself

parent Parents of context node, if existed

child Children of context node, if existed

ancestor Ancestor of context node, if existed

ancestor-or-self Ancestor of context node, and itself

descendant Descendant of context node

descendant-or-self Descendant of context node, and itself

following Nodes in XML documents after context node, not including descentdant

following-sibling Sibling nodes after context node

preceding Nodes in XML documents before context node, not including ancestor

preceding-sibling Preceding siblings of context node

attribute Predicate of context node

namespace Namespace context node

The predicate can also be specified at each step in the path to restrict the set of nodes that originate at one step In other words, predicates allow to identify the needed data more precisely, resulting in smaller and more usable results Some indexing methods are described below:

3.1.1.1 Numbering on the schema

The XML document will be built as a tree with the parent-child hierarchy relation, after that the corresponding name tags will be indexed with 2 indexes according to the pre-order and post order value rule, (this pair of value will form the NodeID) and serialized each tag name

Trang 7

3.1.1.2 Structured joints

The simplest way to improve and evaluate path queries is to divide large expressions into many smaller expressions (called sub-expressions) and perform a search for results in those sub-steps The drawback is that we need to determine the A-D relationship for each pair of nodes, they may have to find many times and repeatedly consider an element in different steps, which is costly and time-consuming

1.3.1.3 Conversion into multi-dimensional space

This approach attempts to convert paths and A-D relationships from input XML documents into multi-dimensional data sets The main idea is to avoid structured joins that may be inoptimal in a variety of situations that cause slow implementation, and also to take advantage of multi-dimensional data structures that are becoming more effective as R-tree The works in [37] and [51] propose a new indexing method for XML trees based on multi-dimensional data sets called MDX (Multidimensional Approach to Indexing XML)

1.3.1.4 Map to relational database

In [36] presents a method specifically designed for XPath queries and path expressions, which represent the nodes of the input XML file with 5 dimensions: entry (E) = {pre (E ); post (E); par (E); att (E); tag (E)} For an E node, pre and post node is the tree browsing value by preceding value, browsing the tree by the following value; par is the tree browsing value according to the preceding value of the parent node of node E; att is the status flag, the tag contains the node's tag name The XPath query will be based on a window query and represented as an SQL query (Structured Query Language) Because nodes are represented in 5-dimensional space, this proposed solution uses R-trees for indexing because they have been evaluated by many studies as having good results in XPath queries

1.4 R-tree method

1.4.1 Concept of R-tree

R-tree method was built to quickly access to spatial zones, by dividing the space into memory zones and creating indexes for these small memory spaces, then applying graph tree theory to manage R-tree is a method of dividing the data space into the minimum rectangular block containing data (Minimum Bounding Rectangle - MBR) The MBRs themselves are stored in the tree structure rather than the data itself (such as metadata), so the search for data will be performed on the nodes

1.4.2 R-tree structure

In general, the R-tree is an index structure for n-dimensional spatial objects and is similar to a B-tree Leaf nodes in the tree contain indexes, so they have the format: (MBR, object_ptr) - where object_ptr refers to a data set in the database and MBR is an n-dimensional rectangle containing the space objects it presents The non-leaf nodes have the form: (MBR, chirld_ptr)

- where chirld_ptr is the address of another node in the tree and the MBR includes rectangles

in the lower nodes An R-tree satisfies the following predicates:

-Each node contains the number of child nodes in the range m and M except the root node

Trang 8

-For each input type (MBR, object_ptr) at leaf node, MBR is the smallest rectangle containing the n-dimensional data object represented by object_ptr

-For each input type (MBR, chirld_ptr) at non-leaf node, MBR is the smallest rectangle containing the rectangle in the chirld node

-The root node has at least 2 subnodes except that it is the leaf node

-This is a balanced tree

1.4.3 Some basic algorithms in R-tree method [30]

a) Search in organizational data structure as R-tree

1.5 The remaining issues

The model shown in Figure 1.1 helps to convert XML data into multi-dimensional space, thereby applying spatial indexing and querying methods to increase processing speed and reduce data size when indexing Because bioinformatics XML is in fact very diverse, R-trees will be more suitable and selected as the basis in this thesis

Figure 1.1: Overview Model

Figure 1.2: Model shows data conversion and

indexing on hard disk

R-tree method still has some problems when applied to process bioinformatics XML data

as follows:

1) Firstly, it is overlapping problem For spatial-based index technique, the larger search space the more time it takes for getting the returned node set But the weakness of R-tree based method is that the queries require a fairly large data scan window, thus causing a considerable impact on the query performance

Trang 9

2).The problem of the sibling connection of tags after converting to space, which is expressed

as points in space such as parent, preceding, sibling, descendant, child, following, etc with

Xpath axis In Figure 2.2 shows the data distribution on the coordinate system, the author

recognizes that all data is skewed as a trapezoid / diagonally aligned (like an airplane wing) Meanwhile, all the previous methods did not care about that, so the queries have not improved significantly when querying in the data zone of this airplane wing

1.6 Conclusion

Chapter 1 presents some fundamental concepts of bioinformatics and bioinformatics data Bioinformatics data is becoming huge due to the regular contribution and sharing of the research community Because the problems of bioinformatics data analysis are very diverse, the storage information documents need structure that is easy to change, flexible, diverse and especially easy to share / contribute Currently, XML documents are an important standard for describing and storing huge bioinformatics data However, XML documents have text and semi-structured data, so the extraction is not the same as regular data Chapter 1 also presents related studies of the problem of XML data extraction, the indexing methods, the algorithms proposed in previous studies have been mentioned, in which R-tree is The algorithm appearing effective with XML documents and XPath queries On that basis, chapter 1 analyzes and presents the research issues of the thesis

CHAPTER 2 BIOX-TREE INDEXING METHOD

2.1 INTRODUCTION

The methods given above for indexing in space based on R-tree are having problems: Firstly, it is overlapping problem For spatial-based index technique, the larger search space the more time it takes for getting the returned node set But the weakness of R-tree based method is that they create an unoptimal window query Figure 2.1 illustrates an instance

of XML document with several small points represent XML data in planar Assume that, from

the context node v we want to get all of its descendant nodes by using a window query {pre(E),

; 0, post(E)} [36] The really needed window is what in white color defined by the tree

browse value with the preorder value of left-most descendant node and the tree browse value with the post order value of righ-most descendant node of node E As the result, the waste area covered by dark color by the query window corresponding to descendant axis causes a considerable impact on the query performance, which the range can be very large in many cases

Figure 2.1 Scanning range of pre-order and post -order (gray zone) and zoomed (white

zone) for descendant queries is performed according to the sample query

Trang 10

Secondly, it is a matter of the connection of tags after converting to space, which is expressed as points in space such as parent, preceding, sibling, descendant, child, following, etc with Xpath axis In Figure 2.2 shows the data distribution on the coordinate system, with tested rice DNA data on 1000 nodes (Figure 2.2a), with tested Swissprot data on about 20,000 nodes (Figure 2.2b), the author recognizes that all data is skewed as a trapezoid / diagonally aligned (like an airplane wing) The author has tested on many different XML documents, from a few hundred nodes to several hundred thousand nodes, with the same results

(a) (b)

Figure 2.2: Example of distributing conversion points for an XML document

Meanwhile all previous methods did not concern about that, they only focused on processing the relationship between parent/child or ancestor/descendant and omitted the other axes that are considered an important part on query processing, especially processing query stream of XPath queries with predicates The queries have not improved significantly when querying in the airplane wing data zone

From there, the author digs into the new indexing method, improved from the R-tree to help XPath queries run more efficiently in a number of axes

Based on the model selected in Chapter 1, the author will make suggestions for improvement in the components: conversion, indexing, query processing module

Figure 2.3: Proposed parts for improvement in the BioX-tree method

The results of this chapter are published in works 1, 2, 3 and 4 in the "List of author's works"

2.2 BioX-tree Indexing method

2.2.1 XML document conversion

Still following that general principle in XML document analysis and transformation, the author has built a separate program to ensure accuracy when compared to the R-tree-based method of the previous studies In document [20], the conversion is implemented by using two procedures startEuity (t, a, att) and endEuity (t) Here, the author has added a new parameter of

Trang 11

the author's own way to increase searching efficiency, besides still using the parameters by the pre order value and the post order value It is the parameter used to indicate the level l (level)

of each node The above two procedures are modified with new names startElement and endElement, presented in Algorithm 2.1

Algorithm ConvertXMLDocument(XMLdoc)

Input: XMLdocument need converting

Đầu ra: file txt containing values in space of a node(E) = {pre (E), post (E), par (E), att

(E), tag (E)}

Algorithm 2.1: Two modified algorithms in XML document conversion

2.2.2 Index structure on BioX-tree

BioX-tree applies a different insert / split strategy to achieve sibling relationships of XML data more easily, while not affecting the spatial differential ability of the index too much Similar to the XPath and R-tree method mentioned in Chapter 1, each tag name in the XML document after conversion is represented as an entry [30] consisting of 5 attributes node (E) = {pre (E), post (E), par (E), att (E), tag (E)} A node will have the size corresponding to a block

in the hard drive

Non-leaf nodes have the form (pointer, MBR) in which the pointer pointer points to the child and the MBR is the smallest rectangle surrounding all entries attached to it We simply understand that the non-leaf nodes will contain metadata information of leaf nodes, need to know the information about leaf nodes can be found here

Leaf nodes, which contain the elements after conversion, are responsible for maintaining the aligned trajectories of the actual XML data To do that, the author applies double-linking methods to keep the connections with the preceding and following XML The author also uses pointers to stay connected with the parents of the XML children In short, the author uses 3 pointers in a leaf node to connect with the preceding and following siblings and their parents,

so each node of this type will have a set of tuple-shaped pointers (previouspointer, nextpointer, parpointer)

Trang 12

The purpose is to try to maintain a relationship that reflects the wing airplane data distribution in space to make the query windows smaller and force a node on the tree to contain only Its siblings, making it possible for us to quickly find sibling relationships in BioX-tree

Figure 2.4: Tree hierarchy under tags in rice DNA XML documents

Figure 2.5: Leaf nodes show a connection on the structure tree of BioX-tree

For example, Figure 2.2 depicts the tree structure of a document related to rice DNA that the author will test in the following sections, here is an XML data set published on Gene NCBI bank They are numbered aby the pre order value and the post order value on the top based on the numbering type and algorithm described above After transforming the data (for simplicity

we only use the pre order value to describe), the nodes will be represented in the structure of the BioX-tree tree as shown in Figure 2.3, the data nodes whether the same parent will be stored in the same leaf node In case the leaf node too many entries and overflows from the array, it will be split and have pointers connected to each other to ensure still connecting with the siblings Straight arrows represent pointers from a leaf node to their previous and next sibling, curved arrows represent connections with their parents

In this example, entries with pre-order value of 21, 22, and 23 are siblings node in the XML document that will be inserted in the same leaf node and a pointer will be used to connect to their parent node, which is the node containing entry 24 That is, 21, 22, 23 and 24 are siblings and have the same parent

2.2.3 Algorithms

Because changing the tree structure will affect the insertion, deletion and query of nodes, the author will redesign some algorithms to be more appropriate This section will show the modified algorithms, and the ones that are not shown the author will reuse as in the original method

Trang 13

2.2.3.1 Insertion algorithm

With the goal is to keep the sibling connection of the XML data The insertion algorithm

is quite complicated A plain pseudo-code explains the insert process as well as the split strategy in case of leaf node is fully available in algorithms 2.2.

3 if (N’ has space to add) then

4 insert entry E into N’

5 else

6 Call CreateNewLeafNode(E) to create a new leaf node on

entry E tree needs inserting here

Input: context node N, entry E need finding siblings

Output: node N contain siblings of entry E

Begin

1 if N is not a leaf node

2 Browse searching entries E’ in N has MBR intersect with MBR of entry E

3 Call FindSiblingNode(N’, E) in which N’ is subnode of N indicated by E’

Finding sibling node of entry E, N’ is found

if (N’ has room to add entry e) then

Add entry E to N’

else

Định dạng
Số trang	26
Dung lượng	573,03 KB