A SHALLOW APPROACH FOR QUERYING GRAPH DATABASE

NoSQL database, the most common way beyonding traditional data models used to store structured data, is applied in improving performance on system with scalable database.. In this paper,

Trang 1

VII-O-1

A SHALLOW APPROACH FOR QUERYING GRAPH DATABASE Dương Quang Hưng, Nguyễn Minh Nhựt, Nguyễn Trần Minh Thư, Bùi Đắc Thịnh

Information System Department University of Science, Ho Chi Minh City

ABSTRACT

Rapidly growing on information system applications bydiverse human demands has led to the essential requirements on data storing problem NoSQL database, the most common way beyonding traditional data models used to store structured data, is applied in improving performance on system with scalable database Among them, Graph database takes reponsibility of storing and querying data related to graph nodes and links which are considerable as large scalable data In this paper, we proposed a work on analyzing the pros and cons of Graph database, in comparison with traditional data models, along with building an experimental scenario to evaluate querying progress on time efficiency The evaluation on the real data crawled from an operating information system shows out the reason that going for Graph database would be a justifiable decision on scalable data

Keywords: NoSQL; graph database; graph model; scalable data model

INTRODUCTION

Relational databases have been around for many decades and are the prefer database technology for most traditional data storages and retrieval applications [8] In particular, they usually use SQL, a declarative query language to exploit such databases In such many analysis, relational databases are generally efficient in case data doesn’t contain many relationships, which require join operations between large tables and cost massive plenty of time Although there have been different approaches such as XML or object databases, they are all absorbed by almost relational database management systems (RDBMSs) [1,2,9].Recently, there has been many shifts in data stores called NoSQL movements, created by challenges of high-performance on reading and writing big data effectively, with the development of the Internet and cloud computing [1,9] Until now, NoSQL still has many definitions to present its core themes In [9], the authors defined NoSQL as a set of concepts that allows any rapid and efficient processing of data sets with a focus on performance, reliability, and agility The most important point in NoSQL that differs with SQL is that it’s free of joins and schema NoSQL allows not only to create data without entity model but also to extract data without joins, which is considered as most costly time reason

Not like relational databases, NoSQL uses a diversity of data store types, from the simple key-value store

to column-family, an extend of column in relational databases, to graph stores used to associate relationships, to document stores used for variable data [2,9] Among them, graph database is the most appropriated solution for dense relationship problems As the system of a sequence of nodes and relationships, graph store is used to facesuch typical problems as social networks, fraud detection, or relationship-heavy data, where graphs are truly one of the most useful structures for modeling objects and links [1,2,5,8,9] In graph store, each two nodes are linked by some relationships and both of them, even relationships, have their own properties which are stored in key-value fields [9]

In this paper, we present a shallow approach to query graph data store on the crawled real data from an operating information system In initial experiment, we evaluate the time efficency of common and advanced queries on two database management systems in representation for relational database and NoSQL database In addition, we also deploy an information system using graph database to demonstrate the feasibility of our application and data

The remains of this paper are organized as follows First, Section II presents the related work on NoSQL and graph data store in particular Next, we describle the approach to query data on graph database Then, Section IV present some experiments on the crawled real data Finally, conclusion is presented in Section V

RELATED WORK

There have been many studies on investigation of alternative storages to relational databases In some way, NoSQL is the blanket term for them In this term, many projects such as Cassandra, BigTable, CouchDB, Voldemort, Dynamo,… are presented and are used more widely [1,8].BigTable [12] is, in effect, a database system created and used by Google, with large-scale, fast, and distributed While Cassandra is developed by Facebook, an open-source, distributed key-value data store [13],project Voldemort is LinkedIn’s large-scale, persistent hash table working in distributed enviroment and being designed majorly to handle errors [14]

Trang 2

Most recently, some very new projects, like Redis [11], is suitable for proving high performance computing to small amount of data but not big data storage; or we could include MongoDB, the hybrid form between relational database and non-relational database MongoDB supports a range of complex data types with powerful query language: most of functions like querying in single-table, and effective index, which can make itself 10 times faster accessing than MySQL, as it claimed [10]

While [1] pointed out the main features in NoSQL that differs with relational databases are considered as four aspects: concurrently reading and updating, supporting mass storage and access requirements, easy on scalability, and low cost [1,8,9]; authors in [7] claimed that there are just two possible reason to move to NoSQL but not relational databases: performance and flexibility These judgementsare somewhat precise in business problems of massively complex relationships between objects such as social networking, rules-based engines, mashups In these case, graph system is the most suitable for quickly analyzing complex network structures, even with mining patterns [8,9]

Graph store represents any complex network problem as graphthat contains nodes on vertices, relationships

on edges and their properties The relationship can be thought of as the connection between the objects from real world objects [9] The author in [9] also pointed out queries in graph data stores are similar to traversing nodes in

a common graph: what the shortest path between two nodes is, what nodes have nearest nodes that have given properties,…

Although graph data store can meet the existing problems, there is still a few of experiments to compare graph data store with the relational databases In [1], the authors just gave some options to consider in which properties that NoSQL is well-adapted The authors in [8] achieved results at specific aspects: designing some experiments on comparison of MySQL [16], representative relational database and Neo4j [1, 3, 4], representative NoSQL The experiments based on a predefined set of queries, evaluated processing speed on both data store managements However, data is random characters (8K or 32K) and is not real-world data

Compared with previous work, our work makes some contributions to the advancement of judging the NoSQL movement as follows:

We present the evaluation on time efficiency and make comparision between a relational database management system and a graph data store system The evaluation is processed on real-world data, which is crawled from an operating information system, by using meaningful graph queries

We build and deploy another information system using graph data store and graph queries to illustrate the feasibility of using graph data store in action

To our knowledge, this is one of the first works that exploits real-world data to compare the performance between relational database and NoSQL, in particular: MSSQL server [17] and Neo4j graph database [1, 3, 4]

A SHALLOW APPROACH FOR QUERYING DATA

Aiming to the target of comparising time efficiency performance, we carried out some specific database management systems, on both relational and graph database Based on related work and some technology knowledge, we decided to choose MSSQL server, representative of relational database and Neo4j, representative

of graph database, NoSQL

The process to benchmark two systems’ performance is as follows: firstly, we build a crawler to get the

real-world data from Foody (http://www.foody.vn), the system with more than one million users.The data is

about a social network, in which food courts are the nuclei Then, based on our knowledge in data schema, we create two schemas which each of them corresponding to a database management system (MSSQL server or Neo4j) Next, the crawled data is ETL processing [15] before being constructed fulfilled databases The experiments are processed on these databases with the same predefined meaningful queries which are suitable and essential for real applications

Trang 3

We present some objects’s brief descriptions in Table 1

Table 1.Object description in data

friend

venue

venue DIADIEM_MONAN FOODCOURT _FOOD Relationship between

COURT and FOOD

Food information, corresponding to some courts THANHVIEN_CHECKIN_DIADIEM USER_CHECKIN_FOODCOURT

Relationship between USER and COURT, related to action ―Check-in‖

THANHVIEN_LIKE_DIADIEM USER_LIKE_FOODCOURT

Relationship between USER and COURT, related to action ―Like‖

EXPERIMENT

To evaluate the time efficiency of queries on two database management systems, we predefined three queries corresponding to existing problems on dense relationship network data The queries are presented in Table 2 We plot the running time on a 2.1 GHz CPU CoreI3, 2GB RAM Execution time is measured in miliseconds (ms)

Table 2 Experiment query on food court data

1 Finding friends of friend in variety of depth-level

2 Browsing food courts that friends used to check-in, like, comment or rate, with the

given properties

3 Suggesting food courts that followed a pattern (User used to come X then coming Y)

Data

Data for experiments is the full data as we presented above Table 3 describes the number of records of each object Data for experiments is the same inMSSQL server and Neo4j

Table 3 Data record used for experiment Object Name (vn-vi) Object name (en-us) Number of records

THANHVIEN_CHECKIN_DIADIEM USER_CHECKIN_VENUE 3886

Trang 4

Query

As we presented above, the experiments will evaluate two systems on three queries respectively In each following subsection, purpose of each query and its performance in time (ms) would be described and analyzed seriously

i Query 1: Finding friends of friend in variety of depth-level

This query is used to find friends along with their properties, with a given user and depth-level It can be described as follows: the current user’s name is Nam; this query targets to find all friends of Nam with given depth-level; assuming that the depth-level equals to 2, the mention-aboved query will find not only friends of Nam but also all friends of friends of Nam The query’s experimental result on two database system is presented

in Figure 1 In which, we should say that costly time of this query in Neo4j tend to be stable when the depth-level increases while the one in MSSQL serverrapidly increasewhen the depth-depth-level equals 5 Figure 2 also shows that costly time in Neo4j is proportional with depth-level increment but just slightly, in comparision with performance in MSSQL server

Figure 1 First query’s experimental result on time costing

Figure 2 First query’s experiment on Neo4j with high depth-level

ii Query 2: Browsing food courts that friends used to check-in, like, comment or rate, with the given properties

This query is used to list all food courts related to current user’s friends (check-in, like, comment or rate)

0 20000 40000 60000 80000 100000 120000 140000

Depth - level

RDBMS Graph Datastore

0 2000

4000

6000

8000

10000

12000

14000

Graph Datastore

Trang 5

of execution Excuding that Neo4j is considered as 20 times faster than MSSQL server in this query, Neo4j still express its stable execution

Figure 3 Second query’s experimental result on time costing

iii Query 3: Suggesting food courts that followed a pattern (User used to come X then coming Y)

In real world, there is a demand that people need suggestion before giving their decision We assumed that when user A visited food court X, user A tends to visit food court Y and so on With a large data, the patterns will be generated and this query is used to suggest users these patterns Absolutely, the properties of the ―next‖ food court will be listed also In this case, we try to explore whether how costly time increase for each database system when more criteria (action check-in, like, comment) are included The result is presented in Figure 4 and Figure 5 When the query included more criteria, absolutely that costly time will increase on both database system, but we can see the Neo4j’s stable is clearly evident

Figure 4 Third query’s experimental result on time costing according to included criteria

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Execution Times

0 10000 20000 30000 40000 50000 60000 70000 80000

Number of criteria

Trang 6

Figure 5 Third query’s experimental result on time costing on Neo4j Application development

Based on the characteristics of crawled data and several functional and non-functional requirements, we developed an information system application that uses ASP.NET MVC [18] and Neo4j community server [3, 4], aiming to indicate the feasibility of an approach to store and query large scalabledata

The application is deployed as a website that is the same purpose with Foody but focusing on advanced queries that utilize graph data store’s ability

CONCLUSION

In this paper, we presented a shallow approach to query data on graph database and made comparison with the relational database We also described the advantage and disadvantage of Graph database Neo4j, in comparison with MSSQL server as a case study Graph database is compatible with scalable data which can be represented as nodes and links between them Experiments show that graph database is critically effective than relational database in case queries is complex and require join operations between the objects Drawbacks, in simple queries or on sparse relationship data, relational database still express its high performance compared with graph database So that graph database is actually suitable with large scale and dense data.Anyway, one of the reason is that relational database has many constraints in data, which is considered as not important at real time in graph data store

However, there are still some limitations in our research such as the specific interfaces of SQL and NoSQL In this case, they are MSSQL server and Neo4j on NoSQL To get a objective glance, the comparision

in a set of interfaces should be included on the crawled real data, which we have done well Moreover, the application we built should be deployed in reality to get feedback on rising of scalable data

REFERENCES

[1] Han, Jing, et al "Survey on NoSQL database." Pervasive computing and applications (ICPCA), 2011 6th international conference on IEEE, 2011

[2] Robinson, Ian, Jim Webber, and Emil Eifrem Graph databases " O'Reilly Media, Inc.", 2013

[3] Miller, Justin J "Graph Database Applications and Concepts with Neo4j." (2013)

[4] Partner, Jonas, Aleksa Vukotic, and Nicki Watt Neo4j in Action O'Reilly Media, 2013

[5] Neubauer, Peter "Graph databases, NOSQL and Neo4j." (2010)

[6] Holzschuher, Florian, and René Peinl "Performance of graph query languages: comparison of cypher, gremlin and native access in neo4j." Proceedings of the Joint EDBT/ICDT 2013 Workshops ACM, 2013 [7] Stonebraker, Michael "SQL databases v NoSQL databases." Communications of the ACM 53.4 (2010): 10-11

[8] Vicknair, Chad, et al "A comparison of a graph database and a relational database: a data provenance perspective." Proceedings of the 48th annual Southeast regional conference ACM, 2010

[9] McCreary, Dan, and Ann Kelly "Making Sense of NoSQL." Greenwich, Conn.: Manning Publications

0 50 100 150 200 250 300 350

Number of criteria

Graph Datastore

Trang 7

[10] Banker, Kyle MongoDB in action Manning Publications Co., 2011

[11] Carlson, Josiah L Redis in Action Manning Publications Co., 2013

[12] Chang, Fay, et al "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4

[13] Lakshman, Avinash, and Prashant Malik "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40

[14] Sumbaly, Roshan, et al "Serving large-scale batch computed data with project voldemort." Proceedings

of the 10th USENIX conference on File and Storage Technologies USENIX Association, 2012

[15] Karakasidis, Alexandros, Panos Vassiliadis, and Evaggelia Pitoura "ETL queues for active data warehousing." Proceedings of the 2nd international workshop on Information quality in information systems ACM, 2005

[16] MySQL: the world's most popular open source database MySQL AB, 1995

[17] Mistry, Ross, and Stacia Misner Introducing Microsoft® MSSQL server® 2012 " O'Reilly Media, Inc.",

2012

[18] Esposito, Dino Programming Microsoft ASP NET MVC Pearson Education, 2011

Định dạng
Số trang	7
Dung lượng	597,63 KB