An improvement solution for multiple attribute information searching based on structured P2P networks

Inform ation content is then distrib u te d to cach node rcbponsiblc for the key based on DH 丁 routing algorithm.. If the to ta l number of inform ation contents of a sub-key is not over

Trang 1

о ѵ ч ^

!N ỏ 广

An Im provem ent Solution for M ultiple

A ttribute Information Searching Based

A thesis s u b m itte d in fu lfillm e n t o f th e re q u ire m e n ts fo r th e degree o f

M a s te r o f In fo rm a tio n T echnology

Decem ber, 2009

Đ A I H Ọ C Q UO C G IA HA NỘI TRUNG TẨM THÔNG TIN THU VIỀN

A - L < ^ ỗ b _ _

Trang 2

Table of Contents

1.1 Overview and M otivation 2

1.2 Coưim unicat.iọn netw ork m o d e l s f> 1.3 P2P network m o d e ls 9

1.3.1 P2P N e tw o rk 9

1.3.2 P2P Network M odels 10

1.3.2.1 Unstructured P2P N e tw o rk 10

1.3.2.2 Hybrid P2P N etw ork 12

1.3.2.3 Structured P 2 P N etw ork 14

1.4 DHT-based P ro to c o l 15

1.4.1 Distributed Haah Table - DHT 15

1.4.2 CHORD P rotocol 16

1.4.2.1 Topology 17

1.4.2.2 Lookup and I n s e r t 18

1.4.2.3 Join and Leave 20

1.4.2.4 Stabilization and F a ilu r e , 21

1.5 Summary 22

2 R e la te d W o rk s 24 2Л INS/Twine： Information distribution based on Attribute-Value trees 24 2.1.1 S o lu tio n 24

2.1.2 System architecture 26

2.1.4 S u m m a ry 27

iii

Trang 3

2.2 CDS: Irifonnation D istrib ution Based O il Load Balancing M a trix 28

— _ TABLE OF CON TENTS 2.2.1 S o lu tio n 28

2.2.2 System architccturc 28

2.2.2.1 Registering a content n a m e 29

2.2.3.1 Query re so lu tio n 30

2.2.3.2 S u m m a r y 30

2.2.4 Load Balancing M a trix (LB M ) 30

2.2.4.1 The structure of L B M 31

2.2.5.1 LB M management mechanism 32

2.2.6 S u m m a ry 33

2.3 Data In d e x in g 33

2.3.1 S o lu t io n 33

2.3/2 Insert a f i l o 36

2.3.3 Lo okup s 37

2.3.4 S u m m a ry 38

ậ 2.4 SMAV: Searching - M ultiple-attribu te V a lu e 38

2.4.1 S o lu tio n 38

2.4.2 Distribution of information c o n te n t 40

2.4.3 Information content name query 42

2.4.4 S u m m a ry 42

A n Im p ro v e m e n t S o lu tio n fo r M u lt ip le - a t t r ib u t e I n fo r m a t io n Search in g on S tr u c tu r e d P 2 P N e tw o rk 44 3.1 I d e a 45

3.2 Three Levels Mapping Model 46

3.2.1 Overview 46

3.2.2 Thrce-Levclb Sub-key Mapping S to rin g 48

3/2.4 Inform ation q u e ry 54

3.2.5 S u m m a r y 57

3.3 The Dynamic Threshold V a lu e s 57

3.3.1 A formula of threshold v a lu e s 57

3.3.2 Adjusted Distribution A lg o r ith m 59

3.3.3 Updating Threshold Value Periodically 61

Trang 4

3.3.4 Adjusted Lookup A lg o rith m 62

3.1 Summary 64

4 S im u la tio n s an d E v a lu a tio n s 65 4.1 Q ualitative E va luations 65

4.2 Simulation D e s c rip tio n 67

4.3 Evaluation Based On S im u la tio n s 68

4.3.1 Load b alancing 68

4.3.3 Routing Performance 74

Trang 5

List of Figures

1.1 Client/Server network m o d e l 8

1.2 Peer-To-Peer network model with 5 peers 10

1.3 Locating resources in a G nutella-likc P2P e n v iro n m e n t 11

1.4 An example of H yb rid P2P M o d e l 13

L5 Distribution data progress based on D H T 15

1.6 Chord's key space with 23 points 17

1.7 Chord's com ponents 17

1.8 Lookup progress of Chord*s protocol with key 5 4 19

1.9 Joining phase of a node in C H O R D protocol 21

2.J Meta string and A V T rc o 25

2.2 Architecture of IN S /T w in S y s te m 26

2.3 S p littin g a rcbourcc description in to s t r a n d s 27

2.4 The architecture of CDS system 29

2.5 An example of distribution of AVs in nodes 30

2.6 The structure of Load Balancing Matrix for {이，V i } 31

2.7 An example of described d a t a 34

2.8 Sample File Queries 34

2.9 Mappings between q u e rie s 35

2.10 Query mapping for three descriptors 36

2.11 An example of mappings t r e e 36

2.12 An example of a path of queries 37

2.13 Key - sub-key m a p p in g s 40

3.1 The number of hop levels of a content name in pure S M A V 49

3.2 Mappings are created fr이ฑ key k ị 49

3.3 The generation of distribute d keys from a content n a m e 53

3.4 Block diagram of query progress of improved S M A V 55

Trang 6

LIS T OF F IG U R E S v ii

3.5 An example of query progress with 4 common k e y s 56

3.6 Combining of common keys and a uncommon k e y 61

4.1 The d istrib u tio n of ЛѴ pairs in content n a m e s 68

4.2 Number of inform ation contents stored in each o f 5000 n o d e s 69

4.3 The number o f queries is processeci by each of 5000 n o d e s 70

4.4 Load balancing among nodes 70

4.5 Mappings stored in every node 72

4.Г) Mappings is created by CNs in a DSMAV b S M A V 72

4.7 The number of keys stored in every n o d e 73

4.8 Level-к Sub-keys are created by three solutions 73

4.9 Logicai hop count required for each query 74

4.10 The maximum number of hop level of three s o lu tio n s 75

4.11 The number of successful queries 75

Trang 7

List of Tables

1Л Comparison: Client/Server vs P 2 P 9

1.2 Definition of variables for node ท using m -b it identifiers 18

2.1 Mapping tabic between distributed key and content n a m e s 41

2.2 Mapping table between distributed key and s u b -k e y s 41

2.3 Mapping table between distributed key and uncommon k e y s 42

3.1 Mapping Table between distributed key and content n a m e s 51

3.2 Mapping table between level-2 sub-keys and d istribu te d k e y s 53

3.3 Mapping Table between keys and c o n te n ts 60

viii

Trang 8

A bstract

Conventional information searching engines such as Google, Yahoo, and Wikipedia support only Keyword-based searching on websites They cannot search information

in various kinds of resources such as personal devices like Laptop, PDA} Cell Phone

or sharing files in P2P Network

Besides, DHT-based P2P networks such as Chord, CAN, Pastry can achieve cxact (!ucry (i.e query of an exact key) with characteristic of scalability, efficicncy

and fault-tolerate However, in the Cítóe of complex queries such as range query or

multiple-attribute query, pure DHT is not efficient since lots of query messages must

be sent

In this thesis, we focus our intentions on m ultiple-attribute query on DHT- batícd P2P network The big problem here is the unbalance among nodes due to the appearance of common attribute/value pairs (AV pairs) in content names The main idea of our method is to lim it number of content items, which assigned to an ID by creating sub-IDs from multiple AV pairs if those AV pairs appear in lots of content names, to threshold value of each node To reduce query cost, our system also keeps the mapping between an ID and its sub IDs if existed in the node responsible for the

ID Moreover, we store only mappings, which are created in distribution progress,

to 2 nodes Our method can achieve both efficiency and a good degree of load balancing even when the distribution of AV pairs is skewed Our simulation result shows the efficiency of our solution in respects of lookup time and the degree of load balancing

1

Trang 9

Chapter 1

Introduction

With the unprecedented growth of information technology, today we can see thatinformation is appearing in everywhere Information might be found in various kinds

of resource« «uch as personal dcvices like Laptops, PDAs, Cell phones , websites in the Internet, sharing files in P2P network 1.,

From the explosion of information, there are more and more information searching demands in somewhere Every day we need lots of information to communicate and work efficiently and easily For instance, we search for weather forecast information before a trip or a picnic We also search for information of the latast news of the day，refercncc« of a product to buy, information of market priccs, etc In lots of cases, if we seize desired information quickly and exactly, we might have more successful opportunities in communication and work Therefore，information searching

is a necessary demand in nowadays information age The emergence of new applications and services will require an efficiency information searching system which can realize complex query on contcnt names in a sealable manner (พ Adjie-Winoto &Liliey 1999； Carzaniga Sz Wolf, 2001; Foster & Tuecke，2002)

There are many large systems to allow searching information such as conventional search engines: Google, Yahoo Amazon, eBay, Wikipedia Google engine allows users to search information based on keywords on Internet This engine can link to billions of websites to search information Information of each website is described

by keywords and then they are processed and stored in servers o f Google

Conventional search systems often use Client/Server model where servers p ro

9

Trang 10

1.1 Overview and M otivation

vide searching services to clients However, Client/Server model have some disadvantages Firstly, it has limitation in scalability Servers are made with high cost because it need a very big capacity of processing and storing Secondly, each server may be a single point of failure When server goes down，operations will be ceascd Moreover, as the big number of simultaneous client requests to a given server in- creaiies the server can become overloaded When a big amount of clients join to the network, traffic congestion on the network has also been an issue

Rcccntly, the appcarancc of Pccr-to-Pccr (P2P) network model has attracted the interest of lots of people, P2P with their decentralized control, self-organization and adaptation have emerged as a significant social and technical phenomenon over the last year Unlike Client/Server model, P2P networks aim to aggregate largo numbers of computers that join and leave the network frequently In pure P2P systems, individual computers communicate dircctly with each other and bharc information and resources without using dedicated servers For example, they provide infrastructure for communities that share CPU cycles (e.g., SETI@Home, Entropia) and/or storage spacc (e.g., Napster (Idit Kcidar, 2006; Napster, 1999) FrccNet, Gnutella (Gnutella, 1999)) or that support collaborative environments (Groove)

In P2P networks, all clients provide resources, including bandwidth, storage spac(î, and computing power If there are more and more many nodes to join to the svstem, the total capacity of the systcn would be more and more increase This

is not true of Client/Server network model with a fixed set of servers, in which adding more clients could mean slower data transfer for all users The distributed nature of P2P networks also increases robustness in case of failures by replicating data over multiple peers, and by enabling peers to find the data without relying on

a centralized index server In the latter ease, there i« 110 single point of failure ill the system

Information searching on P2P network is attended in recent years Advantages

of P2P network model allows us to construct information searching systems with capabilities of scalability and fault-toleratc Bccausc of the whole of data of system are distributed to all nodes; each node is responsible for a portion of data and to take part, in search progrevss The Gnutella network (Gnutella, 1999) supports to share and search file« It searches data by flooding messages to the whole network Nevertheless? Gnutella network requires high overhead; the search may be failed because a query may be not routed to the node is responsible for desired information Hence, it leads to search information inefficiently eDonkey (Weikum, 2002) network

Trang 11

C h a p te r 1 Introduction

suits to share and scarch big files such video files, fu ll music albuma and software based on some nodes (which are called indexing servers) that allow users to locate files within the network However, maintaining indexing servers cause ฯcentral point

of failure” • If indexing servers ccasc operates, a query may be not sent to the nodes whose data are indexed in these servers In general, scalability, efficiency and fault- tolerance in message routing and inform ation query on P2P networks are essential problems th a t must be resolved

The adoption of D istrib u te d Hash Table (D H T ) such as C A N (ร Ratnậamv

к Karp, 2001), Chord (Stoica et al 2001) Pastry (A Rowstroib 2001) ，etc offers

a promising solution for routing problems in P2P network A DHT-based network constructs a structure o f a v irtu a l key spacc where cach node is responsible for a portion of key space7 and keys are used as destinations to route messages Ill the case of Chord, it can route a message to a node responsible for a destination kcv in O(logN) hops while each node only needs to maintain O (logN) links to other nodes, where N is the to ta l of nodes in the network Some typical DHT-based network are BitTorrcnt, the Storm botnct the Kad network, YaCy W ith above organization DHT-hasod network meet, a problem I t only achieve exact queries In DHT-based network, inform ation contcnt is represented by a key from its description I f users want to scarch this inform ation content, they need to send w ith complete query to the network In fact, if users only remember a portion of inform ation content, they would be not able to search this inform ation contcnt In the сабе o f complex queries such as range query or m u ltip le -a ttrib u te query, pure D H T is not efficient since a lot of query messages must be sent

In this thesis, we focus our intentions on DHT-based P2P networks and the realization of m u ltip le -a ttrib u te searching system based on DHT Simple DHT-ba*scd distribution of inform ation contents can be used to realize exact query (i.e querying information contents which have the same content name as the name in the query)

efficiently However，it is not efficient to perform complex queries such as multiple-

a ttrib u te queries It is because the number o f inform ation contents th a t match a complcx query is often large and therefore a lo t of query messages must be sent to

resolve a query To realize multiplc-attribute queries efficiently on DHT-based P2P

network, a framework b u ilt above the D H T is necessary

Several papers (Gao, 2004; Y- Arakawa et al.，2005; M Balazinska，2002; Garces- Erice & Ross, 2004) have proposed solutions for above problems In these solutions, content names are represented by attribute and value (AV) pairs In (Gao, 2Q04;

Trang 12

.1 Overview and M otivation

Y Arakawa et al 2005)，cach AV pair o f a content name iỉ> hashed to a key

In (M, Balazinska，2002; Garces-Erice & Ross, 2004), a content name forms an

attributed/value tree and each strand of an AV tree is matched to a key Inform ation content is then distrib u te d to cach node rcbponsiblc for the key based on DH 丁

routing algorithm Information query is realized by looking up the nodes that are responsible for keys created from a query AV pair However, the distribution of

information content ba«ed on each AV pair or each strand of AV tree causes the load unbalancing since there may be common AV pairs those appear in content names w ith high probability There arc several solutions for this problem such as using m ultiple nodes in responsible for a key (Gao 2004) or lim itin g number of content item« corresponding to a key in a node (Y Arakawa et al., 2005) However, these approaches have not good balance between load balancing and query efficiency though there is a tradeoff between load balancing and query efficiency

In this thesis we propose a system callcd DSM AV (using Dynamic threshold

values for Searching Multiple Attribute/Value) to distribute and query information

content ba^ed on AV pairs on DHT-bai>ed network The m ain idea o f our system is to lim it the number of inform ation contents, which are distribu te d by a distributed key

to a reasonable threshold value It is done by creating sub-keys from multiple AV

pairs in a contcnt name if these A V pairs already appear in a lo t o f contcnt names

If the to ta l number of inform ation contents of a sub-key is not over a threshold

value^ it would be used as a distributed key to distribute its information content

Our system also keeps the mappings between sub-keys created from two common

AV pairs and distribu te d keys created from more two common AV pairs to reduce query cost O ur system satibfics three following requirements:

• Efficiency Distribution： A content name will be distributed to a few of nodes

The number of the mappings and keys are distrib u te d to every node reasonably

Mappings of level-2 sub-key and level-m sub-key (m I 2) are stored in the nodes

th a t take responsibility for level-2 sub-keys

• Efficiency Searching: If a query name contains an uncommon AV pair, it will

be resolved by querying only node which is responsible for the key created by the uncommon AV pair I f a query name contains only common AV pairs, query messages w ill be sent quickly to nodes responsible for prim ary key and

level-2 sub-key The node, which is responsible for level-2 sub-key, also use

mappings between level-2 sul>key and other sub-keys to route query messages

Trang 13

to other nodes, which store queried contents.

• Load balancing: The number of information contents those are corresponding

to a distributed key is not over a threshold value Hence, even if the distri-

b ilio n of AV pairs in content names is skewed, the number of inform ation contents stored in cach node can be kept balanced naturally

O ur solution uses DHT-baised protocols to route messages The keys, which are created by one or more AV pairs in the system, are routed by DHT-based protocols such as CHORD However, the inform ation contents, which include a set

of attribute-value pairs, arc distributed to the node« effectively An inform ation

content, which contains common AV pairs, is distributed to a set of nodes by using

keys, which are hashed from these common AV pairs Sim ilarly, a query process also hash one or more AV pairs of a query name into keys, which are sent to the nodes take responsibility for the keys Returned result would be a list of inform ation contents th a t are matchcd w ith AV pairs of the query name

O ur system can achievc both cfficicncy and a good degree o f load balance even when the d istrib u tio n of AV pairs ill content names is skewed O ur sim ulation shows the effectiveness o f our proposed system

O ur thesis І8 organized in 5 chapters In first Chapter^ we would introduce the overview of P2P network and P2P network models W ith Chapter Two we describe conventional solutions for M u ltiple-attribu tes inform ation searching aỉá our related works Then, we proposed a solution for M u ltip le -a ttrib u te d inform ation searching based on dynamic threshold values on Structured P2P Network in Chapter Three Baling on our proposed solution, in Chapter Four wc im plement simulations and evaluate the solution qua litative ly and quantitatively The last o f all, we summarize carried works and ta lk about trends of development in the futu re in Chapter Five

W ith the rising up of re q u ire m e n t of searching inform ation everywhere» there are

more and more people, who might want to share, search or download data from

network environment Today，network environments allow users to create and share inform ation resources to g e th e r Besides, the appearance o f Internet, where contains frequently updated inform ation, supports user's searching inform ation whenever and

Trang 14

1.2 C om m unication netw ork models 7

whenever they need Users can search latest news, inform ation o f a book, reviews

fo r business products, or download resource fr이n on Internet easily To do this, some inform ation searching systems al lo พร users to search data from everywhere by ubiquitous com puting models

I biquitous com puting is a post-desktop model o f human-computer interaction in which inform ation processing has been thoroughly integrated into everyday objects and activities In the progress of ordinary activities, som eone，• using” ubiquitous computing engages many com putational devices and systems simultaneously, and may not necessarily even be aware th a t they are doing so For this reason, ubiquitous computing model is used in many inform ation searching systems

In these systems, the term M Distributed information” is described as lots of

inform ation resources th a t can connect together based on a communication network environment They contain devices, which store inform ation, such as P D A ’S，Mobile

phone Laptops… Their data may be pieces of information such as the information

of a laptop or a book, or data files such as PDF, M P 3 ，Video files To support users searching the inform ation flexibly, the systems use characteristics of an object

to represent inform ation In other hand, each object may be described by M ultiple- attrib ute s based on its characteristic For instance, a laptop’s attributes can be described by its characteristics as name o f manufacturer, CPU, R A M , HDD I f an object is represented by more and more attributes, it would be more and more found

easily For this reason，an information searching system w ill store lots of data, which

are created by objects’ representations It hence needs to analysis and organize data

Reasonably choosing of a communication network model and implementing a well

protocol would increase performance and effective o f the system

There are two typical communication network models, which a M u ltip le -a ttrib u te inform ation searching system may be able to use, namely C licn t/S crvcr and P2P

network Client/Server network model (Figure 1.1) is used widely and popularly everywhere This model is a distributed application architecture that partitions

tasks or workloads between service providers (servers) and service requesters, called clients Often clients and servers operate over a com puter network on separate hardware A server machine ib a high-pcrformance host th a t is running one or more server programs which share its resources w ith clients A client does not share any

of its resources, b u t requests a server’s content or service function Clients therefore

in itia te com m unication sessions w ith servers which await incoming requests

W ith such architecture, C lient/Server network model has some advantages and

Trang 15

C h ap ter 1 Introduction

CUENT ff lì

广 ÍỄÊ3m ะJ

Figure 1.1： C lient/Server network model

disadvantages In most cases, it allows the roles and responsibilities o f a computing system to be distributed among several independent computers th a t are known to each other only through a network This creates an additional advantage to this architecture： ease of maintenance For example, it is possible to replace, repair, upgrade, or even relocate a server while its clients remain b oth unaware and unaffected by th a t change A ll data are stored on the server, which generally have better security controls than most clients Servers can control access and resources,

to guarantee th a t only those clients w ith the appropriate permissions may access and change data Since data storage is centralized^ updating data are administered easily It functions w ith m ultiple different clients of different capabilities However

C lient/S crvcr model may be not reasonable to constructs distribu te d inform ation searching systems because its disadvantages Traffic congestion on the network has been an issue since the inception of the Client/Server network model The server may become overloaded if it processes a lot of requests from the big number of simultaneous clients I t may be hard to achieve scalability w ith a single point of failure

In the case the server is damaged, clients' requests cannot be fu lfille d and the system

w ill cease Moreover, activities of server require adm inistrators who are knowledgeable experts about network and system This increases cost of management and operation

In contrast w ith the C licnt/S ervcr network model 1 P2P network model, where

it aggregated bandwidth actually increases as nodes are added, since the P2P network's overall bandwidth can be roughly computed the sum o f the bandwidths of every node in th a t network It is very suitable for distributed network architecture

Trang 16

G lo b a l knowledge: servers have a global Local knowledge: nodes only know a small

C entralization: Communications and

management arc centralized

D ecentralization: no global knowledge, only local interactions

Single p o in t o f fa ilu re : a server failure Robustness: several nodes mav fail w ithbrings down the system little or no im pact

L im ite d scalability: servers easily

overloaded

H igh sca la b ility: High aggregate capacity, load d is trib u tio n

E x p e n sive : server storage and bandwidth Low-cost: storage bandwidth is contributed

with composing of participants that make a portion o f th e ir resources (disk storage,

network bandwidth or processing cycles) Peers are both suppliers and consumers of

resources, in contrast to the trad ition al Client-Server model where only servers sup

ply and clients consumci Organization of P2P network allows overcoming problems

of Client/Server network Table 1,1 shows comparison of C licnt/S ervcr and P2P

model In next section, we describe P2P network architecture w ith its advantages

and disadvantages

1.3.1 P 2 P N etw ork

Contrast to client-server networks, where network inform ation is stored Oïl a cen

tralized file server PC and made available to tens, hundreds, or thousands of client

PCs, the inform ation stored across P2P networks ib uniquely decentralized A P2P

network model allows ใฑany PCs to pool their resources together Individual re

sources like Desktops Laptops, PDAs, and storage devices are transformed into

shared, collective resources th a t arc accessible from every PC

Because P2P PCs have their own hard disk drives th a t are accessible by all com

puters each PC acts as both a client (inform ation requestor) and a server

Trang 17

(informa-10 C h a p te r 1 Introduction

tion provider) Ill Figure 1,2, five P2P workstations arc shown A ll five computers can conimunicatc directly with each other and share one another.s resources

A P2P system is a distributed collection of peer nodes It may provide services

to other peers and consume services from other peers Data of each node may

be a portion of common data of system It means that common data of systeixi

is distributed to whole of possible nodes Each node stores a portion of common data and related services Therefore, the load of storage and processing data is also divided into several parts, which correspond with peer nodes Each node stores and processes a small number of pieces data, so if some node fail, the network system І8

almost not affcctcd seriously

As a completely decentraiized model, it allows the development of applications

with： high-availability fault- tolerance, and scalability such as Sharing of content application (Fire sharing, content delivery network, Gnutella eMule, Akamay), Sharing of storage (Distributed file system), Sharing of CPU time (Parallel computing,

There are three models of P2P network consisting of Unstructured, Hybrid and Structured P2P network Each model has individual advantages and disadvantages

1.3.2 P 2 P N etw ork M odels

1 3 2 1 U n stru ctu red P 2 P N e tw o rk

Figure 1.2: Peer-To-Peer network model with 5 peers

Grid)

Unstructured P2P network is an unregulated overlay where each participating node connects to others rand이nly and arbitrarily, act as equals and merge the roles of

Trang 18

Figure 1.3: Locating resources in a Gnutella-like P2P environment

clients and server Unstructured P2P network has no central server managing the network, neither is there a central router Joining of a new node based on a well- known node that, is called a bootstrap node The bootstrap node is flexible in the now node's neighbor selection and routing mcchanisms Unstructured P2P network

has profound impact on the efficiency of search The typical Unstructured P2P

Network system is Gnutella (Gnutella； 1999) (Figure 1.3)

For the purpose of files sharing Gnutella system is constructed basing on Unstructured P2P model Peer nodes of the svstem are organized randomly in Unstructured overlay Data of the system consist of files such as mp3, image, and video, which each of whole peer nodes may share to other nodes A new node, who wants

to join ill the system, would perform an operation ” Join”

Firstly, it would contact with a bootstrap node The bootstrap node would send to back a li«t of existed node, which is choscn randomly Then, the new node would store and relate ship to the list of the existed node as its neighbors From the neighbor nodes, the new node may reach other nodes in the communication network Thence, the new node continue to get more addresses of other existed nodes from the nodes it can reach, and so on Finally, the new node is existed as peer node

of the system Figure 1.3 shows an example of locating resources in a Gnutella-like P2P environment

Trang 19

12 C h ap ter 1 Introduction

query may be not found (Figure 1.3) Popular data is usually found easily because

of they are stored by many nodes The number of content of popular data is enough

to locate a searched content by flooding the network Using flooding allows the

search to be performed easily and reliably in a highly connected overlay

However Hooding has some disadvantages Firstly It creates a lots of duplicate query messages to send to a given node from its many neighbors Secondly； it

is d iliicult to determine the appropriate T im e -to L iv e (T T L ) which controls the flooding progress A high T T L allows achieving high scarch re lia b ility but requires

high overhead Characteristics of the overlay affect the flooding effectiveness versus the overhead Otherwise, a lim ita tio n o f йсаіе-free topologies is the high load on very few number o f hub nodes Peers are not w illin g to m aintain high loads as they may not want to store large number of entries for construction of overlay topology

1.3.2.2 H y b rid P 2 P N e tw o rk

In Unstructured P2P Network model, queries m ight not always be resolved A lthough popular data m ight be stored at several peers, but if a peer node look up data shared by only a few other peers, then it is highly likely th a t search w ill be not successful Because o f there is no correlation between a peer and the data, which

is managed by it There is no guarantee th a t flooding w ill find a peer th a t has the

dc\รircd data Flooding also causes a high amount of signaling traffic in the whole

network and hence such networks have very poor search efficiency

W hile the structure o f the hybrid P2P network model m ight tacklc problems of routing and lookup in U nstructured P2P Network In that, peer nodes are divided into two types, data nodcb and hub nodes D ata node stores real data and information of a hub node Hub node is the same as a ,1 server Î it do not store real data

I t store only indexes to files in the network To jo in in the systemโ each node has

to contact w ith a server and to provide real data，which it wants to share to others Since, t he server would create indexes for these files, and then it stores these indexes

in its database

Figure 1.4 shows an example o f H yb rid P2P Model Servers are entrusted w ith routing tasks Each server store indexes to files and inform ation of data node and

other servers Each data node would store data files and information of a server

Data files are distribu te d among p a rticip a tin g data nodes However, to look up data, each node needs to use routing inform ation th a t is stored centrally in servers

Trang 20

1.3 P 2 P netw ork m od els 13

Figuře 1.4: An example of H ybrid P2P Model

Figure 1.4 also shows an example of lookup a file in H yb rid P2P Model Searching node would send query message to its server The server looks up filcs^ locations

in the list o f indexes ill its database It also sends query requests to other servers simultaneously Those servers would search for indexes corresponding to desired files and return the files* locations to the searching node Finally, the searching nodo would contact the data nodes, which contain desired files, and download real files dircctly In this example, node A would connect to node в and then to download data files from node в directly after getting location inform ation from servers.Being a decentralized network model，eDonkey network is typical H ybrid P2P

Model It best suited to share big files among users It allows sharing video files,

full music albums and softwares There is no central hub for the network Data file« are distribu te d among peer nodes eDonkey servers act as communication hubs for the clients, allowing users to locate files w ith in the network

Like Unstructured P2P Network model Hybrid P2P network is implemented easily becausc of it does not need to implement distribution and routing algorithms‘

Furthermore, because o f servers undertake lookup the locations of data files, Hybrid model does not use query flooding or random searching like Unstructured P2P model, since it is more efficient However, m aintaining indexing server cause Mcentra l point of fa ilu re ř If a server stops providing service, the data, which indexes are

created by them ill the server^ database, would not exist in the system Therefore,

the whole system would cease if all servers stop operating

Trang 21

1 3 2 3 S tru c tu r e d P 2 P N e tw o rk

Structured P2P model is constructed to achieve capacities o f efficiently data searching routing and d is trib u tio n among peer nodes

Unlike H yb rid P2P network where there are some hub nodes as ” servers” 1 the Structured P2P Network is completely decentralized and self-organizing Each node has capacities o f storing data and routing to other nodes based oil a common routing algorithm To jo in into the network system, a new node is assigned by an ID, which is hashed by a globally consistent hatíh function The position of the node is determined by its ID From this ID , the node is responsible for a portion of data corresponding w ith its p a rtia l ID spacc To guarantee equally distrib u tio n of data，Structurorl P2P network use a distributed hashed m ethod

The distributed hashed method based oil a D istributed Hash Table (D H T ) Each node might, determine its position and range o f data managed by it from its ID

Since, it also determines the position of queried data by it As a result7 searching

and routing progrosses are performed efficiently

Structured P2P network is a scalable, efficient, completely decentralized and self-organizing, and load balanced model Each node stores inform ation of O(logN) neighbors Routing algorithm allows it looking up any key w ith O(logN) hop time Because o f the routing algorithm based on globally consistent hash function, a set

o f keys are d istribu te d equally to the key space This allows the system achieving load balancing among nodes W ith characteristic of pure P2P, each node has self- organizing a b ility w ith highly available

Conscquencc, because Structured P2P Network contains many advantages; it is

very interested to use in distribu te d inform ation searching systems Some proposed systems such as IN S /T w in e (M Balazinska, 2002)，CDS (Gao, 2004) and Data

Indexing (Garccs-Ericc k, Ross 2004)t used Structured P2P Network protocols The typical protocols include CAN, PASTRY and CHORD In next section, we continue

to present D istrib ute d Hash Table and a typical protocol, which is implemented on Structured P2P Network model CHORD

Trang 22

1.4 D H T-based Protocol

1.4.1 D istrib u te d H ash Table - D H T

Distributed Hash Tabic (DHT) is used to construct decentralized distributed network systems It provides a lookup service like to a hash table DHT contains key-value pairs, from that, each participating node can retrieve the value associated with a given key efficiently So, each node is responsible for storing and maintaining

a portion of information of key-value pairs which are mappings between a key and a value Each key is managed by a participating node Similarly each node manages

a set of key-value pairs, which include key-value pairs be sent from other nodes If there are more and more nodes in the system, the number of data, which are stored

in each participating node, may be more and more reduced Since, the load capacity and performance of the nodo can increase This allows DHT to scale to extremely large numbers of nodes,

Figure 1.5： Distribution data progress based on DUT

Figure 1,5 shows to distribute data in DHT-based P2P network Each data is hashed into a key by using a consistent hash function The key is distributed to a peer node that is responsible for the key

A procedure น Joirť，is constructed to support a new node joining to DHT-based

P2P network The new node is assigned an ID by using a globally consistent hash function Thence, a common routing progress of the procedure allows the new node knowing its position in overlay network Next, the new node receives the list

of neighbors and data corresponding with partial key space managed by it Its neighbors also receive information of the node and store in their database Finally: the new node is recognized ai> a full-fledged member of the system Since, it takes part to routing and querying data progress with other nodes

Trang 23

l i i C h ap ter 1 Introduction

III DHT-basod protocols, there arc two basic functions: n Insert” and ” Lookup，,

functions Insert function allows a node inserting its data into the system Lookup function is known as query function When a node, which is called distributed

node, wants to insert data into the system, it creates a distrib u te d key from data's description From the key, it determines the node is responsible for the key based

o il a routing algorithm o f a DHT-babed protocol Next, the distribu te d node would

distribute key and key s data to the node Also basing on the routing algorithm,

lookup progress is also performed similarly A query node would create a queried

key from a description of a query name The queried key is send to the node is

responsible for the queried key based the common routing algorithm Queried node would look up data from its database and return inform ation o f data node, which

is responsible for real data Finally, the query node would download data from the

data nodo directly

Otherwise, to keep stabilization of the system, DHT-based P2P network use some other procedures such as •’ LeabC’，r Stabilization” The procedure ” Lease” is

used when an existed node want to release from the network Every node would u«e ” Stabilization” procedure periodically to update its status of neighbors The algorit hm of these procedures is implemented into the system basing on each protocol

used Some typical DHT-based protocols are used ill DHT-based P2P network such

Qb CAN (ร Ratnasamy Ẵ: Karp, 2001), PASTRY (A Rowstron, 2001)，and CHORD

(Stoica et al., 2001) Because of in this thesis, we use C H O R D protocol as low layer

o f our system, in next section, we would describe the main details o f this protocol

1.4.2 C H O R D P ro to co l

I t was introduced in 2001 by Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan CHORD protocol is one of typical DHT-based protocols CHORD'S topology is constructed as a cycle (K Gumm adi et a l.7 2003)，which its point corresponds w ith an ID It allows inserting and looking up keys efficiently w ith the m axim um number of hop time not over O (lo g N )，while N is the whole IDs of CHORD cycle Each node keeps its status batíed on some procedures such as Join, Leave and Stabilization, which are provided by CH O R D protocol.

Trang 24

1.4 DHT-based Protocol 17

I ท CH O R D protocol, node keys are arranged in a cycle which do not contains over 2m IDs or keys from 0 to 2m — 1 Each ID is a result o f a consistent hashing function Using hashing function ІЙ accessary to the scalability and efficiently distribution

w ith both IDs and keys are uniform ly distributed and in the same identifier space Moreover, it allows nodes jo in or leave the network w ith o u t disrupting the network

Л C H O R D cycle contains 2m points Each point corresponds w ith a key and

an ID A key or an ID is the result of globally consistent hashing function When data is inserted into the network, it is hashed to a key based on its description by using the hashing function I f a new node joins in to the network, it is also hashed

to an ID based on its inform ation such aa IP Address, Port by using the hashing function The key or ID is assigned to a point of C H O R D cycle correspondently The position o f node is determined by basing on its ID Figure 1 6 shows key and

ID space of C H O R D protocol w ith 23 points In this Figure, three IDs correspond

to 0, 1 and 3 are real nodes

K*yf

w ir

Figure 1.6: Chord’s key space w ith 23 points

PJ'î hwàtetuor Finget ĩablt

p 2'ร Sưcctitữr

Figure 1.7: Chord's components

ĐẠI HỌC QUỐC GIA HÀ NỘI TRUNG TÂM THÒNG TIN THƯ VIỆN

Trang 25

18 C ha pter 1 In tro d u c tio n

— ■ ■ ■- - - '■ — , • - ■へ— » ^ » , — - —_ _ _ _ _ _ _ ■

Table 1.2: D efinition of variables for node 1 1，using m -b it identifiers

N o ta tio n D e fin itio n

Fing ef[kỊ start ( n +2k - l) mode 2m, 1= к = m

.interval Ịfinger[kỊ.ร tart, fingerỊk- flj s ta rt)

.node First, node =ะ n.fingerỊk].stari

Successor The next node on the identifier circle; or

finger[ 1 ] node

Predecessor The previous node on the identifier circle

Each node manages its key space and neighbors ba^ed on three component« such

atí Predecessor, Successor, and FingerTable (Figure 1.7) These components allow

the node searching and routing inform ation of each key in the whole o f key space

efficiently and quickly Figure 1.7 shows node P2’s components, which include a

Predecessor P I a Successor P3 and a FingerTable The key space o f P2 is keys

from P l + 1 to P2 The FingerTablc contains m entries, which entry i th is calculated

by formula in Table 1.2 I t allows the node determine successor o f ID /k e y i th in the

FingerTable This supports lookup and routing progress o f an ID/key

Table L2 shows derm ition of variables for node ท using m -b it identifiers The

successor of a node is the next node on the identifier circle For example，the

successor o f node 2 is node 3 (Figure 1.6) The successor o f a node is not only to

support routing progress, b u t also the node leaves the network w ith o u t disrupting

the network The succcssor of a key allows the network know the node is responsible

for the key The predecessor is known as the previous node on the identifier circle

In Figure 6 the predecessor of node 0 is node 3 The predecessor supports the

network to operate stably I f a node leaves from the network unexpectedly, the next

node would detect this operation based on its predecessor Since, the next node

is responsible for the p a rtia l key spacc of the node leaved based on stabilization

procedure o f Ch이d protocol

1 4 2 2 Lookup a n d I n s e r t

Lookup a key : Lookup progress is usually carried in CHORD protocol I t is invoked

when any node want to look up to query or insert a key The progress is operated

simply and efficiently Basing on the small number of neighbors, which are found in

Trang 26

Figure 1.8: Lookup progress OÎ Chord’s protocol w ith key 54

FingcrTablc of cach node, any key may be found after a routing progress w ith not over O(logN) steps

A query node performs lookup for a key as follows： It checks if the key belong

to between IDs from its predecessor and itself or not I f yes it know th a t it is responsible for the key The query node would look up inform ation of the key ill its database Otherwise, it would look up the key in other nodes by using its FingerTable

Firstly, the qucrv node looks up an entry ot FingcrTablc, which its field ” sta rt” contains a maximum ID and to be less than the key From that, it dctcct ID ’S successor from field " Successorr of the entry Thence, i t sends query message to the Successor Secondly, after receivc query message, the Successor node also perform lookup progress like the query node The Successor would return inform ation of the key if it is responsible for the key Otherwise, it continues to forward query message to next successor based on its FingerTable The progress is repeated un til the system found the node is responsible for the key Finally, the query node would know the node is responsible for the key If the query node need to query a key

it would download inform ation of the key from the node is responsible for the key directly W ith this lookup progress, the number of necessary forwarding queries to find the node, which is responsible for any key, be not over O(logN), w ith ท is the number o f nodes ill Chord network

Trang 27

20 Chapter 1 Introduction

FingcrTablc to search a Succcsöor of an ID that the entry's field ” Start” contains maximum ID and less than key 54 Since, it finds the node P42 which is a Successor

of ID 40 Secondly P8 sends query message to the node P42 Similarly, P42 would

forward query message to the node P51 Thcnce，P56 rcceivcs query message from P51 Because of P56 is responsible for the key 54, it would look up and send

inform ation of the key 54 to the node P8 Finally, the node P8 would download

data, which correspond to the key 54, from the node P56 directly Routing and Lookup progress is performed in four nodes; the number of hop time is 4 In this

example, each node stores a FingcrTablc w ith 6 entries The number o f real node is

10.

Insert я key: Procedure n Insert^ OÎ しHORD protocol allows any node, which is

called distributed node, insert a key into the system A key is a result o f consistent haishing function from d a ta ’s description Description of data depends on application that used it Inserting progress is done is two steps F irstly, a distributed node looks up the node is responsible for the key Lookup progress is performed as above presentation

Secondly, the d istrib u te d node sends key and its inform ation to manageable node The manageabỉe node would update inform ation of the key in its database

The number of hop time, which takes part in the insert progress is the number of hop

time of lookup progress for the key and one hop time, which takes part in sending the key and its inform ation to the node is responsible for the key

1.4.2.3 Join and Leave

Join ： When a node wants to jo in into the network, it should to know an existed

node in the network In some cases，the existed node may be a bootstrap node Join phases of a node arc done аб follows Firstly, a joining node creatcs an ID from

its inform ation such as IP address, Port number,., by using global consistent ha^h function Thence, it sends jo in in g request to the bootstrap node The bootstrap node uses lookup procedure to locate successor of the ID Secondly, the jo ining node contacts and notifies to the successor about its joining Since, the successor would check if the ID belong to its p a rtia l key spacc between its predecessor and its own

ID or not If yes, it changes predecessor pointer to the joining node Thirdly, the

joining node and the successor would update their database and FingerTable by calling Stabilization procedure, which is presented later Fourthly, the jo in in g node

Trang 28

1.4 D H T -b a s e d P r o to c o l 21

would update its öucccssor and prcdcccsbor corrcctly Figure 1.9 show« Joining pha^c

of a node in C H O R D protocol in 4 steps

Figure 1.9： Joining phase of a node in C H O R D protocol

Leave: Leaving progress of a node in CHORD network is performed when the

node want to leave from the network, A procedure ” Leave” allows the node leave from the network th a t does not affect data and routing of the network To do this, firstly the node n o tify to its successor about its leaving Thence, it moves its data, which are keys and their data, to its successor The successor would receive data from the node and update the successor's database Secondly, the successor moves its predecessor point to node’s predecessor Node’s predecessor also moves its successor to the successor This supports linking between node s predecessor and succcssor w ith o u t losing F in a lly the node would leave from the network physically

1.4.2.4 S ta b iliza tio n and Failure

To guarantee correctness of lookups, a ^Stabilization^ procedure is used to keep

nodes1 successor pointers up to date Each successor's pointer is used to verify

and correct FingcrTablc’s entries This allows lookup progress performing fast and correctly, Moreover，Stabilisation scheme guarantees to add nodes to a CHORD ring

in a wav th a t preserves rcach a b ility of existing nodes, even in the facc of concurrent joins and lost and reordered messages

Every node runs Stabilization periodically When node ท run Stabilization, it would asks rťs successor for the successor’s predecessor p I f there is no node joining between n and ท,ร successor ท'ร successor would return a pointer, which points to

node ท If the return result points to another node and the pointed node’s ID is

between node ท and ท,ร successor Since, node ท knows a new node has joined in and moves its successor pointer to the new node Node ท notifies the new node of

Trang 29

22 C h a p te r 1 I n tr o d u c tio n

iťs existence The 1 1CW node would move prcdcccssor pointer to node ท Node ท also notifies the n's successor of existence o f new node, irs successor would move its predecessor to the new node Finally, iťs successor would move keys, which correspond w ith partial key spacc of new node, and kcys! data to new node The maximum number o f hop tim e is same as lookup progress O(logN)

In practice, some nodes o f Chord physical network may fa il suddenly The failure may be causcd by users or links of the network I f a node be failed, there are two problems The node's stored data and FingerTable's inform ation w ill be lost Since, its successor and predecessor’s pointers are not updated T h is can lead to problems ill routing queries an essential action of DIIT-based protocol as CHORD To tackle the problems, Chord uses a fault-tolerant mechanism based 0 1 1 successor list of each nodo

In Chord protocol, correct maintaining of every node’s successor's pointer is very

im portant Hence, periodically each node would update its successor^ pointer ba^ed

on Stabilization procedure Tô do that, every node keeps a ’successor-list” of its nearest successors on the Chord ring clockwise The first entry o f list is node’s succcssor This li«t is updated periodically in a prc-dcfined period o f time A t the updated time, a node sends updated request to its successor for its successor list Once receiving the list, the successor removes the last entry o f the list and inserts its successor into the top of lis t to update its own successor list I f the successor does not respond in a pre-defmed period of time, the node notifies th a t its successor has failed； it replaces the successor w ith the first live entry in its successor list Since, it also corrects and updates its FingerTable1 ร entries and removes the first entry of the list which is corresponding to the failed successor, and then set successor pointer to the now succcssor

In this section, we presented our m otivation and overview o f P2P network Since,

we soe that, w ith current d istribu te d inform ation environment., constructing information searching systems is Iicccssary In these systems, inform ation content may

be represented by object’s attributes From th a t，the systems allow searching information flexibly and easily Beside, P2P models allow the system achieving many advantages such as scalability, decentralization and self-organization W ith D H T -

Trang 30

Some proposed solutions, which support M u ltip le -a ttr ibute in form ation searching w ith p a rtial queries, may tackle above problems Few o f them allow searching efficiently while others support load balanced of the network These solutions use DHT-based P2P protocols such as C A N , PASTRY, CHORD In the next sections,

\\:v present some proposed solutions for Multiple-attribute inform ation searching oil Structured P2P network

Trang 31

d istrib u tio n based on a ttrib u te value trees The subsection describes components,

which its interaction allows implement the proposed solution The last, we present

summary evaluation for IN S /T w in e system

By using an efficient distributed hash table process of some protocols such as PASTRY, CAN, CH O R D, IN S /丁wine system distributes all resources available to

24

Trang 32

all users independent o f location, which contains IP address，application protocol and port number I t transform s each resource description, which includes hierarchies of attribute-value pairs4 in to a set of numeric keys Therefore, each unique subsequence

of a ttrib ute s and values, which is called a strand, w ill be extracted to query rcsourccs Then Tw ine computes a hash value for each strand, which creates the numeric keys Indeed, the goal o f IN S /T w in e is to describe resources and queries in to a canonical form： an attrib ute -value tree (AVTree), I t is therefore to compare description of queries to original description, w ith zero or more truncated attribute-value pairs

Figure 2.1 shows an example of a resource description of INS/Twine system

bv using AVTrec which represents resources th a t can be annotated w ith mcta-data descriptions

2.1 I N S / T w iiie : I n fo r m a tio n d is tr ib u tio n based o n A ttr ib u te - V a lu e

subjccl tTiiJJic üUbjvVl

Figure 2.1: Meta string and AVTtcc

I ท Figure 1 0 the resource r is described as a M eta string, which corresponds

w ith an AVTroe W hen, the resource description would match the queries:

(/1 : く re s / > cam era < m an > ACompany < /ไทan > < /re s >

(/2 :< res > c a iiiv ra < /re s >

By extracting a rcsourcc description to subscqucncc of attributes and values,

many queries match w ith a resource by comparing AVTrees This work support partial queries Therefore it also support for approximate queries instead of complete queries th a t specify the exact resource descriptions Furtherm ore，this model also allows more flexible queries by separating string values into several attribute» value pairs For example, < model > C om paqPrcsarioC Q 40 < /m odel > could

be divided into < modelพ > Cơmpaq く/modelw > ，< modelw > Presario く

Ị model w > and，< modeiw > CQAO < /model ìíĩ > 7 allowing queries of type

< rnodelw > CQ40 < Ị m o d tlพ >•

Trang 33

^ - - — SfrandMappri

DíMribufed ha^h-tflhle proem (C hord) ị

Figure 2.2: Architecture of IN S /T w in System

IN S /T w in e system includes Client application, International Name Resolver

(INR), Storage/Query Engine and Distributed hatíh-table process (Chord) INR

contains three layers namely Resolver, StrandMapper and Key Router Clients and Engines communicate w ith Revolver to advertise resources or subm it querie« Communication between resolvers takes place in the network core Figure 2.2 shows communication of three main layers w ith remains components The features of three layer« are described below

The Resolvers is the top-most layer It interfaces with a local AVTree storage and query engine The local AVTree storage holds resource descriptions The

query CMigino holds implements query processing Resolvers return sets of name-

records corresponding to partial queries Therefore, it splits the advertised resource

description into strands and passes each one to the StrandM apper layer

2.1.3 System architecture

Figure 2.3 show an example for extracting a resource description in to strands The

number of strand is calculated on formula; 5 = 2 * aL with ” a” is the number of

attribute-value pairs, t is level of the AVTree The number o f attrib ute s and values decide the number o f strands of AVTree Because of each a ttrib u te and each value

Trang 34

2.1 INS/Twine: Information distribution based on Attribute-Value

Figure 2.3: Splitting a resource description into strands

could be extracted to a strand, so this work allows to easily handling partial queries Therefore, each possible sub-sequence of attributes and values maps to a separate numeric key

Common A V pair, which corresponds w ith common numeric key, would create

common strand This is a challenge of load balancing among nodes INS/Twinc tackles this problem by to determine a set of threshold, which is the maximum number of resources for each key, for nodes If common keys, which the number of resources is over threshold, no new resource is accepted under that key

ร tra n d M арpe r laver is responsible for associating cach numcric key with cach

strand The attributes and values of a strand are converted into a single string Next, it hashes the string to a numeric key by using a hash function Then, the key

and corresponding resource description are parsed to the Key R o u te r layer

K ey R o u te r layer is responsible for to determine which other resolvers should

store inform ation about the resource, or should participate in solving the query by

using DHT-bascd P2P network The Chord protocol is used in INW /Twinc system

to guarantee scalable and efficient of the system The returned result’s of Key Rout er layer arc transferred or queried to/from the node on the system

However, high bandwidth overhead, information loss and load unbalancing are

Trang 35

problems of the isystcm th a t wc need to be improved Periodically data refreshing resource inform ation w ith small interval supports keep inform ation up-to-date,

I his requires t hat nodes send messages together periodically and imposing a high bandwidth overhead To support load balancing, the system also defines a threshold

value This threshold value would restrict information resources that lead to information loss For the common strand, which are created from one or more common attribute-value pairs, it is on ly stored in a node This node would receive more data and queries than the others to cause load unbalancing for nodes

Balancing Matrix

Similarly, Content Discovery System (CDS) solution (Gao, 2004) is also to support

M u ltip le -a ttrib u te in form ation searching based on D H T P2P network Inform ation

content is hibshod to a set of keys by using a hash function The CDS system is

designed based on two m ain functions namely register and query The Load Balancing Matrix (L B M ) is presented to guarantee load balancing among nodes This

section presents the solution, LBM management and summary of CDS Following is the detail of CDS system

2.2.1 Solution

Figure 2.4 shows architecture of a node in CDS-based system The architecture includes four layers namely Application CDS, DHT-based overlay and TCP/IP CDS

layer is designed to com municatc w ith A pplication layer such as service discovery

and file sharing, and DIIT-based overlay such as PASTRY, CAN, CHORD The

CDS layer communicate w ith layers by using two basic functions namely register (content name), query (location query)

Otherwise, the A p plication layer provides content names and queries to CDS layer CDS layer analysis content names or queries by translating content names or queries

to the keys Then7 it would send keys registration or query message to DHT-ba^ed

overlay layer DHT-based overlay would determine the nodes are responsible for

Trang 36

keys The TC P /IP takes responsibility for transmitting data among nodes CDS

is a important layer in the system It performs functions and regulates activities of the system Two functions are register and query content names, registration and query progression are shown bellow

2.2.2.1 R egistering a content name

A content name is registered only with a small set of nodes in the system Provider node must, first determine the set of nodes that, should receive this name In CDS system, a content name, which is provided by the Application layer, is represented

as ท at tribute-value pairs C N ะะะ {а і^ ь а г ^ - ^п vn} A key, which is hashed from

an attribute-value pair by a system-wide hash function, is distributed to the node that is responsible for the key, CDS system creates a set of keys corresponding with each content name, and then it distributes keys to nodes in DHT-based P2P network The complete content name is send to cach of the ท nodes, which results

in ท replications the content name W ith this scheme, the node, which manages an

ЛѴ pair {r/jM1}, would store all the content names in the system that have aivi in them

Figure 2.5 shows that an example of distribution of AVs in nodes in DH丁-based P2P network Each node only manages a unique AV pair A content name is stored

in a set of nodes A ll contcnt names, which contain the same AV pair, are stored

in a node In this example, node N3, N4 store C N1； node N5, N6 store CN2; and

node N1 N2 store both CN1 and CN2 So, with a query contains {«!?；], a2r»2}, the returned result of CDS system is CNl and CNs

Trang 37

of nodes Ni rh e node, which rcccivcs the query message, would checks its name

database，and determines the set of content names that its each contain contains

AVs of the query The set o f matching content names is returned to the A pplication layer

2.2.3.2 Sum m ary

W ith such scheme o f registering and querying, some nodes are overloaded because

of these nodes would store and process more content names and queries, which contain common AV pair, than others This is cause of load unbalancing in CDS system Hcncc, CDS system proposed a load balancing mcchanism by using the ỉ.oad Balancing M a trix

2.2.4 Load Balancing M a trix (L B M )

The L B M is (instructed t.0 guarantee the load balancing of the system Each LBM

would take responsibility for content names of a common AV pair The node, which

is responsible this AV pair, is called the head node The head node determines a set

Trang 38

of other nodes to distribute content names based on LBM The following would describe structure, construction condition, construction progression and management mechanism of LBM,

2.2.4.1 T h e s t r u c t u r e o f L B M

The LBM is createci to manage contcnt names of the key ki in the system It includes

Pi columns and Rị rows (dimension Pi X Ri) A cell of LBM corresponds with a node, which take responsibility for a partial data of key ki Each column corresponds

with a subset of content names, which contains key ki, called partition Nodes in

same column are replicas of each other It means that, nodes on column j are the same data that are content names to correspond with key kị.

2.2 CDS: Information Distribution Based On Load Balancing M atrix 31

Head node

Figure 2.6: The structure of Load Balancing M atrix for

Figure 2.6 shows the structure of LBM for {ơi, น1:} Each node in LBM is assigned

by a column and row index, (p, r) and node IDs are determined by applying the hash function, H to the AV-pair, and the column and row indices together:

MP ,r) = H(aiVup}r)

With above formula ID of the head node is Д상0，0) = 0) ( * * )

Trang 39

in the matrix The head node determine the partition number p and R of LBM by calculating maximum content names and query in probed nodes Then, a partition between 1 to p is chosen randomly, and content name is sent to nodes in same partition.

Query resolution: Query progression is divided to two cases If the number of

AV pairs is one, the query is sent directly to the matrix, which corresponding with that pair A subset of partition is chosen to query matching results In else case, the query is performed in the following steps:

Sending probing messages to the nodes of LBM to get all the size of LBMs

Choosing a LBM with the fewest partitions

Sending the query to all partition of the LBM where cach node is cho«cn randomly

To reduce query cost CDS system performs catching the size of LBM when the node issues many queries

Expanding and shrinking of LBM: In the case of the existing partitions in the matrix receive high registration load, the number of partition may be expanded Expansion region (ER) as the set of partitions that are last added to the matrix CDS system defined threshold values Tai and Tq to manage storage and query load of

partitions in LBM If the storage load of current partitions is over Ten, the number of partition is doubled The new nodes arc informed about joining in the matrix, and new registering content names are registered in one of partitions of ER, Expanding and shrinking progression by rows is processed like partitions based query load and threshold value T4.

Trang 40

2.3 D a ta In d e x in g 33

2.2.6 Summary

The basic CDS is a simple and understandable solution Each content name iỉ>

distributed to a set o f nodes, which are responsible for Л Ѵ pairs of a content name This scheme archives the degree of load balancing w ith the participation of nodes Furthermore, the determ ination of the node responsible for each A V pair is done faster than INvS/Twine solution, which spends more tim e in extracting strands from

an AV tree

However, it s till exists some disadvantages th a t should be improved CDS solution caused more redundant inform ation such as d ata and query More nodes store the same set of content name« This is the nodes th a t belong to the same part ition Sometime th is storage is not necessary because o f more queries are only

to process w ith a subset of partition The size of L B M would increase rapidly if

it is doubled for expanding CDS has to proccss more queries to archive matching

results Some queries need to process the whole of nodes in m a trix before receive results So, procession and management cost and searching tim e may be problems

of the system

Data indexing (Garces-Erice к Ross, 2004) is an underlying DHT-based P2P data

storage system, which supports M ultiploattributc information searching by creating

m ultiple indexes, organized hierarchically, which p erm it users to locate data even using scarce inform ation, although at the price of a higher lookup cost The data itself is stored on only one (or few) o f the nodes and discovered based on user’s queries

nodes Inform ation d istrib u tio n and searching progression do not often over lim ited

Định dạng
Số trang	87
Dung lượng	36,02 MB