Resource Information Retrieval Using SENS A Scalable and Expressive Naming System

Resource Information Retrieval Using SENS A Scalable and Expressive Naming System tài liệu, giáo án, bài giảng , luận vă...

Trang 1

Resource Information Retrieval Using SENS

-A Scalable and Expressive Naming System

Hoaison NGUYEN1

, Hiroyuki MORIKAWA2

and Tomonori AOYAMA3 1

College of Technology, Hanoi University 2

School of Frontier Sciences, The University of Tokyo 3

Research Institute for Digital Media and Content, Keio University

SENS, which can retrieve information of computing and content resources distributed widely on the Internet by exact queries and multi-attribute range queries over resource names Our system utilizes a descriptive nam-ing scheme to name resources and a multi-dimensional resource ID space for message routing through the overlay network of name servers (NSs) The resource ID space is constructed on the overlay network based on CAN routing algorithm We propose a novel mapping scheme between resource names and resource IDs, which can preserve the locality of resource IDs while still achieving a good degree of load balancing regarding resource information distribution We also propose a multicast routing algorithm to deliver resource information and a broadcast routing algorithm to route query messages to corresponding NSs at small cost Our simulation re-sults show that our system can achieve good routing performance and load balancing

At present, to provide autonomous information systems such as web services, ubiq-uitous computing systems[1] or Grid computing systems [2] with resource informa-tion, the general approach is to publish the description of resources to a directory service MDS-2 provides directory services for Grid computing systems It uses LDAP [4], a hierarchical directory service, as a uniform interface for accessing and managing information about the status of Grid computing resources For Web services, Universal Description, Discovery and Integration (UDDI) [3] is used to discover services A service provider describes its service using WSDL and publish the description to UDDI directory service Service consumers will ask UDDI direc-tory service for requested services However, conventional direcdirec-tory services can only provide exact query function, but not support rich query functions such as range query Since the evolution of the Internet now brings us the ability to access huge number of ubiquitous computing resources, we consider that a scalable and expressive information retrieval services is essential for autonomous information systems

We design a scalable and expressive naming system called SENS to provide a resource information retrieval service based on resource names In our system, a descriptive naming scheme that names each resource by a tuple of attribute/value pairs is used Resource information is stored at a large number of name servers (NSs) Our system retrieves resource information by exact queries (i.e query in-formation of a resource whose resource name is the same as the query name)

Trang 2

and multiple-attribute range queries (i.e querying information of resources whose names have attribute values satisfying a query range) It routes query messages to NSs that are responsible for queried resource names

Our challenge is to design a message routing protocol on the overlay network of NSs to achieve scalable and efficient resource information distribution and query Our SENS system constructs a high-dimensional resource ID space (i.e DHT key space) on the overlay network of NSs and map resource names to the resource ID space such that the locality of resource names in the resource ID space and the load balancing are both maintained Message routing for resource information queries is performed based on the locality of resource names We propose a multicast routing algorithm to deliver resource information and a broadcast routing algorithm to route query messages to corresponding NSs at small cost Our simulation results show that our system can achieve good routing performance and load balancing The rest of this paper is structured as follows We present the background of our research in Section 2, the design of our system in Section 3 Section 4 describes our simulation and the results Section 5 discusses the application of our SENS system and Section 6 presents conclusions and future works

2.1 Related works

Several works such as INS[7] or ENS [8] have challenged an expressive naming sys-tem with the approach of routing messages based on resource names However, this approach is not scalable since the size of routing table will become unacceptably large when the number of resources increases

Recently, Distributed Hashing Tables (DHTs) such as CAN[9], Chord[10], etc attract lots of attention since they can offer a promising solution for scalable message routing on overlay networks However, DHTs realize range queries with very large overhead because their consistent hashing function maps resource names

in a range of attribute values to a large number of DHT keys

To resolve this problem, a locality preserving hash function is utilized to map each attribute/numerical value pairs of a resource name to a DHT key [12–14] The resource information location and query resolution is performed based on DHT keys mapped by attribute values corresponding to one attribute in the query name However, this approach does not scale well because the distribution of at-tribute/value pairs in resource names is often skewed NSs responsible for popular attribute/value pairs will suffer from heavy load of resource information registra-tions and queries

2.2 CAN routing

We utilize CAN routing [9] as a message routing algorithm on the overlay network

of NSs The reason of using CAN is that in addition to the advantages of a DHT routing algorithm, CAN routing can construct a d-dimensional resource ID space

on the overlay network of NSs CAN routing is performed as followed

The resource ID space is partitioned into hyper-rectangles, called zones Each

NS is responsible for a zone When a new NS joins the overlay network, it will choose an initial point Pi in the resource ID space and send a request to an existing NS responsible for the zone within which the point Pi lies The existing

NS will split in half, retaining half and handing the other half to the new NS

Trang 3

Routing table Sending node

The NSs maintain a coordinate routing table that holds the address and the zone of its neighbor NSs Two NSs are neighbors if their zone overlaps along d − 1 dimensions and abut along one dimension Using its coordinate routing table, a

NS routes a message towards the NS responsible for the destination ID by simple greedy forwarding to the neighbor NS whose zone is closest to the destination ID (Fig 1)

CAN routing algorithm can achieve good routing performance on the overlay network of NSs As shown in [9], in a d-dimensional resource ID space with N NSs with d is a small number, the size of a coordinate routing table is O(d) and the path length between two NSs is O(N1 /d) In the case of a high-dimensional resource ID space with d > log2N , if the space is divided along the dimension determined by a fixed cyclical ordering of the dimensions, the size of a coordinate routing table and the path length between two NSs will be O(logN ) [15]

3.1 Overview

Our system achieves expressiveness on resource naming by utilizing a descriptive naming scheme which names a resource by a tuple of attribute/value pairs For example, a computer is named as: (string OS = “linux”, string CPU-name =

“Pentium 4”, int CPU-clock (Mhz) = 1000 , int memory (MB) = 1024,int harddisk-unusedspace (GB ) = 20 , int network-bandwidth (Mbps) = 1000 ) The attribute includes a data type and a name The data type (e.g string, integer, Boolean) will decide a type of value that an attribute value can take The name of attributes expresses the semantics of (attribute/value) pairs Each kind of resources has a set

of attributes used for naming The number of attribute/value pairs in a resource name may be dozens of pairs

In the case of an exact query, a NS queries information of a resource that has the same resource name as the queried name In the case of a range query, a NS queries information of resources that have resource names satisfying query ranges

of attribute values A query range is expressed by the use of inequality operators (>, <, ≤, ≥ ) and the disjunction operator For example, our system can realize

a range query for computing resources expressed as: (string OS = “linux”, string CPU-name = “Pentium 4”, int CPU-clock (Mhz) ≥ 1000 & int CPU-clock (Mhz)

≤ 1200 , int memory (MB) ≥ 512 , int harddisk-unusedspace (GB) ≥ 10 , int network-bandwidth (Mbps) ≥ 100) If the attribute value corresponding to an attribute in a range query is arbitrary, the wild card (∗) can be used instead

Trang 4

SENS distributes resource information to the overlay network of NSs based on resource IDs SENS builds the resource ID space as a virtual d-dimensional Carte-sian coordinate space(i.e d-dimensional resource ID space) using CAN routing algorithm To limit the number of NSs responsible for a range query, we propose a locality-preserving mapping scheme between a multi-attribute resource name space and a multi-dimensional resource ID space A resource ID is considered as a set of

d coordinates of a point in the d-dimensional resource ID space A resource name is mapped to a resource ID by mapping each attribute value in attribute/value pairs

of the resource name to a coordinate value of the resource ID in a deterministic dimension assigned by the attribute As a result, our matching scheme allows all resource names that match a range query to be mapped within a limited segment of the resource ID space (i.e a range query segment) Furthermore, a data item such

as (string audio.input.format =“AVI”, string audio.output.format = “wav”, int network.bandwidth (Mbps) = 10) can be found with a query which keywords are different with the ones of the data item in their number and order, for example (string audio.input.format =“AVI”, string audio.output.format = “wav”, string video.input.format = *, string video.output.format = *, int network.bandwidth (Mbps) ≥ 5)

Resource information including a resource name and meta data is stored at NSs responsible for resource IDs A NS performs a query by mapping a queried resource name or a range of queried names to a queried resource ID or a range query segment in the ID space It then sending a query message to NSs that are responsible for the queried resource ID or the range query segment

In next subsections, we will describe in detail our design including the mapping scheme between resource names and resource IDs, the construction of the resource

ID space, the distribution of resource information and the query resolution 3.2 Mapping resource names to resource IDs

0

Mapping

Resource name

Resource ID

v

default value

0:

v = Hv(val )

a = Ha(attr )

0

4

a = 6 4

A resource name is mapped to a resource ID by assigning the hash value of each attribute value to a coordinate in a corresponding resource ID The dimension order number of the assigned coordinate is the hash value of the corresponding attribute Here, a uniform hash function H hashes each attribute from 1 to d and

Trang 5

Resource IDs

attr :val 1 1 attr :val 2 2 attr :val 3 3 attr :val 5 5

a =2 2

attr :val 4 4 attr :val 6 6

Mapping

a =a = 3 1 5 a =a = 5 3 6 a = 6 4

default value

attr :val : i i the i attribute/value pair in the resource nameth

a = Ha(attr )

attr1, attr5 are hashed to the same value a1 = a5 = 3 and attributes attr3, attr6 are hashed to the same value a3= a6= 5

a hash function Hv hashes each attribute value from 1 to 2m− 1, where m is the maximum size of a coordinate value in bits If there is a coordinate that no value

is assigned to, a default value (e.g 0) is assigned instead

For example, a resource name as shown in Fig 2 is identified by a tuple of 4 attribute-value pairs: ((attri1: val1), (attri2: val2), (attri3: val3), (attri4: val4)) The resource name will be mapped to a resource ID of a 6-dimensional resource

ID space Because the hash value of attri1 is a1 = Ha(attr1) =3, the hash value

of val1 will be assigned to the 3rd coordinate value and so on Since no value is assigned to the 1st and the 4th coordinate values, the default value 0 is assigned instead

In the case of a numerical attribute value, a locality preserving hashing function

is used as Hv Here, the locality preserving hashing function is defined as if vi> vj

then Hv(vi) > Hv(vj)[12] An example of a locality preserving hashing function is

Hv(v) = (v − vmin) ∗ (2

(vmax− vmin) , where vmax and vmin are the maximum and minimum values that the attribute value may take In the case of an attribute value of string type, a uniform hash function is used as Hv

All resource names that match a range query will be mapped into a segment

of the resource ID space, limited by the hash values of the upper and lower limit

of queried value ranges in each dimension If a resource name matches a range query, its attribute values will be between the upper and lower limit of the value range corresponding to each attribute Since attribute values associated with an attribute will be mapped to coordinate values in the same dimension and the mapping between an attribute value and a coordinate value is locality-reserved, each coordinate value of the resource ID will be between the hash values of the upper and lower limit of queried value ranges in each dimension

Our mapping scheme is not injective Several resource names with different at-tribute/value pairs may be mapped to the same resource ID However, the resource

ID does not need to be unique since the resource information is identified by the resource name, not the resource ID When looking up a query resource ID, the

Trang 6

NS will check the resource name before returning the lookup result of the queried resource ID

If multiple attributes in a resource name are hashed to the same value (e.g

Ha(attri) = Ha(attrj), the corresponding attribute values will be mapped to mul-tiple coordinate values in the same dimension (i.e a set of collided coordinate val-ues) In this case, to preserve the locality property, the resource name is mapped

to multiple resource IDs, each of them contains a set of collided coordinate values For example, a resource name is identified by a tuple of six attribute-value pairs

as shown in Fig 3 and its attributes attr1, attr5 are hashed to the same value

Ha(attr1) = Ha(attr5) = 3 and attributes attr3, attr6 are hashed to the same value Ha(attr3) = Ha(attr6) = 5 In this case, since {v1, v5} and {v3, v6} are two sets of collided coordinate values, the resource name will be mapped to four re-source IDs: (v1, v3),(v1, v6),(v5, v3),(v5, v6), each of them contains two values from each set Resource information will be replicated and delivered to NSs that are responsible for these resource IDs

Large number of collided coordinate values will force a large amount of re-sourceIDs to be generated The probability that a collision occurs depends on the number of attribute/value pairs in a resource name To limit the number of re-source IDs per rere-source name, the number of attribute/value pairs in a rere-source name should be limited to a reasonable value If a resource has the number of attribute/value pairs that is over the limited number, the set of attribute/value pairs should be divided to multiple sets of attributes which correspond to multiple resource names The way of dividing resource names is out of the scope of this paper

3.3 Load balancing

The number of resource IDs distributed to a zone that a NS is responsible depends

on not only the volume of the zone but also the number of the default value that coordinate values of a zone contain It is because in a mapping between a resource name and a resource ID, the default value will be assigned to a number of coordinates values which are not mapped with any hash value of attribute values

To keep the gap between the numbers of resource IDs stored in each NS to be small, for each zone assigned to a NS, the number of coordinate values containing default value and the volume of the zone should be in inverse proportion We simply realize this requirement by randomly assigning the default value to a number of coordinates of each initial point Pi, which is assigned to a NS when it newly joins the overlay network As a result, the probability that a zone whose coordinates contain a large number of default values is split will be high and therefore, the volume of such a zone will be small

3.4 Resource information distribution

A resource ID is mapped to a point P of the resource ID space and information of the resource is delivered to the NS that owns the zone within which the point P lies

If a resource name is mapped to multiple resource IDs, information of the resource will be replicated at NSs responsible for these resource IDs If a NS is responsible for several resource IDs of the same resource, only one copy is replicated at the NS

To deliver information of a resource to multiple NSs responsible for resource IDs, we propose a multicast routing algorithm based on spanning binomial tree

Trang 7

Message forwarding

(0,0,0)({0}) ({0|1|3},{0|3},{0|3|4}) destination ID

multicast dimension

resource IDs ([0,1],[0,1],[0,1]) The zone that a NS is responsible for

(0,0,0)({3}) ({0|1|3},{0|3},{0|3|4})

(3,0,0)({3}) ({0|1|3},{0|1},{0|3|4})

(3,3,0)({2}]

({0|1|3},{0|1},{0|3|4})

(3,3,3)({1}) ({0|1|3},{0|1},{0|3|4})

(0,0,3)({1}) ({0|1|3},{0|1}, {0|3|4})

(0,3,3)({1}) ({0|1|3},{0|1},{0|3|4})

(0,3,0)({2}) ({0|1|3},{0|1}, {0|3|4}) ([0,2],[0,2],[0,2]) ([0,2],[0,2],[2,4])

([2,4],[0,2],[0,2])

([2,4],[2,4],[0,2]) ([2,4],[0,2],[2,4])

Middle NS Agent NS

Fig 4.Multicast routing protocol for sending information of a resource to NSs responsible for resource IDs

(SBT) [16] In our algorithm, only one registration message is sent to each NS responsible for resource IDs of a resource name Thus, our multicast routing algo-rithm only sends minimum amount of messages to deliver information of a resource

to corresponding NSs The algorithm is performed as follows

The resource IDs corresponding to a resource name will construct a hypercube

in the resource ID space The SBT is constructed on the hypercube (Fig 4a) and the message is routed from the root node of the SBT to nodes in lower level Supposing that the resource IDs corresponding to a resource name are expressed

as ({v1

1, , v1

n1}, , {vi1, , vini}, , {vd1, , vdnd}) where {vi1, , vini} is the set of coordinate values in ith dimension (i ∈ [1, d]) The registration message containing information of the resource is first delivered to the NS responsible for the resource ID created from lowest values of each list (i.e (v1

n1, , vi1, , vd1)) This NS becomes the root node of the spanning binomial tree The NS creates new destination resource IDs, which correspond to lower level nodes of the SBT and then forwards the message to these destination resource IDs The NSs receiving the message will create another destination resource IDs based on the SBT and relay the message recursively (Fig 4b)

3.5 Query resolution

In the case of an exact query, the agent NS (i.e the NS that the query host sends the query message) will map the query resource name to queried resource IDs and select the nearest query resource ID as the destination resource ID The query message including the queried resource name will be sent to the destination resource ID based on the CAN routing algorithm The NS responsible for the resource ID will lookup its database to find the queried information and send the information back to the agent NS

In the case of a range query, an agent NS will map the query range to a range query segment in the resource ID space A query message will be broadcasted to all NSs whose zones overlap the segment NSs receiving the query message will

Trang 8

check their database to find the information of resources whose names match the

query

({3},{0}) ([1,5],[1,3],[1,3])

({3},{1}) ([1,5],[1,3],[1,3])

({2},{1}) ([1,5],[1,3],[1,3]) ({1},{1})

({1},{1}) ([1,5],[1,3],[1,3])

({2},{1}) ([1,5],[1,3],[1,3])

Message forwarding ({0},{0}) ([1,5],[1,3],[1,3])

broadcast dimension

query segment broadcast direction

({3},{0}) ([1,5],[1,3],[1,3]) ([0,2],[0,2],[0,2])

([0,2],[0,2],[2,4])

([0,2],[2,4],[0,2])

([4,6],[0,2],[0,2])

([4,6],[2,4],[0,2]) ([2,4],[2,4],[0,2])

([4,6],[2,4],[2,4]) ([2,4],[2,4],[2,4])

([0,2],[2,4],[2,4])

([2,4],[0,2],[2,4]) ([4,6],[0,2],[2,4]) ([2,4], [0,2], [0,2])

a) The spanning polynomial tree created from

a range query segment ([1,5],[1,3],[1,3])

([2,4], [0,2], [2,4]) ([2,4], [2,4], [0,2])

([2,4], [2,4], [2,4]) ([4,6], [0,2], [0,2])

([4,6], [2,4], [0,2])

([4,6], [0,2], [2,4])

([4,6], [2,4], [2,4]) ([0,2], [2,4], [2,4])

d=1, dr=1 l=2, dr=1

d=1, dr=1

d=3, dr=1 ([0,2], [0,2], [0,2])

([0,2], [0,2], [2,4])

([0,2], [2,4], [0,2])

d=3, dr=0

d=1, dr=1 d=2, dr=1 d=1, dr=1

d=1, dr=1

d=2, dr=1

d=1, dr=1

b) Broadcast routing to NSs responsible for the range query segment ([1,5],[1,3],[1,3])

Agent NS

d: broadcast dimension dr: broadcast direction

Fig 5.Broadcasting routing algorithm for sending a query message to the range query

segment based on spanning binomial tree

To reduce the cost of broadcasting, we propose a broadcasting algorithm based

on SBT to broadcast a query message to all NSs in a range query segment with

minimum number of sending messages The number of messages to be sent is about

the number of NSs in the range query segment The SBT is constructed on NSs

in the query segment (Fig 5a) The query message is first sent to a NS in the

range query segment by CAN routing algorithm The NS then forwards the query

message to its neighbor NSs which correspond to lower level nodes in the SBT

These NSs then recursively forward the query message to their neighbor NSs which

correspond to lower level nodes of the SBT (Fig 5b)

We evaluate the performance of SENS by simulation from the following aspects

– Routing performance: Logical hop count required to route a query message to

a NS responsible for the query

– System efficiency: Replication number of resource information corresponding

to a resource name and number of NSs responsible for a query

– The degree of load balancing: The number of resource information stored at

each NS

We implemented a simple simulator to evaluate our system We assume that

the number of attribute/value pairs in a resource name varies from 10 to 20 We

set the dimension number of the resource ID space to be 20 The default value is

randomly set to 12 coordinate values of an initial point assigned to each NS when

they newly join the overlay network

Trang 9

The resource names are generated based on the Zipf dataset, which reflects the popularity of an attribute/value pair based on a parameter called a rank The probability that an attribute/value pair appears in a resource name is in proportion

to 1/rα Here, r is the rank of the attribute/value pair and α is a constant number

We set α as 0.9 and r as a random number between 1 and the total number of attribute/value pairs Our data set has 400 attributes, each of which can take on

1024 values Exact queries and range queries are also generated based on the same Zipf dataset

The next subsection shows our simulation results

4.1 Routing performance

4 5 6 7 8

0 4 8 12 16 Number of value rang es per query

Hop count (hop) Size of a value range: 20%

Size of a value range: 30%

0

2

4

6

8

1000 5000 20000 100000

NS numbe r

Evera ge routing hops for a n exa ct query

Evera ge pa th length

a) Path length and evarage logical hop

count in the case of exact queries

b) Everage logical hop count for a range query in the case of range queries

We first study the increase in the average path length (i.e average logical hop count required to route a message between two NSs) in the overlay network due to the increase of the number of NSs 100,000 messages are sent to random destination resource IDs from a randomly selected NS The simulation result (Fig 6 a)) shows that the average path length increases on a logarithm scale of NS number In the case of a 100,000-node system, the average path length is about 6.9 hops

Fig 6 a) also shows the logical hop count required to route an exact query message to the NS responsible for the queried resource information The average logical hop count required to route an exact query message (5.0 hops in the case

of 20,000-node system) to the corresponding NS is smaller than the average path length(5.9 hops in the case of 20,000-node system) It is because information of

a resource may be replicated in a number of NSs and the nearest resource ID is always selected as the destination resource ID to delivery query messages

We study the logical hop count required for a range query in 20,000-node SENS system by fixing the size of each value range to be 20%, 30%, 40% of the maximum coordinate value and increasing the number of value ranges to be queried As shown

in Fig 6 b), the average logical hop count required to route a range query message

to responsible NSs increases only 1.3 hop when the number of value ranges to be queried increases from 1 to 16 It means that our broadcast routing protocol can achieve good routing performance

Trang 10

4.2 System efficiency

a) Average resource ID number and replication

number corresponding to a resource name b) Number of NSs to be responsible for a range query in the case of range queries

0

2

4

6

8

10

Attribute/value pair number per resource name

Number of replications per name

1 10 100

Average number of replication

number per name

Average number of resource

ID s per name

1 10 100

Number of value ranges per query

Fig 7.Evaluation result of system efficiency

We study the average number of resource IDs and average number of replica-tions per a resource name in 20,000-node SENS system As shown in Fig 7 a), the resource ID number per resource name is large and the resource ID number in-creases on a exponential scale of the number of attribute/value pairs in a resource name However, because a NS may be responsible for a number of resource IDs

of a resource, the replication number per resource name is relatively small For a 20-attribute/value pair resource name, the resource ID number per resource name

is 94.7 while the replication number per resource name is 7.8 on average

The average number of NSs to be queried in a range query increases expo-nentially with the number of value range per query (Fig 7 b)) It is because the volume of query segment increases exponentially However, we consider that num-ber of NSs to be queried is small enough to be viable In the case the numnum-ber of value ranges is 10 and each value range is 20% of the value, the average number

of queried NSs is 2.2 while in the case the number of value ranges is 12 and each value range is 30% of the maximum value, the average number of queried NSs is 7.1

4.3 Load balancing

In order to evaluate the degree of load balancing in SENS system, we measured the number of resource names stored on each NS by delivering 1,000,000 resource names to 20,000-nodes SENS system Number of attribute/value in a resource name varies from 10 to 20

Fig 8.a shows the attribute/value pair appearance probability in resource names Fig 8.b shows the ratio of resource name number stored in each NSs to to-tal name number Our simulation shows that even there are several attribute/value pairs which appear in resource names with high probability (about 13.96% of total number of resource names), the maximum number of resource names stored in a

NS is not over 0.23 % of total number of resource names (Fig 8.b)

Định dạng
Số trang	12
Dung lượng	262,6 KB