Output file о ѵ ч ^ !N ỏ 广 An Improvement Solution for Multiple Attribute Information Searching Based On Structured P2P Networks Nguyen Thanh Dat Faculty of Information Technology Hanoi University of[.]
Trang 1!N ỏ 广
An Im provem ent Solution for M ultiple
A ttribute Information Searching Based
A thesis s u b m itte d in fu lfillm e n t o f th e re q u ire m e n ts fo r th e degree o f
M a s te r o f In fo rm a tio n T echnology
Decem ber, 2009
Đ A I H Ọ C Q UO C G IA HA NỘI TRUNG TẨM THÔNG TIN THU VIỀN
A - L < ^ ỗ b _ _
Trang 2Table of Contents
1.1 Overview and M otivation 2
1.2 Coưim unicat.iọn netw ork m o d e l s f> 1.3 P2P network m o d e ls 9
1.3.1 P2P N e tw o rk 9
1.3.2 P2P Network M odels 10
1.3.2.1 Unstructured P2P N e tw o rk 10
1.3.2.2 Hybrid P2P N etw ork 12
1.3.2.3 Structured P 2 P N etw ork 14
1.4 DHT-based P ro to c o l 15
1.4.1 Distributed Haah Table - DHT 15
1.4.2 CHORD P rotocol 16
1.4.2.1 Topology 17
1.4.2.2 Lookup and I n s e r t 18
1.4.2.3 Join and Leave 20
1.4.2.4 Stabilization and F a ilu r e , 21
1.5 Summary 22
2 R e la te d W o rk s 24 2Л INS/Twine: Information distribution based on Attribute-Value trees 24 2.1.1 S o lu tio n 24
2.1.2 System architecture 26
2.1.3 System architecture 26
2.1.4 S u m m a ry 27
iii
Trang 32.2 CDS: Irifonnation D istrib ution Based O il Load Balancing M a trix 28
2.2.1 S o lu tio n 28
2.2.2 System architccturc 28
2.2.2.1 Registering a content n a m e 29
2.2.3 System architecture 29
2.2.3.1 Query re so lu tio n 30
2.2.3.2 S u m m a r y 30
2.2.4 Load Balancing M a trix (LB M ) 30
2.2.4.1 The structure of L B M 31
2.2.5 System architecture 31
2.2.5.1 LB M management mechanism 32
2.2.6 S u m m a ry 33
2.3 Data In d e x in g 33
2.3.1 S o lu t io n 33
2.3/2 Insert a f i l o 36
2.3.3 Lo okup s 37
2.3.4 S u m m a ry 38
ậ 2.4 SMAV: Searching - M ultiple-attribu te V a lu e 38
2.4.1 S o lu tio n 38
2.4.2 Distribution of information c o n te n t 40
2.4.3 Information content name query 42
2.4.4 S u m m a ry 42
A n Im p ro v e m e n t S o lu tio n fo r M u lt ip le - a t t r ib u t e I n fo r m a t io n Search in g on S tr u c tu r e d P 2 P N e tw o rk 44 3.1 I d e a 45
3.2 Three Levels Mapping Model 46
3.2.1 Overview 46
3.2.2 Thrce-Levclb Sub-key Mapping S to rin g 48
3.2.3 Distribution of information c o n te n t 50
3/2.4 Inform ation q u e ry 54
3.2.5 S u m m a r y 57
3.3 The Dynamic Threshold V a lu e s 57
3.3.1 A formula of threshold v a lu e s 57
3.3.2 Adjusted Distribution A lg o r ith m 59
3.3.3 Updating Threshold Value Periodically 61
Trang 43.3.4 Adjusted Lookup A lg o rith m 62
3.1 Summary 64
4 S im u la tio n s an d E v a lu a tio n s 65 4.1 Q ualitative E va luations 65
4.2 Simulation D e s c rip tio n 67
4.3 Evaluation Based On S im u la tio n s 68
4.3.1 Load b alancing 68
4.3.2 Distribution of information c o n te n t 71
4.3.3 Routing Performance 74
Trang 51.1 Client/Server network m o d e l 8
1.2 Peer-To-Peer network model with 5 peers 10
1.3 Locating resources in a G nutella-likc P2P e n v iro n m e n t 11
1.4 An example of H yb rid P2P M o d e l 13
L5 Distribution data progress based on D H T 15
1.6 Chord's key space with 23 points 17
1.7 Chord's com ponents 17
1.8 Lookup progress of Chord*s protocol with key 5 4 19
1.9 Joining phase of a node in C H O R D protocol 21
2.J Meta string and A V T rc o 25
2.2 Architecture of IN S /T w in S y s te m 26
2.3 S p littin g a rcbourcc description in to s t r a n d s 27
2.4 The architecture of CDS system 29
2.5 An example of distribution of AVs in nodes 30
2.6 The structure of Load Balancing Matrix for {이,V i } 31
2.7 An example of described d a t a 34
2.8 Sample File Queries 34
2.9 Mappings between q u e rie s 35
2.10 Query mapping for three descriptors 36
2.11 An example of mappings t r e e 36
2.12 An example of a path of queries 37
2.13 Key - sub-key m a p p in g s 40
3.1 The number of hop levels of a content name in pure S M A V 49
3.2 Mappings are created fr이ฑ key k ị 49
3.3 The generation of distribute d keys from a content n a m e 53
3.4 Block diagram of query progress of improved S M A V 55
Trang 6LIS T OF F IG U R E S v ii
3.5 An example of query progress with 4 common k e y s 56
3.6 Combining of common keys and a uncommon k e y 61
4.1 The d istrib u tio n of ЛѴ pairs in content n a m e s 68
4.2 Number of inform ation contents stored in each o f 5000 n o d e s 69
4.3 The number o f queries is processeci by each of 5000 n o d e s 70
4.4 Load balancing among nodes 70
4.5 Mappings stored in every node 72
4.Г) Mappings is created by CNs in a DSMAV b S M A V 72
4.7 The number of keys stored in every n o d e 73
4.8 Level-к Sub-keys are created by three solutions 73
4.9 Logicai hop count required for each query 74
4.10 The maximum number of hop level of three s o lu tio n s 75
4.11 The number of successful queries 75
Trang 71Л Comparison: Client/Server vs P 2 P 9
1.2 Definition of variables for node ท using m -b it identifiers 18
2.1 Mapping tabic between distributed key and content n a m e s 41
2.2 Mapping table between distributed key and s u b -k e y s 41
2.3 Mapping table between distributed key and uncommon k e y s 42
3.1 Mapping Table between distributed key and content n a m e s 51
3.2 Mapping table between level-2 sub-keys and d istribu te d k e y s 53
3.3 Mapping Table between keys and c o n te n ts 60
viii
Trang 8A bstract
Conventional information searching engines such as Google, Yahoo, and Wikipedia support only Keyword-based searching on websites They cannot search information
in various kinds of resources such as personal devices like Laptop, PDA} Cell Phone
or sharing files in P2P Network
Besides, DHT-based P2P networks such as Chord, CAN, Pastry can achieve cxact (!ucry (i.e query of an exact key) with characteristic of scalability, efficicncy and fault-tolerate However, in the Cítóe of complex queries such as range query or multiple-attribute query, pure DHT is not efficient since lots of query messages must
be sent
In this thesis, we focus our intentions on m ultiple-attribute query on DHT- batícd P2P network The big problem here is the unbalance among nodes due to the appearance of common attribute/value pairs (AV pairs) in content names The main idea of our method is to lim it number of content items, which assigned to an ID by creating sub-IDs from multiple AV pairs if those AV pairs appear in lots of content names, to threshold value of each node To reduce query cost, our system also keeps the mapping between an ID and its sub IDs if existed in the node responsible for the
ID Moreover, we store only mappings, which are created in distribution progress,
to 2 nodes Our method can achieve both efficiency and a good degree of load balancing even when the distribution of AV pairs is skewed Our simulation result shows the efficiency of our solution in respects of lookup time and the degree of load balancing
1
Trang 9With the unprecedented growth of information technology, today we can see thatinformation is appearing in everywhere Information might be found in various kinds
of resource« «uch as personal dcvices like Laptops, PDAs, Cell phones , websites in the Internet, sharing files in P2P network 1.,
From the explosion of information, there are more and more information searching demands in somewhere Every day we need lots of information to communicate and work efficiently and easily For instance, we search for weather forecast infor- mation before a trip or a picnic We also search for information of the latast news of the day,refercncc« of a product to buy, information of market priccs, etc In lots of cases, if we seize desired information quickly and exactly, we might have more successful opportunities in communication and work Therefore,information searching
is a necessary demand in nowadays information age The emergence of new applications and services will require an efficiency information searching system which can realize complex query on contcnt names in a sealable manner (พ Adjie-Winoto &Liliey 1999; Carzaniga Sz Wolf, 2001; Foster & Tuecke,2002)
There are many large systems to allow searching information such as conventional search engines: Google, Yahoo Amazon, eBay, Wikipedia Google engine allows users to search information based on keywords on Internet This engine can link to billions of websites to search information Information of each website is described
by keywords and then they are processed and stored in servers o f Google
Conventional search systems often use Client/Server model where servers p ro
9
Trang 101.1 Overview and M otivation
vide searching services to clients However, Client/Server model have some disadvantages Firstly, it has limitation in scalability Servers are made with high cost because it need a very big capacity of processing and storing Secondly, each server may be a single point of failure When server goes down,operations will be ceascd Moreover, as the big number of simultaneous client requests to a given server in- creaiies the server can become overloaded When a big amount of clients join to the network, traffic congestion on the network has also been an issue
Rcccntly, the appcarancc of Pccr-to-Pccr (P2P) network model has attracted the interest of lots of people, P2P with their decentralized control, self-organization and adaptation have emerged as a significant social and technical phenomenon over the last year Unlike Client/Server model, P2P networks aim to aggregate largo numbers of computers that join and leave the network frequently In pure P2P systems, individual computers communicate dircctly with each other and bharc information and resources without using dedicated servers For example, they provide infrastructure for communities that share CPU cycles (e.g., SETI@Home, Entropia) and/or storage spacc (e.g., Napster (Idit Kcidar, 2006; Napster, 1999) FrccNet, Gnutella (Gnutella, 1999)) or that support collaborative environments (Groove)
In P2P networks, all clients provide resources, including bandwidth, storage spac(î, and computing power If there are more and more many nodes to join to the svstem, the total capacity of the systcn would be more and more increase This
is not true of Client/Server network model with a fixed set of servers, in which adding more clients could mean slower data transfer for all users The distributed nature of P2P networks also increases robustness in case of failures by replicating data over multiple peers, and by enabling peers to find the data without relying on
a centralized index server In the latter ease, there i« 110 single point of failure ill the system
Information searching on P2P network is attended in recent years Advantages
of P2P network model allows us to construct information searching systems with capabilities of scalability and fault-toleratc Bccausc of the whole of data of system are distributed to all nodes; each node is responsible for a portion of data and to take part, in search progrevss The Gnutella network (Gnutella, 1999) supports to share and search file« It searches data by flooding messages to the whole network Nevertheless? Gnutella network requires high overhead; the search may be failed because a query may be not routed to the node is responsible for desired information Hence, it leads to search information inefficiently eDonkey (Weikum, 2002) network
Trang 11suits to share and scarch big files such video files, fu ll music albuma and software based on some nodes (which are called indexing servers) that allow users to locate files within the network However, maintaining indexing servers cause ฯcentral point
of failure” • If indexing servers ccasc operates, a query may be not sent to the nodes whose data are indexed in these servers In general, scalability, efficiency and fault- tolerance in message routing and inform ation query on P2P networks are essential problems th a t must be resolved
The adoption of D istrib u te d Hash Table (D H T ) such as C A N (ร Ratnậamv
a promising solution for routing problems in P2P network A DHT-based network constructs a structure o f a v irtu a l key spacc where cach node is responsible for a portion of key space7 and keys are used as destinations to route messages Ill the case of Chord, it can route a message to a node responsible for a destination kcv in O(logN) hops while each node only needs to maintain O (logN) links to other nodes, where N is the to ta l of nodes in the network Some typical DHT-based network are BitTorrcnt, the Storm botnct the Kad network, YaCy W ith above organization DHT-hasod network meet, a problem I t only achieve exact queries In DHT-based network, inform ation contcnt is represented by a key from its description I f users want to scarch this inform ation content, they need to send w ith complete query to the network In fact, if users only remember a portion of inform ation content, they would be not able to search this inform ation contcnt In the сабе o f complex queries such as range query or m u ltip le -a ttrib u te query, pure D H T is not efficient since a lot of query messages must be sent
In this thesis, we focus our intentions on DHT-based P2P networks and the realization of m u ltip le -a ttrib u te searching system based on DHT Simple DHT-ba*scd distribution of inform ation contents can be used to realize exact query (i.e querying information contents which have the same content name as the name in the query) efficiently However,it is not efficient to perform complex queries such as multiple-
a ttrib u te queries It is because the number o f inform ation contents th a t match a complcx query is often large and therefore a lo t of query messages must be sent to resolve a query To realize multiplc-attribute queries efficiently on DHT-based P2P network, a framework b u ilt above the D H T is necessary
Several papers (Gao, 2004; Y- Arakawa et al.,2005; M Balazinska,2002; Garces- Erice & Ross, 2004) have proposed solutions for above problems In these solutions, content names are represented by attribute and value (AV) pairs In (Gao, 2Q04;
Trang 12.1 Overview and M otivation
Y Arakawa et al 2005),cach AV pair o f a content name iỉ> hashed to a key
In (M, Balazinska,2002; Garces-Erice & Ross, 2004), a content name forms an attributed/value tree and each strand of an AV tree is matched to a key Inform ation content is then distrib u te d to cach node rcbponsiblc for the key based on DH 丁 routing algorithm Information query is realized by looking up the nodes that are responsible for keys created from a query AV pair However, the distribution of information content ba«ed on each AV pair or each strand of AV tree causes the load unbalancing since there may be common AV pairs those appear in content names w ith high probability There arc several solutions for this problem such as using m ultiple nodes in responsible for a key (Gao 2004) or lim itin g number of content item« corresponding to a key in a node (Y Arakawa et al., 2005) However, these approaches have not good balance between load balancing and query efficiency though there is a tradeoff between load balancing and query efficiency
In this thesis we propose a system callcd DSM AV (using Dynamic threshold values for Searching Multiple Attribute/Value) to distribute and query information content ba^ed on AV pairs on DHT-bai>ed network The m ain idea o f our system is to lim it the number of inform ation contents, which are distribu te d by a distributed key
to a reasonable threshold value It is done by creating sub-keys from multiple AV pairs in a contcnt name if these A V pairs already appear in a lo t o f contcnt names
If the to ta l number of inform ation contents of a sub-key is not over a threshold value^ it would be used as a distributed key to distribute its information content Our system also keeps the mappings between sub-keys created from two common
AV pairs and distribu te d keys created from more two common AV pairs to reduce query cost O ur system satibfics three following requirements:
• Efficiency Distribution: A content name will be distributed to a few of nodes The number of the mappings and keys are distrib u te d to every node reasonably Mappings of level-2 sub-key and level-m sub-key (m I 2) are stored in the nodes
th a t take responsibility for level-2 sub-keys
• Efficiency Searching: If a query name contains an uncommon AV pair, it will
be resolved by querying only node which is responsible for the key created by the uncommon AV pair I f a query name contains only common AV pairs, query messages w ill be sent quickly to nodes responsible for prim ary key and level-2 sub-key The node, which is responsible for level-2 sub-key, also use mappings between level-2 sul>key and other sub-keys to route query messages
Trang 13to other nodes, which store queried contents.
• Load balancing: The number of information contents those are corresponding
to a distributed key is not over a threshold value Hence, even if the distri-
b ilio n of AV pairs in content names is skewed, the number of inform ation contents stored in cach node can be kept balanced naturally
O ur solution uses DHT-baised protocols to route messages The keys, which are created by one or more AV pairs in the system, are routed by DHT-based protocols such as CHORD However, the inform ation contents, which include a set
of attribute-value pairs, arc distributed to the node« effectively An inform ation content, which contains common AV pairs, is distributed to a set of nodes by using keys, which are hashed from these common AV pairs Sim ilarly, a query process also hash one or more AV pairs of a query name into keys, which are sent to the nodes take responsibility for the keys Returned result would be a list of inform ation contents th a t are matchcd w ith AV pairs of the query name
O ur system can achievc both cfficicncy and a good degree o f load balance even when the d istrib u tio n of AV pairs ill content names is skewed O ur sim ulation shows the effectiveness o f our proposed system
O ur thesis І8 organized in 5 chapters In first Chapter^ we would introduce the overview of P2P network and P2P network models W ith Chapter Two we describe conventional solutions for M u ltiple-attribu tes inform ation searching aỉá our related works Then, we proposed a solution for M u ltip le -a ttrib u te d inform ation searching based on dynamic threshold values on Structured P2P Network in Chapter Three Baling on our proposed solution, in Chapter Four wc im plement simulations and evaluate the solution qua litative ly and quantitatively The last o f all, we summarize carried works and ta lk about trends of development in the futu re in Chapter Five
W ith the rising up of re q u ire m e n t of searching inform ation everywhere» there are more and more people, who might want to share, search or download data from network environment Today,network environments allow users to create and share inform ation resources to g e th e r Besides, the appearance o f Internet, where contains frequently updated inform ation, supports user's searching inform ation whenever and
Trang 141.2 C om m unication netw ork models 7
whenever they need Users can search latest news, inform ation o f a book, reviews
fo r business products, or download resource fr이n on Internet easily To do this, some inform ation searching systems al lo พร users to search data from everywhere by ubiquitous com puting models
I biquitous com puting is a post-desktop model o f human-computer interaction in which inform ation processing has been thoroughly integrated into everyday objects and activities In the progress of ordinary activities, som eone,• using” ubiquitous computing engages many com putational devices and systems simultaneously, and may not necessarily even be aware th a t they are doing so For this reason, ubiquitous computing model is used in many inform ation searching systems
In these systems, the term M Distributed information” is described as lots of inform ation resources th a t can connect together based on a communication network environment They contain devices, which store inform ation, such as P D A ’S,Mobile phone Laptops… Their data may be pieces of information such as the information
of a laptop or a book, or data files such as PDF, M P 3 ,Video files To support users searching the inform ation flexibly, the systems use characteristics of an object
to represent inform ation In other hand, each object may be described by M ultiple- attrib ute s based on its characteristic For instance, a laptop’s attributes can be described by its characteristics as name o f manufacturer, CPU, R A M , HDD I f an object is represented by more and more attributes, it would be more and more found easily For this reason,an information searching system w ill store lots of data, which are created by objects’ representations It hence needs to analysis and organize data Reasonably choosing of a communication network model and implementing a well protocol would increase performance and effective o f the system
There are two typical communication network models, which a M u ltip le -a ttrib u te inform ation searching system may be able to use, namely C licn t/S crvcr and P2P network Client/Server network model (Figure 1.1) is used widely and popularly everywhere This model is a distributed application architecture that partitions
tasks or workloads between service providers (servers) and service requesters, called clients Often clients and servers operate over a com puter network on separate hardware A server machine ib a high-pcrformance host th a t is running one or more server programs which share its resources w ith clients A client does not share any
of its resources, b u t requests a server’s content or service function Clients therefore
in itia te com m unication sessions w ith servers which await incoming requests
W ith such architecture, C lient/Server network model has some advantages and
Trang 15CUENT ff lì
广 ÍỄÊ3m ะJ
Figure 1.1: C lient/Server network model
disadvantages In most cases, it allows the roles and responsibilities o f a computing system to be distributed among several independent computers th a t are known to each other only through a network This creates an additional advantage to this architecture: ease of maintenance For example, it is possible to replace, repair, upgrade, or even relocate a server while its clients remain b oth unaware and unaffected by th a t change A ll data are stored on the server, which generally have better security controls than most clients Servers can control access and resources,
to guarantee th a t only those clients w ith the appropriate permissions may access and change data Since data storage is centralized^ updating data are administered easily It functions w ith m ultiple different clients of different capabilities However
C lient/S crvcr model may be not reasonable to constructs distribu te d inform ation searching systems because its disadvantages Traffic congestion on the network has been an issue since the inception of the Client/Server network model The server may become overloaded if it processes a lot of requests from the big number of simul- taneous clients I t may be hard to achieve scalability w ith a single point of failure
In the case the server is damaged, clients' requests cannot be fu lfille d and the system
w ill cease Moreover, activities of server require adm inistrators who are knowledgeable experts about network and system This increases cost of management and operation
In contrast w ith the C licnt/S ervcr network model 1 P2P network model, where
it aggregated bandwidth actually increases as nodes are added, since the P2P network's overall bandwidth can be roughly computed the sum o f the bandwidths of every node in th a t network It is very suitable for distributed network architecture
Trang 16G lo b a l knowledge: servers have a global Local knowledge: nodes only know a small
C entralization: Communications and
management arc centralized
D ecentralization: no global knowledge, only local interactions
Single p o in t o f fa ilu re : a server failure Robustness: several nodes mav fail w ithbrings down the system little or no im pact
L im ite d scalability: servers easily
overloaded
H igh sca la b ility: High aggregate capacity, load d is trib u tio n
E x p e n sive : server storage and bandwidth Low-cost: storage bandwidth is contributed
with composing of participants that make a portion o f th e ir resources (disk storage,
network bandwidth or processing cycles) Peers are both suppliers and consumers of
resources, in contrast to the trad ition al Client-Server model where only servers sup
ply and clients consumci Organization of P2P network allows overcoming problems
of Client/Server network Table 1,1 shows comparison of C licnt/S ervcr and P2P
model In next section, we describe P2P network architecture w ith its advantages
and disadvantages
1.3.1 P 2 P N etw ork
Contrast to client-server networks, where network inform ation is stored Oïl a cen
tralized file server PC and made available to tens, hundreds, or thousands of client
PCs, the inform ation stored across P2P networks ib uniquely decentralized A P2P
network model allows ใฑany PCs to pool their resources together Individual re
sources like Desktops Laptops, PDAs, and storage devices are transformed into
shared, collective resources th a t arc accessible from every PC
Because P2P PCs have their own hard disk drives th a t are accessible by all com
puters each PC acts as both a client (inform ation requestor) and a server
Trang 17(informa-tion provider) Ill Figure 1,2, five P2P workstations arc shown A ll five computers can conimunicatc directly with each other and share one another.s resources.
A P2P system is a distributed collection of peer nodes It may provide services
to other peers and consume services from other peers Data of each node may
be a portion of common data of system It means that common data of systeixi
is distributed to whole of possible nodes Each node stores a portion of common data and related services Therefore, the load of storage and processing data is also divided into several parts, which correspond with peer nodes Each node stores and processes a small number of pieces data, so if some node fail, the network system І8 almost not affcctcd seriously
As a completely decentraiized model, it allows the development of applications
with: high-availability fault- tolerance, and scalability such as Sharing of content application (Fire sharing, content delivery network, Gnutella eMule, Akamay), Sharing of storage (Distributed file system), Sharing of CPU time (Parallel computing,
There are three models of P2P network consisting of Unstructured, Hybrid and Structured P2P network Each model has individual advantages and disadvantages
1.3.2 P 2 P N etw ork M odels
1 3 2 1 U n stru ctu red P 2 P N e tw o rk
Figure 1.2: Peer-To-Peer network model with 5 peers
Grid)
Unstructured P2P network is an unregulated overlay where each participating node connects to others rand이nly and arbitrarily, act as equals and merge the roles of
Trang 18Figure 1.3: Locating resources in a Gnutella-like P2P environment
clients and server Unstructured P2P network has no central server managing the network, neither is there a central router Joining of a new node based on a well- known node that, is called a bootstrap node The bootstrap node is flexible in the now node's neighbor selection and routing mcchanisms Unstructured P2P network
has profound impact on the efficiency of search The typical Unstructured P2P
Network system is Gnutella (Gnutella; 1999) (Figure 1.3)
For the purpose of files sharing Gnutella system is constructed basing on Unstructured P2P model Peer nodes of the svstem are organized randomly in Unstructured overlay Data of the system consist of files such as mp3, image, and video, which each of whole peer nodes may share to other nodes A new node, who wants
to join ill the system, would perform an operation ” Join”
Firstly, it would contact with a bootstrap node The bootstrap node would send to back a li«t of existed node, which is choscn randomly Then, the new node would store and relate ship to the list of the existed node as its neighbors From the neighbor nodes, the new node may reach other nodes in the communication network Thence, the new node continue to get more addresses of other existed nodes from the nodes it can reach, and so on Finally, the new node is existed as peer node
of the system Figure 1.3 shows an example of locating resources in a Gnutella-like P2P environment
Trang 19query may be not found (Figure 1.3) Popular data is usually found easily because
of they are stored by many nodes The number of content of popular data is enough
to locate a searched content by flooding the network Using flooding allows the search to be performed easily and reliably in a highly connected overlay
However Hooding has some disadvantages Firstly It creates a lots of duplicate query messages to send to a given node from its many neighbors Secondly; it
is d iliicult to determine the appropriate T im e -to L iv e (T T L ) which controls the flooding progress A high T T L allows achieving high scarch re lia b ility but requires high overhead Characteristics of the overlay affect the flooding effectiveness versus the overhead Otherwise, a lim ita tio n o f йсаіе-free topologies is the high load on very few number o f hub nodes Peers are not w illin g to m aintain high loads as they may not want to store large number of entries for construction of overlay topology
1.3.2.2 H y b rid P 2 P N e tw o rk
In Unstructured P2P Network model, queries m ight not always be resolved A lthough popular data m ight be stored at several peers, but if a peer node look up data shared by only a few other peers, then it is highly likely th a t search w ill be not successful Because o f there is no correlation between a peer and the data, which
is managed by it There is no guarantee th a t flooding w ill find a peer th a t has the dc\รircd data Flooding also causes a high amount of signaling traffic in the whole network and hence such networks have very poor search efficiency
W hile the structure o f the hybrid P2P network model m ight tacklc problems of routing and lookup in U nstructured P2P Network In that, peer nodes are divided into two types, data nodcb and hub nodes D ata node stores real data and information of a hub node Hub node is the same as a ,1 server Î it do not store real data
I t store only indexes to files in the network To jo in in the systemโ each node has
to contact w ith a server and to provide real data,which it wants to share to others Since, t he server would create indexes for these files, and then it stores these indexes
in its database
Figure 1.4 shows an example o f H yb rid P2P Model Servers are entrusted w ith routing tasks Each server store indexes to files and inform ation of data node and other servers Each data node would store data files and information of a server Data files are distribu te d among p a rticip a tin g data nodes However, to look up data, each node needs to use routing inform ation th a t is stored centrally in servers
Trang 201.3 P 2 P netw ork m od els 13
Figuře 1.4: An example of H ybrid P2P Model
Figure 1.4 also shows an example of lookup a file in H yb rid P2P Model Searching node would send query message to its server The server looks up filcs^ locations
in the list o f indexes ill its database It also sends query requests to other servers simultaneously Those servers would search for indexes corresponding to desired files and return the files* locations to the searching node Finally, the searching nodo would contact the data nodes, which contain desired files, and download real files dircctly In this example, node A would connect to node в and then to download data files from node в directly after getting location inform ation from servers.Being a decentralized network model,eDonkey network is typical H ybrid P2P Model It best suited to share big files among users It allows sharing video files, full music albums and softwares There is no central hub for the network Data file« are distribu te d among peer nodes eDonkey servers act as communication hubs for the clients, allowing users to locate files w ith in the network
Like Unstructured P2P Network model Hybrid P2P network is implemented easily becausc of it does not need to implement distribution and routing algorithms‘ Furthermore, because o f servers undertake lookup the locations of data files, Hybrid model does not use query flooding or random searching like Unstructured P2P model, since it is more efficient However, m aintaining indexing server cause Mcentra l point of fa ilu re ř If a server stops providing service, the data, which indexes are created by them ill the server^ database, would not exist in the system Therefore, the whole system would cease if all servers stop operating
Trang 21The distributed hashed method based oil a D istributed Hash Table (D H T ) Each node might, determine its position and range o f data managed by it from its ID Since, it also determines the position of queried data by it As a result7 searching and routing progrosses are performed efficiently.
Structured P2P network is a scalable, efficient, completely decentralized and self-organizing, and load balanced model Each node stores inform ation of O(logN) neighbors Routing algorithm allows it looking up any key w ith O(logN) hop time Because o f the routing algorithm based on globally consistent hash function, a set
o f keys are d istribu te d equally to the key space This allows the system achieving load balancing among nodes W ith characteristic of pure P2P, each node has self- organizing a b ility w ith highly available
Conscquencc, because Structured P2P Network contains many advantages; it is very interested to use in distribu te d inform ation searching systems Some proposed systems such as IN S /T w in e (M Balazinska, 2002),CDS (Gao, 2004) and Data Indexing (Garccs-Ericc k, Ross 2004)t used Structured P2P Network protocols The typical protocols include CAN, PASTRY and CHORD In next section, we continue
to present D istrib ute d Hash Table and a typical protocol, which is implemented on Structured P2P Network model CHORD
Trang 221.4 D H T-based Protocol
1.4.1 D istrib u te d H ash Table - D H T
Distributed Hash Tabic (DHT) is used to construct decentralized distributed network systems It provides a lookup service like to a hash table DHT contains key-value pairs, from that, each participating node can retrieve the value associated with a given key efficiently So, each node is responsible for storing and maintaining
a portion of information of key-value pairs which are mappings between a key and a value Each key is managed by a participating node Similarly each node manages
a set of key-value pairs, which include key-value pairs be sent from other nodes If there are more and more nodes in the system, the number of data, which are stored
in each participating node, may be more and more reduced Since, the load capacity and performance of the nodo can increase This allows DHT to scale to extremely large numbers of nodes,
Figure 1.5: Distribution data progress based on DUT
Figure 1,5 shows to distribute data in DHT-based P2P network Each data is hashed into a key by using a consistent hash function The key is distributed to a peer node that is responsible for the key
A procedure น Joirť,is constructed to support a new node joining to DHT-based
P2P network The new node is assigned an ID by using a globally consistent hash function Thence, a common routing progress of the procedure allows the new node knowing its position in overlay network Next, the new node receives the list
of neighbors and data corresponding with partial key space managed by it Its neighbors also receive information of the node and store in their database Finally: the new node is recognized ai> a full-fledged member of the system Since, it takes part to routing and querying data progress with other nodes