PEERSTORE a PEER TO PEER BACKUP SYSTEM

Name: Zhang HanDegree: Master of Science Dept: Computer Science Thesis Title: PeerStore: A Peer-to-Peer Backup System Keywords: Peer-to-Peer, Backup, Distributed Hash Table, metadata Abs

Trang 2

Name: Zhang Han

Degree: Master of Science

Dept: Computer Science

Thesis Title: PeerStore: A Peer-to-Peer Backup System

Keywords: Peer-to-Peer, Backup, Distributed Hash Table, metadata

Abstract

The vision of this thesis is to focus on designing a Peer-to-Peer backup system to besuited into large unstable networks Peer-to-peer backup, the implementation of adata backup service on top of a peer-to-peer network has received research attention

in recent years This thesis tries to concentrate on the use of Peer-to-Peer backup

in the Internet, with a large number of anonymous users The thesis offers both ageneral analysis of the requirements and issues of peer-to-peer backup, and a newdesign for such a system Existing systems are introduced and their suitability forInternet is evaluated, before we present our own novel peer-to-peer backup schemePeerStore PeerStore offers better performance by separating the tasks of dataplacement and metadata management, its improvements are shown by running theexperiments in real networks

Trang 3

SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE

2004

Trang 4

I would like to express my sincere gratitude to my supervisor, Prof Tan Kian-Lee,for his guidance and patience His advice, insights and comments have helped metremendously throughout my master year Working under Prof Tan is a greatexperience and he has enriched my experience greatly in being a researcher

I am particularly grateful to Martin A Landers, my project partner from Munich,who worked with me on the PeerStore project for half a year The PeerStore andthis thesis would be impossible without his help and effort

At the same time, I would like to thank my parents, for their support and agement throughout my years of studies They have guided me both in study and

encour-in life, I hope what I have done and what I will be done make them proud of me

Last but not least, I am deeply indebt to my girlfriend in China She helped methrough the entire year of my Master study despite the long distance between us

I must thank her for giving me care when I needed most

Trang 5

This thesis studies various issues related to Peer-to-Peer backup, which is a newservice based on a typical Peer-to-Peer network We study various systems andtechniques proposed in recent years in Peer-to-Peer backup area and propose ourown novel scheme: PeerStore We implement existing system as well as our ownproposed system to be run in real network

Peer backup, the implementation of a data backup on top of a Peer network, offers interesting possibilities for both corporate users and privateusers In recent research works, corporate scenario has been well studied and sev-eral schemes have been proposed which are proven to be well-suited However, forprivate users, especially like those anonymous users connected to large unstablenetworks such as the Internet, these schemes may not be applicable or may incurhigh maintenance cost The main target of this thesis is to design a new system

Peer-to-to take care of these large network users in doing backup, while at the same timeimpose certain mechanisms for security concerns

A detailed analysis is carried out to explore different requirements and issues related

to both the underlying Peer-to-Peer network and the top level backup semantics.The backup part of the system requires a high degree of flexibility, while the Peer-to-Peer part, invisibility should be the main concern Some challenging tasks are

to deal with limited system resources, support for fairness in order to avoid riding, as well as take care of peer heterogeneity

free-A number of recent research works on Peer-to-Peer backup proposed various niques and approaches, the thesis gives a description on all of them and discusses

Trang 6

the suitability for the three most important systems: pStore, Cooperative InternetBackup Scheme and Pastiche, because they represent the three most typical sys-tems in Peer-to-Peer backup

Based on all these, PeerStore, our own system, is proposed in order to fulfill ouroriginal design goal: A Peer-to-Peer backup system to be suited into large unstablenetwork We also implement both pStore and PeerStore using Java, the experi-ments run on real networks, which consists of 50 PCs, shows great improvements

in reducing maintenance cost and better support for fairness and heterogeneity

To this end, we believe that our contribution has addressed some important to-Peer backup issues in large unstable networks PeerStore has proven to handlethese issues better than the existing systems, the discussion in the designing alsogives future research directions in Peer-to-Peer backup The implemented Peer-Store can be further extended to incorporate more functionalities to address moreconcerns

Trang 7

1.1 Peer-to-Peer Backup 1

1.2 Contributions 3

1.3 Thesis Organization 4

2 Issues in Peer-to-Peer Backup 5 2.1 Backup Requirements 5

2.2 Peer-to-Peer Requirements 6

2.3 Resource Constraints 7

2.3.1 Storage Space Constraints 8

2.3.2 Bandwidth Constraints 8

2.3.3 Dealing with Duplicated Data 9

2.4 Peer-to-Peer Issues 10

2.4.1 Dealing with Unreliable Peers 10

v

Trang 8

CONTENTS vi

2.4.2 Dealing with Free-Riders 11

2.4.3 Dealing with Malicious Peers 12

2.4.4 Dealing with Peer Heterogeneity 12

2.4.5 Ensuring Availability in Unstable Network 13

2.5 Backup Organization 14

2.5.1 Data Storage and Retrieval 14

2.5.2 Metadata Management 14

2.5.3 Ensuring Confidentiality and Integrity 15

3 Related Work 17 3.1 Introduction to Existing Systems 17

3.1.1 pStore 17

3.1.2 A Cooperative Internet Backup Scheme 22

3.1.3 Pastiche 24

3.1.4 Samsara - Fairness for Pastiche 27

3.1.5 Other Systems 29

3.2 Analysis of Existing Systems 31

3.2.1 pStore 32

3.2.2 Cooperative Internet Backup Scheme 33

3.2.3 Pastiche with Samsara 34

Trang 9

CONTENTS vii

4.1 Overview 38

4.2 Backup 39

4.3 Restore 41

4.4 Metadata Layer 42

4.5 Data Distribution 44

4.5.1 Finding Trade Partners 46

4.5.2 Imbalance in Trades 47

4.5.3 The Trading Process 48

4.6 Fairness 50

4.6.1 Safekeeping 50

4.6.2 Punishment Model 52

4.7 Short-term Availability vs Long-term Availability 53

5 Experimental Results 57 5.1 Simulation to prove Dominance of Data Migration 57

5.1.1 Simulation Setup 58

5.1.2 Results 60

5.2 Simulation to compare Performance of pStore and PeerStore 63

5.3 Evaluation 66

Trang 10

CONTENTS viii

A Implementation of pStore and PeerStore 77

A.1 Development Platform 77

A.2 Overview 77

Trang 11

List of Figures

3.1 Overview of pStore 18

3.2 Block Creation Process in pStore 19

3.3 Block Sharing in pStore 20

3.4 Block Matching in pStore 20

3.5 Adding Version to File in pStore 21

3.6 Virtual Disk using Erasure Code 23

3.7 Buddy Searching Process in Pastiche 25

3.8 Robustness of Anchors during File Modification 27

3.9 Generation and Forwarding of Storage Claims in Samsara 28

3.10 Data Migration in Distributed Hash Table 32

4.1 PeerStore Overview 38

4.2 Overview of Backup Process in PeerStore 40

4.3 Information Generated from a File during Backup 41

ix

Trang 12

LIST OF FIGURES x

4.4 Internal Structure of Block Metadata in PeerStore 43

4.5 Trading between peers in PeerStore 45

4.6 Messages in Creating a Trade 48

4.7 Challenging an Arbitrary Number of Blocks using Hash Chain 52

5.1 Cumulative Distribution Functions of the exponential distribution used determining session and downtime durations for the simulation 59 5.2 Comparison of routing-maintenance and data-migration costs in pStore 60 5.3 Growth of data-migration cost with the amount of data to be backed up in DHT 62

5.4 Comparison of pStore and PeerStore Maintenance Traffic 64

5.5 Trade Ratio vs Number of Successful Backup Nodes 65

5.6 Trade Ratio vs Number of Successful Backup Nodes 66

A.1 Sequence Diagram of Backup in PeerStore 78

Trang 13

List of Tables

5.1 Comparison of Existing Peer-to-Peer backup system with PeerStore 68

xi

Trang 14

LIST OF TABLES xii

Trang 15

impor-in a collaborative fashion, which is expected to be cheap and efficient Simpor-ince theInternet connection for users all over the world are getting faster and faster, allthese computers can form a large-scale storage network for backup.

1

Trang 16

1.1 PEER-TO-PEER BACKUP 2

All the PCs, or with a formal name ”peers”, must be able to interact with eachother to form a network that shares and offers resources: a Peer-to-Peer network.Such networks turn thousands or even millions of small computers into servants

- fulfilling both the client and the server roles at the same time - have becomeextremely popular since its first commercial appearance as ”Napster” Peer-to-Peer paradigm has attracted much interest from both the research and the softwaredevelopment community Peer-to-Peer applications have already, and will further,change the nature of interaction of the Internet, which are mostly client-servermodel nowadays

Most early Peer-to-Peer systems were mainly focused at the idea of file-sharing, inrecent years, there are many new systems aimed at other services such as instant

messaging, web caching, censorship, etc The focus of this thesis is Peer-to-Peer

backup, which enables users to backup their data on top of a Peer-to-Peer network

in a collaborative fashion Investigation of recent research works on Peer-to-Peerbackup shows that most of them has been devoted to long-term archival and pub-lishing system, rather than real backup system

In our thesis, we will first examine and investigate various issues in Peer-to-Peerbackup, defining requirements and addressing important concerns and comparingexisting approaches Based on these requirements and studies, we will propose ourown Peer-to-Peer backup system: PeerStore, with improved scheme to possiblyprovide a better solution comparing to the existing Peer-to-Peer backup systems.Security issues related to anonymous users connected to Internet will be also ad-dressed at various points

Trang 17

1.2 CONTRIBUTIONS 3

In this thesis, I have made the following contributions

• Investigate various Peer-to-Peer backup issues, which include both issues

re-lated to the underlying Peer-to-Peer network and those concerning the upperlevel backup semantics

• Examine and compare existing approaches in Peer-to-Peer backup, to analyze

their advantages and disadvantages in doing Peer-to-Peer backup

• Propose a two layer Peer-to-Peer backup system: PeerStore, which decouples

the data placement from metadata management so as to relax the strictnessDistributed Hash Table has imposed on data placement

• Implement two Peer-to-Peer backup systems from scratch, using Java The

first system contains the existing approach in Peer-to-Peer backup, and thesecond system implements the newly proposed two-layer system These twosystems are meant for experiments in real networks by doing actual backupand restore operations, so as to compare their performance

• Evaluate the proposed backup system by running experiments in real

net-works, results have shown an improved performance in reducing maintenancecost and supporting for fairness and peer heterogeneity

With the newly proposed system: PeerStore, the following paper is published:

PeerStore: Better Performance by Relaxing in Peer-to-Peer Backup M Landers,

H Zhang, K.L Tan, Proceedings of the 4th IEEE International Conference on Peer-to-Peer Computing, Zurich, Switzerland, August 2004.

Trang 18

1.3 THESIS ORGANIZATION 4

The thesis consists of 6 chapters: Chapter 2 discusses various issues related toPeer-to-Peer backup, including important requirements for Peer-to-Peer backup,and other concerns in Peer-to-Peer backup systems Chapter 3 gives a detailedexplanation of the existing approaches in Peer-to-Peer backup, compares the prosand cons of these systems and tries to gain useful ideas from these analysis Chap-ter 4 presents the proposed design and some general implementation issues of thenew system: PeerStore PeerStore offers improved performance in unstable largescale networks by relaxing the strict constraints Distributed Hash Table has put

on data placement It accomplishes this by decoupling the tasks of data age and metadata management Chapter 5 includes the experimental results byrunning PeerStore over the real networks, to demonstrate the improvements of thenewly proposed system Chapter 6 gives conclusion and points out further researchdirections in Peer-to-Peer backup

Trang 19

stor-Chapter 2

Issues in Peer-to-Peer Backup

This chapter discusses various requirements and issues in Peer-to-Peer backup tem All issues discussed here are important to a perfect Peer-to-Peer backup,however, some of them are in the relation of a tradeoff themselves Therefore, thetarget system should try to address all these issues and compromise between them

To the user, backing up data means creating redundant copies of important files

to guard against losing them

Backup and Restore of Individual Files

The system should be file-based, which means the user can select arbitrarynumber of files to backup according to his will Any subset of the files backed

up can be restored whenever the user issues the command

5

Trang 20

2.2 PEER-TO-PEER REQUIREMENTS 6

Versioning

It is possible that a good version of a file can be overwritten by a corruptednew version In order to tackle this problem, access to a reasonable smallnumber of recent versions of the a file in the backup store is required

Backup of File Metadata

The metadata of a file includes information such as the name of the file, itslocation in the file system, access rights, modification time, etc Metadatacan play an vital role in organizing backup files and restoration

rely-Peer-to-Peer Metadata Management

Metadata is important as it captures the information on how and wherebackup data is stored in the Peer-to-Peer network Therefore, the meta-data must also be replicated in the Peer-to-Peer network, just like the actual

Trang 21

Three resource constraints problems will be discussed here: storage space, width, duplicate data These are the major factors affecting the design of thePeer-to-Peer backup system.

Trang 22

band-2.3 RESOURCE CONSTRAINTS 8

2.3.1 Storage Space Constraints

Nowadays, even though the capacities of modern hard disk have outgrown theindividual storage needs, disk space still remains a limited resource Especially inscenarios like backing up data, which needs several copies for every piece of data

In the Peer-to-Peer backup system, peers will have to contribute multiple timestheir storage needs to make the system work The major challenge lies behind ishow to motivate the user to contribute this amount of space and how to ensurefair distribution within all the peers Thus, the Peer-to-Peer backup system mustmaximize the storage space available and at the same time minimize the backupsize The system has to give incentives to the users or force the user to contribute

at least as much storage as they use To minimize the backup size, replicationoverhead and metadata management need to be kept low and the system need tohave a duplicate check and removal mechanism to reduce the actual data stored

in they system in order to use the space available wisely To increase the amount

of available storage, the system need to attract as many peers as possible, whichmeans the system needs to be flexible enough to take care of different backupneeds This may introduce the problem of peer heterogeneity, which is also one ofthe major issues in designing the system

2.3.2 Bandwidth Constraints

Bandwidth is another limited resources in a Peer-to-Peer backup system Today’sInternet connections vary in bandwidth by several orders of magnitude: rangingfrom kilo bits per second to fast connections like cable-modem or digital subscriber

Trang 23

2.3 RESOURCE CONSTRAINTS 9

line with several hundred mega bits per second

The direct implication from this is that the system needs to keep the communicationoverhead small so as to use the bandwidth effectively There is a tradeoff betweenfrequent network communication and network bandwidth usage, since infrequentnetwork communication may lead to inaccurate information of the data location,while short update intervals between peers will surely consume too much networkresources Another indirect implication is that because of the bandwidth limitation

of each user, the amount of data a peer can backup and store stays as a function

of bandwidth

In conclusion, bandwidth is another factor needs to be taken into account whenthe system tries to deal with heterogeneity

2.3.3 Dealing with Duplicated Data

One study on the usage of file system [27, 34, 2] has shown that the data of a user’sdaily work only affects a very small number of files, and is not altering the entirefile system statistics largely, if duplicates among all the users were eliminated,47% of disk space can be reclaimed, which means duplicate removal can have alarge impact on the usage of the system resource, which can possibly save largeamount of storage space as well as network bandwidth The observation from thesestudy introduces the concept of ”Britney Spears effect”, referring to the popularity

of songs, movies in file-sharing networks Actually, the concentration on popularresources is the main motivation behind and what makes the Peer-to-Peer systemwork

The implications from these studies has some impact, depends on the degree of

Trang 24

2.4 PEER-TO-PEER ISSUES 10

overlap, on the design of the system The benefit gained from duplicate removal will

be small if the degree of overlap between users is small However, if the overlap ishigh(for example a lot of people are backing up windows installation disk), enforcing

a duplicate check and removal mechanism will largely save the storage by limitingthe number of stored copies of common files or even file portions Of course, issuesrelated to security will rise since the shared common data files will be readable bymultiple peers

All the system needs is a mechanism to check the entire network of the presence of acertain file or a data block If the results returned shows that the files to be backed

up has already sufficient copies in the system, then the backup request for thisparticular file will be ignored This also emphasizes the importance of DistributedHash Table (DHT) in the Peer-to-Peer system as it can provide an efficient way ofquerying the entire network for certain data blocks

Because the backup system is based on a Peer-to-Peer network, issues that arerelated to Peer-to-Peer will certainly have to be taken into account when designingthe system

2.4.1 Dealing with Unreliable Peers

In this thesis, two common categories of unreliable peers will be the main focus:unreachable peers and faulty peers Unreachable peers are those peers becomedisconnected from the network due to various reasons like cable unplugged, shutting

Trang 25

down of machine, etc They can cause problems of losing resources for the network.Faulty peers are those deviated from the assumed protocols because of software orsystem failure, without malicious intent

The main approach in dealing with unreliable peers, as has been proposed by mostPeer-to-Peer system, is redundancy By storing multiple copies of each piece ofdata, even the data may be missing locally, it will still be available within thesystem While for faulty peers, protocols with explicit versions can help to makeall peers updated in terms of software

2.4.2 Dealing with Free-Riders

Free-riding peers are those who want to maximize their profit from the system, atthe cost of others According to Adar and Huberman investigating on free-riding inGnutella [1], they have pointed out that the number of free-riders can easily outgrowthe number of honest peers In the Peer-to-Peer backup scenario, free-riders meansstoring their own data into the system without contributing accordingly, or evenwithout giving any resources at all

Free-riding is a common topic in Peer-to-Peer research, most solutions proposed by

other research works can be divided into trading solution or rating solution For

trading solutions, peers can either exchange the resource directly to fulfill both oftheir consumption of system resources, or use some kind of intermediate medium(”money”) to help them to complete the trade For rating system, each peer in thesystem will be rated according to their behavior in the system and will be givenmore benefit or be punished according to their rating One subtle problem needs

to be taken care of is the Sybil attacks [12], where peers have been rated as a ”bad

Trang 26

citizen” in the network, but they rejoin the network using a new identifier

2.4.3 Dealing with Malicious Peers

Malicious peers are those with the intention of disrupting the service and bringingdown the system only Even though this category of peers is only a fraction, theyare often highly skilled and motivated, which makes them more harmful

The security of the system depends on two factors: the individual security of thepeer and the security of the interaction between peers To ensure the first one,peers need to be careful of any input from the network while in order to secure thelatter, peers should not trust peers easily and they should challenge the relevantpeers regularly

2.4.4 Dealing with Peer Heterogeneity

Because of mathematical modelling or simply for simplicity, peers are normallytreated equally in a Peer-to-Peer system, especially in systems that use DistributedHash Table (DHT) However, in reality, peers can come from all categories, theymay be different in hard disk space, network connection speed, processor frequency,etc Thus, they can have all kinds of backup request, ranging from a few kilo bytes

to several gigabytes Peer Heterogeneity generally requires a high freedom of thesystem However, there is always a trade-off between autonomy and efficiency.Because if the network is too unstructured in order to provide high degree ofautonomy, searching in the network as well as broadcasting queries will becomevery inefficient Therefore, seeking compromised solution will be the main focus of

Trang 27

the proposed system

2.4.5 Ensuring Availability in Unstable Network

In the real world, peers are not up and running all the time, they can be nected from the network at various times because of system crashing, being blocked

discon-by firewall, etc, causing the data stored on these peers become unavailable In der to capture the property of the population of peers over the entire network, the

or-author of [4] has introduce the concepts of short-term availability and long-term

availability Normally, in order to ensure short term availability of every piece of

data, high maintenance cost will be introduced While in a system with long-termavailability, data will rarely be lost but might be temporarily become unavailable.Either short-term or long-term availability requires a maintenance mechanism toensure data can be eventually retrieved If files or data blocks have been lost, theymust be replicated in order to ensure redundancy, if metadata are corrupted oroutdated, they must be updated In networks with a high churn-rate (the rate atwhich network population changes), ensuring short-term availability requires ag-gressive maintenance, frequently re-creating and moving data when peers join andleave the network For a system with long-term availability, a peer must monitorthe intervals of its partner’s up time and challenge them regularly In a Peer-to-Peer backup system, long-term availability can be sufficient, but the problem withthis approach is it requires a long time to restore the backup data as some portionsmay be missing temporarily

Trang 28

2.5 BACKUP ORGANIZATION 14

Since this is a backup system, issues related to building backup semantics on top of

a Peer-to-Peer network will be discussed here: Data storage and retrieval, metadatamanagement as well as confidentiality

2.5.1 Data Storage and Retrieval

Most Peer-to-Peer backup systems are in favor of dealing with data in small andequal-sized blocks This approach avoids the problem of fragmentation and able totreat files of different size in the same manner Storage location can also be locatedeasily, especially in a structured network where searching and broadcasting can bedone efficiently

2.5.2 Metadata Management

In a Peer-to-Peer backup system, the task of the metadata is to keep track of thelocation of backup data stored in the Peer-to-Peer network And we expect themaintenance of the metadata is bound to become a distributed responsibility in aPeer-to-Peer system

There are two major concerns with the metadata management: the size of themetadata and the consistency When the size of metadata becomes substantiallylarge, dealing with them in a distributed fashion can reduce the requirement ofdealing with them in a single PC However, the nature of distributed system in-troduces another problem: metadata consistency All peers in the network need

Trang 29

Two concerns with the metadata consistency need to be addressed here: The mation stored in the metadata must be consistent with the actual location of thedata; the replicated copies of the same metadata must be same on all storing peers.The key to ensure consistency is to make it the responsibility of the updating host

infor-to updates all the copies of the metadata in the network

2.5.3 Ensuring Confidentiality and Integrity

Since in the Peer-to-Peer network, all the backup data are transferred to otherhosts, so the main problem here is whether the other hosts can read or even modifythe data stored on them while they should not have the permission to do so.Encryption can be one solution to this problem so that only the owner node can readand modify the data One popular scheme that has been used by quite a number of

recent Peer-to-Peer systems is called convergent encryption, which creates identical

ciphertext for identical data blocks, and at the same time let each owner maintain

a distinct key

Trang 30

2.5 BACKUP ORGANIZATION 16

Trang 31

Chapter 3

Related Work

This chapter is divided into two sections: in the first section, it tries to introducefour actual Peer-to-Peer backup systems in detail and give a short overview of otherrelated work; in the second section it tries to evaluate these systems in terms oftheir suitability for large unstable networks such as the Internet

3.1.1 pStore

Introduction

The pStore system[3] separates backup and underlying Peer-to-Peer functinality.pStore completely surrenders all Peer-to-Peer tasks to Chord [32], a distributedhash table (DHT) implementation, concentrating on the implementation of backupsemantics on top of it When creating a new backup, pStore splits each individualfile of the backup set into a number of equal-sized blocks (except the last block,

17

Trang 32

3.1 INTRODUCTION TO EXISTING SYSTEMS 18

000

111 111

DHT(distributed hash table)

Figure 3.1: Overview of pStorewhich may be smaller) and a metadata record, which are then stored inside thehash table (Figure 3.1) The mapping of data blocks to nodes inside the Peer-to-Peer overlay is managed by Chord

Before inserting each file block (FB) into the network, it is encrypted using

conver-gent encryption[5] In this scheme each block is encrypted with a symmetric cipher

using a cryptographic hash value of the block, H1, as the key The encrypted block

(EB) is run through the hash function once more to obtain hash value H2, which

serves as the identifier (ID) of that block After that, the tuple (H2, EB) is inserted

into the distributed hash table (DHT) Both H1 and H2 and general information

on the block are added to the metadata descriptor (called file block list, FBL).Finally the FBL is inserted into the hash table as well Since it contains keys forall files blocks, it is encrypted using the symmetric key of the user, which can bederived from a password The block creation process is illustrated in Figure 3.2

Data Sharing

Trang 33

Encrypted File block

cryptographic hash

0000 0000 0000 0000 0000

1111 1111 1111 1111 1111

00000000 00000000 00000000 00000000 00000000 00000000

11111111 11111111 11111111 11111111 11111111 11111111

ID,H1, length, offset

block 2 block 1

user’s key

DHT DHT

1

2 = ID EB

Figure 3.2: Block Creation Process in pStore

By employing convergent encryption, pStore ensures that all peers that back upthe same file produce the same set of encrypted blocks This guarantees that allstorage requests for identical blocks will be routed to the same host in the DHT

If a host receives a duplicate request, asking it to store a block which is alreadypresent in its block store, it will silently ignore it and return a ”success” response.The scheme ensures that only a single copy of each block is stored in the systemand common blocks are shared Each host stores an owner tag list (OTL) with eachblock stored to indicate all the peers that send a storage request for this particularblock Figure 3.3 illustrates the block sharing in pStore

To further decrease the storage requirements, pStore incorporates an incrementalbackup scheme Exploiting the observation that files in consecutive backup snap-shots often show only minor differences, pStore tries to match previously storedfile blocks with the new version of a file using a modified rsync algorithm [33]

Trang 34

00000

00000 11111 11111

0 0 0 1 1 1 0

0 0 1 1 1

As shown in Figure 3.5, the new list contains references to matching old blocks aswell as a small number of newly created blocks Under this scheme, the frequentbackups of a file with small changes should only demand moderate resources

By employing these techniques, significant savings on the global storage ments can be achieved The authors of pStore claim a reduction of between 3 and

Trang 35

require-3.1 INTRODUCTION TO EXISTING SYSTEMS 21

Figure 3.5: Adding Version to File in pStore

60 percent in total storage requirements However, savings depend heavily on thedata inserted into the backup system and scale directly with the degree of overlapbetween hosts (A study conducted as part of the Farsite project [5] found up to50% overlap between the computers of Microsoft employees.)

Replication

As discussed in Chapter 2, replication is needed in order to make data storagerobust against host failure pStore stores multiple copies of blocks under differ-ent keys inside the hash table, it accomplishes this by hashing the block identifiertogether with a salt value pStore uses a collection of well-known salt values for

different replicas To create the identifier of the i-th replica of a block, the client simply appends the salt value for replica i to the block identifier, H2, and calculatesthe hash value of this concatenation Using the obtained hash value as the key, theclient then proceeds to insert the replica into the system In this way, clients cancreate any number of replicas they want

Fairness

Trang 36

pStore offers no dedicated mechanism to fairness Nothing prevents peers fromoverloading the system with too much data and nothing prevents clients from free-riding, storing data inside the system without providing storage themselves

3.1.2 A Cooperative Internet Backup Scheme

Introduction

”A Cooperative Internet Backup Scheme” [18] tries to address the Peer-to-Peerbackup problems in an entire different manner This system focuses on providing

a virtual replicated ”disk”, leaving the actual implementation of the actual backup

to clients The system emphasizes a lot on fairness, at the cost of losing a few otherdesirable properties

The main idea of this scheme is: Peers exchange disk space in a symmetric manner.When forming a trading partnership, both peers agree on terms like the exchangequantity, the time-frame of the exchange and certain availability of each peer Ofcourse, it is rare for two peers to find exact matches in exchanging quantities, sotrade ratio might be used Once both peers have agreed to the deal, they can beginstoring data on each other’s hard disk

Trang 37

a peer can increase system efficiency by trading with many partners, which keepsthe storage overhead low On the other hand, bandwidth consumption and latencyincreases with each partner added because more fragments must be retrieved to re-construct data, leading to decreased system efficiency Selecting the right number

of fragments is a trade-off between space efficiency and network overhead

Finding Partner

Cooperative Internet Backup suggests a central server acting a matchmaker in

or-der to facilitate finding trade partners among peers Peer that is willing to trade(to backup data) contacts the server with the description of the partner it is look-ing for The server keeps all these information in a database and will return therequesting peer with a list of potential candidate partners The requesting peerwill contact these candidate partners to establish suitable trades

There is one problem with the approaching in this system: a peer must remember

Trang 38

the list of peers it has traded with, i.e., its partners It has not provided a factory solution to this problem as from the information from the paper

satis-Fairness

Fairness is the focus point of Cooperative Backup Scheme Building on the base ofsymmetric trading, there are a number of mechanisms dealing with free-riding andmalicious peers

Replacing bad partner is the first step If a partner loses data or does not keep itsuptime promises, a peer will drop the partner and prefer to look for a new partner.The misbehaved partner will be kept in a black list so as to avoid adding thesepartners in the future Of course, a peer does not drop its partner on its first fail-ure, since the partner may be temporarily experiencing downtime or crashing, inorder to take this into account, peers will wait for a time interval before it decidesthe partner is ”bad”

In order to locate bad partners, a peer needs to check their partners periodicallywhether they are still holding the required backup data and whether they are still

in their up time as they agreed The peer will request random blocks now and then

to make sure the partner is doing well

3.1.3 Pastiche

Introduction

Pastiche [9] combines the features of the previous systems, though itself has no

Trang 39

1 2

3

"Lighthouse sweep"

"My abstract is 4b56cd36g please report coverage"

Abstract

Chunkstore

network proximity coverage rate

Goal: High overlap

Figure 3.7: Buddy Searching Process in Pastichedirect connection with either of them Pastiche resembles pStore in terms of theway backup data is managed and overlap is exploited Files to be backed upare split into blocks which are encrypted using convergent encryption to supportsharing of blocks between peers Metadata is considered a special block whichwill never be split However, during the process of distributing the data blocks,Pastiche is more similar to Internet Cooperative Backup Scheme Pastiche usesDHT routing to discover backup ”buddies”, which will store data for each other.There is a restriction for peers to become buddies: they need to show a very highoverlap in their backup data, so that this enables them to create a complete backupfor each other by only exchanging a small set of blocks not present at the partner

In this way, symmetric trading is formed, with each partner storing data for theother

Discovering Buddies

Trang 40

During the process of finding the partner, one important criteria is the overlappingdata in backup sets The reason is with higher overlap between buddies, less datathey need to exchange Discovering matching peers is done in a distributed scheme

with the help of a central server which is also used by Cooperative Internet Backup

Scheme

Each peer joins two Distributed Hash Tables (DHT) The first DTH is organized bynetwork proximity and the second DHT is organized by coverage rate, based on anestimate of overlap between peers The second DHT is used only when discoveringpartners in the first DHT fails

The process of partner discovery works as following: (Figure 3.7)

1 A host first calculates an abstract of its file system The abstract consists of

checksums of a small randomly chosen set of files which is used as a fingerprint

of the host’s data

2 With the abstract obtained, the host initiates a search in the network

prox-imity DHT (i.e., the first DHT) by doing a lighthouse sweep.

3 If the lighthouse sweep fails, a search in the coverage DHT (i.e., second DHT)

is used as a backup Inside the coverage DHT, instead of using the lighthousesweep, the request is forwarded towards hosts with higher coverage at eachhop during a single search

4 Finally, when a high-overlapped partner has been found, the set of differentdata blocks are exchange between the established partners

Data Sharing and Replication

All peers need to split their data so as to produce similar sets of blocks for files

Định dạng
Số trang	93
Dung lượng	290,09 KB