Tối ưu hóa viễn thông và thích nghi Kỹ thuật Heuristic P13 doc

In this chapter, we develop a distributed file system model and use it as an experimental simulation tool to design, implement and test network adaptation algorithms.. In our model each

Trang 1

Adaptive Demand-based

Heuristics for Traffic Reduction

in Distributed Information

Systems

George Bilchev and Sverrir Olafsson

As Internet connectivity is reaching the global community, information systems are becoming more and more distributed Inevitably, this overnight exponential growth has also caused traffic overload at various places in the network Until recently, it was believed that scaling the Internet was simply an issue of adding more resources, i.e bandwidth and processing power could be brought to where they were needed The Internet’s exponential growth, however, exposed this impression as a myth Information access has not been and will not be evenly distributed As it has been observed, user requests create ‘hot-spots’ of network load, with the same data transmitted over the same network links again and again These hotspots are not static, but also move around, making it impossible to accurately predict the right network capacity to be installed All these justify the requirement to develop new infrastructure for data dissemination on an ever-increasing scale, and the design of adaptive heuristics for traffic reduction

In this chapter, we develop a distributed file system model and use it as an experimental simulation tool to design, implement and test network adaptation algorithms Section 13.2 describes in detail the distributed file system model and explains the implemented simulation environment Two adaptation algorithms are developed in section 13.3 One is

Telecommunications Optimization: Heuristic and Adaptive Techniques, edited by D.W Corne, M.J Oates and G.D Smith

Trang 2

based on the ‘greedy’ heuristic principle and the other is a genetic algorithm tailored to handle the constraints of our problem Experiments are shown in section 13.4, and section 13.5 gives conclusions and discusses possible future research directions

Figure 13.1 A schematic representation of the network and the distributed file system.

13.2 The Adaptation Problem of a Distributed File System

The World Wide Web is rapidly moving us towards a distributed, interconnected information environment, in which an object will be accessed from multiple locations that may be geographically distributed worldwide For example, a database of customers’ information can be accessed from the location where a salesmen is working for the day In another example, an electronic document may be co-authored and edited by several users

In such distributed information environments, the replication of objects in the distributed system has crucial implications for system performance The replication scheme affects the performance of the distributed system, since reading an object locally is faster and less costly than reading it from a remote server In general, the optimal replication scheme of an object depends on the request pattern, i.e the number of times users request the data Presently, the replication scheme of a distributed database is established in a static fashion when the database is designed The replication scheme remains fixed until the designer manually intervenes to change the number of replicas or their location If the

request pattern is fixed and known a priori, then this is a reasonable solution However, in

practice the request patterns are often dynamic and difficult to predict Therefore, we need

1(7:25.

Trang 3

an adaptive network that manages to optimize itself as the pattern changes We proceed with the development of a mathematical model of a distributed information/file system

A distributed file system consists of interconnected nodes where each node i, i = 1,N has

a local disk with capacity d i to store files – see Figures 13.1 and 13.2 There is a collection

of M files each of size s j,j=1,M.Copies of the files can reside on any one of the disks provided there is enough capacity The communication cost c i,k between nodes i and k

(measured as transferred bytes per simulated second) is also given

Figure 13.2 Users connect to each node from the distributed files system and generate requests.

In our model each node runs a file manager which is responsible for downloading files

from the network (Figure 13.3) To do that, each file manager i maintains an index vector l ij containing the location where each file j is downloaded from User applications running on

the nodes generate file requests the frequency of which can be statistically monitored in a

matrix {p i,j}

To account for contention and to distribute the file load across the network it has been

decided to model how busy the file managers are at each node k as follows:

∑

=

⋅

=

M m N n

j j

k l j j j

k

s p

,

1, 1 ,

, ,

,

Thus, the response time of the file manager at node k can be expressed as waiting time in a

buffer (Schwartz, 1987):

Database Server

User

Community

NETWORK

Trang 4

Figure 13.3 Each node runs a file manager (responsible for allocating files on the network) and a

number of user applications which generate the requests









∞

>

−

=

otherwise

1

k k k k k

b b

τ

where τk reflects the maximum response capacity of the individual servers The overall

performance at node i can be measured as the time during which applications wait for files

to download (i.e response time):

j M

j

l l

j

c

s

j

,

1 ,

, ,

⋅







 +

=

The first term in the sum represents the time needed for the actual transfer of the data and the second term reflects the waiting time for that transfer to begin The goal is to minimize the average network response time:

N

i M

j

l l

j N

c

s

j j j

,

1 1 ,

1

, , ,







 +

∑∑

= =

The minimization is over the index matrix {li,j} There are two constraints: (1) the available

disk capacity on each node should not be exceeded; and (2) each file must have at least one copy somewhere on the network

File Manager

Application 1

Application Z

…….

Multi-tasking NODE

Trang 5

13.2.1 The Simulation Environment

In our distributed file system model the users generate file requests and the network responds by allocating and downloading the necessary files The file requests can be statistically monitored and future requests predicted from observed patterns The simulation environment captures these ideas by modeling the user file requests as random walks:

) 1 ( )

, t = p t− +γ

where γ is drawn from a uniform distribution U(−r,r) The parameter r determines the

‘randomness’ of the walk If it is close to zero thenp,j(t)≈ p,j(t−1)

During the simulated interval [t, t+1] the model has information about the file requests that have occurred in the previous interval [t–1, t] Thus the dynamics of the simulation can

be formally defined as:

For t=1,2,3,…

generate new file requests: P(t)=P(t −1)+{ }γi,j

simulate network:

) ( )

( )

1 1 , 1

1

, ,

t p r c

s t

O t

N

i M

j

l l

j N

N

i i

j

⋅







 +

=

= =

=

An adaptive distributed file system would optimize its file distribution according to the user requests Sinc e the future user requests are not known and c an only be predic ted the optimization algorithm would have to use an expected value of the requests derived from previous observations:

)) 1 ( ( Prediction )

(

P

Thus an adaptive distributed file system can be simulated as follows:

For t=1,2,3,…

file requests prediction: P(t)=Prediction(P(t−1))

generate new file requests: P(t)=P(t−1)+{ }γ ,j

simulate network:

) ( )

( )

1 1 , 1

1

, ,

t p r c

s t

O t

N

i M

j

l l

j N

N

i

j

⋅







 +

=

= =

=

The next section describes the developed optimization algorithms in detail

Trang 6

13.3 Optimization Algorithms

The ‘greedy’ principle consists of selfishly allocating resources (provided constraints allow

it) without regard to the performance of the other members of the network (Cormen et al.,

1990) While greedy algorithms are optimal for certain problems (e.g the minimal spanning tree problem) in practice they often produce only near optimal solutions Greedy algorithms, however, are very fast and are usually used as a heuristic method The greedy approach seems very well suited to our problem since the uncertainties in the file request prediction mean that we never actually optimize the real problem, but our expectation of it The implemented greedy algorithm works as follows:

For each file j check every node i to see if there is enough

space to accommodate it and if enough space is available

calculate the response time of the network if file j was at node i:

j k N

k

i i k

j

p r c

s

,

1 ,

⋅







 +

∑

=

After all nodes are checked copy the file to the best found node

The above described algorithm loads only one copy of each file into the distributed file system If multiple copies are allowed, then add copies of the files in the following way:

For each node i get the most heavily used file

(i.e., max( ,j j)

Check if there is enough space to accommodate it

If yes, copy it Continue until all files are checked

Genetic Algorithms (GAs) are very popular due to their simple idea and wide applicability

(Holland, 1975; Goldberg, 1989) The simple GA is a population-based search in which the

individuals (each representing a point from the search space) exchange information (i.e reproduce) to move through the search space The exchange of information is done through operators (such as mutation and crossover) and is based on the ‘survival of the fittest’

principle, i.e better individuals have greater chance to reproduce

It is well established that in order to produce good results the basic GA must be tailored

to the problem at hand by designing problem specific representation and operators The

Trang 7

overall flow of control in our implemented GA for the distributed files system model is similar to the steady state genetic algorithm described in Chapter 1

In order to describe the further implementation details of our GA we need to answer the following questions: How are the individuals represented? How is the population initialized? How is the selection process implemented? What operators are used?

Individuals representation: each individual from the population represents a distribution

state for the file system captured by the matrix { }l,j

Initialization: it is important to create a random population of feasible individuals The

implemented initialization process randomly generates a node index for each object and tries to accommodate it on that node In case of a failure the process is repeated for the same object

Selection process: the individuals are first linearly ranked ac c ording to their fitness and

then are selected by a roulette-wheel process using their rank value

Operators: the main problem is the design of operators which preserve feasibility of the

solutions (Bilchev and Parmee, 1996) This is important for our problem since it intrinsically has two constraints: (i) the disk capacity of the nodes must not be exceeded; and (ii) each file must have at least one copy somewhere on the network (If feasibility is not preserved by the operators, then the fitness function would require to

be modified by an appropriate penalty function in order to drive the population into the feasible region.) We have developed two main operators both preserving feasibility: The new operators developed in this work are called Safe-add and Safe-delete Safe-add works as follows:

For each node randomly select and copy a file which is not

already locally present and whose size is smaller than the

available disk space

Check to see if any of the nodes would respond faster by

downloading files from the new locations and if yes, update the matrix { }l,j

Safe-delete is as follows:

For each node randomly select and delete a file provided it

is not the last copy

Update the matrix { }l,j to reflect on the above changes

In our experiments we have used a population size of 70 individuals for 30 generations During each generation 50 safe-add and three safe-delete operators are applied During the selection process the best individual has 5% more chances of being selected as compared to the second best, and so on

Trang 8

13.4 Simulation Results

In our experiments we start with an offline simulation during which the optimization

algorithms are run when the file system is not used (i.e overnight, for example) In this scenario, we assume that both algorithms have enough time to finish their optimization before the file system is used again

A typical simulation is shown in Figure 13.4 Tests are done using seven nodes and 100 files All simulation graphs start from the same initial state of the distributed file system Then the two optimization algorithms are compared against a non-adaptive (static) file system (i.e when no optimization is used) The experiments undoubtedly reveal that the adaptive distributed file system produces better results as compared to a static file system The graphs also clearly indicate the excellent performance of the GA optimizer, which consistently outperforms the greedy algorithm

Figure 13.4 An offline simulation Adaptive distributed file systems utilizing a genetic algorithm and

a greedy algorithm respectively are compared against a static distributed file system The experiments use seven nodes and 31 files

To show the effect of delayed information, we run the greedy algorithm once using the

usage pattern collected from the previous simulation step P(t–1) (which are available in practice) and once using the actual P(t) (which is not known in practice) The difference in

performances reveals how much better we can do if perfect information were available (Figure 13.5)

Static file system

Greedy algorithm

Genetic algorithm

t

O(t)

Response time

Trang 9

Figure 13.5 A greedy algorithm with perfect information is compared to a greedy algorithm with

delayed information In practice, we only have delayed information

Figure 13.6 Online simulation The circles indicate when the GA optimization takes place The

GA/greedy algorithm ratio is 60 (i.e the GA is run once for every 60 runs of the greedy algorithm)

Static file system

Greedy algorithm

Genetic algorithm

t

O(t)

Response time

Greedy algorithm with

perfect information

Greedy algorithm

t

O(t)

Response time

Trang 10

Figure 13.7 Online simulation The GA/greedy algorithm ratio is 40 This is the critical ratio where

the average performance of both algorithm is comparable

Figure 13.8 Online simulation The GA/greedy algorithm ratio is 30 The GA manages to maintain

its performance advantage

t

O(t)

Response time

Static file system

Genetic algorithm Greedy algorithm

Static file system

Greedy algorithm

Genetic algorithm

t

O(t)

Response time

Định dạng
Số trang	11
Dung lượng	217,5 KB