In this chapter, we develop a distributed file system model and use it as an experimental simulation tool to design, implement and test network adaptation algorithms.. In our model each
Trang 1Adaptive Demand-based
Heuristics for Traffic Reduction
in Distributed Information
Systems
George Bilchev and Sverrir Olafsson
As Internet connectivity is reaching the global community, information systems are becoming more and more distributed Inevitably, this overnight exponential growth has also caused traffic overload at various places in the network Until recently, it was believed that scaling the Internet was simply an issue of adding more resources, i.e bandwidth and processing power could be brought to where they were needed The Internet’s exponential growth, however, exposed this impression as a myth Information access has not been and will not be evenly distributed As it has been observed, user requests create ‘hot-spots’ of network load, with the same data transmitted over the same network links again and again These hotspots are not static, but also move around, making it impossible to accurately predict the right network capacity to be installed All these justify the requirement to develop new infrastructure for data dissemination on an ever-increasing scale, and the design of adaptive heuristics for traffic reduction
In this chapter, we develop a distributed file system model and use it as an experimental simulation tool to design, implement and test network adaptation algorithms Section 13.2 describes in detail the distributed file system model and explains the implemented simulation environment Two adaptation algorithms are developed in section 13.3 One is
Telecommunications Optimization: Heuristic and Adaptive Techniques, edited by D.W Corne, M.J Oates and G.D Smith
Copyright © 2000 John Wiley & Sons Ltd ISBNs: 0-471-98855-3 (Hardback); 0-470-84163X (Electronic)
Trang 2based on the ‘greedy’ heuristic principle and the other is a genetic algorithm tailored to handle the constraints of our problem Experiments are shown in section 13.4, and section 13.5 gives conclusions and discusses possible future research directions
Figure 13.1 A schematic representation of the network and the distributed file system.
13.2 The Adaptation Problem of a Distributed File System
The World Wide Web is rapidly moving us towards a distributed, interconnected information environment, in which an object will be accessed from multiple locations that may be geographically distributed worldwide For example, a database of customers’ information can be accessed from the location where a salesmen is working for the day In another example, an electronic document may be co-authored and edited by several users
In such distributed information environments, the replication of objects in the distributed system has crucial implications for system performance The replication scheme affects the performance of the distributed system, since reading an object locally is faster and less costly than reading it from a remote server In general, the optimal replication scheme of an object depends on the request pattern, i.e the number of times users request the data Presently, the replication scheme of a distributed database is established in a static fashion when the database is designed The replication scheme remains fixed until the designer manually intervenes to change the number of replicas or their location If the
request pattern is fixed and known a priori, then this is a reasonable solution However, in
practice the request patterns are often dynamic and difficult to predict Therefore, we need
1(7:25.
Trang 3an adaptive network that manages to optimize itself as the pattern changes We proceed with the development of a mathematical model of a distributed information/file system
A distributed file system consists of interconnected nodes where each node i, i = 1,N has
a local disk with capacity d i to store files – see Figures 13.1 and 13.2 There is a collection
of M files each of size s j,j=1,M.Copies of the files can reside on any one of the disks provided there is enough capacity The communication cost c i,k between nodes i and k
(measured as transferred bytes per simulated second) is also given
Figure 13.2 Users connect to each node from the distributed files system and generate requests.
In our model each node runs a file manager which is responsible for downloading files
from the network (Figure 13.3) To do that, each file manager i maintains an index vector l ij containing the location where each file j is downloaded from User applications running on
the nodes generate file requests the frequency of which can be statistically monitored in a
matrix {p i,j}
To account for contention and to distribute the file load across the network it has been
decided to model how busy the file managers are at each node k as follows:
∑
∑
=
=
=
⋅
⋅
=
M m N n
j j
k l j j j
k
s p
s p
,
1, 1 ,
, ,
,
Thus, the response time of the file manager at node k can be expressed as waiting time in a
buffer (Schwartz, 1987):
Database Server
User
Community
NETWORK
Trang 4Figure 13.3 Each node runs a file manager (responsible for allocating files on the network) and a
number of user applications which generate the requests
∞
>
−
=
otherwise
1
k k k k k
b b
τ
where τk reflects the maximum response capacity of the individual servers The overall
performance at node i can be measured as the time during which applications wait for files
to download (i.e response time):
j M
j
l l
j
c
s
j
,
1 ,
, ,
⋅
+
=
The first term in the sum represents the time needed for the actual transfer of the data and the second term reflects the waiting time for that transfer to begin The goal is to minimize the average network response time:
N
i M
j
l l
j N
c
s
j j j
,
1 1 ,
1
, , ,
+
∑∑
= =
The minimization is over the index matrix {li,j} There are two constraints: (1) the available
disk capacity on each node should not be exceeded; and (2) each file must have at least one copy somewhere on the network
File Manager
Application 1
Application Z
…….
Multi-tasking NODE
Trang 513.2.1 The Simulation Environment
In our distributed file system model the users generate file requests and the network responds by allocating and downloading the necessary files The file requests can be statistically monitored and future requests predicted from observed patterns The simulation environment captures these ideas by modeling the user file requests as random walks:
) 1 ( )
, t = p t− +γ
where γ is drawn from a uniform distribution U(−r,r) The parameter r determines the
‘randomness’ of the walk If it is close to zero thenp,j(t)≈ p,j(t−1)
During the simulated interval [t, t+1] the model has information about the file requests that have occurred in the previous interval [t–1, t] Thus the dynamics of the simulation can
be formally defined as:
For t=1,2,3,…
generate new file requests: P(t)=P(t −1)+{ }γi,j
simulate network:
) ( )
( )
1 1 , 1
1
1
, ,
t p r c
s t
O t
N
i M
j
l l
j N
N
i i
j
⋅
+
=
= =
=
An adaptive distributed file system would optimize its file distribution according to the user requests Sinc e the future user requests are not known and c an only be predic ted the optimization algorithm would have to use an expected value of the requests derived from previous observations:
)) 1 ( ( Prediction )
(
P
Thus an adaptive distributed file system can be simulated as follows:
For t=1,2,3,…
file requests prediction: P(t)=Prediction(P(t−1))
generate new file requests: P(t)=P(t−1)+{ }γ ,j
simulate network:
) ( )
( )
1 1 , 1
1
1
, ,
t p r c
s t
O t
N
i M
j
l l
j N
N
i
i
j
⋅
+
=
= =
=
The next section describes the developed optimization algorithms in detail
Trang 613.3 Optimization Algorithms
The ‘greedy’ principle consists of selfishly allocating resources (provided constraints allow
it) without regard to the performance of the other members of the network (Cormen et al.,
1990) While greedy algorithms are optimal for certain problems (e.g the minimal spanning tree problem) in practice they often produce only near optimal solutions Greedy algorithms, however, are very fast and are usually used as a heuristic method The greedy approach seems very well suited to our problem since the uncertainties in the file request prediction mean that we never actually optimize the real problem, but our expectation of it The implemented greedy algorithm works as follows:
For each file j check every node i to see if there is enough
space to accommodate it and if enough space is available
calculate the response time of the network if file j was at node i:
j k N
k
i i k
j
p r c
s
,
1 ,
⋅
+
∑
=
After all nodes are checked copy the file to the best found node
The above described algorithm loads only one copy of each file into the distributed file system If multiple copies are allowed, then add copies of the files in the following way:
For each node i get the most heavily used file
(i.e., max( ,j j)
Check if there is enough space to accommodate it
If yes, copy it Continue until all files are checked
Genetic Algorithms (GAs) are very popular due to their simple idea and wide applicability
(Holland, 1975; Goldberg, 1989) The simple GA is a population-based search in which the
individuals (each representing a point from the search space) exchange information (i.e reproduce) to move through the search space The exchange of information is done through operators (such as mutation and crossover) and is based on the ‘survival of the fittest’
principle, i.e better individuals have greater chance to reproduce
It is well established that in order to produce good results the basic GA must be tailored
to the problem at hand by designing problem specific representation and operators The
Trang 7overall flow of control in our implemented GA for the distributed files system model is similar to the steady state genetic algorithm described in Chapter 1
In order to describe the further implementation details of our GA we need to answer the following questions: How are the individuals represented? How is the population initialized? How is the selection process implemented? What operators are used?
Individuals representation: each individual from the population represents a distribution
state for the file system captured by the matrix { }l,j
Initialization: it is important to create a random population of feasible individuals The
implemented initialization process randomly generates a node index for each object and tries to accommodate it on that node In case of a failure the process is repeated for the same object
Selection process: the individuals are first linearly ranked ac c ording to their fitness and
then are selected by a roulette-wheel process using their rank value
Operators: the main problem is the design of operators which preserve feasibility of the
solutions (Bilchev and Parmee, 1996) This is important for our problem since it intrinsically has two constraints: (i) the disk capacity of the nodes must not be exceeded; and (ii) each file must have at least one copy somewhere on the network (If feasibility is not preserved by the operators, then the fitness function would require to
be modified by an appropriate penalty function in order to drive the population into the feasible region.) We have developed two main operators both preserving feasibility: The new operators developed in this work are called Safe-add and Safe-delete Safe-add works as follows:
For each node randomly select and copy a file which is not
already locally present and whose size is smaller than the
available disk space
Check to see if any of the nodes would respond faster by
downloading files from the new locations and if yes, update the matrix { }l,j
Safe-delete is as follows:
For each node randomly select and delete a file provided it
is not the last copy
Update the matrix { }l,j to reflect on the above changes
In our experiments we have used a population size of 70 individuals for 30 generations During each generation 50 safe-add and three safe-delete operators are applied During the selection process the best individual has 5% more chances of being selected as compared to the second best, and so on
Trang 813.4 Simulation Results
In our experiments we start with an offline simulation during which the optimization
algorithms are run when the file system is not used (i.e overnight, for example) In this scenario, we assume that both algorithms have enough time to finish their optimization before the file system is used again
A typical simulation is shown in Figure 13.4 Tests are done using seven nodes and 100 files All simulation graphs start from the same initial state of the distributed file system Then the two optimization algorithms are compared against a non-adaptive (static) file system (i.e when no optimization is used) The experiments undoubtedly reveal that the adaptive distributed file system produces better results as compared to a static file system The graphs also clearly indicate the excellent performance of the GA optimizer, which consistently outperforms the greedy algorithm
Figure 13.4 An offline simulation Adaptive distributed file systems utilizing a genetic algorithm and
a greedy algorithm respectively are compared against a static distributed file system The experiments use seven nodes and 31 files
To show the effect of delayed information, we run the greedy algorithm once using the
usage pattern collected from the previous simulation step P(t–1) (which are available in practice) and once using the actual P(t) (which is not known in practice) The difference in
performances reveals how much better we can do if perfect information were available (Figure 13.5)
Static file system
Greedy algorithm
Genetic algorithm
t
O(t)
Response time
Trang 9Figure 13.5 A greedy algorithm with perfect information is compared to a greedy algorithm with
delayed information In practice, we only have delayed information
Figure 13.6 Online simulation The circles indicate when the GA optimization takes place The
GA/greedy algorithm ratio is 60 (i.e the GA is run once for every 60 runs of the greedy algorithm)
Static file system
Greedy algorithm
Genetic algorithm
t
O(t)
Response time
Greedy algorithm with
perfect information
Greedy algorithm
t
O(t)
Response time
Trang 10Figure 13.7 Online simulation The GA/greedy algorithm ratio is 40 This is the critical ratio where
the average performance of both algorithm is comparable
Figure 13.8 Online simulation The GA/greedy algorithm ratio is 30 The GA manages to maintain
its performance advantage
t
O(t)
Response time
Static file system
Genetic algorithm Greedy algorithm
Static file system
Greedy algorithm
Genetic algorithm
t
O(t)
Response time