Tài liệu Grid Computing P13 ppt

the management of the functionalities, efficiencies and the qualities of services of large computing systems through logically distributed, autonomous controlling elements, and to achiev

Trang 1

Autonomic computing and Grid

Pratap Pattnaik, Kattamuri Ekanadham, and Joefon Jann

Thomas J Watson Research Center, Yorktown Heights, New York, United States

13.1 INTRODUCTION

The goal of autonomic computing is the reduction of complexity in the management of large computing systems The evolution of computing systems faces a continuous growth

in the number of degrees of freedom the system must manage in order to be efficient Two major factors contribute to the increase in the number of degrees of freedom: Historically, computing elements, such as CPU, memory, disks, network and so on, have nonuniform advancement The disparity between the capabilities/speeds of various elements opens

up a number of different strategies for a task depending upon the environment In turn, this calls for a dynamic strategy to make judicious choices for achieving targeted effi-ciency Secondly, the systems tend to have a global scope in terms of the demand for their services and the resources they employ for rendering the services Changes in the demands/resources in one part of the system can have a significant effect on other parts

of the system Recent experiences with Web servers (related to popular events such as the Olympics) emphasize the variability and unpredictability of demands and the need

to rapidly react to the changes A system must perceive the changes in the environment and must be ready with a variety of choices, so that suitable strategies can be quickly selected for the new environment The autonomic computing approach is to orchestrate

Grid Computing – Making the Global Infrastructure a Reality. Edited by F Berman, A Hey and G Fox

 2003 John Wiley & Sons, Ltd ISBN: 0-470-85319-0

Trang 2

the management of the functionalities, efficiencies and the qualities of services of large computing systems through logically distributed, autonomous controlling elements, and to achieve a harmonious functioning of the global system within the confines of its stipulated behavior, while individual elements make locally autonomous decisions In this approach,

one moves from a resource/entitlement model to a goal-oriented model In order to

signif-icantly reduce system management complexity, one must clearly delineate the boundaries

of these controlling elements The reduction in complexity is achieved mainly by making

a significant amount of decisions locally in these elements If the local decision process

is associated with a smaller time constant, it is easy to revise it, before large damage is done globally

Since Grid Computing, by its very nature, involves the controlled sharing of computing

resources across distributed, autonomous systems, we believe that there are a number of

synergistic elements between Grid computing and autonomic computing and that the advances in the architecture in either one of these areas will help the other In Grid computing also, local servers are responsible for enforcing local security objectives and for managing various queuing and scheduling disciplines Thus, the concept of cooperation

in a federation of several autonomic components to accomplish a global objective is a common theme for both autonomic computing and Grid computing As the architecture

of Grid computing continues to improve and rapidly evolve, as expounded in a number

of excellent papers in this issue, we have taken the approach of describing the autonomic server architecture in this paper We make some observations on the ways we perceive it

to be a useful part of the Grid architecture evolution

The choice of the term autonomic in autonomic computing is influenced by an analogy

with biological systems [1, 2] In this analogy, a component of a system is like an organ-ism that survives in an environment A vital aspect of such an organorgan-ism is a symbiotic relationship with others in the environment – that is, it renders certain services to others

in the environment and it receives certain services rendered by others in the environment

A more interesting aspect for our analogy is its adaptivity – that is, it makes constant efforts to change its behavior in order to fit into its environment In the short term, the organism perseveres to perform its functions despite adverse circumstances, by readjusting itself within the degrees of freedom it has In the long term, evolution of a new species takes place, where environmental changes force permanent changes to the functionality and behavior While there may be many ways to perform a function, an organism uses its local knowledge to adopt a method that economizes its resources Rapid response to external stimuli in order to adapt to the changing environment is the key aspect we are attempting to mimic in autonomic systems

The autonomic computing paradigm imparts this same viewpoint to the components of

a computing system The environment is the collection of components in a large system The services performed by a component are reflected in the advertised methods of the component that can be invoked by others Likewise, a component receives the services of others by invoking their methods The semantics of these methods constitute the behavior that the component attempts to preserve in the short term In the long term, as technology progresses new resources and new methods may be introduced Like organisms, the com-ponents are not perfect They do not always exhibit the advertised behavior exactly There can be errors, impreciseness or even cold failures An autonomic component watches for

Trang 3

these variations in the behavior of other components that it interacts with and adjusts to the variations

Reduction of complexity is not a new goal During the evolution of computing sys-tems, several concepts emerged that help manage the complexity Two notable concepts are particularly relevant here: object-oriented programming and fault-tolerant comput-ing Object-oriented designs introduced the concept of abstraction, in which the interface specification of an object is separated from its implementation Thus, implementation of

an object can proceed independent of the implementation of dependent objects, since it uses only their interface specifications The rest of the system is spared from knowing or dealing with the complexity of the internal details of the implementation of the object Notions of hierarchical construction, inheritance and overloading render easy develop-ment of different functional behaviors, while at the same time enabling them to reuse the common parts An autonomic system takes a similar approach, except that the alternative implementations are designed for improving the performance, rather than providing dif-ferent behaviors The environment is constantly monitored and suitable implementations are dynamically chosen for best performance

Fault-tolerant systems are designed with additional support that can detect and cor-rect any fault out of a predetermined set of faults Usually, redundancy is employed to overcome faults Autonomic systems generalize the notion of fault to encompass any behavior that deviates from the expected or the negotiated norm, including performance degradation or change-of-service costs based on resource changes Autonomic systems

do not expect that other components operate correctly according to stipulated behavior The input–output responses of a component are constantly monitored and when a compo-nent’s behavior deviates from the expectation, the autonomic system readjusts itself either

by switching to an alternative component or by altering its own input–output response suitably

Section 13.2 describes the basic structure of a typical autonomic component, delineat-ing its behavior, observation of environment, choices of implementation and an adaptive strategy While many system implementations may have these aspects buried in some detail, it is necessary to identify them and delineate them, so that the autonomic nature

of the design can be improved in a systematic manner Section 13.3 illustrates two spec-ulative methodologies to collect environmental information Some examples from server design are given to illustrate them Section 13.4 elaborates on the role of these aspects in

a Grid computing environment

13.2 AUTONOMIC SERVER COMPONENTS

The basic structure of any Autonomic Server Component, C, is depicted in Figure 13.1,

in which all agents that interact with C are lumped into one entity, called the environment.

This includes clients that submit input requests to C, other components whose services can

be invoked by C and resource managers that control the resources for C An autonomic component has four basic specifications:

AutonomicComp ::= BehaviorSpec, StateSpec, MethodSpec, StrategySpec

BehaviorSpec ::= InputSet , OutputSet , ValidityRelation β ⊆ ×

Trang 4

StateSpec ::= InternalState , EstimatedExternalState ξ

MethodSpec ::= MethodSet , each π ∈ : × × ξ → × × ξ

StrategySpec ::= Efficiency η, Strategy α : × × ξ → 

The functional behavior of C is captured by a relation, β ⊆ × , where is the input alphabet,is the output alphabet andβ is a relation specifying valid input–output pair Thus, if C receives an inputu ∈ , it delivers an outputv ∈ , satisfying the relation

β(u, v) The output variability permitted by the relationβ (as opposed to a function) is very common to most systems As illustrated in Figure 13.1, a client is satisfied to get any one of the many possible outputs (v, v, ) for a given input u, as long as they

satisfy some property specified byβ All implementations of the component preserve this functional behavior

The state information maintained by a component comprises two parts: internal state

and external stateξ Internal state, , contains the data structures used by an imple-mentation and any other variables used to keep track of input–output history and resource utilization The external state ξ is an abstraction of the environment of C and includes information on the input arrival process, the current level of resources available for C and the performance levels of other components of the system whose services are invoked

by C The component C has no control over the variability in the ingredients of ξ, as they are governed by agents outside C The input arrival process is clearly outside C We assume an external global resource manager that may supply or withdraw resources from

C dynamically Finally, the component C has no control over how other components are performing and must expect arbitrary variations (including failure) in their health Thus the state information,ξ, is dynamically changing and is distributed throughout the system

Internal state y

Estimated state

of environment x

p1

p2

p3 Π

b(u,v) b(u,v ′) b(u ′′,v)

v ∈ F

v ′ ∈ F

v ′′∈ F

u ∈ S

Clients Resource mgrs Other services

Autonomic Server Component C

Environment

Trang 5

C cannot have complete and accurate knowledge ofξ at any time Hence, the best C can

do is to keep an estimate,ξ, of ξ at any time and periodically update it as and when it receives correct information from the appropriate sources

An implementation, π, is the usual input–output transformation based on state π :

× × ξ → × × ξ, where an input–output pair u ∈ and v ∈ produced will satisfy the relation β(u, v) There must be many implementations, π ∈ , available for the autonomic component in order to adapt to the situation A single implementation provides no degree of freedom Each implementation may require different resources and data structures For any given input, different implementations may produce different outputs (of different quality), although all of them must satisfy the relationβ

Finally, the intelligence of the autonomic component is in the algorithmαthat chooses the best implementation for any given input and state Clearly switching from one imple-mentation to another might be expensive as it involves restructuring of resources and data The component must establish a cost model that defines the efficiency,η, at which the component is operating at any time The objective is to maximize η In principle, the strategy, α, evaluates whether it is worthwhile to switch the current implementa-tion for a given input and state, based on the costs involved and the benefit expected Thus, the strategy is a function of the form α : × × ξ →  As long as the

cur-rent implementation is in place, the component continues to make local decisions based

on its estimate of the external state When actual observation of the external state indi-cates significant deviations (from the estimate), an evaluation is made to choose the right implementation, to optimizeη This leads to the following two aspects that can be studied separately

Firstly, given that the component has up-to-date and accurate knowledge of the state

of the environment, it must have an algorithm to determine the best implementation to adapt This is highly dependent upon the system characteristics, the costs associated and the estimated benefits from different implementations An interesting design criterion is

to choose the time constants for change of implementation, so that the system enters a stable state quickly Criteria and models for such designs are under investigation and here

we give a few examples

Secondly, a component may keep an estimate of the external state (which is distributed and dynamically changing) and must devise a means to correct its estimate periodically,

so that the deviation from the actual state is kept within bounds We examine this question

in the next section

13.3 APPROXIMATION WITH IMPERFECT

KNOWLEDGE

A general problem faced by all autonomic components is the maintenance of an estimate,

ξ, of a distributed and dynamically changing external state, ξ, as accurately as possi-ble We examine two possible ways of doing this: by self-observation and by collective observation

Trang 6

13.3.1 Self-observation

Here a component operates completely autonomously and does not receive any explicit external state information from its environment Instead, the component deduces infor-mation on its environment solely from its own interactions with the environment This is indeed the way organisms operate in a biological environment (No one explicitly tells an animal that there is a fire on the east side It senses the temperatures as it tries to move around and organizes in its memory the gradients and if lucky, moves west and escapes the fire.) Following the analogy, an autonomic component keeps a log of the input–output history with its clients, to track both the quality that it is rendering to its clients as well

as the pattern of input arrivals Similarly, it keeps the history of its interactions with each external service that it uses and tracks its quality On the basis of these observations,

it formulates the estimate, ξ, of the state of its environment, which is used in its local decisions to adapt suitable implementations The estimate is constantly revised as new inputs arrive This strategy results in a very independent component that can survive in any environment However, the component cannot quickly react to the rapidly changing environment It takes a few interactions before it can assess the change in its environment Thus, it will have poor impulse response, but adapts very nicely to gradually changing circumstances We illustrate this with the example of a memory allocator

13.3.1.1 Example 1 Memory allocator

This simple example illustrates how an autonomic server steers input requests with fre-quently observed characteristics to implementations that specialize in efficient handling of those requests The allocator does not require any resources or external services Hence, the only external state it needs to speculate upon,ξ, is the pattern of inputs – specifically how frequently a particular size is being requested in the recent past

The behavior, (, , β), of a memory allocator can be summarized as follows: The input set has two kinds of inputs: alloc(n) and free(a); the output set has three possible responses: null, error and an address Alloc(n) is a request for a block of n

bytes The corresponding output is an address of a block or an error indicating inability

to allocate The relationβ validates any block, as long as it has the requested number of free bytes in it Free(a) returns a previously allocated block The system checks that the block is indeed previously allocated and returns null or error accordingly

The quality of service,η, must balance several considerations: A client expects quick response time and also that its request is never denied A second criterion is locality of allocated blocks If the addresses are spread out widely in the address space, the client

is likely to incur more translation overheads and prefers all the blocks to be within a compact region of addresses Finally, the system would like to minimize fragmentation and avoid keeping a large set of noncontiguous blocks that prevent it from satisfying requests for large blocks

We illustrate a P i that has two implementations: The first is a linked-list allocator, which keeps the list of the addresses and sizes of the free blocks that it has To serve a new allocation request, it searches the list to find a block that is larger than (or equal to) the requested size It divides the block if necessary and deletes the allocated block from

Trang 7

the list and returns its address as the output When the block is returned, it searches the list again and tries to merge the block with any free adjacent portions in the free list The

second strategy is called slab allocation It reserves a contiguous chunk of memory, called slab, for each size known to be frequently used When a slab exists for the requested size,

it peals off a block from that slab and returns it When a block (allocated from a slab)

is returned to it, it links it back to the slab When no slab exists for a request, it fails to allocate

The internal state, , contains the data structures that handle the linked-list and the list of available slabs The estimated environmental state, ξ, contains data structures to track the frequency at which blocks of each size are requested or released The strategy,

α, is to choose the slab allocator when a slab exists for the requested size Otherwise the linked-list allocator is used When the frequency for a size (for which no slab exists) exceeds a threshold, a new slab is created for it, so that subsequent requests for that size are served faster When a slab is unused for a long time, it is returned to the linked-list The cost of allocating from a slab is usually smaller than the cost of allocating from a linked-list, which in turn, is smaller than the cost of creating a new slab The allocator sets the thresholds based on these relative costs Thus, the allocator autonomically reorganizes its data structures based on the pattern of sizes in the inputs

13.3.2 Collective observation

In general, a system consists of a collection of components that are interconnected by the services they offer to each other As noted before, part of the environmental state,ξ, that

is relevant to a component, C, is affected by the states of other components For instance,

if D is a component that provides services for C, then C can make more intelligent decisions if it has up-to-date knowledge of the state of D If C is periodically updated about the state of D, the performance can be better than what can be accomplished by self-observation To elaborate on this, consider a system of n interacting components,

C i , i = 1, n Let S ii (t ) denote the portion of the state of C i at time t, that is relevant

to other components in the system For each i = j, C i, keeps an estimate, S ij (t ), of the corresponding state, S jj (t ), of C j Thus, each component has an accurate value of its own state and an estimated value of the states of other components Our objective is to come up with a communication strategy that minimizes the norm i,j |S ij (t ) − S ii (t )|, for any time t This problem is similar to the time synchronization problem and the best solution is for all components to broadcast their states to everyone after every time step But since the broadcasts are expensive, it is desirable to come up with a solution that minimizes the communication unless the error exceeds certain chosen limits For instance, let us assume that each component can estimate how its state is going to change in the near future Let t

i be the estimated derivative of S ii (t ), at timet – that is, the estimated value of S ii (t + dt) is given by S ii (t ) t i ( dt) There can be two approaches to using this information

13.3.2.1 Subscriber approach (push paradigm)

Suppose a componentC j is interested in the state ofC i Then C j will subscribe to C i

and obtains a tuple of the form, t, S ii t, which is stored as part of its estimate of

Trang 8

the external state,ξ This means that at timet, the state ofC i wasS ii (t )and it grows at the rate of t

i, so thatC j can estimate the state of C i at future time,t + δt, as S ii (t )+

t

i ∗ δt Component,C i, constantly monitors its own state,S ii (t ), and whenever the value

|S ii (t ) t i − S ii (t + δt)| exceeds a tolerance limit, it computes a new gradient, t i +δt and sends to all its subscribers the new tuplet + δt, S ii (t t +δt

i The subscribers, replace the tuple in theirξ with the new information Thus, the bandwidth of updates is proportional to the rate at which states change Also, depending upon the tolerance level, the system can have a rapid impulse response

13.3.2.2 Enquirer approach (pull paradigm)

This is a simple variation of the above approach, where an update is sent only upon explicit request from a subscriber Each subscriber may set its own tolerance limit and monitor the variation If the current tuple is t, S ii t i, the subscriber requests for

a new update when the increment t

i ∗ δt exceeds its tolerance limit This relieves the source component the burden of keeping track of subscribers and periodically updating them Since all information flow is by demand from a requester, impulse response can be poor if the requester chooses poor tolerance limit

13.3.2.3 Example 2 Routing by pressure propagation

This example abstracts a common situation that occurs in Web services It illustrates how components communicate their state to each other, so that each component can make decisions to improve the overall quality of service The behavior,β, can be summarized

as follows: The system is a collection of components, each of which receives transactions from outside Each component is capable of processing any transaction, regardless of where it enters the system Each component maintains an input queue of transactions and processes them sequentially When a new transaction arrives at a component, it is entered into the input queue of a selected component This selection is the autonomic aspect here and the objective is to minimize the response time for each transaction

Each component is initialized with some constant structural information about the system, µ i , τ ij, where µ i is the constant time taken by componentC i to process any transaction andτ ij is the time taken forC ito send a transaction toC j Thus, if a transaction that enteredC i was transferred and served atC j, then its total response time is given by

τ ij + (1 + Q j ) ∗ µ j, whereQ j is the length of the input queue atC j, when the transaction entered the queue there In order to give best response to the transaction,C i chooses to forward it toC j, which minimizes[τ ij + (1 + Q j ) ∗ µ j], over all possiblej SinceC i has

no precise knowledge ofQ j, it must resort to speculation, using the collective observation scheme

As described in the collective observation scheme, each component,C i, maintains the tuple t

j , Q t

j, from which the queue size of C j at time t + δt can be estimated as

Q t j t j ∗ δt When a request arrives atC I at timet + δt, it computes the targetj, which minimizes [τ ij + (1 + Q t

j t

j ∗ δt) ∗ µ j], over all possible j The request is sent to

be queued atC j Each component,C j, broadcasts a new tuple, t +δt , Q t +δt, to

Trang 9

all other components whenever the quantity |Q t

j t

j ∗ δt − Q t +δt

j | exceeds a tolerance limit

13.4 GRID COMPUTING

The primary objective of Grid computing [3] is to facilitate controlled sharing of resources

and services that are made available in a heterogeneous and distributed system Both heterogeneity and distributedness force the interactions between entities to be based on

protocols that specify the exchanges of information in a manner that is independent

of how a specific resource/service is implemented Thus, a protocol is independent of details such as the libraries, language, operating system or hardware employed in the implementation In particular, implementation of a protocol communication between two heterogeneous entities will involve some changes in the types and formats depending upon the two systems Similarly, implementation of a protocol communication between two distributed entities will involve some marshaling and demarshaling of information

and instantiation of local stubs to mimic the remote calls The fabric layer of Grid

architecture defines some commonly used protocols for accessing resources/services in such a system Since the interacting entities span multiple administrative domains, one needs to put in place protocols for authentication and security These are provided by the

connectivity layer of the Grid architecture A Service is an abstraction that guarantees a specified behavior, if interactions adhere to the protocols defined for the service Effort is

under way for standardization of the means in which a behavior can be specified, so that clients of the services can plan their interactions accordingly, and the implementers of

the services enforce the behavior The resource layer of Grid architecture defines certain

basic protocols that are needed for acquiring and using the resources available Since there can be a variety of ways in which resource sharing can be done, the next layer, called

the collective layer, describes protocols for discovering available services, negotiating for

desired services, and initiating, monitoring and accounting of services chosen by clients

13.4.1 Synergy between the two approaches

The service abstraction of the Grid architecture maps to the notion of a component of

autonomic computing described in Section 13.2 As we noted with components, the imple-mentation of a high-level service for a virtual organization often involves several other resources/services, which are heterogeneous and distributed The behavior of a service is the BehaviorSpec of a component in Section 13.2 and an implementation must ensure that they provide the advertised behavior, under all conditions Since a service depends upon other services and on the resources that are allocated for its implementation, prudence dictates that its design be autonomic Hence, it must monitor the behavior of its dependent services, its own level of resources that may be controlled by other agents and the quality

of service it is providing to its clients In turn, this implies that a service implementa-tion must have a strategy such as α of Section 13.2, which must adapt to the changing environment and optimize the performance by choosing appropriate resources Thus, all the considerations we discussed under autonomic computing apply to this situation In

Trang 10

particular, there must be general provisions for the maintenance of accurate estimates of global states as discussed in Section 13.3, using either the self-observation or collective observation method A specialized protocol in the collective layer of the Grid architecture could possibly help this function

Consider an example of a data-mining service offered on a Grid There may be one

or more implementations of the data-mining service and each of them requires database services on the appropriate data repositories All the implementations of a service form a collective and they can coordinate to balance their loads, redirecting requests for services arriving at one component to components that have lesser loads An autonomic data-mining service implementation may change its resources and its database services based

on its performance and the perceived levels of service that it is receiving Recursively the database services will have to be autonomic to optimize the utilization of their services Thus, the entire paradigm boils down to designing each service from an autonomic per-spective, incorporating logic to monitor performance, discover resources and apply them

as dictated by its objective function

13.5 CONCLUDING REMARKS

As systems get increasingly complex, natural forces will automatically eliminate interac-tions with components whose complexity has to be understood by an interactor The only components that survive are those that hide the complexity, provide a simple and stable interface and possess the intelligence to perceive the environmental changes, and struggle

to fit into the environment While facets of this principle are present in various degrees

in extant designs, explicit recognition of the need for being autonomic can make a big difference, and thrusts us toward designs that are robust, resilient and innovative In the present era, where technological changes are so rapid, this principle assumes even greater importance, as adaptation to changes becomes paramount

The first aspect of autonomic designs that we observe is the clear delineation of the interface of how a client perceives a server Changes to the implementation of the service should not compromise this interface in any manner The second aspect of an autonomic server is the need for monitoring the varying input characteristics of the clientele as well

as the varying response characteristics of the servers on which this server is dependent

In the present day environment, demands shift rapidly and cannot be anticipated most

of the time Similarly, components degrade and fail, and one must move away from deterministic behavior to fuzzy behaviors, where perturbations do occur and must be observed and acted upon Finally, an autonomic server must be prepared to quickly adapt

to the observed changes in inputs as well as dependent services The perturbations are not only due to failures of components but also due to performance degradations due to changing demands Autonomic computing provides a unified approach to deal with both

A collective of services can collaborate to provide each other accurate information so that local decisions by each service contribute to global efficiency

We observe commonalities between the objectives of Grid and autonomic approaches

We believe that they must blend together and Grid architecture must provide the necessary framework to facilitate the design of each service with an autonomic perspective While

Tiêu đề	Autonomic computing and Grid
Tác giả	Pratap Pattnaik, Kattamuri Ekanadham, Joefon Jann
Người hướng dẫn	F. Berman, A. Hey, G. Fox
Trường học	IBM Thomas J. Watson Research Center
Chuyên ngành	Grid Computing
Thể loại	Book chapter
Năm xuất bản	2003
Thành phố	Yorktown Heights

Định dạng
Số trang	11
Dung lượng	106,08 KB