Mission-Critical Network Planning phần 2 doc

If we next assume that σ is theaverage nodal loss per outage, measured in terms of percentage of users, then on abroad network basis, the minimum risk or percent expected minimum lossρ f

Trang 1

expressed as Metcalfe’s Law, states that the potential value of a network is

propor-tional to the number of active user nodes More precisely, the usefulness of a work to new and existing users equals the square of the number of user nodes.Although networks with higher scale factors provide greater potential value pernode, they are less economical to grow [13]

net-σ has implications on network risk Large, scale-free networks imply distributedarchitectures, which, we see from prior discussion, have a greater likelihood of out-age But because there are fewer users served per node, the expected loss from anodal outage is less In a network of limited scale, where users are more concen-trated at each node, greater damage can happen

These points can be illustrated by estimating the minimum expected loss in a

network of N nodes, each with an outage potential p, equal to the probability of a failure The probability of f failures occurring out of N possible nodes is statistically

characterized by the well-known Binomial distribution [14]:

centage of time that one or more failures occur If we next assume that σ is theaverage nodal loss per outage, measured in terms of percentage of users, then on abroad network basis, the minimum risk (or percent expected minimum loss)ρ for anetwork is given by:

( )

Figure 2.14 graphically shows how ρ varies with network scale at different

nodal outage potentials p It shows that investing to reduce nodal outage potentials,

regardless of scale, can ultimately still leave about one percent of the users at risk.Expanding the size of the network to reduce risk is more effective when the scale is

Relatively scale-free

scale = 10 scope = 5, ,σ= 2 Exponential scale-freescale = 10 scope = 5, ,σ= 2 Relatively scale-limitedscale = 10 scope = 2, ,σ= 5

Service node User node

Figure 2.13 Network scale factor.

Trang 2

limited or, in other words, users are more concentrated Concentration will nate risk up to a certain point, beyond which size can greater influence network risk.This analysis assumes uniform, random outage potential at each node Ofcourse, this assumption may not hold true in networks with a nonuniform user dis-tribution and nodal outage potential An intentional versus random outage at a con-centrated node, such as one resulting from a directed security attack, can inflictgreat damage Overall, scale-free networks with less concentrated nodal functional-ity are less vulnerable to directed, nonrandom outages than networks of limitedscale.

domi-2.4.4 Complexity

Complexity in a network is characterized by variety Use of many diverse networktechnologies, circuits, OSs, vendors, interfaces, devices, management systems, serv-ice providers, and suppliers is quite common and often done for valid reasons

Many networks are grown incrementally over time and result in à la carte designs.

Such designs usually do not perform as well as a single integrated design Even if acomplex design has strong tolerance against errors, extra complexity can impederecovery activities and require more costly contingency

The problems with complexity are attributed to not only system variety Theyare also attributed to whether components are chosen correctly for their intended

purpose and how well they are matched A matched system is one in which

compo-nents fit well with each other and do not encroach upon the limits of one another.Turnkey systems delivered by a single vendor or integrator are usually deliveredwith intentions of behaving as matched systems On the other hand, unmatched sys-tems are usually characterized by poorly integrated piecemeal designs involving dif-ferent vendor systems that are plugged together The cost of maintaining the samelevels of service increases exponentially as more complexity is used to deliver theservice

Consolidation of multiple components, interfaces, and functions into a singlesystem or node can reduce the number of resources needed and eliminate potentialpoints of failure and the interoperability issues associated with matching But fromprevious discussion, we know that consolidation poses greater risk because the

Trang 3

consolidated resource can result in a single point of failure with high damage tial, unless a redundant architecture is used.

This chapter reviewed some fundamental concepts of continuity that form thebasis for much of the remaining discussion in this book Many of these conceptscan be applied to nearly all levels of networking A basic comprehension of theseprinciples provides the foundation for devising and applying continuity strategiesthat use many of the remedial techniques and technologies discussed further in thisbook

Adverse events are defined as those that violate a well-defined envelope ofoperational performance criteria Service disruptions in a network arise out of lack

of preparedness rather than the adverse events themselves Preparedness requireshaving the appropriate capabilities in place to address such events It includes having

an early detection mechanism to recognize them even before they happen; the ability

to contain the effects of disruption to other systems; a failover process that cantransfer service processing to other unaffected working components; a recoverymechanism to restore any failed components; and the means to resume to normaloperation following recovery

Redundancy is a tactic that utilizes multiple resources so that if one resource isunable to provide service, another can In networks, redundancy can add to capitaland operating costs and should therefore be carefully designed into an operation At

a minimum, it should eliminate single points of failure—particularly those that port mission-critical services There are various ways to implement redundancy.However, if not done correctly, it can be ineffective and provide a false sense ofsecurity

sup-Tolerance describes the ability to withstand disruptions and is usually expressed

in terms of availability A greater level of tolerance in a network or system implieslower transaction loss and higher cost FT, FR, and HA are tolerance categories thatare widely used to classify systems and services, with FT representing the highestlevel of tolerance FR and HA solutions can be cost-effective alternatives to minimiz-ing service disruption, but they may not guarantee transaction preservation duringfailover to the same degree as FT

There are several key principles of network and system design to ensure ity Capacity should be put in the right places—indiscriminate placement of capacitycan produce bottlenecks, which can lead to other service disruptions Networksshould be designed in compartments that each represent their own failure groups, sothat a disruption in one compartment does not affect another The network architec-ture should be balanced so that loss of a highly interconnected node does not disruptthe entire network

continu-Finally, the adage “the simpler, the better” prevails Complexity should be couraged at all levels The variety of technologies, devices, systems, vendors, andservices should be minimized They should also be well matched This means thateach should be optimally qualified for its intended job and work well with othercomponents

Trang 4

Computerworld—Spe-[10] Nolle, T., “Balancing Risk,” Network Magazine, December 2001, p 96.

[11] Sanborn, S., “Spreading Out the Safety Net,” Infoworld, April 1, 2002, pp 38–41 [12] Porter, D., “Nothing Is Unsinkable,” Enterprise Systems Journal, June 1998, pp 20–26 [13] Rybczynski, T., “Net-Value—The New Economics of Networking,” Computer Telephony Integration, April 1999, pp 52–56.

[14] Bryant, E C., Statistical Analysis, New York: McGraw-Hill Book Company, Inc., 1960,

pp 20–24.

Trang 6

C H A P T E R 3

Continuity Metrics

Metrics are quantitative measures of system or network behavior We use metrics tocharacterize system behavior so that decisions can be made regarding how to man-age and operate them efficiently Good metrics are those that are easily understood

in terms of what they measure and how they convey system or network behavior.There is no single metric that can convey the adequacy of a mission-critical net-work’s operation Using measures that describe the behavior of a single platform orportion of a network is insufficient One must measure many aspects of a network

to arrive at a clear picture of what is happening

There is often no true mathematical way of combining metrics for a network.Unlike the stock market, use of a computed index to convey overall network status

is often flawed For one thing, many indices are the result of combining measuresobtained from ordinal and cardinal scales, which is mathematically incorrect Somemeasures are obtained through combination using empirically derived models This

is can also be flawed because a metric is only valid within the ranges of data fromwhich it was computed The best way of combining measures is through humanjudgment A network operator or manager must be trained to observe different met-rics and use them to make decisions Like a pilot, operators must interpret informa-tion from various gauges to decide the next maneuver

Good useful metrics provide a balance between data granularity and the effortrequired for computation Many statistical approaches, such as experimentaldesign, are aimed at providing the maximum amount of information with the leastamount of sampling The cost and the ability to obtain input data have improvedover the years Progress in computing and software has made it possible to conductcalculations using vast amounts of data in minimal time, impossible 20 or 30 yearsago The amount of time, number of samples, complexity, and cost all should beconsidered when designing metrics

Metrics should be tailored to the item being measured No single metric is cable to everything in a network Furthermore, a metric should be tied to a serviceobjective It should be used to express the extent to which an objective is beingachieved A metric should be tied to each objective in order to convey the degree towhich it is satisfied

appli-Finally, computing a metric should be consistent when repeated over time; erwise, comparing relative changes in the values would be meaningless Repeatedcalculations should be based on the same type of data, the same data range, and thesame sampling approach More often than not, systems or network services arecompared based on measures provided by a vendor or service provider Comparingdifferent vendors or providers using the measures they each supply is often difficult

oth-31

Trang 7

and sometimes fruitless, as each develops their metrics based on their ownmethodologies [1].

Recovery is all of the activities that must occur from the time of an outage to the timeservice is restored These will vary among organizations and, depending on the con-text of use, within a mission-critical network environment Activities involved torecover a component are somewhat different than those to recover an entire datacenter But in either case, the general meaning remains the same General recoveryactivities include declaration that an adverse event has occurred (or is about tooccur); initialization of a failover process; system restoration or repair activities; andsystem restart, cutover, and resumption of service Two key recovery metrics aredescribed in the following sections

3.1.1 Recovery Time Objective

The recovery time objective (RTO) is a target measure of the elapsed time intervalbetween the occurrence of an adverse event and the restoration of service RTOshould be measured from the point when the disruption occurred until operation isresumed In mission-critical environments, this means that operation is essentially inthe same functional state as it was prior to the event Some IT organizations mayalter this definition, by relaxing some of the operational state requirements afterresumption and accepting partial operation as a resumed state Likewise, some willdefine RTO based on the time of recognizing and declaring that an adverse event hasoccurred This can be misleading because it does not take into account monitoringand detection time

RTO is an objective, specified in hours and minutes—a target value that is mined by an organization’s management that represents an acceptable recoverytime What value an organization assigns to “acceptable” is influenced by a variety

deter-of factors, including the importance deter-of the service and consequential revenue loss,the nature of the service, and the organization’s internal capabilities In systems, itmay even be specified in milliseconds Some will also specify RTO in transactions or

a comparable measure that conveys unit throughput of an entity This approach isonly valid if that entity’s throughput is constant over time

RTOs can be applied to any network component—from an individual system to

an entire data center Organizations will define different RTOs for different aspects

of their business To define an RTO, an organization’s managers must determinehow much service interruption their business can tolerate They must determine howlong a functional entity, such as a business process, can be unavailable One mayoften see RTOs in the range of 24 to 48 hours for large systems, but these numbers donot reflect any industry standard Virtual storefronts are unlikely to tolerate highRTOs without significant loss of revenue Some vertical markets, such as banking,must adhere to financial industry requirements for disruption of transactions [2].Cost ultimately drives the determination of an RTO A high cost is required toachieve a low RTO for a particular process or operation To achieve RTOs close tozero requires expensive automated recovery and redundancy [3] As the target RTO

Trang 8

increases, the cost to achieve the RTO decreases An RTO of long duration invitesless expensive redundancy and more manual recovery operation However, concur-rent with this is business loss As shown in Figure 3.1, loss is directly related toRTO—the longer the RTO, the greater the loss During recovery, business loss can

be realized in many ways, including loss productivity or transactions This topic isdiscussed further in this chapter At some point, there is an RTO whose costs cancompletely offset the losses during the recovery [4, 5]

It becomes evident that defining an RTO as a sole measure is meaningless out some idea of what level of service the recovery provides Furthermore, differentsystems will have their own RTO curves Critical systems will often have a muchsmaller RTO than less critical ones They can also have comparable RTOs but withmore stringent tolerance for loss A tiered-assignment approach can be used Thisinvolves defining levels of system criticality and then assigning an RTO value toeach So, for example, a three-level RTO target might look like this:

with-• Level 1—restore to same service level;

• Level 2—restore to 75% service level;

• Level 3—restore to 50% service level

A time interval can be associated to each level as well as a descriptor of the level

of service provided For example, a system assigned a level 2 RTO of 1 hour mustcomplete recovery within that time frame and disrupt no more than 25% of service

A system can be assigned a level 1 RTO of 1 hour as well, but must restore to thesame level of service Level 1 may require failover procedures or recovery to a secon-dary system

Assuming that the service level is linearly proportional to time, RTOs across ferent levels can be equated on the same time scale A time-equivalent RTO, RTOE,can thus be computed as:

Recovery costs offset loss

Figure 3.1 RTO versus loss and cost.

Trang 9

3.1.1.1 Recovery Time Components

The RTO interval must incorporate all of the activities to restore a network or ponent back to service A flaw in any of the component activities could lead to sig-nificant violation of the RTO To this end, each component activity can be assigned

com-an RTO as well The addition of each of these component RTOs may not necessarilyequal the overall RTO because activities can be conducted in parallel Some of thesecomponent RTOs can include time to detection and declaration of an adverse event,time to failover (sometimes referred to as a failover time objective), time to diagnose,and time to repair

The last two items are typically a function of the network or system complexityand typically pose the greatest risk In complex networks, one can expect that thelikelihood of achieving an RTO for the time to diagnose and repair is small Failover

to a redundant system is usually the most appropriate countermeasure, as it can buytime for diagnostics and repair A system or network operating in a failed state issomewhat like a twin-engine airplane flying on one engine Its level of reliability isgreatly reduced until diagnostics and repairs are made

Figure 3.2 illustrates the continuum of areas activity relative to a mission-criticalnetwork Of course, these may vary but are applicable to most situations The areasinclude the following:

• Network recovery This is the time to restore voice or data communication

fol-lowing an adverse event Network recovery will likely influence many otheractivities as well For instance, recovery of backup data over a network could

be affected until the network is restored

• Data recovery This is time to retrieve backup data out of storage and deliver

to a recovery site, either physically or electronically It also includes the time toload media (e.g., tape or disk) and install or reboot database applications This

is also referred to as the time to data (TTD) and is discussed further in thechapter on storage

• Application recovery This is the time to correct a malfunctioning application.

• Platform recovery This is the time to restore a problematic platform to service

operation

• Service recovery This represents recovery in the broadest sense It represents

the cumulative time to restore service from an end user’s perspective It is, inessence, the result of an amalgamation of all of the preceding recovery times

Network recovery Data recovery Application recovery Platform recovery

Time scale

Trang 10

All of these areas are discussed at greater length in the subsequent chapters ofthis book.

3.1.2 Recovery Point Objective

The recovery point objective (RPO) is used as target metric for data recovery It isalso measured in terms of time, but it refers to the age or freshness of data required

to restore operation following an adverse event Data, in this context, might alsoinclude information regarding transactions not recorded or captured Like RTO,the smaller the RPO, the higher the expected data recovery cost Reloading a dailybackup tape can satisfy a tolerance for no more than 24 hours’ worth of data How-ever, a tolerance for only one minute’s worth of data or transaction loss mightrequire more costly data transfer methods, such as mirroring, which is discussed inthe chapter on storage

Some view the RPO as the elapsed time of data recovery in relation to theadverse event This is actually the aforementioned TTD RPO is the point in time to

which the data must be recovered—sometimes referred to as the freshness window.

It is the maximum tolerable elapsed time between the last safe backup and the point

of recovery An organization that can tolerate no data loss (i.e., RPO = 0) impliesthat data would have to be restored instantaneously following an adverse event andwould have to employ a continuous backup system

Figure 3.3 illustrates the relationship between TTD and RPO using a timeline If

we denote the time between the last data snapshot and an adverse event as a random

variable å, then it follows that the TTD+ ε must meet the RPO objective:

A target RPO should be chosen that does not exceed the snapshot interval (SI),and at best equals the SI If data is not restored prior to the next scheduled snapshot,then the snapshot should be postponed or risk further data corruption:

Error Snapshot Restoredata

RPO

RTO

Time scale

Snapshot

SI

Restore operation

Figure 3.3 Relationship of RPO, RTO, and TTD.

Trang 11

The variable ε represents a margin of error in specifying an RPO, ing that TTD is usually a fixed quantity—which may not necessarily be true in allcases.

assum-There are some caveats to using an RPO RPO assumes that the integrity of thedata is preserved at recovery The value does not necessarily convey the quality andquantity of data that is to be restored [6] It is conceivable that an operation can pro-vide service with partial transaction data, until transactions are reconstructed at alater time If data is restored to a level such that full operation can take place, then inthe strictest sense the RPO has been satisfied

RPO also assumes a uniform transaction-recording rate over time, which maynot necessarily be true In other words, an RPO of one hour implicitly assumes that

no more than one hour’s worth of transaction loss can be tolerated In fact, if anadverse event took place during off hours, the likelihood is that hardly any transac-tions would be lost within an hour’s time For this reason, different levels of RPOmay need to be specified depending on the time of day

3.1.3 RTO Versus RPO

RTO and RPO are not necessarily tied to each other, but they can be interrelated [7].Figure 3.3 also illustrated the relationship between RTO and RPO Specifying anRTO that is short in duration does not necessarily imply a short RPO For example,although a system can be restored with working data within an RTO of an hour fol-lowing an adverse event, it is acceptable for that data to be four hours old—theRPO RTO specifies the maximum time duration to recover a network, system, orcomponent RPO defines how much working data, in terms of time, can be lost inthe process A system or network working under an RTO and RPO both equivalent

to zero requires instantaneous recovery and essentially no data loss Specification ofboth the RTO and RPO is driven mainly by economics

Reliability is defined as the probability (or likelihood) that a network (or

compo-nent) will perform satisfactorily during a specified period of time It is measured byhow long it takes for a network or system to fail (i.e., how long it continues to func-

tion until it ceases due to failure) Reliability and availability are often used

inter-changeably, but there is a subtle difference between them Availability (discussed inSection 3.3) is the probability that a network is in service and available to users atany given instant in time

The difference between reliability and availability is best explained through

an analogy A car, for example, may break down and require maintenance5% of the time It is therefore 95% reliable However, suppose the same car isequally shared between two family members To each, the car is available only47.5% (50%× 95%) of the time, even though the car is very reliable Even if the carwas 100% reliable, the availability to each is still only 50% To improve availability,the family members can purchase an additional car, so that each now has 100%autoavailability

Trang 12

3.2.1 Mean Time to Failure

Mean time to failure (MTTF) is a metric that is often used to characterize the ating life of a system It is the amount of time from the placement of a system orcomponent in service until it permanently fails Passive components are not oftenconsidered in MTTF estimations They can have lifetimes on the order of 20 years

oper-or so Netwoper-ork cables are known to have lifetimes from three to 100 years oper-or so,depending on where and how they are used Active components, on the other hand,may likely have shorter MTTFs

Ideally, accurately calculating MTTF requires the system to be monitored for itsexpected useful lifetime, which can be quite long Whether estimation of the MTTFinvolves monitoring a system for several years or a hundred years, accurate MTTFcomputation is often impractical to obtain Some standard prediction methods toestimate MTTF are found in Military Standards (MIL-217) and Naval Surface War-fare Center (NSWC) specifications, and telephony standards such as TelcordiaSpecifications (TR-332 Version 6) and French Telecommunications (RDF 2000) If

a system is in operation until it is no longer of use, then one can say that the mission time of the device is assumed to be the same as the MTTF In many military network

installations, a system’s mission time may complete much sooner than the MTTF.MTTF is viewed as an elapsed time If a network element or system is not usedall of the time, but at a periodic rate (e.g., every day during business hours), then the

percentage of time it is in operational use is referred to as the duty cycle The duty

cycle is defined as:

whereδ is the duty cycle and OT is the total operating time of the element For a work circuit, for example, it is the fraction of time the circuit is transmitting For asystem or component such as a disk drive, it is the percentage of time the drivespends actively reading and writing If, for example, the drive has a MTTF of250,000 hours and is in use 5% of the time (δ = 05), the same drive would have aMTTF of 125,000 hours if it were used twice as much (δ = 10) In other words, themore a system or device is in use, the shorter the life expectancy

net-3.2.2 Failure Rate

Systems will fail or be placed out of service for many reasons A system upgrade,maintenance, or system fault may require placing a system out of service A failure

rate, F, can be defined to express failure frequency in terms of failures per unit time,

say percentage of failures per 1,000 hours System vendors often use statistical pling methods to estimate average failure rates over large populations of compo-nents These populations can be on the order of tens of thousands of components.Once components are embedded in a complex platform or network, their signifi-cance in the overall reliability becomes ambiguous The more complex a system ornetwork grows, the greater likelihood of failure, even though the individual subsys-tems are highly reliable

sam-A failure may not necessarily mean that a system has stopped operating sam-A ure can also be associated with those circumstances in which a system is producing

Trang 13

service at an unsatisfactory performance level and hence is of little or no use The

failure rate, F, of a system can be estimated as follows:

F =f / System’s useful life (3.5)

where f is the number of failures experienced during a system’s useful life or mission

time (i.e., the total time a system is performing service operations.) Many failure mation techniques assume an exponential distribution where the failure rate is con-stant with time

esti-3.2.3 Mean Time to Recovery

Mean time to recovery (MTTR) is sometimes referred to as mean time to repair orrestore In either case, it means the same thing It is the time required to restoreoperation in a component that has stopped operating or that is not operating to asatisfactory performance level It includes the total time it takes to restore the com-ponent to full operation It could include things like diagnosing, repairing, replace-ment, reboot, and restart MTTR is expressed in units of time The time to diagnosecan typically present the most uncertainty in estimating MTTR and can thus have aprofound effect on MTTR and, ultimately, system availability

MTTR can be estimated from observed data in several ways The most commonmethod is to simply obtain the sum total of all observed restoration times and divide

by the number of reported outages or trouble tickets MTTR can be used to estimatethe restoration rate,µ (sometimes referred to as the recovery rate) of a system asfollows:

where µ is used to convey the recoverability of a system Systems that minimizeMTTR or that have a high recoverabilityµ should be favored MTTR is also a pri-mary measure of availability The availability of systems with numerous compo-nents will be bound by those having the longest MTTR

3.2.4 Mean Time Between Failure

The mean time between failure (MTBF) is a metric that conveys the mean or averagelife of a system based on the frequency of system outages or failures For this reason,

it is different than MTTF, although the two are quite often used interchangeably.Also, MTBF is sometimes referred to as the mean time between system outages(MTBSO) [8], depending on the context of use For our purposes, we will use MTBFbecause it is the more recognizable metric

MTBF is a measure that system vendors often use to compare their product toanother [9] System vendors will quote an MTBF without any basis or justification.Many system vendors may quote an MTTF for a product, which may actually be thecomputed MTBF Because of the complexity of today’s systems, computation of atrue MTBF for a platform can be daunting Another issue is that mission-critical net-work systems do not, by definition, function as isolated items An MTBF usuallyconveys stand-alone operation If the MTBF is reached, considerable operationalrisk is incurred

Trang 14

A system with a low MTBF will require more servicing and consequently tional staffing, monitoring, and spare components This typically implies highermaintenance costs but lower capital costs A high MTBF, on the other hand, indi-cates that a system will run longer between failures and is of higher quality Thismay imply a higher capital cost but lower maintenance cost Some systems will try

addi-to integrate high-quality components having the highest MTBFs, but their level ofintegration is such that the MTBF of the overall system is still low

MTBF is measured based on the number of failures during the service life of a

system, or simply the inverse of the failure rate, F:

For example, if a system has a MTTF of 100 years and experiences three failures

in that time (f= 3), then the MTBF is approximately 33.3 years Many use MTBF toconvey the reliability of a system in terms of time The higher the MTBF, the morereliable a system is

The MTBF for a system can be estimated in various ways If MTBF is estimated

as an arithmetic mean of observed MTBF values across N systems, one could

assume that MTBF represents the point in time that approximately half of the

sys-tems have had a failure, assuming that F is uniform over time In general, the centage, p, of devices that could fail in a given year is then estimated as:

So, for example, in a large network with an estimated MTBF of 20 years, onewould expect on average about 2.5% of the devices to fail in a given year It isimportant to recognize that some components might fail before reaching the MTBF,while others might outperform it without problem It is best to use MTBF with themost critical components of a system or network, particularly those that are poten-tially single points of failure

To plan recovery and network-management operations and resources, it isoften valuable to have a feel for how many simultaneous outages or failures can

occur in a network, consisting of N nodes If there were only one node (N= 1) in a

network, then p is the probability of that node failing in a day However, as the number of nodes N in a network increases, so will the likelihood of having more failures in a given day In general, the probability of f discrete events occurring out

of N possible outcomes, each with a probability of occurrence p, is statistically

char-acterized by the well-known Binomial distribution [10]:

P f =N p! f 1−p N f− / !f N−f ! (3.9)

where P(f) is the probability of f events occurring If we assume that N is the mum number of possible node failures in a given day and p is the probability of an individual node failing, then P(f) is the probability of f failures in a given time frame.

maxi-If we substitute the expression for p obtained in (3.8) into (3.9), then the probability

of having f failures in a network (or what percentage of time f failures are likely to

occur) is [11]:

P f =N!2MTBF−1 N f− / !f N−f !2MTBF (3.10)

Trang 15

This expression assumes that all nodes have an equal probability of failing,which obviously may not always be true However, it can be used as an approxima-tion for large networks It also assumes that all failures are independent of eachother, which is also another simplifying assumption that may not necessarily be true.

In fact, many times a failure will lead to other failures, creating a rolling failure In

the expression, it is assumed the total number of failure outcomes N is the same as if

all nodes were to fail simultaneously

This expression can be used to gain insight into large networks If P(0) indicates the percentage of the time no failures will occur, then 1 – P(0) is the percentage of

time that one or more failures will occur Figure 3.4 shows how this probability ies with the number of network nodes for different values of nodal MTBF Animportant concept is evident from the figure The marginal gain of improving thenodal MTBF is more significant with the size of the network; however, the gainsdiminish as the improvements get better

var-Variants of MTBF will be used in different contexts For example, mean time todata loss (MTDL) or mean time to data availability (MTDA) have often been used,but convey similar meaning Highly redundant systems sometimes use the mean timebetween service interruptions (MTBI)

3.2.5 Reliability

Reliability is the probability that a system will work for some time period t without

failure [12] This is given by:

where R(t) is the reliability of a system This function assumes that the probability that a system will fail by a time t follows an exponential distribution [13] Although

this assumption is commonly used in many system applications, there are a number

of other well-known probability distributions that have been used to characterizesystem failures

MTBF = 50K hours MTBF = 100K hours MTBF = 150K hours MTBF = 200K hours MTBF = 250K hours

Figure 3.4 Probability of failure versus network size.

Trang 16

Reliability deals with frequency of failure, while availability is concerned withduration of service A highly reliable system that may infrequently fail can still havereduced availability if it is out of service for long periods of time Keeping the duration

of outages as short as possible will improve the availability Mathematically, ability is the probability of having access to a service and that the service operatesreliably For a mission-critical network, availability thus means making sure usershave access to service and that service is reliable when accessed If a component orportion of a network is unreliable, introducing redundancy can thus improve serviceavailability [14] (Availability is discussed in Section 3.3.)

avail-Figure 3.5 illustrates the reliability function over time for two MTBF values

Reliability at any point in time t is essentially the probability percentage that a

sys-tem with a particular MTBF will operate without a failure or the percentage of thesystems that will still be operational at that point in time

When different network components or platforms are connected in series, theoverall system reliability is reduced because it is the product of component systemreliabilities Failure of any one component could bring down the entire system Inlarge networks or platforms with many systems and components, a high level of reli-ability may be difficult to achieve Improving the reliability of a single componentwill marginally improve the overall system reliability However adding a redundantcomponent will improve the overall reliability

A reliability block diagrams (RBD) is a tool used for a first-pass computation ofreliability Figure 3.6 illustrates two RBD examples If a system is operational only if

all components are operating, the relationship is conveyed as a serial relationship If the system is operational if either component is operating, then a parallel relationship is made Both arrangements can be generalized to greater numbers of N compo-

nents, or to systems with components having a mix of parallel or serialrelationships The following formulas are used to convey those relationships:

Figure 3.5 Reliability function.

Trang 17

RBDs can be used to construct a generalized abstract model of a system as an aid

to understanding the reliability of a system But they become impractical to modellarge complex systems with numerous detailed interactions

Availability is the proportion of time that a system or network will provide service

It is the percentage of required mission time that a system actually provides service.Reliability is the likelihood that a system will continue to provide service withoutfailure Availability is the likelihood that a system will provide service over the

course of its lifetime The availability, A, of a system or component can be calculated

by the following [15]:

The unavailability of a system or component is simply 1 – A This ends up being

numerically equivalent to amortizing the MTTR over the MTBF [16] For example,

if a critical system in a network has an MTBF of 10,000 hours and an MTTR of 2hours, it is available 99.98% of the time and unavailable 02% of the time This mayseem like a highly available system, but one must consider the absolute service andoutage times that are implied Assuming this system must provide service all the time(i.e., 7× 24 × 365), implying a total of 8,7601

hours per year of service, then it isunavailable (or down) 1.75 hours per year This could be significant for a mission-critical network The relationship between availability and MTBF is shown inFigure 3.7

A Logical system

Equivalent RBD

R(t) = R (t) R (t) R (t) A B C

I/O module CPU

Serial relationship

Hard drive

Figure 3.6 Reliability block diagrams.

1 Use 8,766 hours per year to account for leap years.

Trang 18

Redundancy can have a significant effect on availability In the examplementioned earlier, if a redundant system were to be placed in parallel opera-tion in the same network, then the percentage of time the network is available

is now equivalent to the percentage of time either or both systems are ing Figure 3.8 shows an example of the use of availability block diagrams(ABDs) ABDs can be used in the same fashion as RBDs In fact, the mathemati-cal relationship between systems working in parallel or series is the same asthe RBD

operat-The relationship of availability among systems can be generalized in the same

way as an RBD For a network having N components or systems, the following

for-mulas can be used:

Availability

MTBF 100%

Figure 3.7 Availability versus MTBF.

Trang 19

redun-to less than a day Adding parallel redundancy redun-to systems with low availability rateshas greater impact than adding redundancy to systems that already have high avail-ability Additionally, improving the availability of individual parallel systems willonly marginally improve overall network availability.

Systems in mission-critical networks, be they computing or networking forms, require rapid MTTRs Availability is affected by system or network recoveryprocedures Figure 3.10 shows the relationship between MTTR and availability.Decreasing the MTTR can have a profound effect on improving availability Thus,any tactic that can be used to reduce the MTTR can help improve overall availabil-ity Systems needing a high MTTR may thus require a back up system

plat-Over the years, the IT industry has used several known levels of availability [17].These are listed in Table 3.1 If a system is carrier class, for example, it is considered99.999% available (colloquially known as five nines) This level is the standard forpublic switched telephone network (PSTN) systems [18] This means that therecould be one failure during a year that lasts just over five minutes or there can be fivefailures that each last one minute

Organizations will typically define several levels of availability, according totheir severity and impact on operations For example, the following levels could bedefined, each with an associated availability or downtime [19]:

• Level 1: Users and service are interrupted, data is corrupted;

• Level 2: Users and service are interrupted, data remains intact;

A = 99.9% A = 99% A = 99.9%

Router Server Database

Web site network availability

Trang 20

• Level 3: Users interrupted, service remains intact;

• Level 4: No interruptions, but performance degradation;

• Level 5: No interruptions, failover is implemented

One of the fundamental laws of availability is the law of diminishing returns.The higher the level of availability, the greater the incremental cost of achieving asmall improvement A general rule is that each additional nine after the second nine

in an availability value will cost twice as much As one approaches 100% ity, return on investment diminishes The relationship between cost and availability

availabil-is illustrated in Figure 3.11 In the end, to achieve absolute (100%) availability availabil-iscost prohibitive—there is only so much availability that can be built into an infra-structure This view does not consider the potential savings resulting from improvedavailability

In a network where systems operate in series or in parallel, the availability of thenetwork is dependent on the ability to continue service even if a system or network

Figure 3.10 Effect of MTTR on availability.

Table 3.1 Classes of Availability

Availability (%) Annual Downtime Description

98 175.2 hours Failures too frequent

99 87.6 hours Failures rare

99.5 43.8 hours Considered high availability

99.9 8.8 hours Three nines (often used for storage systems) 99.99 52.6 minutes Considered fault resilient

99.999 5.3 minutes Fault tolerant (also called carrier class for

PSTN infrastructure) 99.99966 1.8 minutes Six sigma (often used in manufacturing) [20] 99.9999 31.5 seconds Six nines

100 0 Continuous availability

Trang 21

link becomes inoperable This requires a mix of redundancy, reliable systems, andgood network management We have seen that the overall impact of adding redun-dant systems in parallel is significant, while adding components in series actuallyreduces overall network availability Because all systems connected in series mustoperate for overall service, there is greater dependency on the availability of eachcomponent.

The availability calculations presented here can be used to estimate the tion of each system to the overall network availability Although the resulting avail-ability generally yields a high-side estimate, it can flag those portions of a networkthat can be problematic More complex statistical models are required to obtainmore precise estimates Markov chain models, which enumerate all possible operat-ing states of each component, can be used Transitions between states (e.g., between

contribu-an operational or failure state) are assumed to be probabilistic instead of tic, assuming some probability distribution Availability is then measured by thefraction of time a system is operational

determinis-The problem with this approach, however, is that it is limited to small numbers

of components and states As both of these grow in number, the problem cannot besolved easily In this case, other approaches, such as network simulation, can beused This requires building a computerized model of a network, which oftenreflects the network’s logical topology Calibration of the model is required in order

to baseline the model’s accuracy

Availability for a network can be estimated from field data in many ways A

gen-eral formula for estimating observed availability, A o, from a large network is thefollowing:

follow-• Availability is a relative measure—it is how one entity perceives the operation

of another End users ultimately determine availability It should be computed

Định dạng
Số trang	43
Dung lượng	443,3 KB