If we next assume that σ is theaverage nodal loss per outage, measured in terms of percentage of users, then on abroad network basis, the minimum risk or percent expected minimum lossρ f
Trang 1expressed as Metcalfe’s Law, states that the potential value of a network is
propor-tional to the number of active user nodes More precisely, the usefulness of a work to new and existing users equals the square of the number of user nodes.Although networks with higher scale factors provide greater potential value pernode, they are less economical to grow [13]
net-σ has implications on network risk Large, scale-free networks imply distributedarchitectures, which, we see from prior discussion, have a greater likelihood of out-age But because there are fewer users served per node, the expected loss from anodal outage is less In a network of limited scale, where users are more concen-trated at each node, greater damage can happen
These points can be illustrated by estimating the minimum expected loss in a
network of N nodes, each with an outage potential p, equal to the probability of a failure The probability of f failures occurring out of N possible nodes is statistically
characterized by the well-known Binomial distribution [14]:
centage of time that one or more failures occur If we next assume that σ is theaverage nodal loss per outage, measured in terms of percentage of users, then on abroad network basis, the minimum risk (or percent expected minimum loss)ρ for anetwork is given by:
( )
Figure 2.14 graphically shows how ρ varies with network scale at different
nodal outage potentials p It shows that investing to reduce nodal outage potentials,
regardless of scale, can ultimately still leave about one percent of the users at risk.Expanding the size of the network to reduce risk is more effective when the scale is
Relatively scale-free
scale = 10 scope = 5, ,σ= 2 Exponential scale-freescale = 10 scope = 5, ,σ= 2 Relatively scale-limitedscale = 10 scope = 2, ,σ= 5
Service node User node
Figure 2.13 Network scale factor.
Trang 2limited or, in other words, users are more concentrated Concentration will nate risk up to a certain point, beyond which size can greater influence network risk.This analysis assumes uniform, random outage potential at each node Ofcourse, this assumption may not hold true in networks with a nonuniform user dis-tribution and nodal outage potential An intentional versus random outage at a con-centrated node, such as one resulting from a directed security attack, can inflictgreat damage Overall, scale-free networks with less concentrated nodal functional-ity are less vulnerable to directed, nonrandom outages than networks of limitedscale.
domi-2.4.4 Complexity
Complexity in a network is characterized by variety Use of many diverse networktechnologies, circuits, OSs, vendors, interfaces, devices, management systems, serv-ice providers, and suppliers is quite common and often done for valid reasons
Many networks are grown incrementally over time and result in à la carte designs.
Such designs usually do not perform as well as a single integrated design Even if acomplex design has strong tolerance against errors, extra complexity can impederecovery activities and require more costly contingency
The problems with complexity are attributed to not only system variety Theyare also attributed to whether components are chosen correctly for their intended
purpose and how well they are matched A matched system is one in which
compo-nents fit well with each other and do not encroach upon the limits of one another.Turnkey systems delivered by a single vendor or integrator are usually deliveredwith intentions of behaving as matched systems On the other hand, unmatched sys-tems are usually characterized by poorly integrated piecemeal designs involving dif-ferent vendor systems that are plugged together The cost of maintaining the samelevels of service increases exponentially as more complexity is used to deliver theservice
Consolidation of multiple components, interfaces, and functions into a singlesystem or node can reduce the number of resources needed and eliminate potentialpoints of failure and the interoperability issues associated with matching But fromprevious discussion, we know that consolidation poses greater risk because the
Trang 3consolidated resource can result in a single point of failure with high damage tial, unless a redundant architecture is used.
This chapter reviewed some fundamental concepts of continuity that form thebasis for much of the remaining discussion in this book Many of these conceptscan be applied to nearly all levels of networking A basic comprehension of theseprinciples provides the foundation for devising and applying continuity strategiesthat use many of the remedial techniques and technologies discussed further in thisbook
Adverse events are defined as those that violate a well-defined envelope ofoperational performance criteria Service disruptions in a network arise out of lack
of preparedness rather than the adverse events themselves Preparedness requireshaving the appropriate capabilities in place to address such events It includes having
an early detection mechanism to recognize them even before they happen; the ability
to contain the effects of disruption to other systems; a failover process that cantransfer service processing to other unaffected working components; a recoverymechanism to restore any failed components; and the means to resume to normaloperation following recovery
Redundancy is a tactic that utilizes multiple resources so that if one resource isunable to provide service, another can In networks, redundancy can add to capitaland operating costs and should therefore be carefully designed into an operation At
a minimum, it should eliminate single points of failure—particularly those that port mission-critical services There are various ways to implement redundancy.However, if not done correctly, it can be ineffective and provide a false sense ofsecurity
sup-Tolerance describes the ability to withstand disruptions and is usually expressed
in terms of availability A greater level of tolerance in a network or system implieslower transaction loss and higher cost FT, FR, and HA are tolerance categories thatare widely used to classify systems and services, with FT representing the highestlevel of tolerance FR and HA solutions can be cost-effective alternatives to minimiz-ing service disruption, but they may not guarantee transaction preservation duringfailover to the same degree as FT
There are several key principles of network and system design to ensure ity Capacity should be put in the right places—indiscriminate placement of capacitycan produce bottlenecks, which can lead to other service disruptions Networksshould be designed in compartments that each represent their own failure groups, sothat a disruption in one compartment does not affect another The network architec-ture should be balanced so that loss of a highly interconnected node does not disruptthe entire network
continu-Finally, the adage “the simpler, the better” prevails Complexity should be couraged at all levels The variety of technologies, devices, systems, vendors, andservices should be minimized They should also be well matched This means thateach should be optimally qualified for its intended job and work well with othercomponents
Trang 4Computerworld—Spe-[10] Nolle, T., “Balancing Risk,” Network Magazine, December 2001, p 96.
[11] Sanborn, S., “Spreading Out the Safety Net,” Infoworld, April 1, 2002, pp 38–41 [12] Porter, D., “Nothing Is Unsinkable,” Enterprise Systems Journal, June 1998, pp 20–26 [13] Rybczynski, T., “Net-Value—The New Economics of Networking,” Computer Telephony Integration, April 1999, pp 52–56.
[14] Bryant, E C., Statistical Analysis, New York: McGraw-Hill Book Company, Inc., 1960,
pp 20–24.
Trang 6C H A P T E R 3
Continuity Metrics
Metrics are quantitative measures of system or network behavior We use metrics tocharacterize system behavior so that decisions can be made regarding how to man-age and operate them efficiently Good metrics are those that are easily understood
in terms of what they measure and how they convey system or network behavior.There is no single metric that can convey the adequacy of a mission-critical net-work’s operation Using measures that describe the behavior of a single platform orportion of a network is insufficient One must measure many aspects of a network
to arrive at a clear picture of what is happening
There is often no true mathematical way of combining metrics for a network.Unlike the stock market, use of a computed index to convey overall network status
is often flawed For one thing, many indices are the result of combining measuresobtained from ordinal and cardinal scales, which is mathematically incorrect Somemeasures are obtained through combination using empirically derived models This
is can also be flawed because a metric is only valid within the ranges of data fromwhich it was computed The best way of combining measures is through humanjudgment A network operator or manager must be trained to observe different met-rics and use them to make decisions Like a pilot, operators must interpret informa-tion from various gauges to decide the next maneuver
Good useful metrics provide a balance between data granularity and the effortrequired for computation Many statistical approaches, such as experimentaldesign, are aimed at providing the maximum amount of information with the leastamount of sampling The cost and the ability to obtain input data have improvedover the years Progress in computing and software has made it possible to conductcalculations using vast amounts of data in minimal time, impossible 20 or 30 yearsago The amount of time, number of samples, complexity, and cost all should beconsidered when designing metrics
Metrics should be tailored to the item being measured No single metric is cable to everything in a network Furthermore, a metric should be tied to a serviceobjective It should be used to express the extent to which an objective is beingachieved A metric should be tied to each objective in order to convey the degree towhich it is satisfied
appli-Finally, computing a metric should be consistent when repeated over time; erwise, comparing relative changes in the values would be meaningless Repeatedcalculations should be based on the same type of data, the same data range, and thesame sampling approach More often than not, systems or network services arecompared based on measures provided by a vendor or service provider Comparingdifferent vendors or providers using the measures they each supply is often difficult
oth-31
Trang 7and sometimes fruitless, as each develops their metrics based on their ownmethodologies [1].
Recovery is all of the activities that must occur from the time of an outage to the timeservice is restored These will vary among organizations and, depending on the con-text of use, within a mission-critical network environment Activities involved torecover a component are somewhat different than those to recover an entire datacenter But in either case, the general meaning remains the same General recoveryactivities include declaration that an adverse event has occurred (or is about tooccur); initialization of a failover process; system restoration or repair activities; andsystem restart, cutover, and resumption of service Two key recovery metrics aredescribed in the following sections
3.1.1 Recovery Time Objective
The recovery time objective (RTO) is a target measure of the elapsed time intervalbetween the occurrence of an adverse event and the restoration of service RTOshould be measured from the point when the disruption occurred until operation isresumed In mission-critical environments, this means that operation is essentially inthe same functional state as it was prior to the event Some IT organizations mayalter this definition, by relaxing some of the operational state requirements afterresumption and accepting partial operation as a resumed state Likewise, some willdefine RTO based on the time of recognizing and declaring that an adverse event hasoccurred This can be misleading because it does not take into account monitoringand detection time
RTO is an objective, specified in hours and minutes—a target value that is mined by an organization’s management that represents an acceptable recoverytime What value an organization assigns to “acceptable” is influenced by a variety
deter-of factors, including the importance deter-of the service and consequential revenue loss,the nature of the service, and the organization’s internal capabilities In systems, itmay even be specified in milliseconds Some will also specify RTO in transactions or
a comparable measure that conveys unit throughput of an entity This approach isonly valid if that entity’s throughput is constant over time
RTOs can be applied to any network component—from an individual system to
an entire data center Organizations will define different RTOs for different aspects
of their business To define an RTO, an organization’s managers must determinehow much service interruption their business can tolerate They must determine howlong a functional entity, such as a business process, can be unavailable One mayoften see RTOs in the range of 24 to 48 hours for large systems, but these numbers donot reflect any industry standard Virtual storefronts are unlikely to tolerate highRTOs without significant loss of revenue Some vertical markets, such as banking,must adhere to financial industry requirements for disruption of transactions [2].Cost ultimately drives the determination of an RTO A high cost is required toachieve a low RTO for a particular process or operation To achieve RTOs close tozero requires expensive automated recovery and redundancy [3] As the target RTO
Trang 8increases, the cost to achieve the RTO decreases An RTO of long duration invitesless expensive redundancy and more manual recovery operation However, concur-rent with this is business loss As shown in Figure 3.1, loss is directly related toRTO—the longer the RTO, the greater the loss During recovery, business loss can
be realized in many ways, including loss productivity or transactions This topic isdiscussed further in this chapter At some point, there is an RTO whose costs cancompletely offset the losses during the recovery [4, 5]
It becomes evident that defining an RTO as a sole measure is meaningless out some idea of what level of service the recovery provides Furthermore, differentsystems will have their own RTO curves Critical systems will often have a muchsmaller RTO than less critical ones They can also have comparable RTOs but withmore stringent tolerance for loss A tiered-assignment approach can be used Thisinvolves defining levels of system criticality and then assigning an RTO value toeach So, for example, a three-level RTO target might look like this:
with-• Level 1—restore to same service level;
• Level 2—restore to 75% service level;
• Level 3—restore to 50% service level
A time interval can be associated to each level as well as a descriptor of the level
of service provided For example, a system assigned a level 2 RTO of 1 hour mustcomplete recovery within that time frame and disrupt no more than 25% of service
A system can be assigned a level 1 RTO of 1 hour as well, but must restore to thesame level of service Level 1 may require failover procedures or recovery to a secon-dary system
Assuming that the service level is linearly proportional to time, RTOs across ferent levels can be equated on the same time scale A time-equivalent RTO, RTOE,can thus be computed as:
Recovery costs offset loss
Figure 3.1 RTO versus loss and cost.
Trang 93.1.1.1 Recovery Time Components
The RTO interval must incorporate all of the activities to restore a network or ponent back to service A flaw in any of the component activities could lead to sig-nificant violation of the RTO To this end, each component activity can be assigned
com-an RTO as well The addition of each of these component RTOs may not necessarilyequal the overall RTO because activities can be conducted in parallel Some of thesecomponent RTOs can include time to detection and declaration of an adverse event,time to failover (sometimes referred to as a failover time objective), time to diagnose,and time to repair
The last two items are typically a function of the network or system complexityand typically pose the greatest risk In complex networks, one can expect that thelikelihood of achieving an RTO for the time to diagnose and repair is small Failover
to a redundant system is usually the most appropriate countermeasure, as it can buytime for diagnostics and repair A system or network operating in a failed state issomewhat like a twin-engine airplane flying on one engine Its level of reliability isgreatly reduced until diagnostics and repairs are made
Figure 3.2 illustrates the continuum of areas activity relative to a mission-criticalnetwork Of course, these may vary but are applicable to most situations The areasinclude the following:
• Network recovery This is the time to restore voice or data communication
fol-lowing an adverse event Network recovery will likely influence many otheractivities as well For instance, recovery of backup data over a network could
be affected until the network is restored
• Data recovery This is time to retrieve backup data out of storage and deliver
to a recovery site, either physically or electronically It also includes the time toload media (e.g., tape or disk) and install or reboot database applications This
is also referred to as the time to data (TTD) and is discussed further in thechapter on storage
• Application recovery This is the time to correct a malfunctioning application.
• Platform recovery This is the time to restore a problematic platform to service
operation
• Service recovery This represents recovery in the broadest sense It represents
the cumulative time to restore service from an end user’s perspective It is, inessence, the result of an amalgamation of all of the preceding recovery times
Network recovery Data recovery Application recovery Platform recovery
Time scale
Trang 10All of these areas are discussed at greater length in the subsequent chapters ofthis book.
3.1.2 Recovery Point Objective
The recovery point objective (RPO) is used as target metric for data recovery It isalso measured in terms of time, but it refers to the age or freshness of data required
to restore operation following an adverse event Data, in this context, might alsoinclude information regarding transactions not recorded or captured Like RTO,the smaller the RPO, the higher the expected data recovery cost Reloading a dailybackup tape can satisfy a tolerance for no more than 24 hours’ worth of data How-ever, a tolerance for only one minute’s worth of data or transaction loss mightrequire more costly data transfer methods, such as mirroring, which is discussed inthe chapter on storage
Some view the RPO as the elapsed time of data recovery in relation to theadverse event This is actually the aforementioned TTD RPO is the point in time to
which the data must be recovered—sometimes referred to as the freshness window.
It is the maximum tolerable elapsed time between the last safe backup and the point
of recovery An organization that can tolerate no data loss (i.e., RPO = 0) impliesthat data would have to be restored instantaneously following an adverse event andwould have to employ a continuous backup system
Figure 3.3 illustrates the relationship between TTD and RPO using a timeline If
we denote the time between the last data snapshot and an adverse event as a random
variable å, then it follows that the TTD+ ε must meet the RPO objective:
A target RPO should be chosen that does not exceed the snapshot interval (SI),and at best equals the SI If data is not restored prior to the next scheduled snapshot,then the snapshot should be postponed or risk further data corruption:
Error Snapshot Restoredata
RPO
RTO
Time scale
Snapshot
SI
Restore operation
Figure 3.3 Relationship of RPO, RTO, and TTD.
Trang 11The variable ε represents a margin of error in specifying an RPO, ing that TTD is usually a fixed quantity—which may not necessarily be true in allcases.
assum-There are some caveats to using an RPO RPO assumes that the integrity of thedata is preserved at recovery The value does not necessarily convey the quality andquantity of data that is to be restored [6] It is conceivable that an operation can pro-vide service with partial transaction data, until transactions are reconstructed at alater time If data is restored to a level such that full operation can take place, then inthe strictest sense the RPO has been satisfied
RPO also assumes a uniform transaction-recording rate over time, which maynot necessarily be true In other words, an RPO of one hour implicitly assumes that
no more than one hour’s worth of transaction loss can be tolerated In fact, if anadverse event took place during off hours, the likelihood is that hardly any transac-tions would be lost within an hour’s time For this reason, different levels of RPOmay need to be specified depending on the time of day
3.1.3 RTO Versus RPO
RTO and RPO are not necessarily tied to each other, but they can be interrelated [7].Figure 3.3 also illustrated the relationship between RTO and RPO Specifying anRTO that is short in duration does not necessarily imply a short RPO For example,although a system can be restored with working data within an RTO of an hour fol-lowing an adverse event, it is acceptable for that data to be four hours old—theRPO RTO specifies the maximum time duration to recover a network, system, orcomponent RPO defines how much working data, in terms of time, can be lost inthe process A system or network working under an RTO and RPO both equivalent
to zero requires instantaneous recovery and essentially no data loss Specification ofboth the RTO and RPO is driven mainly by economics
Reliability is defined as the probability (or likelihood) that a network (or
compo-nent) will perform satisfactorily during a specified period of time It is measured byhow long it takes for a network or system to fail (i.e., how long it continues to func-
tion until it ceases due to failure) Reliability and availability are often used
inter-changeably, but there is a subtle difference between them Availability (discussed inSection 3.3) is the probability that a network is in service and available to users atany given instant in time
The difference between reliability and availability is best explained through
an analogy A car, for example, may break down and require maintenance5% of the time It is therefore 95% reliable However, suppose the same car isequally shared between two family members To each, the car is available only47.5% (50%× 95%) of the time, even though the car is very reliable Even if the carwas 100% reliable, the availability to each is still only 50% To improve availability,the family members can purchase an additional car, so that each now has 100%autoavailability
Trang 123.2.1 Mean Time to Failure
Mean time to failure (MTTF) is a metric that is often used to characterize the ating life of a system It is the amount of time from the placement of a system orcomponent in service until it permanently fails Passive components are not oftenconsidered in MTTF estimations They can have lifetimes on the order of 20 years
oper-or so Netwoper-ork cables are known to have lifetimes from three to 100 years oper-or so,depending on where and how they are used Active components, on the other hand,may likely have shorter MTTFs
Ideally, accurately calculating MTTF requires the system to be monitored for itsexpected useful lifetime, which can be quite long Whether estimation of the MTTFinvolves monitoring a system for several years or a hundred years, accurate MTTFcomputation is often impractical to obtain Some standard prediction methods toestimate MTTF are found in Military Standards (MIL-217) and Naval Surface War-fare Center (NSWC) specifications, and telephony standards such as TelcordiaSpecifications (TR-332 Version 6) and French Telecommunications (RDF 2000) If
a system is in operation until it is no longer of use, then one can say that the mission time of the device is assumed to be the same as the MTTF In many military network
installations, a system’s mission time may complete much sooner than the MTTF.MTTF is viewed as an elapsed time If a network element or system is not usedall of the time, but at a periodic rate (e.g., every day during business hours), then the
percentage of time it is in operational use is referred to as the duty cycle The duty
cycle is defined as:
whereδ is the duty cycle and OT is the total operating time of the element For a work circuit, for example, it is the fraction of time the circuit is transmitting For asystem or component such as a disk drive, it is the percentage of time the drivespends actively reading and writing If, for example, the drive has a MTTF of250,000 hours and is in use 5% of the time (δ = 05), the same drive would have aMTTF of 125,000 hours if it were used twice as much (δ = 10) In other words, themore a system or device is in use, the shorter the life expectancy
net-3.2.2 Failure Rate
Systems will fail or be placed out of service for many reasons A system upgrade,maintenance, or system fault may require placing a system out of service A failure
rate, F, can be defined to express failure frequency in terms of failures per unit time,
say percentage of failures per 1,000 hours System vendors often use statistical pling methods to estimate average failure rates over large populations of compo-nents These populations can be on the order of tens of thousands of components.Once components are embedded in a complex platform or network, their signifi-cance in the overall reliability becomes ambiguous The more complex a system ornetwork grows, the greater likelihood of failure, even though the individual subsys-tems are highly reliable
sam-A failure may not necessarily mean that a system has stopped operating sam-A ure can also be associated with those circumstances in which a system is producing
Trang 13service at an unsatisfactory performance level and hence is of little or no use The
failure rate, F, of a system can be estimated as follows:
F =f / System’s useful life (3.5)
where f is the number of failures experienced during a system’s useful life or mission
time (i.e., the total time a system is performing service operations.) Many failure mation techniques assume an exponential distribution where the failure rate is con-stant with time
esti-3.2.3 Mean Time to Recovery
Mean time to recovery (MTTR) is sometimes referred to as mean time to repair orrestore In either case, it means the same thing It is the time required to restoreoperation in a component that has stopped operating or that is not operating to asatisfactory performance level It includes the total time it takes to restore the com-ponent to full operation It could include things like diagnosing, repairing, replace-ment, reboot, and restart MTTR is expressed in units of time The time to diagnosecan typically present the most uncertainty in estimating MTTR and can thus have aprofound effect on MTTR and, ultimately, system availability
MTTR can be estimated from observed data in several ways The most commonmethod is to simply obtain the sum total of all observed restoration times and divide
by the number of reported outages or trouble tickets MTTR can be used to estimatethe restoration rate,µ (sometimes referred to as the recovery rate) of a system asfollows:
where µ is used to convey the recoverability of a system Systems that minimizeMTTR or that have a high recoverabilityµ should be favored MTTR is also a pri-mary measure of availability The availability of systems with numerous compo-nents will be bound by those having the longest MTTR
3.2.4 Mean Time Between Failure
The mean time between failure (MTBF) is a metric that conveys the mean or averagelife of a system based on the frequency of system outages or failures For this reason,
it is different than MTTF, although the two are quite often used interchangeably.Also, MTBF is sometimes referred to as the mean time between system outages(MTBSO) [8], depending on the context of use For our purposes, we will use MTBFbecause it is the more recognizable metric
MTBF is a measure that system vendors often use to compare their product toanother [9] System vendors will quote an MTBF without any basis or justification.Many system vendors may quote an MTTF for a product, which may actually be thecomputed MTBF Because of the complexity of today’s systems, computation of atrue MTBF for a platform can be daunting Another issue is that mission-critical net-work systems do not, by definition, function as isolated items An MTBF usuallyconveys stand-alone operation If the MTBF is reached, considerable operationalrisk is incurred
Trang 14A system with a low MTBF will require more servicing and consequently tional staffing, monitoring, and spare components This typically implies highermaintenance costs but lower capital costs A high MTBF, on the other hand, indi-cates that a system will run longer between failures and is of higher quality Thismay imply a higher capital cost but lower maintenance cost Some systems will try
addi-to integrate high-quality components having the highest MTBFs, but their level ofintegration is such that the MTBF of the overall system is still low
MTBF is measured based on the number of failures during the service life of a
system, or simply the inverse of the failure rate, F:
For example, if a system has a MTTF of 100 years and experiences three failures
in that time (f= 3), then the MTBF is approximately 33.3 years Many use MTBF toconvey the reliability of a system in terms of time The higher the MTBF, the morereliable a system is
The MTBF for a system can be estimated in various ways If MTBF is estimated
as an arithmetic mean of observed MTBF values across N systems, one could
assume that MTBF represents the point in time that approximately half of the
sys-tems have had a failure, assuming that F is uniform over time In general, the centage, p, of devices that could fail in a given year is then estimated as:
So, for example, in a large network with an estimated MTBF of 20 years, onewould expect on average about 2.5% of the devices to fail in a given year It isimportant to recognize that some components might fail before reaching the MTBF,while others might outperform it without problem It is best to use MTBF with themost critical components of a system or network, particularly those that are poten-tially single points of failure
To plan recovery and network-management operations and resources, it isoften valuable to have a feel for how many simultaneous outages or failures can
occur in a network, consisting of N nodes If there were only one node (N= 1) in a
network, then p is the probability of that node failing in a day However, as the number of nodes N in a network increases, so will the likelihood of having more failures in a given day In general, the probability of f discrete events occurring out
of N possible outcomes, each with a probability of occurrence p, is statistically
char-acterized by the well-known Binomial distribution [10]:
P f =N p! f 1−p N f− / !f N−f ! (3.9)
where P(f) is the probability of f events occurring If we assume that N is the mum number of possible node failures in a given day and p is the probability of an individual node failing, then P(f) is the probability of f failures in a given time frame.
maxi-If we substitute the expression for p obtained in (3.8) into (3.9), then the probability
of having f failures in a network (or what percentage of time f failures are likely to
occur) is [11]:
P f =N!2MTBF−1 N f− / !f N−f !2MTBF (3.10)
Trang 15This expression assumes that all nodes have an equal probability of failing,which obviously may not always be true However, it can be used as an approxima-tion for large networks It also assumes that all failures are independent of eachother, which is also another simplifying assumption that may not necessarily be true.
In fact, many times a failure will lead to other failures, creating a rolling failure In
the expression, it is assumed the total number of failure outcomes N is the same as if
all nodes were to fail simultaneously
This expression can be used to gain insight into large networks If P(0) indicates the percentage of the time no failures will occur, then 1 – P(0) is the percentage of
time that one or more failures will occur Figure 3.4 shows how this probability ies with the number of network nodes for different values of nodal MTBF Animportant concept is evident from the figure The marginal gain of improving thenodal MTBF is more significant with the size of the network; however, the gainsdiminish as the improvements get better
var-Variants of MTBF will be used in different contexts For example, mean time todata loss (MTDL) or mean time to data availability (MTDA) have often been used,but convey similar meaning Highly redundant systems sometimes use the mean timebetween service interruptions (MTBI)
3.2.5 Reliability
Reliability is the probability that a system will work for some time period t without
failure [12] This is given by:
where R(t) is the reliability of a system This function assumes that the probability that a system will fail by a time t follows an exponential distribution [13] Although
this assumption is commonly used in many system applications, there are a number
of other well-known probability distributions that have been used to characterizesystem failures
MTBF = 50K hours MTBF = 100K hours MTBF = 150K hours MTBF = 200K hours MTBF = 250K hours
Figure 3.4 Probability of failure versus network size.
Trang 16Reliability deals with frequency of failure, while availability is concerned withduration of service A highly reliable system that may infrequently fail can still havereduced availability if it is out of service for long periods of time Keeping the duration
of outages as short as possible will improve the availability Mathematically, ability is the probability of having access to a service and that the service operatesreliably For a mission-critical network, availability thus means making sure usershave access to service and that service is reliable when accessed If a component orportion of a network is unreliable, introducing redundancy can thus improve serviceavailability [14] (Availability is discussed in Section 3.3.)
avail-Figure 3.5 illustrates the reliability function over time for two MTBF values
Reliability at any point in time t is essentially the probability percentage that a
sys-tem with a particular MTBF will operate without a failure or the percentage of thesystems that will still be operational at that point in time
When different network components or platforms are connected in series, theoverall system reliability is reduced because it is the product of component systemreliabilities Failure of any one component could bring down the entire system Inlarge networks or platforms with many systems and components, a high level of reli-ability may be difficult to achieve Improving the reliability of a single componentwill marginally improve the overall system reliability However adding a redundantcomponent will improve the overall reliability
A reliability block diagrams (RBD) is a tool used for a first-pass computation ofreliability Figure 3.6 illustrates two RBD examples If a system is operational only if
all components are operating, the relationship is conveyed as a serial relationship If the system is operational if either component is operating, then a parallel relation- ship is made Both arrangements can be generalized to greater numbers of N compo-
nents, or to systems with components having a mix of parallel or serialrelationships The following formulas are used to convey those relationships:
Figure 3.5 Reliability function.
Trang 17RBDs can be used to construct a generalized abstract model of a system as an aid
to understanding the reliability of a system But they become impractical to modellarge complex systems with numerous detailed interactions
Availability is the proportion of time that a system or network will provide service
It is the percentage of required mission time that a system actually provides service.Reliability is the likelihood that a system will continue to provide service withoutfailure Availability is the likelihood that a system will provide service over the
course of its lifetime The availability, A, of a system or component can be calculated
by the following [15]:
The unavailability of a system or component is simply 1 – A This ends up being
numerically equivalent to amortizing the MTTR over the MTBF [16] For example,
if a critical system in a network has an MTBF of 10,000 hours and an MTTR of 2hours, it is available 99.98% of the time and unavailable 02% of the time This mayseem like a highly available system, but one must consider the absolute service andoutage times that are implied Assuming this system must provide service all the time(i.e., 7× 24 × 365), implying a total of 8,7601
hours per year of service, then it isunavailable (or down) 1.75 hours per year This could be significant for a mission-critical network The relationship between availability and MTBF is shown inFigure 3.7
A Logical system
Equivalent RBD
R(t) = R (t) R (t) R (t) A B C
I/O module CPU
Serial relationship
Hard drive
Figure 3.6 Reliability block diagrams.
1 Use 8,766 hours per year to account for leap years.
Trang 18Redundancy can have a significant effect on availability In the examplementioned earlier, if a redundant system were to be placed in parallel opera-tion in the same network, then the percentage of time the network is available
is now equivalent to the percentage of time either or both systems are ing Figure 3.8 shows an example of the use of availability block diagrams(ABDs) ABDs can be used in the same fashion as RBDs In fact, the mathemati-cal relationship between systems working in parallel or series is the same asthe RBD
operat-The relationship of availability among systems can be generalized in the same
way as an RBD For a network having N components or systems, the following
for-mulas can be used:
Availability
MTBF 100%
Figure 3.7 Availability versus MTBF.
Trang 19redun-to less than a day Adding parallel redundancy redun-to systems with low availability rateshas greater impact than adding redundancy to systems that already have high avail-ability Additionally, improving the availability of individual parallel systems willonly marginally improve overall network availability.
Systems in mission-critical networks, be they computing or networking forms, require rapid MTTRs Availability is affected by system or network recoveryprocedures Figure 3.10 shows the relationship between MTTR and availability.Decreasing the MTTR can have a profound effect on improving availability Thus,any tactic that can be used to reduce the MTTR can help improve overall availabil-ity Systems needing a high MTTR may thus require a back up system
plat-Over the years, the IT industry has used several known levels of availability [17].These are listed in Table 3.1 If a system is carrier class, for example, it is considered99.999% available (colloquially known as five nines) This level is the standard forpublic switched telephone network (PSTN) systems [18] This means that therecould be one failure during a year that lasts just over five minutes or there can be fivefailures that each last one minute
Organizations will typically define several levels of availability, according totheir severity and impact on operations For example, the following levels could bedefined, each with an associated availability or downtime [19]:
• Level 1: Users and service are interrupted, data is corrupted;
• Level 2: Users and service are interrupted, data remains intact;
A = 99.9% A = 99% A = 99.9%
Router Server Database
Web site network availability
Trang 20• Level 3: Users interrupted, service remains intact;
• Level 4: No interruptions, but performance degradation;
• Level 5: No interruptions, failover is implemented
One of the fundamental laws of availability is the law of diminishing returns.The higher the level of availability, the greater the incremental cost of achieving asmall improvement A general rule is that each additional nine after the second nine
in an availability value will cost twice as much As one approaches 100% ity, return on investment diminishes The relationship between cost and availability
availabil-is illustrated in Figure 3.11 In the end, to achieve absolute (100%) availability availabil-iscost prohibitive—there is only so much availability that can be built into an infra-structure This view does not consider the potential savings resulting from improvedavailability
In a network where systems operate in series or in parallel, the availability of thenetwork is dependent on the ability to continue service even if a system or network
Figure 3.10 Effect of MTTR on availability.
Table 3.1 Classes of Availability
Availability (%) Annual Downtime Description
98 175.2 hours Failures too frequent
99 87.6 hours Failures rare
99.5 43.8 hours Considered high availability
99.9 8.8 hours Three nines (often used for storage systems) 99.99 52.6 minutes Considered fault resilient
99.999 5.3 minutes Fault tolerant (also called carrier class for
PSTN infrastructure) 99.99966 1.8 minutes Six sigma (often used in manufacturing) [20] 99.9999 31.5 seconds Six nines
100 0 Continuous availability
Trang 21link becomes inoperable This requires a mix of redundancy, reliable systems, andgood network management We have seen that the overall impact of adding redun-dant systems in parallel is significant, while adding components in series actuallyreduces overall network availability Because all systems connected in series mustoperate for overall service, there is greater dependency on the availability of eachcomponent.
The availability calculations presented here can be used to estimate the tion of each system to the overall network availability Although the resulting avail-ability generally yields a high-side estimate, it can flag those portions of a networkthat can be problematic More complex statistical models are required to obtainmore precise estimates Markov chain models, which enumerate all possible operat-ing states of each component, can be used Transitions between states (e.g., between
contribu-an operational or failure state) are assumed to be probabilistic instead of tic, assuming some probability distribution Availability is then measured by thefraction of time a system is operational
determinis-The problem with this approach, however, is that it is limited to small numbers
of components and states As both of these grow in number, the problem cannot besolved easily In this case, other approaches, such as network simulation, can beused This requires building a computerized model of a network, which oftenreflects the network’s logical topology Calibration of the model is required in order
to baseline the model’s accuracy
Availability for a network can be estimated from field data in many ways A
gen-eral formula for estimating observed availability, A o, from a large network is thefollowing:
follow-• Availability is a relative measure—it is how one entity perceives the operation
of another End users ultimately determine availability It should be computed