Độ tin cậy của hệ thống máy tính và mạng P1

INTRODUCTION Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design Martin L.. ISBNs: 0-471-29342-3 Hardback; 0-471-22460-X Electronic 1 The central theme of

Trang 1

INTRODUCTION

Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design

Martin L Shooman Copyright  2002 John Wiley & Sons, Inc ISBNs: 0-471-29342-3 (Hardback); 0-471-22460-X (Electronic)

1

The central theme of this book is the use of reliability and availability putations as a means of comparing fault-tolerant designs This chapter definesfault-tolerant computer systems and illustrates the prime importance of suchtechniques in improving the reliability and availability of digital systems thatare ubiquitous in the 21st century The main impetus for complex, digital sys-tems is the microelectronics revolution, which provides engineers and scien-tists with inexpensive and powerful microprocessors, memories, storage sys-tems, and communication links Many complex digital systems serve us inareas requiring high reliability, availability, and safety, such as control of airtraffic, aircraft, nuclear reactors, and space systems However, it is likely thatplanners of financial transaction systems, telephone and other communicationsystems, computer networks, the Internet, military systems, office and homecomputers, and even home appliances would argue that fault tolerance is nec-essary in their systems as well The concluding section of this chapter explainshow the chapters and appendices of this book interrelate

com-1.1 WHAT IS FAULT-TOLERANT COMPUTING?

Literally, fault-tolerant computing means computing correctly despite the tence of errors in a system Basically, any system containing redundant com-ponents or functions has some of the properties of fault tolerance A desktopcomputer and a notebook computer loaded with the same software and withfiles stored on floppy disks or other media is an example of a redundant sys-

Trang 2

exis-tem Since either computer can be used, the pair is tolerant of most hardwareand some software failures.

The sophistication and power of modern digital systems gives rise to a host

of possible sophisticated approaches to fault tolerance, some of which are aseffective as they are complex Some of these techniques have their origin inthe analog system technology of the 1940s–1960s; however, digital technologygenerally allows the implementation of the techniques to be faster, better, andcheaper Siewiorek [1992] cites four other reasons for an increasing need forfault tolerance: harsher environments, novice users, increasing repair costs, andlarger systems One might also point out that the ubiquitous computer system

is at present so taken for granted that operators often have few clues on how

to cope if the system should go down

Many books cover the architecture of fault tolerance (the way a fault-tolerantsystem is organized) However, there is a need to cover the techniques required

to analyze the reliability and availability of fault-tolerant systems A propercomparison of fault-tolerant designs requires a trade-off among cost, weight,volume, reliability, and availability The mathematical underpinnings of theseanalyses are probability theory, reliability theory, component failure rates, andcomponent failure density functions

The obvious technique for adding redundancy to a system is to provide aduplicate (backup) system that can assume processing if the operating (on-line)system fails If the two systems operate continuously (sometimes called hotredundancy), then either system can fail ﬁrst However, if the backup system

is powered down (sometimes called cold redundancy or standby redundancy),

it cannot fail until the on-line system fails and it is powered up and takes over

A standby system is more reliable (i.e., it has a smaller probability of failure);however, it is more complex because it is harder to deal with synchronizationand switching transients Sometimes the standby element does have a smallprobability of failure even when it is not powered up One can further enhancethe reliability of a duplicate system by providing repair for the failed system.The average time to repair is much shorter than the average time to failure.Thus, the system will only go down in the rare case where the first system failsand the backup system, when placed in operation, experiences a short time tofailure before an unusually long repair on the first system is completed.Failure detection is often a difficult task; however, a simple scheme called

a voting system is frequently used to simplify such detection If three systemsoperate in parallel, the outputs can be compared by a voter, a digital comparatorwhose output agrees with the majority output Such a system succeeds if allthree systems or two or the three systems work properly A voting system can

be made even more reliable if repair is added for a failed system once a singlefailure occurs

Modern computer systems often evolve into networks because of the ﬂexibleway computer and data storage resources can be shared among many users.Most networks either are built or evolve into topologies with multiple pathsbetween nodes; the Internet is the largest and most complex model we all use

Trang 3

WHAT IS FAULT-TOLERANT COMPUTING? 3

If a network link fails and breaks a path, the message can be routed via one ormore alternate paths maintaining a connection Thus, the redundancy involvesalternate paths in the network

In both of the above cases, the redundancy penalty is the presence of extrasystems with their concomitant cost, weight, and volume When the trans-mission of signals is involved in a communications system, in a network, orbetween sections within a computer, another redundancy scheme is sometimesused The technique is not to use duplicate equipment but increased transmis-sion time to achieve redundancy To guard against undetected, corrupting trans-mission noise, a signal can be transmitted two or three times With two trans-missions the bits can be compared, and a disagreement represents a detectederror If there are three transmissions, we can essentially vote with the majority,thus detecting and correcting an error Such techniques are called error-detect-ing and error-correcting codes, but they decrease the transmission speed by

a factor of two or three More efﬁcient schemes are available that add extrabits to each transmission for error detection or correction and also increasetransmission reliability with a much smaller speed-reduction penalty

The above schemes apply to digital hardware; however, many of the bility problems in modern systems involve software errors Modeling the num-ber of software errors and the frequency with which they cause system failuresrequires approaches that differ from hardware reliability Thus, software reli-ability theory must be developed to compute the probability that a softwareerror might cause system failure Software is made more reliable by testing toﬁnd and remove errors, thereby lowering the error probability In some cases,one can develop two or more independent software programs that accomplishthe same goal in different ways and can be used as redundant programs Themeaning of independent software, how it is achieved, and how partial softwaredependencies reduce the effects of redundancy are studied in Chapter 5, whichdiscusses software

relia-Fault-tolerant design involves more than just reliable hardware and software.System design is also involved, as evidenced by the following personal exam-ples Before a departing flight I wished to change the date of my return, but thereservation computer was down The agent knew that my new return flight wasseldom crowded, so she wrote down the relevant information and promised toenter the change when the computer system was restored I was advised to con-firm the change with the airline upon arrival, which I did Was such a procedurepart of the system requirements? If not, it certainly should have been

Compare the above example with a recent experience in trying to purchasetickets by phone for a concert in Philadelphia 16 days in advance On myMonday call I was told that the computer was down that day and that nothingcould be done On my Tuesday and Wednesday calls I was told that the com-puter was still down for an upgrade, and so it took a week for me to receive

a call back with an offer of tickets How difﬁcult would it have been to printout from memory ﬁles seating plans that showed seats left for the next week

so that tickets could be sold from the seating plans? Many problems can be

Trang 4

avoided at little cost if careful plans are made in advance The planners mustalways think “what do we do if ?” rather than “it will never happen.”This discussion has focused on system reliability: the probability that the

system never fails in some time interval For many systems, it is acceptable

for them to go down for short periods if it happens infrequently In such cases,the system availability is computed for those involving repair A system is said

to be highly available if there is a low probability that a system will be down

at any instant of time Although reliability is the more stringent measure, bothreliability and availability play important roles in the evaluation of systems

1.2 THE RISE OF MICROELECTRONICS AND THE COMPUTER 1.2.1 A Technology Timeline

The rapid rise in the complexity of tasks, hardware, and software is why faulttolerance is now so important in many areas of design The rise in complexityhas been fueled by the tremendous advances in electrical and computer tech-nology over the last 100–125 years The low cost, small size, and low powerconsumption of microelectronics and especially digital electronics allow prac-tical systems of tremendous sophistication but with concomitant hardware andsoftware complexity Similarly, the progress in storage systems and computernetworks has led to the rapid growth of networks and systems

A timeline of the progress in electronics is shown in Shooman [1990, TableK-1] The starting point is the 1874 discovery that the contact between a metalwire and the mineral galena was a rectifier Progress continued with the vacuumdiode and triode in 1904 and 1905 Electronics developed for almost a half-cen-tury based on the vacuum tube and included AM radio, transatlantic radiotele-phony, FM radio, television, and radar The field began to change rapidly afterthe discovery of the point contact and field effect transistor in 1947 and 1949and, ten years later in 1959, the integrated circuit

The rise of the computer occurred over a time span similar to that of electronics, but the more significant events occurred in the latter half of the20th century One can begin with the invention of the punched card tabulatingmachine in 1889 The first analog computer, the mechanical differential ana-lyzer, was completed in 1931 at MIT, and analog computation was enhanced bythe invention of the operational amplifier in 1938 The first digital computerswere electromechanical; included are the Bell Labs’ relay computer (1937–40),the Z1, Z2, and Z3 computers in Germany (1938–41), and the Mark I com-pleted at Harvard with IBM support (1937–44) The ENIAC developed at theUniversity of Pennsylvania between 1942 and 1945 with U.S Army support

micro-is generally recognized as the ﬁrst electronic computer; it used vacuum tubes.Major theoretical developments were the general mathematical model of com-putation by Alan Turing in 1936 and the stored program concept of computingpublished by John von Neuman in 1946 The next hardware innovations were

in the storage ﬁeld: the magnetic-core memory in 1950 and the disk drive

Trang 5

THE RISE OF MICROELECTRONICS AND THE COMPUTER 5

in 1956 Electronic integrated circuit memory came later in 1975 Softwareimproved greatly with the development of high-level languages: FORTRAN(1954–58), ALGOL (1955–56), COBOL (1959–60), PASCAL (1971), the Clanguage (1973), and the Ada language (1975–80) For computer advancesrelated to cryptography, see problem 1.25

The earliest major computer systems were the U.S Airforce SAGE airdefense system (1955), the American Airlines SABER reservations system(1957–64), the first time-sharing systems at Dartmouth using the BASIC lan-guage (1966) and the MULTICS system at MIT written in the PL-I language(1965–70), and the first computer network, the ARPA net, that began in 1969.The concept of RAID fault-tolerant memory storage systems was first pub-lished in 1988 The major developments in operating system software werethe UNIX operating system (1969–70), the CM operating system for the 8086Microprocessor (1980), and the MS-DOS operating system (1981) The choice

of MS-DOS to be the operating system for IBM’s PC, and Bill Gates’ ﬂedglingcompany as the developer, led to the rapid development of Microsoft.The ﬁrst home computer design was the Mark-8 (Intel 8008 Microproces-

sor), published in Radio-Electronics magazine in 1974, followed by the Altair

personal computer kit in 1975 Many of the giants of the personal computingfield began their careers as teenagers by building Altair kits and programmingthem The company then called Micro Soft was founded in 1975 when Gateswrote a BASIC interpreter for the Altair computer Early commercial personalcomputers such as the Apple II, the Commodore PET, and the Radio ShackTRS-80, all marketed in 1977, were soon eclipsed by the IBM PC in 1981.Early widely distributed PC software began to appear in 1978 with the Word-star word processing system, the VisiCalc spreadsheet program in 1979, earlyversions of the Windows operating system in 1985, and the first version of theOffice business software in 1989 For more details on the historical develop-ment of microelectronics and computers in the 20th century, see the followingsources: Ditlea [1984], Randall [1975], Sammet [1969], and Shooman [1983].Also see www.intel.com and www.microsoft.com

This historical development leads us to the conclusion that today one canbuild a very powerful computer for a few hundred dollars with a handful ofmemory chips, a microprocessor, a power supply, and the appropriate input,output, and storage devices The accelerating pace of development is breath-taking, and of course all the computer memory will be ﬁlled with softwarethat is also increasing in size and complexity The rapid development of themicroprocessor—in many ways the heart of modern computer progress—isoutlined in the next section

1.2.2 Moore’s Law of Microprocessor Growth

The growth of microelectronics is generally identiﬁed with the growth ofthe microprocessor, which is frequently described as “Moore’s Law” [Mann,2000] In 1965, Electronics magazine asked Gordon Moore, research director

Trang 6

TABLE 1.1 Complexity of Microchips and Moore’s Law

In Table 1.2, the transistor complexity of Intel’s CPUs is compared with

Assuming a Doubling Period of Two Years

MicrochipComplexity

Moore’s Law Complexity:

1985.25 80386 280,000 (22.5/2)× 113,507 c 269,9671989.75 80486 1,200,000 (24.5/2)× 269,967 c 1,284,1851993.25 Pentium (P5) 3,100,000 (23.5/2)× 1,284,185 c 4,319,4661995.25 Pentium Pro 5,500,000 (22/2)× 4,319,466 c 8,638,933

(P6)

1997.50 Pentium II 7,500,000 (22 25/2)× 8,638,933 c 18,841,647

(P6 + MMX)

1998.50 Merced (P7) 14,000,000 (23.25/2)× 8,638,933 c 26,646,1121999.75 Pentium III 28,000,000 (21.25/2)× 26,646,112 c 41,093,9222000.75 Pentium 4 42,000,000 (21/2)× 41,093,922 c 58,115,582

Note: This table is based on Intel’s data from its Microprocessor Report: http://www.physics.udel edu/wwwusers.watson.scen103/intel.html.

Trang 7

Moore’s Law, with a doubling every two years Note that there are manyclosely spaced releases with different processor speeds; however, the tablerecords the ﬁrst release of the architecture, generally at the initial speed.The Pentium P5 is generally called Pentium I, and the Pentium II is a P6with MMX technology In 1993, with the introduction of the Pentium, theIntel microprocessor complexities fell slightly behind Moore’s Law Somesay that Moore’s Law no longer holds because transistor spacing cannot bereduced rapidly with present technologies [Mann, 2000; Markov, 1999]; how-ever, Moore, now Chairman Emeritus of Intel Corporation, sees no funda-mental barriers to increased growth until 2012 and also sees that the physicallimitations on fabrication technology will not be reached until 2017 [Moore,2000]

The data in Table 1.2 is plotted in Fig 1.1 and shows a close fit to Moore’sLaw The three data points between 1997 and 2000 seem to be below the curve;however, the Pentium 4 data point is back on the Moore’s Law line Moore’sLaw fits the data so well in the first 15 years (Table 1.1) that Moore has occu-pied a position of authority and respect at Fairchild and, later, Intel Thus,there is some possibility that Moore’s Law is a self-fulfilling prophecy: that

is, the engineers at Intel plan their new projects to conform to Moore’s Law.The problems presented at the end of this chapter explore how Moore’s Law

is faring in the 21st century

An article by Professor Seth Lloyd of MIT in the September 2000 issue

of Nature explores the fundamental limitations of Moore’s Law for a laptop

based on the following: Einstein’s Special Theory of Relativity (E c mc2),Heisenberg’s Uncertainty Principle, maximum entropy, and the SchwarzschildRadius for a black hole For a laptop with one kilogram of mass and one liter

of volume, the maximum available power is 25 million megawatt hours (theenergy produced by all the world’s nuclear power plants in 72 hours); the ulti-mate speed is 5.4 × 1050 hertz (about 1043 the speed of the Pentium 4); andthe memory size would be 2.1 × 1031 bits, which is 4 × 1030 bytes (1.6 ×

1022 times that for a 256 megabyte memory) [Johnson, 2000] Clearly, cation techniques will limit the complexity increases before these fundamentallimitations

fabri-1.2.3 Memory Growth

Memory size has also increased rapidly since 1965, when the PDP-8 computer came with 4 kilobytes of core memory and when an 8 kilobyte sys-tem was considered large In 1981, the IBM personal computer was limited

mini-to 640,000 kilobytes of memory by the operating system’s nearsighted iﬁcations, even though many “workaround” solutions were common By theearly 1990s, 4 or 8 megabyte memories for PCs were the rule, and in 2000,the standard PC memory size has grown to 64–128 megabytes Disk memoryhas also increased rapidly: from small 32–128 kilobyte disks for the PDP 8e

Trang 8

spec-1970 1975 1980 1985 1990 1995 2000 20051,000

Year

computer in 1970 to a 10 megabyte disk for the IBM XT personal computer

in 1982 From 1991 to 1997, disk storage capacity increased by about 60%per year, yielding an eighteenfold increase in capacity [Fisher, 1997; Markoff,1999] In 2001, the standard desk PC came with a 40 gigabyte hard drive

If Moore’s Law predicts a doubling of microprocessor complexity every twoyears, disk storage capacity has increased by 2.56 times each two years, fasterthan Moore’s Law

Trang 9

1.2.4 Digital Electronics in Unexpected Places

The examples of the need for fault tolerance discussed previously focused onmilitary, space, and other large projects There is no less a need for fault toler-ance in the home now that electronics and most electrical devices are digital,which has greatly increased their complexity In the 1940s and 1950s, the mostcomplex devices in the home were the superheterodyne radio receiver with 5vacuum tubes, and early black-and-white television receivers with 35 vacuumtubes Today, the microprocessor is ubiquitous, and, since a large percentage ofmodern households have a home computer, this is only the tip of the iceberg

In 1997, the sale of embedded microcomponents (simpler devices than thoseused in computers) totaled 4.6 billion, compared with about 100 million micro-processors used in computers Thus computer microprocessors only represent2% of the market [Hafner, 1999; Pollack, 1999]

The bewildering array of home products with microprocessors includesthe following: clothes washers and dryers; toasters and microwave ovens;electronic organizers; digital televisions and digital audio recorders; homealarm systems and elderly medic alert systems; irrigation systems; pacemak-ers; video games; Web-surﬁng devices; copying machines; calculators; tooth-brushes; musical greeting cards; pet identiﬁcation tags; and toys Of coursethis list does not even include the cellular phone, which may soon assumethe functions of both a personal digital assistant and a portable Internet inter-face It has been estimated that the typical American home in 1999 had 40–60microprocessors—a number that could grow to 280 by 2004 In addition, amodern family sedan contains about 20 microprocessors, while a luxury carmay have 40–60 microprocessors, which in some designs are connected via alocal area network [Stepler, 1998; Hafner, 1999]

Not all these devices are that simple either An electronic toothbrush has3,000 lines of code The Furby, a $30 electronic–robotic pet, has 2 main pro-cessors, 21,600 lines of code, an infrared transmitter and receiver for Furby-to-Furby communication, a sound sensor, a tilt sensor, and touch sensors onthe front, back, and tongue In short supply before Christmas 1998, Web site

prices rose as high as $147.95 plus shipping! [USA Today, 1998] In 2000, the

sensation was Billy Bass, a ﬁsh mounted on a wall plaque that wiggled, talked,and sang when you walked by, triggering an infrared sensor

Hackers have even taken an interest in Furby and Billy Bass They havemodiﬁed the hardware and software controlling the interface so that one Furbycontrols others They have modiﬁed Billy Bass to speak the hackers’ dialogand sing their songs

Late in 2000, Sony introduced a second-generation dog-like robot calledAibo (Japanese for “pal”); with 20 motors, a 32-bit RISC processor, 32megabytes of memory, and an artiﬁcial intelligence program Aibo acts like

a frisky puppy It has color-camera eyes and stereo-microphone ears, touchsensors, a sound-synthesis voice, and gyroscopes for balance Four different

“personality” modules make this $1,500 robot more than a toy [Pogue, 2001]

Trang 10

What is the need for fault tolerance in such devices? If a Furby fails, youdiscard it, but it would be disappointing if that were the only sensible choicefor a microwave oven or a washing machine It seems that many such devicesare designed without thought of recovery or fault-tolerance Lawn irrigationtimers, VCRs, microwave ovens, and digital phone answering machines are allupset by power outages, and only the best designs have effective battery back-ups My digital answering machine was designed with an effective recoverymode The battery backup works well, but it “locks up” and will not functionabout once a year To recover, the battery and AC power are disconnected forabout 5 minutes; when the power is restored, a 1.5-minute countdown begins,during which the device reinitializes There are many stories in which failure

of an ignition control computer stranded an auto in a remote location at night.Couldn’t engineers develop a recovery mode to limp home, even if it did use alittle more gas or emit fumes on the way home? Sufﬁcient fault-tolerant tech-nology exists; however, designers have to use it Fortunately, the cellular phoneallows one to call for help!

Although the preceding examples relate to electronic systems, there is noless a need for fault tolerance in mechanical, pneumatic, hydraulic, and othersystems In fact, almost all of us need a fault-tolerant emergency procedure toheat our homes in case of prolonged power outages

1.3 RELIABILITY AND AVAILABILITY

1.3.1 Reliability Is Often an Afterthought

The attainment of high reliability and availability is very difﬁcult to achieve invery complex systems Thus, a system designer should formulate a number ofdifferent approaches to a problem and weigh the pluses and minuses of eachdesign before recommending an approach One should be careful to base con-clusions on an analysis of facts, not on conjecture Sometimes the best solutionincludes simplifying the design a bit by leaving out some marginal, complexfeatures It may be difﬁcult to convince the authors of the requirements thatsometimes “less is more,” but this is sometimes the best approach Design deci-sions often change as new technology is introduced At one time any attempt todigitize the Library of Congress would have been judged infeasible because ofthe storage requirement However, by using modern technology, this could beaccomplished with two modern RAID disk storage systems such as the EMCSymmetrix systems, which store more than nine terabytes (9 × 1012 bytes)[EMC Products-At-A-Glance, www.emc.com] The computation is outlined inthe problems at the end of this chapter

Reliability and availability of the system should always be two factors thatare included, along with cost, performance, time of development, risk of fail-ure, and other factors Sometimes it will be necessary to discard a few designobjectives to achieve a good design The system engineer should always keep

Trang 11

RELIABILITY AND AVAILABILITY 11

in mind that the design objectives generally contain a list of key features and alist of desirable features The design must satisfy the key features, but if one ortwo of the desirable features must be eliminated to achieve a superior design,the trade-off is generally a good one

1.3.2 Concepts of Reliability

Formal deﬁnitions of reliability and availability appear in Appendices A andB; however, the basic ideas are easy to convey without a mathematical devel-opment, which will occur later Both of these measures apply to how good thesystem is and how frequently it goes down An easy way to introduce reliabil-ity is in terms of test data If 50 systems operate for 1,000 hours on test and

two fail, then we would say the probability of failure, P f, for this system in1,000 hours of operation is 2/50 or P f(1,000) c 0.04 Clearly the probability

of success, P s , which is known as the reliability, R, is given by R(1,000) c

P s(1,000) c 1− P f(1,000) c 48/50c 0.96 Thus, reliability is the probability

of no failure within a given operating period One can also deal with a

fail-ure rate, f r, for the same system that, in the simplest case, would be f r c 2failures/(50× 1,000) operating hours—that is, f r c 4 × 10− 5 or, as it is some-

times stated, f r c z c 40 failures per million operating hours, where z is often

called the hazard function The units used in the telecommunications industryare ﬁts (failures in time), which are failures per billion operating hours Moredetailed mathematical development relates the reliability, the failure rate, and

time For the simplest case where the failure rate z is a constant (one

gener-ally usesl to represent a constant failure rate), the reliability function can be

shown to be R(t) c e −lt If we substitute the preceding values, we obtain

R(1 , 000) c e− 4 × 10 −5 × 1 , 000c 0.96which agrees with the previous computation

It is now easy to show that complexity causes serious reliability problems

The simplest system reliability model is to assume that in a system with n components, all the components must work If the component reliability is R c,

then the system reliability, Rsys, is given by

Rsys(t) c [R c (t)] n c [e −lt]n c e −nlt

Consider the case of the ﬁrst supercomputer, the CDC 6600 [Thornton,1970] This computer had 400,000 transistors, for which the estimated fail-ure rate was then 4 × 10− 9 failures per hour Thus, even though the failurerate of each transistor was very small, the computer reliability for 1,000 hourswould be

R(1 , 000) c e− 400 , 000 × 4 × 10 −9 × 1 , 000c 0.20

Trang 12

If we repeat the calculation for 100 hours, the reliability becomes 0.85.Remember that these calculations do not include the other components in thecomputer that can also fail The conclusion is that the failure rate of deviceswith so many components must be very low to achieve reasonable reliabilities.Integrated circuits (ICs) improve reliability because each IC replaces hundreds

of thousands or millions of transistors and also because the failure rate of an

IC is low See the problems at the end of this chapter for more examples

1.3.3 Elementary Fault-Tolerant Calculations

The simplest approach to fault tolerance is classical redundancy, that is, to have

an additional element to use if the operating one fails As a simple example, let

us consider a home computer in which constant usage requires it to be alwaysavailable A desktop will be the primary computer; a laptop will be the backup.The ﬁrst step in the computation is to determine the failure rate of a personalcomputer, which will be computed from the author’s own experience Table 1.3lists the various computers that the author has used in the home There has been

a total of 2 failures and 29 years of usage Since each year contains 8,766 hours,

we can easily convert this into a failure rate The question becomes whether toestimate the number of hours of usage per year or simply to consider each year

as a year of average use We choose the latter for simplicity Thus the failurerate becomes 2/29c 0.069 failures per year, and the reliability of a single PC

for one year becomes R(1) c e− 0 069c 0.933 This means there is about a 6.7%probability of failure each year based on this data

If we have two computers, both must fail for us to be without a computer

Assuming the failures of the two computers are independent, as is generally

the case, then the system failure is the product of the failure probabilities for

8088and 10 MB disk

Intel 386 Processor and

65MB disk

(repackaged in 1990) added new

components used:

1990–92

386and 80 MB disk

Trang 13

RELIABILITY AND AVAILABILITY 13

Boston

NewYork

PhiladelphiaPittsburgh

Boston

NewYork

PhiladelphiaPittsburgh

(a)

(b)

the four cities; (b), a Hamiltonian network connecting the four cities

computer 1 (the primary) and computer 2 (the backup) Using the precedingfailure data, the probability of one failure within a year should be 0.067; oftwo failures, 0.067× 0.067 c 0.00449 Thus, the probability of having at leastone computer for use is 0.9955 and the probability of having no computer atsome time during the year is reduced from 6.7% to 0.45%—a decrease by afactor of 15 The probability of having no computer will really be much lesssince the failed computer will be rapidly repaired

As another example of reliability computations, consider the primitive puter network as shown in Fig 1.2(a) This is called a tree topology because

com-all the nodes are connected and there are no loops Assume that p is the

reli-ability for some time period for each link between the nodes The probreli-ability

Trang 14

that Boston and New York are connected is the probability that one link is

good, that is, p The same probability holds for New York–Philadelphia and for

Philadelphia–Pittsburgh, but the Boston–Philadelphia connection requires two

links to work, the probability of which is p2 More commonly we speak of the

all-terminal reliability, which is the probability that all cities are connected—p3

in this example—because all three links must be working Thus if pc 0.9, theall-terminal reliability is 0.729

The reliability of a network is raised if we add more links so that loopsare created The Hamiltonian network shown in Fig 1.2(b) has one more linkthan the tree and has a higher reliability In the Hamiltonian network, all nodes

are connected if all four links are working, which has a probability of p4 Allnodes are still connected if there is a single link failure, which has a probability

of three successes and one failure given by p3 (1− p) However, there are 4 ways for one link to fail, so the probability of one link failing is 4p3(1− p) The

reliability is the probability that there are zero failures plus the probability that

there is one failure, which is given by [p4+ 4p3(1− p)] Assuming that p c 0.9

as before, the reliability becomes 0.9477—a considerable improvement overthe tree network Some of the basic principles for designing and analyzing thereliability of computer networks are discussed in this book

1.3.4 The Meaning of Availability

Reliability is the probability of no failures in an interval, whereas availability

is the probability that an item is up at any point in time Both reliability andavailability are used extensively in this book as measures of performance and

“yardsticks” for quantitatively comparing the effectiveness of various erant methods Availability is a good metric to measure the beneficial effects ofrepair on a system Suppose that an air traffic control system fails on the aver-age of once a year; we then would say that the mean time to failure (MTTF),was 8,766 hours (the number of hours in a year) If an airline’s reservationsystem went down 5 times in a year, we would say that the MTTF was 1/5ofthe air traffic control system, or 1,753 hours One would say that, based on theMTTF, the air traffic control system was much better; however, suppose weconsider repair and calculate typical availabilities A simple formula for cal-culating the system availability (actually, the steady-state availability), based

fault-tol-on the Uptime and Downtime of the system, is given as follows:

Tiêu đề	Reliability of computer systems and networks: fault tolerance, analysis, and design
Tác giả	Martin L. Shooman
Thể loại	Book chapter
Năm xuất bản	2002

Định dạng
Số trang	29
Dung lượng	175,71 KB