Software fault tolerance techniques and implementation.. 1.5.2 Information or Data Redundancy 192.3.2 Output Types and Related Data Re-expression 38 2.5 Architectural Structure for Diver
Trang 1TE AM
Team-Fly®
Trang 2Software Fault Tolerance Techniques
and Implementation
Trang 3Limits of Liability and Disclaimer of Warranty
Every reasonable attempt has been made to ensure the accuracy, ness, and correctness of the information contained in this book at the time ofwriting However, neither the author nor the publisher, Artech House, Inc.,shall be responsible or liable in negligence or otherwise, in respect to anyinaccuracy or omission herein The author and the publisher make no repre-sentation that this information is suitable for every application to which areader may attempt to apply the information Many of the techniques andtheories are still subject to academic debate The author and Artech Housemake no warranty of any kind, expressed or implied, including warranties offitness for a particular purpose, with regard to the information contained inthis book, all of which is provided as is. Without derogating from the gen-erality of the foregoing, neither the author nor the publisher shall be liablefor any direct, indirect, incidental, or consequential damages or loss caused
complete-by or arising from any information or advice, inaccuracy, or omission herein.This work is published with the understanding that the author and ArtechHouse are supplying information, but are not attempting to render engineer-ing judgment or other professional services
For a complete listing of the Artech House Computing Library,
turn to the back of this book
Trang 4Software Fault Tolerance Techniques
and Implementation
Laura L Pullum
Artech House Boston London www.artechhouse.com
Trang 5Pullum, Laura.
Software fault tolerance techniques and implementation / Laura Pullum
p cm - (Artech House computing library)
Includes bibliographical references and index
ISBN 1-58053-137-7 (alk paper)
1 Fault -tolerant computing 2 Computer software -Reliability.
I Title II Series
Software fault tolerance techniques and implementation
-(Artech House computing library)
1 Computer software -Development 2 Software failures
All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized Artech Ho use cannot attest to the accuracy of this informa tion Use of a term in this book should not be regarded as affecting the validity of any trade mark or service mark.
International Standard Book Number: 1-58053-137-7
Library of Congress Catalog Card Number: 2001035915
Trang 61.2 Organization and Intended Use 4
1.3 Means to Achieve Dependable Software 6 1.3.1 Fault Avoidance or Prevention 7
Trang 71.5.2 Information or Data Redundancy 19
2.3.2 Output Types and Related Data Re-expression 38
2.5 Architectural Structure for Diverse Software 442.6 Structure for Development of Diverse Software 44
3 Design Methods, Programming Techniques,
vi Software Fault Tolerance Techniques and Implementation
Trang 83.1.1 Similar Errors and a Lack of Diversity 60
3.3 Dependable System Development Model and
3.3.3 Design Paradigm for N-Version Programming 93
4.2.3 N-Version Programming Issues and Discussion 127
4.3.3 Distributed Recovery Block Issues and Discussion 139
Trang 94.4 N Self-Checking Programming 144
4.4.3 N Self-Checking Programming Issues and Discussion 149
4.5.3 Consensus Recovery Block Issues and Discussion 159
4.6.3 Acceptance Voting Issues and Discussion 169
4.7.3 Consensus Recovery Block, Recovery Block
Technique, and N-Version Programming
4.7.4 Acceptance Voting, Consensus Recovery Block,
Recovery Block Technique, and N-Version
5 Data Diverse Software Fault Tolerance Techniques 191
viii Software Fault Tolerance Techniques and Implementation
Trang 105.2.1 N-Copy Programming Operation 208
5.2.3 N-Copy Programming Issues and Discussion 214
5.3.2 Two-Pass Adjudicators and Multiple Correct Results 223
5.3.4 Two-Pass Adjudicator Issues and Discussion 229
6.1.1 N-Version Programming with Tie-Breaker and
6.1.2 N-Version Programming with Tie-Breaker and
6.3 Data-Driven Dependability Assurance Scheme 247
6.4.1 Self-Configuring Optimal Programming Operation 2536.4.2 Self-Configuring Optimal Programming Example 2576.4.3 Self-Configuring Optimal Programming Issues and
Trang 117 Adjudicating the Results 269
Trang 12The scope, complexity, and pervasiveness of computer-based and controlledsystems continue to increase dramatically The consequences of these sys-tems failing can range from the mildly annoying to catastrophic, with seriousinjury occurring or lives lost, human-made and natural systems destroyed,security breached, businesses failed, or opportunities lost As softwareassumes more of the responsibility of providing functionality and control insystems, it becomes more complex and more significant to the overall systemperformance and dependability
It would be ideal if the processes by which software is conceptualized,created, analyzed, and tested had advanced to the state that software is devel-oped without errors Given the current state-of-the-practice, fewer errors areintroduced, but not all errors are prevented So even if we have the best peo-ple and use the best practices and tools, we are still imperfect beings, and itwould be very risky to assume that the software we develop is error-free Thisbook examines the means to protect against software design faults and toler-ate the operational effects of these introduced imperfections
Chapter 1 provides definitions of several basic terms and concepts and
a proposed reading guide Chapter 2 presents various means of structuringredundancy so that it can effect software fault tolerance Chapter 3 presentsprogramming practices used in several software fault tolerance techniques,along with common problems and issues faced by various approaches to soft-ware fault tolerance
The essence of this book is the presentation of the software fault erance techniques themselves Design diverse techniques are presented in
tol-xi
Trang 13Chapter 4, data diverse techniques in Chapter 5, and other techniques inChapter 6 The decision mechanisms used with many of the techniques arepresented in Chapter 7.
This book may be used as a reference for researchers or practitioners Itmay also be used as a textbook or reference for a graduate-level software engi-neering course or for a course on software fault tolerance A proposed naviga-tional guide to reading the book is provided in Figure 1.1, Section 1.2.Software fault tolerance is not a panacea for all our software problems.Since, at least for the near future, software fault tolerance will primarily beused in critical systems, it is even more important to emphasize that faulttolerant does not mean safe, nor does it cover the other attributes com-prising dependability, such as reliability, availability, confidentiality, integ-rity, and maintainability (as none of these covers fault tolerance) Each must
be designed-in and their, at times conflicting, characteristics analyzed Poorrequirements analysis will yield poor software in most cases Simply applying
a software fault tolerance technique prior to testing or fielding a system isnot sufficient Software due diligence is required!
xii Software Fault Tolerance Techniques and Implementation
Trang 14I am grateful to the staff at Artech House Publishers and to the reviewersfor their encouragement and support during the process of writing and pro-ducing this book
I would be happy to hear from readers who have updated researchfindings, implementation examples, new techniques, or other informa-tion that might enhance the usefulness of this book in any future updates.Such comments and suggestions can be sent to me via e-mail atpullum@mindspring.com
xiii
Trang 16Introduction
Computer-based systems have increased dramatically in scope, complexity,and pervasiveness, particularly in the past decade Most industries are highlydependent on computers for their basic day-to-day functioning Safe andreliable software operation is a significant requirement for many types of sys-tems, for instance, in aircraft and air traffic control, medical devices, nuclearsafety, petrochemical control, high-speed rail, electronic banking andcommerce, automated manufacturing, military and nautical systems, foraeronautics and space missions, and for appliance-type applications such
as automobiles, washing machines, temperature control, and telephony, toname a few The cost and consequences of these systems failing can rangefrom mildly annoying to catastrophic, with serious injury occurring or liveslost, systems (both human-made and natural) destroyed, security breached,businesses failed, or opportunities lost As software assumes more of theresponsibility for providing functionality in these systems, it becomes morecomplex and more significant to the overall system performance anddependability
Ideally, the processes by which software is conceptualized, created, lyzed, and tested would have advanced to the point where software could bedeveloped without errors The current state-of-the-practice is such that fewererrors are introduced, but unfortunately not all errors are prevented Even ifthe best people, practices, and tools are used, it would be very risky to assumethe software developed is error-free There may also be cases in which anerror, found late in the systems life cycle and perhaps prohibitively expensive
ana-to repair, is knowingly allowed ana-to remain in the system
1
Trang 17Examples of events, with a range of consequences, in which software isthought to be a contributing factor are briefly noted below Additional exam-ples of reported software-related accidents and incidents are related by Peter
G Neumann in Computer Related Risks [1] (Chapter 2) and in the archives ofthe Internet Risks Forum he moderates, by Nancy G Leveson in Safeware[2] (appendixes), and by Debra S Herrmann in Software Safety and Reliabil-ity [3] (Chapter 1)
• The aerospace industry has unique challenges and takes exceptionalcare in software development Despite the care taken, severalsoftware-related incidents have caused widespread attention A fewexamples include: problems in the backup tracking software delayedthe launch of Atlantis (STS-36) for three days [4]; software on thespace shuttle Endeavor (STS-49) effectively rounded near-zero val-ues to zero, causing problems when attempting rendezvous withIntelstat 6 [57]; and an Apollo 11 software flaw made the moonsgravity repulsive rather than attractive [1, 8]
• In January 1990, the AT&T system suffered a nine-hour UnitedStateswide blockade [9] when one switch experienced abnormalbehavior and attempted recovery Because of a flaw in recovery-recognition software (in all 114 switches) and a network design thatpermitted propagation of the effects, the problem spread to allswitches
• During the Persian Gulf War, clock drift in the Patriot systemcaused it to miss a scud missile that hit an American barracks inDhahran The missile hit killed 29 people and injured 97 others.The clock drift was reportedly caused by the softwares use of twodifferent and unequal representations (24-bit and 48-bit) of thevalue 0.1 [1011] As with most complex systems, the source of theresulting problem was multifaceted, in this case with software one ofseveral problem sources
• Several Airbus A320 problems (e.g., [1216]) have been initiallyblamed on the pilots and their skills in handling anomalous situa-tions However, serious questions have been raised about the rolesoftware may have played in these incidents
• Six known accidental massive radiation overdoses by the Therac-25radiation therapy system had software error as their precipitatingevent A thorough account of the Therac-25 accidents is provided
in [17]
2 Software Fault Tolerance Techniques and Implementation
Trang 18• A software problem caused radiation safety doors in the Sellafield,United Kingdom, nuclear reprocessing plant to be opened acciden-tally [18].
• A recent report outlined the impact of major system outages on ous businesses, noting that the cost for a brokerage is $6.5 millionper hour, the cost per hour for a credit-card authorization system
vari-is $2.6 million, and for an automated teller machine, $14,500 inautomatic teller machine fees [19]
Increasing the dependability of software presents some unique challengeswhen compared to traditional hardware systems Hardware faults are pri-marily physical faults, which can be characterized and predicted over time.Software does not physically wear out, burn out, or otherwise physicallydeteriorate with time (although it can be argued that the value of specificinstances of data and software may degrade over time) Software has onlylogical faults, which are difficult to visualize, classify, detect, and correct.Software faults may be traced to incorrect requirements (where the softwarematches the requirements, but the behavior specified in the requirements isnot appropriate) or to the implementation (software design and coding) notsatisfying the requirements Changes in operational usage or incorrect modi-fications may introduce new faults To protect against these faults, we cannotsimply add redundancy, as is typically done for hardware faults, becausedoing so will simply duplicate the problem So, to provide protection againstthese faults, we turn to software fault tolerance
1.1 A Few Definitions
To provide additional basis for the discussions in the remainder of this book,
a few basic definitions are provided
A fault is the identified or hypothesized cause of an error [20], times called a bug. It can be viewed as simply the consequence of a failure
some-of some other system (including the developer) that has delivered or is nowdelivering a service to the given system [21] An active fault is one that pro-duces an error
An error is part of the system state that is liable to lead to a failure [21]
It can be unrecognized as an error (i.e., latent) or detected An error maypropagate, that is, produce other errors Faults are known to be present whenerrors are detected
Trang 19A failure occurs when the service delivered by the system deviates fromthe specified service, otherwise termed an incorrect result [21] This impliesthat the expected service is described, typically by a specification or set ofrequirements.
So, with software fault tolerance, we want to prevent failures by erating faults whose occurrences are known when errors are detected Thecycle…failure→fault→error→failure→fault…indicates their generalcausal relationship The causal order is maintained, however the generality isexhibited when, for example, an error leads to a fault without an observedfailure (if observation capabilities are not in place or are inadequate) Anotherexample of the generality is when one or more errors occur before a failuredue to those errors occurs The classic definition [22, 23] of software fault tol-erance is: using a variety of software methods, faults (whose origins are related
tol-to software) are detected and recovery is accomplished
1.2 Organization and Intended Use
This book is organized as follows The remainder of this chapter describeshow software fault tolerance is an important means to achieve dependablesoftware, the types of recovery used in fault tolerant software, and the types
of redundancy used in software fault tolerance techniques Redundancyalone is not sufficient for detecting and tolerating software faults It requiressome form of diversity to achieve software fault tolerance Chapter 2 presentsvarious means of structuring redundancy, for example, via forms of diversity,
so that it can effect software fault tolerance Some programming methodsare used in several different software fault tolerance techniques or are simplyimportant enough to discuss apart from the techniques in which they areused These programming methods are discussed in Chapter 3, along withcommon problems and issues faced by various approaches to software faulttolerance
The essence of this book is the presentation of the software fault ance techniques themselves, including the way they operate, usage scenarios,examples, and issues The techniques are categorized and discussed accord-ing to type of diversitydesign diverse techniques in Chapter 4, data diversetechniques in Chapter 5, and the catch-all other techniques in Chapter 6.Just as we were able to extract some issues and programming methods com-mon to several software fault tolerance techniques, we can also extract anddiscuss the decision mechanisms (DM) used with many of the techniques.This is done in Chapter 7
toler-4 Software Fault Tolerance Techniques and Implementation