1.3 Error Recovery Techniques for Dependable Systems / 101.4 Code Design Process for Dependable Systems / 16 3.1 Minimum-Weight & Equal-Weight-Row Codes / 78... 4 Codes for High-Speed Me
Trang 2Code Design for Dependable Systems Theory and Practical Applications
Eiji FujiwaraTokyo Institute of Technology
A JOHN WILEY & SONS, INC., PUBLICATION
Trang 3Code Design for Dependable Systems
Trang 5Code Design for Dependable Systems Theory and Practical Applications
Eiji FujiwaraTokyo Institute of Technology
A JOHN WILEY & SONS, INC., PUBLICATION
Trang 6Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission
of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470,
or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness
of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss
of profit or any other commercial damages, including but not limited to special, incidental, consequential,
or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States
at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not
be available in electronic formats For more information about Wiley products, visit our web site at
Trang 71.3 Error Recovery Techniques for Dependable Systems / 10
1.4 Code Design Process for Dependable Systems / 16
3.1 Minimum-Weight & Equal-Weight-Row Codes / 78
Trang 84 Codes for High-Speed Memories I: Bit Error Control Codes 974.1 Modified Hamming SEC-DED Codes / 98
4.2 Modified Double-Bit Error Correcting BCH Codes / 105
4.3 On-Chip ECCs / 110
Exercises / 123
References / 126
5.1 Single-Byte Error Correcting (SbEC) Codes / 134
5.2 Single-Byte Error Correcting and Double-Byte Error Detecting
(SbEC-DbED) Codes / 154
5.3 Single-Byte Error Correcting and Single p-Byte within a Block
Error Detecting (SbEC-Spb=BED) Codes / 171
Exercises / 180
References / 183
6 Codes for High-Speed Memories III: Bit / Byte Error
6.1 Single-Byte / Burst Error Detecting SEC-DED Codes / 188
6.2 Single-Byte Error Correcting and Double-Bit Error Detecting
References / 258
7 Codes for High-Speed Memories IV: Spotty Byte Error
7.1 Spotty Byte Errors / 264
7.2 Single Spotty Byte Error Correcting (St=bEC) Codes / 264
7.3 Single Spotty Byte Error Correcting and Single-Byte Error
Detecting (St=bEC-SbED) Codes / 274
7.4 Single Spotty Byte Error Correcting and Double Spotty Byte
Error Detecting (St=bEC-Dt =bED) Codes / 284
7.5 A General Class of Spotty Byte Error Control Codes / 290
Exercises / 326
References / 330
8.1 Parallel Decoding Burst Error Control Codes / 336
Error Detecting (SbEC-(Sbþ S)ED) Codes
Trang 98.2 Parallel Decoding Cyclic Burst Error Correcting Codes / 351
8.3 Transient Behavior of Parallel Encoder / Decoder Circuits
of Error Control Codes / 353
Exercises / 369
References / 370
9.1 Error Location of Faulty Packages and Faulty Chips / 373
9.2 Block Error Locating (Sb=pbEL) Codes / 376
9.3 Single-Bit Error Correcting and Single-Block Error Locating
(SEC-Sb=pbEL) Codes / 377
9.4 Single-Bit Error Correcting and Single-Byte Error Locating
(SEC-Se=bEL) Codes / 389
9.5 Burst Error Locating Codes / 396
9.6 Code Conditions for Error Locating Codes / 404
Exercises / 409
References / 410
10 Codes for Unequal Error Control / Protection ( UEC / UEP ) 41310.1 Error Models for UEC Codes and UEP Codes / 413
10.2 Fixed-Byte Error Control UEC Codes / 417
10.3 Burst Error Control UEC / UEP Codes / 427
10.4 Application of the UEC / UEP Codes / 439
Exercises / 457
References / 461
11.1 Tape Memory Codes / 465
11.2 Magnetic Disk Memory Codes / 487
11.3 Optical Disk Memory Codes / 500
13.1 M-Ary Asymmetric Errors in Data Entry Systems / 599
Trang 1013.2 M-Ary Asymmetric Symbol Error Correcting Codes / 600
13.3 Nonsystematic M-Ary Asymmetric Error Correcting Codes with
Deletion / Insertion / Adjacent-Symbol-Transposition Error
Correction Capabilities / 623
13.4 Codes for Two-Dimentional Matrix Symbols / 632
Exercises / 644
References / 646
14.1 MDS Array Codes Tolerating Multiple-Disk Failures / 650
14.2 Codes for Distributed Storage Systems / 661
Exercises / 675
References / 677
Trang 11Error control coding theory has been studied for over half a century, and it is still goingstronger than ever The most recent examples are the turbo codes and the low-densityparity check codes (LDPCs) Also, during these years, error control codes have beenextensively applied to various digital systems, such as computer and communicationsystems, as an essential technique to improve system reliability As an integral part ofmodern day high-speed dependable systems and semiconductor memories, high-speedparallel decoding is essential Error control codes suitable for high-speed paralleldecoding are regularly expressed and studied in parity-check matrices For highly reliablecommunication systems and disk memory systems, on the other hand, serial decodingbased on linear feedback shift registers (LFSRs) is used Error control codes for serialdecoding are typically expressed and studied using generator polynomials In this book,the former codes are called matrix codes and the latter polynomial codes So far,traditional coding theory has been studied mainly using code generator polynomials Weemphasize that the linear codes expressed in polynomials can always be expressed usingparity-check matrices, but the converse is not always possible This book focusesspecifically on the design theory for matrix codes and their practical applications, whichhas been seriously lacking in the traditional scope of coding theory investigations
In dependable computer systems, many types of error control codes have been applied
to memory subsystems and processors in order to achieve efficient and reliable dataprocessing and storage Some systems could never have been realized without theapplication of cost-effective error control codes, mainly very large capacity, high-speedsemiconductor memories, very high-density magnetic disk memories, and recent opticaldisk memories such as compact disc (CD) and digital versatile disc (DVD) More recentlymobile digital systems have gained wide popularity, and these systems are sometimesoperated under unfavorable environments where electromagnetic noise, a-particles andcosmic rays abound Modern high-speed, high-density VLSI processors and semicon-ductor memories are operated at low supply voltage levels and thus low logic signalswing; they therefore are vulnerable to external disturbances that can induce transienterrors Transient errors are a dominant concern in today’s digital systems Error control
ix
Trang 12coding is the most efficient and effective way to tolerate these errors, and is expected tobecome ever more important in future VLSI systems.
The challenge is to choose among many different applications of error controlcodes Often a new application calls for a new type of code that can be developed mostefficiently to fit a new requirement Matrix codes are far more flexible compared withpolynomial codes Parity-check matrices can be manipulated easily Some knownexamples are column vector exchange in a matrix, the odd-weight-column matrix, thelow-density matrix, and the rotational matrix form These manipulations of matriceshave yielded many useful codes for important applications Polynomial codes, on theother hand, are impossible to be manipulated in a similar way for code design fine-tuning The main reason is that the matrix code is capable of expressing various types
of code functions and thus allows for very high design flexibility In practice, suchflexibility has led to excellent code designs, satisfying the various reliability requirements
of the dependable systems
This book builds on the author’s previous book, Error Control Coding for ComputerSystems (Prentice-Hall, 1989), and it likewise aims at introducing the latest developmentsand advances in the field However, as was mentioned earlier, additionally the book isunique in its concentration on the treatment of matrix codes Unlike any existing codingtheory books, this book will not burden the reader with unnecessary background onpolynomial algebra The book includes only the mathematical background essential forthe understanding of matrix code construction and design Such an arrangement frees upspace for the description of some fine artistry of matrix code design strategies andtechniques Matrix code designs are presented with respect to practical applications, such
as high-speed semiconductor memories, mass memories of disks and tapes, logic circuitsand systems, data entry systems, and distributed storage systems Also new classes ofmatrix codes, such as error locating codes, spotty byte error control codes, and unequalerror control codes, are presented in their practical settings The new parallel decodingalgorithm of the burst error control codes is demonstrated and further extended to thegeneralized parallel decoding of the codes
Chapter 1 provides background and a preview of material covered in the subsequentchapters First, it defines faults, errors, and failures and explains the many types of faultsand errors This is the core knowledge needed to understand what constitutes a goodcode To design an efficient and effective code for a given application, it is important first
to know what types of errors matter, how much the system’s reliability can be improved
by coding techniques, and what are the constraints on check-bit length, decoding speed,and so forth The matrix code designing procedure is laid out in this chapter from thisstandpoint The chapter concludes with a brief introduction to the competitors of thecoding technique in dependable systems, namely conventional error recovery techniquesand / or error masking techniques
Chapter 2 provides the fundamental mathematical background and coding theorynecessary to understand the later chapters The chapter covers the matrix representations
of well-known error control codes, such as simple parity-check codes, cyclic codes,Hamming codes, BCH codes, Reed-Solomon codes, and Fire codes These codes aremanipulated in the later chapters in examples of how matrix codes satisfy the systemrequirements for given applications
Chapter 3 discusses the matrix code design techniques related to high-speed decoding,area efficient encoding / decoding hardware, modularized organization of encoding /decoding circuits, and so forth
Trang 13Chapters 4, 5, 6, and 7 cover topics on matrix code design for high-speedsemiconductor memories Depending on the application, the matrix code can be designed
to handle bit or byte errors and in some cases a mixture of both bit and byte errors Thelatter are typical errors found in large capacity semiconductor memory systems usinghigh-density RAM chips Chapter 4 discusses bit error control codes, such as the modifiedHamming single-bit error correcting and double-bit error detecting (SEC-DED) codes, themodified double-bit error correcting BCH codes, and the memory on-chip codes For thememory systems using byte-organized RAM chips, single-byte error correcting (SbEC)codes, and single-byte error correcting and double-byte error detecting (SbEC-DbED)codes, are presented in Chapter 5 The codes for the mixed type of bit errors and byteerrors are presented in Chapter 6 Among them, a byte error detecting SEC-DED code,developed by the author and his colleague in the 1980s, has found practical application inrecent workstations Chapter 7 presents a relatively new class of byte error control codes:spotty byte error control codes This class of codes has been specifically designed to fitthe large capacity memory systems that use high-density RAM chips with wide input /output data of 8, 16, and 32 bits Also a general class of these codes with minimumHamming distance-d and with maximum distance separable (MDS) characteristics ispresented in this chapter The well-known Reed-Solomon codes are included in thesegeneralized codes, which makes them practically and theoretically important They will bequite useful for future applications
Chapter 8 presents the generalized parallel decoding algorithm for error control codes.Initially developed for burst error control codes, this new decoding algorithm includes theconventional parallel decoding algorithm of the existing bit / byte error correcting codes.The generalized algorithm can also be used for multiple burst or byte error correctingcodes The chapter takes this new algorithm and demonstrates how the parallel decodingmethod can be implemented in combinational circuits In addition the chapter addressesthe important problem of glitches in parallel decoding circuits Parallel decoding circuitsdepend heavily on large exclusive-OR tree circuits, which are well known to readilyproduce glitches The glitches are the unwanted logic signal transitions that can generate,propagate, and accumulate in the logic circuits and then induce noise and instability on thepower supply lines The chapter explains why the glitches are generated, how they arepropagated and accumulated in the circuits, and how to reduce these undesirable effects.Chapter 9 presents a new class of codes, namely error locating codes Error location is
an error control function lying midway between error correction and error detection Anerror locating code will indicate where the errors lie but not the precise erroneous digitpositions This type of codes is useful for identifying the faulty block, faulty package, orfaulty chip, and thus enables fault isolation and reconfiguration The chapter includespractical codes for memory systems to use in locating faulty packages / cards It alsoprovides a practical code for locating faulty chips Both codes have the capability tocorrect single-bit errors, even though the codes are mainly designed for identifying thefaulty areas In addition, burst error locating codes are introduced here The chapterconcludes with a precise analysis of error locating codes with an emphasis on the codeconditions and their relation between error locating codes and error correcting / detectingcodes
Chapter 10 shows yet another new class of unequal error control (UEC) codes In manyapplications certain positions in a word have higher error rates or require more protection.The UEC codes can indicate the area in a word having a higher error rate with strongererror control code functions, and the area having a lower error rate with weaker error
Trang 14control functions In other words, this type of code has different code functions within acode word, depending on the area and the associated error rate The chapter providesoptimal codes with some UEC code functions Similar codes exist in unequal errorprotection (UEP) codes This type of code protects the valuable information part of a wordagainst errors For example, control information or address information in communicationmessages or computer words, or similarly pointer information in the database words, must
be more protected from errors than their other parts The chapter provides some UEPcodes that protect against burst errors and also against single-bit errors The chapterincludes examples of UEC and UEP codes used in holographic memories and losslesscompressed data
Chapters 11, 12, 13, and 14 present the codes for some specific systems, namely massmemories such as magnetic tapes and disks, logic circuits and systems, data entrysystems, and distributed storage systems Chapter 11 covers the codes designedspecifically for mass memories such as tape memories, magnetic disk memories, andrecent optical disk memories The various modified types of Reed-Solomon codes andadaptive parity codes are presented to the tape memories and to the disk memories.Codes for recent CDs and DVDs are also introduced Chapter 12 mentions errorchecking for logic systems using efficient error detecting codes An important concept
of self-checking is first introduced The chapter then clarifies how the errors in the logiccircuits and systems are detected, how the error detecting checker circuits areimplemented, how the errors in the checker itself are detected, and how the self-testingcheckers are implemented Especially self-checking ALU is presented by using parity-based codes, and also self-checking design for processor systems is demonstrated.Chapter 13 presents the codes for data entry systems In these systems, in general,nonbinary symbols are routinely used in character recognition systems, and recent two-dimensional symbols The chapter characterizes the errors that occur in these nonbinarysymbols as asymmetric errors and presents some asymmetric error control codes Thesecodes are basically nonlinear, and are designed by using elements in newly definedrings Also nonsystematic nonbinary asymmetric error correcting codes are designedbased on a multilevel coding method and a set-partitioning algorithm, and QR codesand two-dimensional unidirectional clustered error correcting codes are presented fortwo-dimensional matrix symbols Chapter 14 provides the codes for distributed storagesystems connected via networks Codes for recent RAID systems that tolerate twodisk failures are introduced, and then an efficient error recovery scheme from multipledisk failures in the distributed storge system is discussed and is implemented by usingblock design in combinatorial theory
The introductory portion of the book, Chapters 1 and 2, and the parts of Chapters 3, 4, 5,
6, 8, 9, and 10, can be used as the text for a course at an advanced undergraduate level orfor an introductory one-semester course at the graduate level For graduate classes andadvanced students who have the background in mathematics, logic circuits, andrudimentary knowledge of codes, the book can be used as a whole with selected topicsfrom each of the chapters Practicing engineers / designers will find useful discussions inChapters 6 to 14, which demonstrate, in detail, the procedure of designing sophisticatedcodes in practical form For the practicing engineer, Chapter 2 presents mathematics andcoding theory, not in strict form but in introductory form, which is necessary inunderstanding the later chapters Many examples, figures, exercises, and references areprovided in each chapter of the book Many attractive codes with practical codeparameters and their evaluation data on decoding hardware and error detection capabilities
Trang 15are fully demonstrated These can be used by practicing engineers as a practical guide andhandy reference.
My sincere appreciation goes to many people Professors Jack K Wolf of theUniversity of California San Diego, Hideki Imai of the University of Tokyo, T R N Rao
of the University of Louisiana Lafayette, and Bella Bose of Oregon State Universityencouraged me to continue my research on code design theory and to write this book.Emeritus professor Yoshihiro Tohma of Tokyo Institute of Technology, Professors TakashiNanya of the University of Tokyo, Hideo Ito of Chiba University, and Jien-Chung Lo ofthe University of Rhode Island gave important suggestions and valuable discussions onresearch for dependable systems Recently Professor Lo also provided valuable comments
on the final book and an important discussion on glitches, (i.e., logical noise) that aregenerated, propagated, and accumulated in large exclusive-OR tree circuits in the paralleldecoder of the codes The author’s NTT colleagues, Dr Shigeo Kaneda, now professor
at Doshisha University, and Dr Kazumitsu Matsuzawa, now professor at KanagawaUniversity, collaborated to develop practical codes for computer memories Dr MasatoKitakami, now associate professor at Chiba University, Dr Mitsuru Hamada, nowassociate professor at Tamagawa University, Dr Shuxin Jiang, Dr Saowapa Kiattichai, Dr.Hongyuang Chen, Dr Kazuteru Namba, Dr Ganesan Umanesan, Dr Haruhiko Kaneko,
Dr Kazuyoshi Suzuki, Mr Tsuyoshi Tanaka, Mr Toshihiko Kashiyama, and Mr HiroyukiOhde devoted themselves to designing the excellent codes in their master’s and / ordoctorate course programs at the Tokyo Institute of Technology Much of the motivationfor making the codes practical was due to discussions with many researchers and engineers
in Japanese industry
Thanks also go to art designer, Mr Ippei Inoh, a friend of mine, who proposed anddirected the marvelous idea of the front cover design Ms Tiki Ishizuka, a computergraphic designer, arranged the wonderful fine art of this cover You can see ‘‘Hoh-Oh,’’ alegendary happy bird, in the center of the front cover whose original pattern was introducedfrom China more than one thousand years ago to Japan, and since then appeared as an artdesign in Japanese art and craft products I sincerely hope the book will bring happiness andpleasure to the reader
At this point in a preface, I usually thank my wife, Sachiko, and my daughter’s family,Sayaka, Makoto, and Asuka, for encouraging me in continuing this difficult project
ðAutumn in 2005 on the foot of Mt FujiÞ
Trang 181.1 Faults and Failures 3
1.1.1 Faults 3
1.1.2 Failures 6
1.2 Error Models 6
1.2.1 Hard Errors and Soft Errors 7
1.2.2 Random Errors, Clustered Errors, and Their Mixed-Type Errors 7
1.2.3 Symmetric Errors, Asymmetric Errors, and Unidirectional Errors 9
1.2.4 Unequal Error Probability Model and Unequal Error Protection Model 10 1.3 Error Recovery Techniques for Dependable Systems 10
1.3.1 Error Detection / Error Checking 10
1.3.2 Error Recovery / Error Masking 11
1.4 Code Design Process for Dependable Systems 16
1.4.1 Code Functions 17
1.4.2 Code Design Process 18
References 19
Trang 19Introduction
Before designing a dependable system, we need to have enough knowledge of the system’sfaults, errors, and failures of the dependable techniques including coding techniques, and ofthe design process for practical codes This chapter provides the background on code designfor dependable systems
First, we need to make clear the difference between three frequently encountered technicalterms in designing dependable systems—namely faults, errors, and failures These termsare fully defined in [LAPR92, AVIZ04] Faults are primarily identified as the genericsources of abnormalities that alter the operation of circuits, devices, modules, systems, and /
or software products Failure can arise from any type of possible faults Faults are oftencalled defects when they occur in hardware and bugs when in software
1.1.1 Faults
As causes of failure, faults are sometimes predictable but difficult to identify Faults can occurduring any stage in a system’s or product’s life cycle: during specification, design, production,
or operation Faults are characterized by their origin and their nature [LAPR92, GEFF02]
Origin of Faults Timing is a factor because faults can provoke failure in the operation phase
at any one of a system’s previous life phases: specification, design, production, and operation.During the specification phase, for example, an incomplete definition of services maylead to different interpretations by the client, the designer, and the user Eventually, in the
Code Design for Dependable Systems: Theory and Practical Applications, by Eiji Fujiwara
Copyright # 2006 John Wiley & Sons, Inc.
3
Trang 20operation phase, the failure becomes evident when the services provided differ from theuser’s expectations.
During the design and the production phases, for example, a designer’s lack ofsufficient knowledge of architectural levels, structural levels, and the like, may result in atype of physical defect that induces, for example, short or open circuits
During the operation phase, for example, an elevation of ambient temperature can causeelectronic devices and products to malfunction
Nature of Faults During the specification and the design phases, faults that occur are calledhuman-made faults During the production and the operation phases, these may occur physicalfaults, hardware faults, or solid faults Each type is due to some physical abnormality in thecomponent arising from aging or defective materials Faults are of two types in their duration:
1 Permanent These faults arise, for example, from a power supply breakdown,defective open or short circuits, bridging or open lines, electro-migration, and soforth The defects in the input / output of the logical circuits or lines are calledstuck-at ‘1’ faults or stuck-at ‘0’ faults
2 Temporal These faults can be transient or intermittent Transient faults occurrandomly and externally because of external noise, namely environmental problems
of external electromagnetic waves but also external particles such as a-particles andneutrons Intermittent faults occur randomly but internally because of unstable ormarginally stable hardware, varying hardware or software state as a function of load
or activity, or signal coupling (i.e., crosstalk) between adjacent signal lines Someintermittent faults may be due to glitches [LO05], which are unpredictable spikenoise pulses occurring and propagated especially in large exclusive-OR (XOR) treenetworks (see Chapter 8) Parallel decoding circuits of error control codes withlarge code lengths require large exclusive-OR tree networks, so glitches can becomeserious problems This topic will be covered in more detail in Section 8.3
Transient faults and Intermittent faults are the major source of errors in modern-daydigital systems Some reports show that more than 60% of all failures in computer systemsare caused by transient or intermittent faults For example, in DRAM (Dynamic RandomAccess Memory) chips, transient errors result mainly from a-particles emitted by the decay
of radioactive particles in the semiconductor materials [MAY79, NOOR80, SAIH82] Oneidentified source of a-particles is the lead solder balls used to attach the chip to the substrate
As they pass through the chip, a-particles create sufficient electron-hole pairs to addcharge to the DRAM capacitor cells These particles have low energy level, and thus havevery low probability of causing more than one memory cell to flip when the memory cellsare not packed in extreme density In today’s ultra–high-density RAMs, not only DRAMsbut also SRAMs (Static Random Access Memories), it has been recognized that multiplecosmic-ray-induced transient errors are a serious problem [OSAD03, 04]
Temporal errors have also been observed in microprocessor chips The trend towardsmaller geometries by ever-shrinking semiconductor designs results in lower operating signalvoltages and higher speed operation, and therefore brings additional transient or intermittenterrors into play [KARN04] In today’s ubiquitous digital device or system environment, PDAsand personal computers equipped with these high-speed microprocessor chips and high-density RAM chips are further prone to be damaged by even worse circumstances whenoperated in airplanes at high altitude or near the high-voltage electric power lines
Trang 21The important point is that the faults due to temporary environmental problems do notneed repair because the hardware is physically undamaged.
Cosmic rays, however, can give rise to significant transient errors, called soft errors[KARN04, MAKI00, HAZU00, ZIEG98, MASS96, CALV94] Figure 1.1 shows thecosmic ray and its influence at the earth surface level In the cosmic environment heavyparticles with very high energy from solar winds can penetrate the semiconductor chips insatellite digital systems and cause more than double-bit errors [MUEL99] Sometimesthey can cause physical faults such as latchup in CMOS circuits
A detailed report of field testing for soft errors due to cosmic rays was presented in 1996[ZIEG96a, 96b, 96c, OGOR96, SRIN96] In the report cosmic rays are defined as particles
in solar wind originating in the sun or as galactic particles that enter the solar systemstriking atmospheric atoms and creating a shower of secondary particles Most suchparticles produced by the shower either decay spontaneously or lose energy gradually, andeventually lose all energy in the cascade Some of these particles may strike the earth.Therefore the cosmic rays at sea level consist mostly of neutrons, protons, pions, muons,electrons, and photons About 95% of these particles are neutrons with no charge but withthe high energy (more than 10 MeV) that causes significant soft errors or latchups inelectronic circuits So cosmic rays can create multiple errors Altitude causes the neutronflux to increase exponentially, and hence the fail rate of electronic circuits at airplanealtitude is about one hundred times worse than at terrestrial level Concrete shielding withseveral feet of thickness can significantly attenuate the flux of these high-energy particles.Figure 1:2 shows how neutrons and other particles, including a-particles, generated bythe collision of nuclei in the atmosphere, can strike silicon chips and produce sufficientelectron-hole pairs in the chips to impair their functioning
Earth
Cosmic ray
Neutron Pion
Neutron Proton
-Meson -Particle
Proton Neutron
Atmospheric zone
Collision with nucleus
in atmosphere
Collision with nucleus Proton, Neutron, Pion
-Meson Neutron (energy level > 10 MeV):
0.01 Particles/(cm s) at sea level 2.1.0 Particles/(cm s) at 10,000 m high level 2.Cosmic zone
Figure 1.1 Cosmic rays.
Trang 221.1.2 Failures
A failure is defined as nonperformance that occurs when a delivered service no longercomplies with its specifications [LAPR92], and a failure is also defined as nonperformancewhen the system or component is unable to perform its intended function for a specifiedtime under specified environmental conditions [LEVE95]
Some types of failure are defined with respect to specific conditions For example, avalue failure means that the value of the delivered service does not comply with thespecification and a timing failure represents a response in incorrect timing, either faster orslower than the specified time A temporary failure means an erroneous behavior at acertain moment lasting only a short time A crash failure, or catastrophic failure, is the onethat stops the mission because the system is completely blocked
An error is a manifestation of an unexpected fault within a system that is liable to lead tosystem failure The transformation of a fault to an error is called fault activation Themechanism that creates errors in the system and finally provokes a failure is called errorpropagation Before provoking a failure, errors can be masked or corrected by some errorcontrol mechanisms such as error correcting codes, retries, or triple modular redundancy(TMR) and thus recovered without inducing a system failure
A fault remains in passive mode until an error first appears at some structure of thesystem This occurrence is called an initial activation and the error is called a primitiveerror In this case latency is defined as the mean time between the fault’s occurrence and itsinitial activation as an error Figure 1:3 presents the causal relationship between fault,error, and failure Various types of errors can occur, and these different types are coveredbelow
Charged
(Moved by collision) Si chip
(No collision) particle
Electron)
Figure 1.2 Electron holes in a silicon chip caused by particles.
Trang 231.2.1 Hard Errors and Soft Errors
Hard errors are caused by permanent faults; they therefore affect the system functions for
a long period of time This type of error is typically provoked by faults that appear as open
or short anywhere on the chips, modules, cards, or boards Hard errors are also calledpermanent errors
Soft errors, on the other hand, are caused by temporal faults, especially those resultingfrom external causes Soft errors have a limited duration, meaning they interrupt systemfunctions for a very short time period The most likely sources of soft errors are radioactiveparticles and external noise Alpha particles and cosmic particles [ZIEG96a, ZIEG96b,ZIEG96c, OGOR96, SRIN96] are the major contributors mentioned previously Thereforesoft errors are also called transient errors The intermittent errors are provoked byintermittent faults
1.2.2 Random Errors, Clustered Errors, and Their Mixed-Type ErrorsMultiple errors that occur randomly in time and / or space are called random errors.Error can occur in every bit position of a word with almost equal probability Therandom type of error is unpredictable and is typically caused by white noise orexternal particles
Errors may cluster non-uniformly in a word, and these multiple errors may gather inparticular and unpredictable positions in the word Clustered errors include burst errorsand byte errors Burst errors occur typically in disks or tape memory Byte errors aretypically found in semiconductor memory The difference is in the data-recordingmedium In disk memory, the data are recorded on a continuous surface In semiconductormemory, the data are stored in RAM chips, and a data fragment, called a byte, is read orstored in each chip In disk or tape memory, defects or dust particles on the recordingsurface can cause burst errors to occur anywhere in the continuous recording medium
Failure
interface User
Error Fault
Trang 24Clustered errors may occur in the two-dimensional matrix symbols as well as in the tape ordisk memory of a continuous two-dimensional recording medium In semiconductormemory, on the other hand, byte errors may occur in a fragment of readout data, namely in
a single byte, corresponding to the faulty chip This is because each chip is physicallyseparated and independent, and therefore the presence of a fault in a chip does not extend
to the adjacent chips Figure 1.4 illustrates the different cases of random errors, byte errors,and burst errors
Another error model consists of mixed clustered and random errors in the operationalphase The clustered errors mentioned above are sometimes caused by physical faults due
Random Errors
External noise, particles, and permanent faults occurred randomly
Received data / Readout data
Faulty chip
Memory chip with
Memory card (Package)
Figure 1.4 Models of random errors, byte errors, and burst errors.
Trang 25to aging problems However, systems and devices are more prone to damage fromtransient faults than from physical faults Transient faults are source of random errors.Therefore, when a physical fault occurs during the operational phase, both types oferror—clustered and random—must be taken into account For example, in semiconductormemories with byte-organized RAM chips, the major types of errors are transient errors,(i.e., random bit errors) caused by a-particles or external noises After some time inoperation, byte errors will occur due to the aging of RAM chips Therefore both bit errorsand byte errors, meaning both random errors and permanent errors, may occur separately
or simultaneously A similar situation holds for transmission systems, where both randombit errors and burst errors can occur Chapter 6 deals with the codes which control themixed type of single-byte errors and random bit errors
1.2.3 Symmetric Errors, Asymmetric Errors, and Unidirectional Errors
In binary systems the probability of errors that force 0 to 1, called 0-errors, is, in general, equal
to those going from 1 to 0, called 1-errors This class of errors is known as symmetric errors.When these errors occur with unequal probabilities, they are called asymmetric errors In thebinary asymmetric error model, only one type of error, either 0-errors or 1-errors, can occur,and the error type is known a priori If both error types occur but are not mixed, then this class
of errors is said to be unidirectional errors [BLAU93] In binary systems these errors arecaused by symmetric faults, asymmetric faults, or unidirectional faults
In nonbinary systems using numerals, 0; 1; 2; 3; ; 9, or alpha-numeric symbols,asymmetric errors are the type that occur That is, the probability of an error that forces onenonbinary symbol A to another symbol B is sometimes different from that of symbol Aforced to yet another symbol C For example, in handwritten character recognitionsystems, the probability of a 7 being mistaken for a 9 is much higher than that of a 7 beingmistaken for a 4, or pð9j7Þ pð4j7Þ, where pðBjAÞ means probability of a symbol A beingmistaken for another symbol B This is because the numbers 7 and 9 are close in shapewhereas 7 and 4 are not so similar Likewise in keyboard input systems the symbolslocated on adjacent keys can be more easily mistyped Figure 1:5 shows examples of theseerror models In the asymmetric error model, the error graphs are not perfect andsometimes not bi-directional On the other hand, in the symmetric nonbinary error model,they are perfect and bi-directional
If symbols are removed or added in a word, as is sometimes caused by human mistakes(i.e., human-made faults), this class of errors is called deletion errors or insertion errors,respectively
(a) Example of an asymmetric error graph for handwritten
character (numerals) recognition systems
(b) Asymmetric error graph for keyboard systems
Trang 261.2.4 Unequal Error Probability Model and the Unequal Error
Protection Model
The probability of error appearing in any position of a word is usually considered to beequal However, there is an error model to consider where some positions of a word havehigher error probability than other positions These are sometimes caused in the system byusing devices with low reliabilities in the corresponding positions of a word, or by havingerror-sensitive areas in some positions of a word which are more vulnerable to external noises
or have a low noise margin In such cases the erroneous positions or areas with high errorprobabilities are known a priori The type of error model that is relevant here is known as theunequal error probability model The codes based on this error model are called unequalerror control (UEC) codes Chapter 10 will discuss the UEC codes and present its application
to holographic memory, which has non-uniform error probability in the recording medium.Some types of computer words or communication messages have a structure such that theinformation included in one part of the word is more important or more valuable than that inother parts Control and address information in the computer or communication messages, andpointer information in the database words are good examples In general, errors in this part,such as errors in control information or in pointer information, will cause much more seriousdamage to the subsequent processes in the system Another example is error in the decimalnumbers During processing of digital data of conventional decimal numbers or measurementdata, errors in the higher order digits will yield more devastating effects on the subsequentprocesses in digital systems than errors in the lower order digits Therefore the higher orderdigits should be more strongly protected against errors than the lower order digits This type oferror model is known as an unequal error protection model The codes based on this are calledunequal error protection (UEP) codes and will also be discussed in Chapter 10
Error detection is an essential part of a dependable system design Ideally, error detectionwill block the propagation of an error during online operations, before it reaches thesystem interface and causes a system failure The error is best be detected immediately as
it occurs so that its effect can be minimized
Upon detecting an error by an error detection mechanism, some error recoverytechnique must mask the fault or remove it, and thus block the error’s propagation Amongsuch mechanisms, error correcting codes and triple modular redundancy (TMR) correcterrors or mask faults directly, that is, without an additional error detection procedure.Some important error detection techniques and error recovery techniques, comparative
to the error control coding techniques, are briefly described below For more information,the reader is referred to the following excellent texts and papers on dependable systems ordesign techniques for fault tolerance: [AVIZ78, SIEW82, RENN84, EZHI86, ABRA86,PRAD86, JOHN89, LEE90, AVRE00]
1.3.1 Error Detection / Error Checking
Prediction & Comparison The basic error detecting or checking concept for onlineoperations exists in prediction & comparison That is, the output of the circuit / module ispredicted from the input, and then the predicted output and the original circuit / module
Trang 27output are compared bit by bit The errors are detected if the actual output is not perfectlymatched to the predicted output.
Duplication is an important and popularly used error detection technique in dependabledigital systems This is a special case of prediction & comparison, because the output isgenerated, or predicted, by a copy of the circuit / module and then compared with that ofthe original This concept exists also in software duplication where a copy of the same orequivalent software is prepared and executed, and then the outputs are compared.Parity-prediction is another important and popularly applied technique The outputparity bit is predicted from the input, and then compared with the parity bit generated fromthe original output
Error Detecting Codes Error detecting codes typically deal with simple parity-checkcodes, cyclic codes, checksum codes, and other basic linear codes, as will be explained inSection 2.3 Some further important and newly developed codes will be presented in laterchapters
The application of error detecting codes in online operations is also called checking or
an online testing The error detection circuit is denoted as a checker These applicationswill be examined in-depth in Section 12.1 where the self-checking concept is presented.Additional topics on how to detect errors caused by faults in the checker itself and how
to design such checkers are covered in Section 12.2 where self-testing checkers are cussed In summary, Chapter 12 covers error-checking concepts, self-testing checker designmethodologies, and concrete checker design for logic circuits and for computer systems
dis-Watchdog Timer and dis-Watchdog Processor A watchdog timer is very useful fordetecting faults in a system The idea behind this scheme is that some part of the system shouldact to indicate fault-free status so that absence of this action is indicative of a fault Also thetimer must be repeatedly reset by the system Failure of the system to perform the reset func-tion results in the system being turned off to prevent a system failure from occurring
A watchdog timer can be used to detect faults in both the hardware and the software of asystem In many applications software routines are expected to execute within pre-specified time frame In digital control systems, for example, the routines executerepetitively at specified intervals If a routine suddenly needs more than the expected time
to execute, the fault may be in the software’s, for example, infinite loop [JOHN89] In thisregard the watchdog timer is an important control flow check tool
A watchdog processor is an extension of the concept of a watchdog timer This is aspecial subprocessor that checks the online operations of the processor being checked Thewatchdog processor runs the watchdog programs that collect information from theprocessor being checked and generate signatures, such as address and data information,and processor state information, during online operations The new information is thencompared to that already prepared in the watchdog program
1.3.2 Error Recovery / Error Masking
Error recovery techniques are essential to improving system reliability The importantrecovery techniques, as was mentioned before, include coding techniques and some modularredundancy techniques, such as TMR, that correct or mask the faults directly Othereffective error detection methods are also available to mask the faults after the detection oferrors, for example, self-checked duplication and sift-out redundancy, as discussed below
Trang 28Error Correcting Codes Many different error control codes have been studied anddeveloped to correct and / or detect the types of errors mentioned in Section 1.2 Amongthe most practical matrix codes are those presented in this book.
Error correcting codes head the list of the most effective and efficient techniques used
to mask faults, both temporal and permanent The coding approach involves someredundancy, for example, additional check bits, additional hardware in the form ofencoding / decoding logic circuits, and additional decoding time delay Nevertheless, thecoding performance is superior to that competitive techniques, especially in quicklymasking of temporal faults For this reason error control codes are still being extensivelyapplied to various digital systems to improve their reliability
Retry Just as space redundancy requires additional hardware resources, the retrymethod called time redundancy which requires additional time to perform multiple iden-tical operations of commands or programs immediately after errors are detected This verysimple technique requires almost no additional hardware but can very effectively recoversystem operations from temporary faults, meaning transient and intermittent faults There-fore the retry method is popularly applied to digital systems, including processors, mainmemories, disk memories, tape memories, and I/O devices
Alternate data retry, abbreviated by ADR [SHED78], is a kind of retry operation that iseffective in masking permanent faults besides temporary faults Figure 1:6 presents theprinciple behind masking a single permanent fault by ADR Note that this simple exampleshows the even-parity encoded bus circuit with four lines, including a parity line Figure
1:6(a) shows that if a stuck-at ‘0’ permanent fault occurs in the first bus line, then the parity encoded data from circuit A, here 1001, is received at the input of the circuit B as
even-0001, which is an odd-parity encoded data Therefore a single error can be detected byexamining the parity check of the data Next, by the ADR method, in Figure 1:6(b), thebit-by-bit complement of the original data, which is 0110, is transmitted from circuit A to
(a) Error detection by parity check
A
1 0 0 1
0 0 0 1 stuck-at ‘0’ fault
(b) Retried by complemented dataA
B
B
0 1 1 0
0 1 1 0
stuck-at ‘0’ fault
Inverted
1 0 0 1
Figure 1.6
Trang 29the input of the circuit B Even though the first line is still preserving a stuck-at ‘0’ fault,the fault is masked because the data on this line are also a ‘0’ Finally the received data areinverted, and then the original correct data, 1001, are recovered In this example, apermanent fault is masked at the second stage of ADR, and finally the correct data arerecovered at the third stage of ADR Also, in this example, if the fault in Figure 1:6(a) is atemporary fault, the error it caused can be completely masked and will have no effectbecause the temporary fault will disappear by the time of the second stage of ADR.
In general, if the logic circuit that performs the function FðXÞ for the circuit input Xsatisfies the relation
FðXÞ ¼ FðXÞ;
where X means the complement of X, then the ADR with bit-by-bit complementary retry
at the second stage can be performed successfully The function F that satisfies the relationabove is called a self-complementary function, and the circuit that satisfies the relation iscalled a self-complementary circuit The former even-parity busline circuit is a self-complementary circuit The adder, the multiplier, and the divider are also good examples
of self-complementary circuits
N-Modular Redundancy (NMR) and Reconfiguration Triple modular redundancy(TMR) is the most typical form of N-modular redundancy The TMR method triplicates theoriginal module and performs a majority vote to determine the output of the system If one ofthe modules becomes faulty, the other two fault-free modules mask the results of the faultyone when the majority vote is performed This is shown in Figure 1:7(a) This voting concept
(a) Triple modular redundancy (TMR)
(B) Triple modular redundancy with triplicated voters
Input 1
Output 1 Voter
Module 1
Input 2
Output 2 Voter
Module 2
Input 3
Output 3 Voter
Trang 30is applied to TMR software to protect against software faults in any one of three identical orequivalent software programs that perform the same function.
The difficulty in the TMR exists in its voter That is, if the voter fails, the systemcompletely fails One approach is to apply TMR to the voter itself such that three voters areused and three independent voting results are provided as shown in Figure 1:7(b) Thethree modules are functionally identical and receive identical inputs The results generated
by three modules are voted on by the three voters to produce three results Each result iscorrect unless more than one module or input is faulty
N-Modular redundancy (NMR) is a generalization of the TMR and is a typical spaceredundancy technique In most cases, N is selected as an odd number so that a majorityvoting principle can be applied For example, the 5MR system consists of five identicalmodules and a voter This system produces correct output in the presence of, or masks, asmany as two faulty modules
The modular redundancy concept has been extended and modified by combining theconcept of reconfiguration The following forms show some such combinations.Self-checked duplication is an extended form of duplication in which each module hasits own self-checking mechanism in order to identify the faulty state of the module itself
In this system, two self-checked modules are operated and checked in parallel at all times
If one module is found to have errors by its own error detection mechanism, then thesystem output is switched to the error-free module, meaning it is reconfigured This concept
is a form of hot standby sparing in which the spare module operates synchronously with theonline module and is prepared to take over at any time When the online module is failed,the standby spare module takes over immediately In contrast to the hot standby sparing,there exists cold standby sparing where the spare is unpowered until needed to replace afaulty module
N-Modular redundancy with spares is also known as hybrid redundancy It provides abasic core of N modules arranged in a voting system, and in addition spares are provided toreplace faulty modules For example, while the TMR with one spare masks one faultymodule, the spare will replace the faulty module immediately upon the detection of thefault After that spare is used, the system is still capable of masking another faulty module.Therefore two faulty modules can be masked in this system The aforementioned 5MRrequires five modules in order to mask two faulty modules, but the TMR with one spareapproach requires only four modules The system remains in the basic NMR configurationuntil the disagreement detector determines that a faulty module exists One approach tofault detection is to compare the output of the voter with the individual outputs of themodules A module that disagrees with the majority is regarded as faulty and removedfrom the NMR core A spare module is then switched in to replace the faulty module Thereliability of the basic NMR system is maintained as long as the pool of spares is notexhausted This is shown in Figure 1:8
Self-Purging Redundancy is similar to the NMR with the spare modules approach Themain difference is that all modules operate actively in this redundancy system, unlike theNMR with spares where some spare modules are not an active part of the system until afault occurs This is shown in Figure 1:9 Each switch in the self-purging redundancyseparates the faulty module if the module output is not equal to the voter output Thereconfiguration is essentially accomplished by the system logically removing the faultymodule via the switch and thus reducing the number of N in the reconfigured NMR system.Sift-out redundancy also requires N identical modules in the system but with every pair
of two module outputs compared to identify faulty modules If there exist N¼ 5 modules,
Trang 31ten comparisons are performed This redundancy requires an N-way multiplexer instead of
an N-input voter, as shown in Figure 1:10 The comparator in this redundancy circuitreceives all outputs of the modules and produces comparison outputs of every twomodules, that is, NðN 1Þ=2 outputs, and then determines the faulty modules in thedetection circuit Finally the output of the N-way multiplexer is selected based on thefaulty indication outputs of the detection circuit This essentially masks the effects of anyfaulty modules
This redundancy can tolerate up to N 2 faulty modules Its tolerance is thereforeequal to the TMR system with N 3 spares and also to the self-purging system having avoter with threshold level of two
Switch
Disagreement identification
Trang 32System Recovery by Software Retry techniques require error detection by checkers,and immediately after the error detection the same operations are performed In contrast,checkpoint techniques allow some latency time after error detection because the processcan be restored to an earlier point of execution Checkpointing is mostly implemented in soft-ware and requires some hardware to store the backup data The techniques result from a com-bination of checkpointing and rollback In checkpointing, complete copy of the system stateshould be saved at specific points, namely checkpoints, during process execution The infor-mation to be stored is the set of system state including data, programs, machine state, and soforth, which is necessary to restart the continued successful execution from the checkpoint.Rollback is a part of actual recovery process and occurs after the repair, such as by reconfi-guration, that removes faulty modules or equipments from the system, or after the error due totransient faults has died out An important design criterion is how often checkpoints are to beset, that is, in determining checkpoint intervals If the checkpoints are too infrequent for theactual error rate experienced, too much computation time will be lost due to rollback On theother hand, too frequent checkpoints result in an unnecessary increase in operation time andmemory due to the overhead of saving system states when establishing checkpoints.
What types of dependable techniques are the most effective in the design of dependablesystems? In some cases other than coding techniques, or a combination of codingtechniques and other dependable techniques, will better meet the reliability requirement orthe cost / performance requirement of a system
Module 2
Module N
System
inputs
Module 1
N
Detection circuit
N(N-1)/2 Comparator
N-Input multiplexer
System output
indicating faulty module Figure 1.10 Shift-out redundancy.
Trang 33Before designing the error control codes, we therefore have to pay attention to a number ofpreconditions or preparatory measures: Where to apply the code? How to apply the codeeffectively? How much reliability of the system to improve and satisfy its performance by codingtechniques? What are the requirements for decoding speed, and how much decoder hardware?What about the detection capability of errors falling outside the capability of the code? Thissection addresses all these important questions with respect to the code design process.
1.4.1 Code Functions
Error detection and error correction are the more known code functions An importantcode function that lies midway between these two functions is error location The errorlocating code indicates which blocks, or components of a word contain error but does notindicate the precise erroneous digit position nor the error value This is a code function that
is efficient for retransmission of a word segment, especially in communication systemswhere whole words do not need retransmitting [WOLF63] Also in computer systems theerror locating code provides the information on where to find the faulty module, faultypackage / card, or faulty device, which is very useful for system maintenance If the system
is equipped with spares, then the system can be recovered by removing the faulty blocksand switching to the spares
Figure 1:11 shows the different functions of these three code types Because erroneousposition, and error value can be determined by use of ‘‘error correction’’, all errors can be
Figure 1.11
Trang 34corrected Of course, use of ‘‘error detection’’ alone does not allow any erroneousposition nor error value to be determined; it only indicates the presence of error in aword For ‘‘error location’’, as was mentioned before, only the area where the wordincludes an erroneous position is indicated by the code For example, note in Figure 1:11that the code’s information is that errors exist in the second block of the word and
no definite error positions in the block nor the error values are determined Errorlocating codes will be covered in Chapter 9 Many practical codes, in general, have
a mixture of these code functions, for example, single error correction and doubleerror detection
1.4.2 Code Deisgn Process
Before attempting the design of codes, we need to give the following items our carefulconsideration:
1 Circumstance where the systems or equipments with the coding techniques are to beapplied, for example, the particular needs of medical appliances, nuclear appli-ances, or digital systems in aircraft or satellite,
2 Fabrication structure, that is, how the systems or the equipments are organized, forexample, chip / card (package) organization, bit / byte organization, or binary / nonbinary,
3 Devices, such as memories, logic circuits, or FPGAs that are used in the system towhich the coding techniques are to apply
4 Combination of fault / error masking techniques with coding techniques
The design process for the error control codes is presented next, and is shown in Figure
1:12 Steps 1 through 3 pertain to the phase of setting code parameters, and steps 4 and 5are for the phase of code designing
Step 1 Determine error rates and error types:
Raw error rate of devices, modules, or systems, and what target error rate to attain
Whether symmetric error, asymmetric error, or unidirectional error
Whether equal error or unequal error
Whether random bit error, byte error, spotty byte error,aor burst error
Whether bit or byte error,*or rather, bit plus byte errorb
Step 2 Determine code parameters and code constraints:
Information-bit length, and required check-bit length
Maximum random bit error length—or byte error length, spotty error length,aorburst error length
Required decoding speed
Required decoder hardware complexity
a See Chapter 7.
b See Chapter 6.
Trang 35Step 3 Determine code function:
Error detection, error correction, error location, or mixed type of these code functions
Step 4 Design code, and calculate code bounds:
Theoretical bound on code length or check-bit length
Mathematical knowledge required for code design, for example, algebra, torial mathematics, number theory, graph theory, statistics, and probability theory
combina-Step 5 Evaluate the code designed:
Check-bit length, and comparison to its bound
Decoding speed
Decoder hardware complexity
Error detection probability of multiple errors beyond the code capability
If the code does not satisfy the requirements, then go back to step 4
Trang 36[AVIZ78] A Avizienis, ‘‘Fault Tolerance, the Survival Attribute of Digital Systems’’ Proc IEEE, 66(October 1978): 1109–1125.
[AVIZ04] A Avizienis, J.-C Laprie, B Randell, and C Landwehr, ‘‘Basic Concepts and Taxonomy
of Dependable and Secure Computing,’’ IEEE Trans Depend Secure Comput., 1 (January–March2004): 11–33
[AVRE00] D R Avresky (ed.), Dependable Network Computing, Kluwer Academic Publishers(2000)
[BLAU93] M Blaum, Codes for Detecting and Correcting Unidirectional Errors, IEEE ComputerSociety Press (1993)
[CALV94]P Calvel, P Lamothe, and C Barillot, ‘‘Space Radiation Evaluation of 16 Mbit DRAMsfor Mass Memory Applications,’’ IEEE Trans Nucl Sci., 41 (December 1994): 2267–2271.[EZHI86] P D Ezhilchevan and S K Shrivastava, ‘‘A Characterization of Faults in Systems,’’ Proc.5th Symp on Reliability in Distributed Software and Database Systems (January 1986): 215–222.[GEFF02] J.-C Geffroy and G Motet, Design of Dependable Computing Systems, Kluwer AcademicPublishers (2002), chs 1–5
[HAZU00] P Hazucha, C Svensson, and S A Wender, ‘‘Cosmic-Ray Soft Error Rate
[LAPR92] J C Laprie (ed.), Dependability: Basic Concepts and Terminology, Springer-Verlag (1992).[LEE90] P A Lee and T Anderson, Fault Tolerance, Principles and Practice, Springer-Verlag(1990)
[LEVE95] N Leveson, Safeware, Addison-Wesley (1995)
[LO05] J C Lo and E Fujiwara,‘‘Transient Behavior of the Encoding/Decoding Circuits of ErrorControl Codes,’’ Proc IEEE Int Symp on Defect and Fault Tolerance in VLSI Systems (October2005):
[MAKI00] A Makihara, H Shindou, N Nemoto, S Kuboyama, S Matsuda, T Oshima, T Hirao, H.Itoh, S Buchner, and A B Campbell, ‘‘Analysis of Single-Ion Multiple-Bit Upset in High-Density DRAMs,’’ IEEE Trans Nucl Sci., 47 (December 2000): 2400–2404
[MASS96] L W Massengill, ‘‘Cosmic and Terrestrial Single-Event Radiation Effects in DynamicRandom Access Memories,’’ IEEE Trans Nucl Sci., 43 (April 1996): 576–593
[MAY79] T C May, ‘‘Soft Errors in VLSI: Present and Future,’’ IEEE Trans Comp Hybrids Manuf.Technol., CHMT-2 (December 1979): 377–387
[MUEL99] M Mueller, L C Alves, W Fischer, M L Fair, and I Modi, ‘‘RAS Strategy for IBMS/390 G5 and G6,’’ IBM J Res Dev., 43 (September–November 1999): 875–888
[NOOR80] D J W Noorlag, L M Terman, and A G Konheim, ‘‘The Effect of Induced Soft Errors on Memory Systems with Error Correction,’’ IEEE J Solid-State Circ., SC-15(June 1980): 319–325
Alpha-Particle-[OGOR96] T J O’Gorman, J M Ross, A H Taber, J F Ziegler, H P Muhlfeld, C J Montrose,
H W Curtis, and J L Walsh, ‘‘Field Testing for Cosmic Ray Soft Errors in SemiconductorMemories,’’ IBM J Res Dev., 40 (January 1996): 41–50
Trang 37[OSAD03] K Osada, Y Saitoh, E Ibe, and K Ishibashi, ‘‘16.7-fA/Cell Tunnel-Leakage-Suppressed16-Mb SRAM for Handling Cosmic-Ray-Induced Multierrors,’’ IEEE J Solid-State Circ., 38(November 2003): 1952–1957.
[OSAD04] K Osada, K Yamaguchi, Y Saitoh, and T Kawahara, ‘‘SRAM Immunity to Induced Multierrors Based on Analysis of an Induced Parasitic Bipolar Effect,’’ IEEE J Solid-State Circ., 19 (May 2004): 827–833
Cosmic-Ray-[PRAD86] D K Pradhan, Fault-Tolerant Computing, Vol 1 and 2, Prentice-Hall (1986).[RENN84] D A Rennels, ‘‘Fault-Tolerant Computing—Concepts and Examples,’’ IEEE Trans.Comput., C-33 (December 1984): 1116–1129
[SAIH82] G A Sai-Halasz, M R Wordeman, and R H Denard, ‘‘Alpha-Particle-Induced Soft ErrorRate in VLSI Circuits,’’ IEEE J Solid-State Circ., SC-17 (April 1982): 355–361
[SELL68] F F Sellers, Jr., M Y Hsiao, L W Bearnson, Error Detecting Logic for DigitalComputers, McGraw-Hill (1968)
[SHED78] J J Shedletsky, ‘‘Error Correction by Alternate-Data Retry,’’ IEEE Trans Comput., C-27(February 1978): 106–112
[SIEW82] D P Siewiorek and R S Swarz, The Theory and Practice of Reliable System Design,Digital Press (1982)
[SRIN96] G R Srinivasan, ‘‘Modeling the Cosmic-Ray-Induced Soft-Error Rate in IntegratedCircuits: An Overview,’’ IBM J Res Dev., 40 (January 1996): 77–89
[WOLF63] J K Wolf and B Elspas, ‘‘Error-Locating Codes—A New Concept in Error Control,’’IEEE Trans Info Theory, IT-9 (April 1963): 113–117
[ZIEG96a] J F Ziegler, H W Curtis, H P Muhlfeld, C J Montrose, B Chin, etc., ‘‘IBMExperiments in Soft Fails in Computer Electronics (1978–1994),’’ IBM J Res Dev., 40 (January1996): 3–18
[ZIEG96b] J F Ziegler, ‘‘Terrestrial Cosmic Rays,’’ IBM J Res Dev., 40 (January 1996): 19–39.[ZIEG96c] J F Ziegler, H P Muhlfeld, C J Montrose, H W Curtis, T J O’Gorman, and J M Ross,
‘‘Accelerated Testing for Cosmic Soft-Error Rate,’’ IBM J Res Dev., 40 (January 1996): 51–72.[ZIEG98] J F Ziegler, M E Nelson, J D Shell, R J Peterson, C J Gelderloos, H P Mahlfeld, and
C J Montrose, ‘‘Cosmic Ray Soft Error Rates of 16-Mb DRAM Memory Chips,’’ IEEE J State Circ., 33 (February 1998): 246–252
Trang 382.1 Introduction to Algebra 232.1.1 Groups and Rings 232.1.2 Fields 262.1.3 Representation for Elements of Galois Fields 282.1.4 Properties of Galois Field GFð2mÞ 312.2 Linear Codes 332.2.1 Vector Space and Subspace 342.2.2 Linear Codes as Vector Spaces 352.2.3 Matrix Algebra 362.2.4 Distance and Error Control Capability 392.2.5 Parity-Check Matrices for Linear Codes 412.3 Basic Matrix Codes 482.3.1 Simple Parity-Check Codes 482.3.2 Hamming Single Error Correcting (SEC) Codes 492.3.3 Hamming Single Error Correcting and Double Error Detecting
(SEC-DED) Codes 522.3.4 Cyclic Codes 532.3.5 Binary BCH Codes 582.3.6 Reed-Solomon Codes as Nonbinary BCH Codes 652.3.7 Burst Error Correcting Fire Codes 68Exercises 71References 75
Trang 39Mathematical Background
and Matrix Codes
The research in error control codes has relied to a large extent on the powerful structures ofmodern algebra A number of important and practical codes based on the structure of ringsand Galois fields have been developed This chapter provides the algebraic structures andthe fundamental codes, expressed mostly by matrices, necessary to understand thesubsequent chapters and to design codes that fit practical requirements The level of thediscussion is introductory For a more rigorous treatment, the reader is advised to consultthe following excellent texts on coding theory: [PETE72, MACW77, BRAH84, BERL84,PLES98, LIN04]
The most important ideas in coding theory are based on the arithmetic systems of modernalgebra These systems are not so familiar to most of us, so here we pause to develop abackground of this mathematics before we proceed to study coding theory and to designpractical codes
2.1.1 Groups and Rings
A group is a mathematical abstraction of an algebraic structure A ring is also an abstractset that is an Abelian group and has an additional structure
Code Design for Dependable Systems: Theory and Practical Applications, by Eiji Fujiwara
Copyright # 2006 John Wiley & Sons, Inc.
23
Trang 40a b ¼ b a:
This is called a commutative axiom Groups with this additional axiom are calledcommutative groups, or Abelian groups In every group the identity element is unique.Also the inverse of each group element is unique, meaningða1Þ1 ¼ a The proofs ofthese are left to the reader to complete
Homomorphism and Isomorphism The number of elements in a group is said
to be the order of the group A homomorphism is a mapping f that preserves thestructure of the sets between two groups fA; g and fA0; 0g, meaning f : A ! A0,which satisfies fða bÞ ¼ f ðaÞ 0fðbÞ for a; b 2 A If there exists a homomorphismfrom A onto A0, then A0is said to be homomorphic for A In particular, if there existsone-to-one mapping between A and A0, that is, f0: A0! A is also a homomorphism,
or bijection (one-to-one and onto), then f is said to be an isomorphism, and A and
A0 are said to be isomorphic The reader is advised to find an isomorphism betweenthe set of integers under addition Z4 ¼ f0; 1; 2; 3g and the multiplicative group
G¼ f1; 2; 3; 4g
Subgroups
Definition 2.2 Let G be a group and let F be a subset of G Then F is called a subgroup of
G if F satisfies all properties of a group with the same operation &