code design for dependable systems theory and practical applications

1.3 Error Recovery Techniques for Dependable Systems / 101.4 Code Design Process for Dependable Systems / 16 3.1 Minimum-Weight & Equal-Weight-Row Codes / 78... 4 Codes for High-Speed Me

Trang 2

Code Design for Dependable Systems Theory and Practical Applications

Eiji FujiwaraTokyo Institute of Technology

A JOHN WILEY & SONS, INC., PUBLICATION

Trang 3

Code Design for Dependable Systems

Trang 5

Code Design for Dependable Systems Theory and Practical Applications

Eiji FujiwaraTokyo Institute of Technology

A JOHN WILEY & SONS, INC., PUBLICATION

Trang 6

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission

of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright

Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470,

or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness

of the contents of this book and speciﬁcally disclaim any implied warranties of merchantability or ﬁtness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss

of proﬁt or any other commercial damages, including but not limited to special, incidental, consequential,

or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States

at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats Some content that appears in print may not

be available in electronic formats For more information about Wiley products, visit our web site at

Trang 7

1.3 Error Recovery Techniques for Dependable Systems / 10

1.4 Code Design Process for Dependable Systems / 16

3.1 Minimum-Weight & Equal-Weight-Row Codes / 78

Trang 8

4 Codes for High-Speed Memories I: Bit Error Control Codes 974.1 Modiﬁed Hamming SEC-DED Codes / 98

4.2 Modiﬁed Double-Bit Error Correcting BCH Codes / 105

4.3 On-Chip ECCs / 110

Exercises / 123

References / 126

5.1 Single-Byte Error Correcting (SbEC) Codes / 134

5.2 Single-Byte Error Correcting and Double-Byte Error Detecting

(SbEC-DbED) Codes / 154

5.3 Single-Byte Error Correcting and Single p-Byte within a Block

Error Detecting (SbEC-Spb=BED) Codes / 171

Exercises / 180

References / 183

6 Codes for High-Speed Memories III: Bit / Byte Error

6.1 Single-Byte / Burst Error Detecting SEC-DED Codes / 188

6.2 Single-Byte Error Correcting and Double-Bit Error Detecting

References / 258

7 Codes for High-Speed Memories IV: Spotty Byte Error

7.1 Spotty Byte Errors / 264

7.2 Single Spotty Byte Error Correcting (St=bEC) Codes / 264

7.3 Single Spotty Byte Error Correcting and Single-Byte Error

Detecting (St=bEC-SbED) Codes / 274

7.4 Single Spotty Byte Error Correcting and Double Spotty Byte

Error Detecting (St=bEC-Dt =bED) Codes / 284

7.5 A General Class of Spotty Byte Error Control Codes / 290

Exercises / 326

References / 330

8.1 Parallel Decoding Burst Error Control Codes / 336

Error Detecting (SbEC-(Sbþ S)ED) Codes

Trang 9

8.2 Parallel Decoding Cyclic Burst Error Correcting Codes / 351

8.3 Transient Behavior of Parallel Encoder / Decoder Circuits

of Error Control Codes / 353

Exercises / 369

References / 370

9.1 Error Location of Faulty Packages and Faulty Chips / 373

9.2 Block Error Locating (Sb=pbEL) Codes / 376

9.3 Single-Bit Error Correcting and Single-Block Error Locating

(SEC-Sb=pbEL) Codes / 377

9.4 Single-Bit Error Correcting and Single-Byte Error Locating

(SEC-Se=bEL) Codes / 389

9.5 Burst Error Locating Codes / 396

9.6 Code Conditions for Error Locating Codes / 404

Exercises / 409

References / 410

10 Codes for Unequal Error Control / Protection ( UEC / UEP ) 41310.1 Error Models for UEC Codes and UEP Codes / 413

10.2 Fixed-Byte Error Control UEC Codes / 417

10.3 Burst Error Control UEC / UEP Codes / 427

10.4 Application of the UEC / UEP Codes / 439

Exercises / 457

References / 461

11.1 Tape Memory Codes / 465

11.2 Magnetic Disk Memory Codes / 487

11.3 Optical Disk Memory Codes / 500

13.1 M-Ary Asymmetric Errors in Data Entry Systems / 599

Trang 10

13.2 M-Ary Asymmetric Symbol Error Correcting Codes / 600

13.3 Nonsystematic M-Ary Asymmetric Error Correcting Codes with

Deletion / Insertion / Adjacent-Symbol-Transposition Error

Correction Capabilities / 623

13.4 Codes for Two-Dimentional Matrix Symbols / 632

Exercises / 644

References / 646

14.1 MDS Array Codes Tolerating Multiple-Disk Failures / 650

14.2 Codes for Distributed Storage Systems / 661

Exercises / 675

References / 677

Trang 11

Error control coding theory has been studied for over half a century, and it is still goingstronger than ever The most recent examples are the turbo codes and the low-densityparity check codes (LDPCs) Also, during these years, error control codes have beenextensively applied to various digital systems, such as computer and communicationsystems, as an essential technique to improve system reliability As an integral part ofmodern day high-speed dependable systems and semiconductor memories, high-speedparallel decoding is essential Error control codes suitable for high-speed paralleldecoding are regularly expressed and studied in parity-check matrices For highly reliablecommunication systems and disk memory systems, on the other hand, serial decodingbased on linear feedback shift registers (LFSRs) is used Error control codes for serialdecoding are typically expressed and studied using generator polynomials In this book,the former codes are called matrix codes and the latter polynomial codes So far,traditional coding theory has been studied mainly using code generator polynomials Weemphasize that the linear codes expressed in polynomials can always be expressed usingparity-check matrices, but the converse is not always possible This book focusesspeciﬁcally on the design theory for matrix codes and their practical applications, whichhas been seriously lacking in the traditional scope of coding theory investigations

In dependable computer systems, many types of error control codes have been applied

to memory subsystems and processors in order to achieve efﬁcient and reliable dataprocessing and storage Some systems could never have been realized without theapplication of cost-effective error control codes, mainly very large capacity, high-speedsemiconductor memories, very high-density magnetic disk memories, and recent opticaldisk memories such as compact disc (CD) and digital versatile disc (DVD) More recentlymobile digital systems have gained wide popularity, and these systems are sometimesoperated under unfavorable environments where electromagnetic noise, a-particles andcosmic rays abound Modern high-speed, high-density VLSI processors and semicon-ductor memories are operated at low supply voltage levels and thus low logic signalswing; they therefore are vulnerable to external disturbances that can induce transienterrors Transient errors are a dominant concern in today’s digital systems Error control

ix

Trang 12

coding is the most efﬁcient and effective way to tolerate these errors, and is expected tobecome ever more important in future VLSI systems.

The challenge is to choose among many different applications of error controlcodes Often a new application calls for a new type of code that can be developed mostefficiently to fit a new requirement Matrix codes are far more flexible compared withpolynomial codes Parity-check matrices can be manipulated easily Some knownexamples are column vector exchange in a matrix, the odd-weight-column matrix, thelow-density matrix, and the rotational matrix form These manipulations of matriceshave yielded many useful codes for important applications Polynomial codes, on theother hand, are impossible to be manipulated in a similar way for code design fine-tuning The main reason is that the matrix code is capable of expressing various types

of code functions and thus allows for very high design ﬂexibility In practice, suchﬂexibility has led to excellent code designs, satisfying the various reliability requirements

of the dependable systems

This book builds on the author’s previous book, Error Control Coding for ComputerSystems (Prentice-Hall, 1989), and it likewise aims at introducing the latest developmentsand advances in the ﬁeld However, as was mentioned earlier, additionally the book isunique in its concentration on the treatment of matrix codes Unlike any existing codingtheory books, this book will not burden the reader with unnecessary background onpolynomial algebra The book includes only the mathematical background essential forthe understanding of matrix code construction and design Such an arrangement frees upspace for the description of some ﬁne artistry of matrix code design strategies andtechniques Matrix code designs are presented with respect to practical applications, such

as high-speed semiconductor memories, mass memories of disks and tapes, logic circuitsand systems, data entry systems, and distributed storage systems Also new classes ofmatrix codes, such as error locating codes, spotty byte error control codes, and unequalerror control codes, are presented in their practical settings The new parallel decodingalgorithm of the burst error control codes is demonstrated and further extended to thegeneralized parallel decoding of the codes

Chapter 1 provides background and a preview of material covered in the subsequentchapters First, it defines faults, errors, and failures and explains the many types of faultsand errors This is the core knowledge needed to understand what constitutes a goodcode To design an efficient and effective code for a given application, it is important first

to know what types of errors matter, how much the system’s reliability can be improved

by coding techniques, and what are the constraints on check-bit length, decoding speed,and so forth The matrix code designing procedure is laid out in this chapter from thisstandpoint The chapter concludes with a brief introduction to the competitors of thecoding technique in dependable systems, namely conventional error recovery techniquesand / or error masking techniques

Chapter 2 provides the fundamental mathematical background and coding theorynecessary to understand the later chapters The chapter covers the matrix representations

of well-known error control codes, such as simple parity-check codes, cyclic codes,Hamming codes, BCH codes, Reed-Solomon codes, and Fire codes These codes aremanipulated in the later chapters in examples of how matrix codes satisfy the systemrequirements for given applications

Chapter 3 discusses the matrix code design techniques related to high-speed decoding,area efﬁcient encoding / decoding hardware, modularized organization of encoding /decoding circuits, and so forth

Trang 13

Chapters 4, 5, 6, and 7 cover topics on matrix code design for high-speedsemiconductor memories Depending on the application, the matrix code can be designed

to handle bit or byte errors and in some cases a mixture of both bit and byte errors Thelatter are typical errors found in large capacity semiconductor memory systems usinghigh-density RAM chips Chapter 4 discusses bit error control codes, such as the modifiedHamming single-bit error correcting and double-bit error detecting (SEC-DED) codes, themodified double-bit error correcting BCH codes, and the memory on-chip codes For thememory systems using byte-organized RAM chips, single-byte error correcting (SbEC)codes, and single-byte error correcting and double-byte error detecting (SbEC-DbED)codes, are presented in Chapter 5 The codes for the mixed type of bit errors and byteerrors are presented in Chapter 6 Among them, a byte error detecting SEC-DED code,developed by the author and his colleague in the 1980s, has found practical application inrecent workstations Chapter 7 presents a relatively new class of byte error control codes:spotty byte error control codes This class of codes has been specifically designed to fitthe large capacity memory systems that use high-density RAM chips with wide input /output data of 8, 16, and 32 bits Also a general class of these codes with minimumHamming distance-d and with maximum distance separable (MDS) characteristics ispresented in this chapter The well-known Reed-Solomon codes are included in thesegeneralized codes, which makes them practically and theoretically important They will bequite useful for future applications

Chapter 8 presents the generalized parallel decoding algorithm for error control codes.Initially developed for burst error control codes, this new decoding algorithm includes theconventional parallel decoding algorithm of the existing bit / byte error correcting codes.The generalized algorithm can also be used for multiple burst or byte error correctingcodes The chapter takes this new algorithm and demonstrates how the parallel decodingmethod can be implemented in combinational circuits In addition the chapter addressesthe important problem of glitches in parallel decoding circuits Parallel decoding circuitsdepend heavily on large exclusive-OR tree circuits, which are well known to readilyproduce glitches The glitches are the unwanted logic signal transitions that can generate,propagate, and accumulate in the logic circuits and then induce noise and instability on thepower supply lines The chapter explains why the glitches are generated, how they arepropagated and accumulated in the circuits, and how to reduce these undesirable effects.Chapter 9 presents a new class of codes, namely error locating codes Error location is

an error control function lying midway between error correction and error detection Anerror locating code will indicate where the errors lie but not the precise erroneous digitpositions This type of codes is useful for identifying the faulty block, faulty package, orfaulty chip, and thus enables fault isolation and reconﬁguration The chapter includespractical codes for memory systems to use in locating faulty packages / cards It alsoprovides a practical code for locating faulty chips Both codes have the capability tocorrect single-bit errors, even though the codes are mainly designed for identifying thefaulty areas In addition, burst error locating codes are introduced here The chapterconcludes with a precise analysis of error locating codes with an emphasis on the codeconditions and their relation between error locating codes and error correcting / detectingcodes

Chapter 10 shows yet another new class of unequal error control (UEC) codes In manyapplications certain positions in a word have higher error rates or require more protection.The UEC codes can indicate the area in a word having a higher error rate with strongererror control code functions, and the area having a lower error rate with weaker error

Trang 14

control functions In other words, this type of code has different code functions within acode word, depending on the area and the associated error rate The chapter providesoptimal codes with some UEC code functions Similar codes exist in unequal errorprotection (UEP) codes This type of code protects the valuable information part of a wordagainst errors For example, control information or address information in communicationmessages or computer words, or similarly pointer information in the database words, must

be more protected from errors than their other parts The chapter provides some UEPcodes that protect against burst errors and also against single-bit errors The chapterincludes examples of UEC and UEP codes used in holographic memories and losslesscompressed data

Chapters 11, 12, 13, and 14 present the codes for some specific systems, namely massmemories such as magnetic tapes and disks, logic circuits and systems, data entrysystems, and distributed storage systems Chapter 11 covers the codes designedspecifically for mass memories such as tape memories, magnetic disk memories, andrecent optical disk memories The various modified types of Reed-Solomon codes andadaptive parity codes are presented to the tape memories and to the disk memories.Codes for recent CDs and DVDs are also introduced Chapter 12 mentions errorchecking for logic systems using efficient error detecting codes An important concept

of self-checking is first introduced The chapter then clarifies how the errors in the logiccircuits and systems are detected, how the error detecting checker circuits areimplemented, how the errors in the checker itself are detected, and how the self-testingcheckers are implemented Especially self-checking ALU is presented by using parity-based codes, and also self-checking design for processor systems is demonstrated.Chapter 13 presents the codes for data entry systems In these systems, in general,nonbinary symbols are routinely used in character recognition systems, and recent two-dimensional symbols The chapter characterizes the errors that occur in these nonbinarysymbols as asymmetric errors and presents some asymmetric error control codes Thesecodes are basically nonlinear, and are designed by using elements in newly definedrings Also nonsystematic nonbinary asymmetric error correcting codes are designedbased on a multilevel coding method and a set-partitioning algorithm, and QR codesand two-dimensional unidirectional clustered error correcting codes are presented fortwo-dimensional matrix symbols Chapter 14 provides the codes for distributed storagesystems connected via networks Codes for recent RAID systems that tolerate twodisk failures are introduced, and then an efficient error recovery scheme from multipledisk failures in the distributed storge system is discussed and is implemented by usingblock design in combinatorial theory

The introductory portion of the book, Chapters 1 and 2, and the parts of Chapters 3, 4, 5,

6, 8, 9, and 10, can be used as the text for a course at an advanced undergraduate level orfor an introductory one-semester course at the graduate level For graduate classes andadvanced students who have the background in mathematics, logic circuits, andrudimentary knowledge of codes, the book can be used as a whole with selected topicsfrom each of the chapters Practicing engineers / designers will ﬁnd useful discussions inChapters 6 to 14, which demonstrate, in detail, the procedure of designing sophisticatedcodes in practical form For the practicing engineer, Chapter 2 presents mathematics andcoding theory, not in strict form but in introductory form, which is necessary inunderstanding the later chapters Many examples, ﬁgures, exercises, and references areprovided in each chapter of the book Many attractive codes with practical codeparameters and their evaluation data on decoding hardware and error detection capabilities

Trang 15

are fully demonstrated These can be used by practicing engineers as a practical guide andhandy reference.

My sincere appreciation goes to many people Professors Jack K Wolf of theUniversity of California San Diego, Hideki Imai of the University of Tokyo, T R N Rao

of the University of Louisiana Lafayette, and Bella Bose of Oregon State Universityencouraged me to continue my research on code design theory and to write this book.Emeritus professor Yoshihiro Tohma of Tokyo Institute of Technology, Professors TakashiNanya of the University of Tokyo, Hideo Ito of Chiba University, and Jien-Chung Lo ofthe University of Rhode Island gave important suggestions and valuable discussions onresearch for dependable systems Recently Professor Lo also provided valuable comments

on the ﬁnal book and an important discussion on glitches, (i.e., logical noise) that aregenerated, propagated, and accumulated in large exclusive-OR tree circuits in the paralleldecoder of the codes The author’s NTT colleagues, Dr Shigeo Kaneda, now professor

at Doshisha University, and Dr Kazumitsu Matsuzawa, now professor at KanagawaUniversity, collaborated to develop practical codes for computer memories Dr MasatoKitakami, now associate professor at Chiba University, Dr Mitsuru Hamada, nowassociate professor at Tamagawa University, Dr Shuxin Jiang, Dr Saowapa Kiattichai, Dr.Hongyuang Chen, Dr Kazuteru Namba, Dr Ganesan Umanesan, Dr Haruhiko Kaneko,

Dr Kazuyoshi Suzuki, Mr Tsuyoshi Tanaka, Mr Toshihiko Kashiyama, and Mr HiroyukiOhde devoted themselves to designing the excellent codes in their master’s and / ordoctorate course programs at the Tokyo Institute of Technology Much of the motivationfor making the codes practical was due to discussions with many researchers and engineers

in Japanese industry

Thanks also go to art designer, Mr Ippei Inoh, a friend of mine, who proposed anddirected the marvelous idea of the front cover design Ms Tiki Ishizuka, a computergraphic designer, arranged the wonderful ﬁne art of this cover You can see ‘‘Hoh-Oh,’’ alegendary happy bird, in the center of the front cover whose original pattern was introducedfrom China more than one thousand years ago to Japan, and since then appeared as an artdesign in Japanese art and craft products I sincerely hope the book will bring happiness andpleasure to the reader

At this point in a preface, I usually thank my wife, Sachiko, and my daughter’s family,Sayaka, Makoto, and Asuka, for encouraging me in continuing this difﬁcult project

ðAutumn in 2005 on the foot of Mt FujiÞ

Trang 18

1.1 Faults and Failures 3

1.1.1 Faults 3

1.1.2 Failures 6

1.2 Error Models 6

1.2.1 Hard Errors and Soft Errors 7

1.2.2 Random Errors, Clustered Errors, and Their Mixed-Type Errors 7

1.2.3 Symmetric Errors, Asymmetric Errors, and Unidirectional Errors 9

1.2.4 Unequal Error Probability Model and Unequal Error Protection Model 10 1.3 Error Recovery Techniques for Dependable Systems 10

1.3.1 Error Detection / Error Checking 10

1.3.2 Error Recovery / Error Masking 11

1.4 Code Design Process for Dependable Systems 16

1.4.1 Code Functions 17

1.4.2 Code Design Process 18

References 19

Trang 19

Introduction

Before designing a dependable system, we need to have enough knowledge of the system’sfaults, errors, and failures of the dependable techniques including coding techniques, and ofthe design process for practical codes This chapter provides the background on code designfor dependable systems

First, we need to make clear the difference between three frequently encountered technicalterms in designing dependable systems—namely faults, errors, and failures These termsare fully deﬁned in [LAPR92, AVIZ04] Faults are primarily identiﬁed as the genericsources of abnormalities that alter the operation of circuits, devices, modules, systems, and /

or software products Failure can arise from any type of possible faults Faults are oftencalled defects when they occur in hardware and bugs when in software

1.1.1 Faults

As causes of failure, faults are sometimes predictable but difﬁcult to identify Faults can occurduring any stage in a system’s or product’s life cycle: during speciﬁcation, design, production,

or operation Faults are characterized by their origin and their nature [LAPR92, GEFF02]

Origin of Faults Timing is a factor because faults can provoke failure in the operation phase

at any one of a system’s previous life phases: specification, design, production, and operation.During the specification phase, for example, an incomplete definition of services maylead to different interpretations by the client, the designer, and the user Eventually, in the

Code Design for Dependable Systems: Theory and Practical Applications, by Eiji Fujiwara

Copyright # 2006 John Wiley & Sons, Inc.

3

Trang 20

operation phase, the failure becomes evident when the services provided differ from theuser’s expectations.

During the design and the production phases, for example, a designer’s lack ofsufﬁcient knowledge of architectural levels, structural levels, and the like, may result in atype of physical defect that induces, for example, short or open circuits

During the operation phase, for example, an elevation of ambient temperature can causeelectronic devices and products to malfunction

Nature of Faults During the speciﬁcation and the design phases, faults that occur are calledhuman-made faults During the production and the operation phases, these may occur physicalfaults, hardware faults, or solid faults Each type is due to some physical abnormality in thecomponent arising from aging or defective materials Faults are of two types in their duration:

1 Permanent These faults arise, for example, from a power supply breakdown,defective open or short circuits, bridging or open lines, electro-migration, and soforth The defects in the input / output of the logical circuits or lines are calledstuck-at ‘1’ faults or stuck-at ‘0’ faults

2 Temporal These faults can be transient or intermittent Transient faults occurrandomly and externally because of external noise, namely environmental problems

of external electromagnetic waves but also external particles such as a-particles andneutrons Intermittent faults occur randomly but internally because of unstable ormarginally stable hardware, varying hardware or software state as a function of load

or activity, or signal coupling (i.e., crosstalk) between adjacent signal lines Someintermittent faults may be due to glitches [LO05], which are unpredictable spikenoise pulses occurring and propagated especially in large exclusive-OR (XOR) treenetworks (see Chapter 8) Parallel decoding circuits of error control codes withlarge code lengths require large exclusive-OR tree networks, so glitches can becomeserious problems This topic will be covered in more detail in Section 8.3

Transient faults and Intermittent faults are the major source of errors in modern-daydigital systems Some reports show that more than 60% of all failures in computer systemsare caused by transient or intermittent faults For example, in DRAM (Dynamic RandomAccess Memory) chips, transient errors result mainly from a-particles emitted by the decay

of radioactive particles in the semiconductor materials [MAY79, NOOR80, SAIH82] Oneidentiﬁed source of a-particles is the lead solder balls used to attach the chip to the substrate

As they pass through the chip, a-particles create sufﬁcient electron-hole pairs to addcharge to the DRAM capacitor cells These particles have low energy level, and thus havevery low probability of causing more than one memory cell to ﬂip when the memory cellsare not packed in extreme density In today’s ultra–high-density RAMs, not only DRAMsbut also SRAMs (Static Random Access Memories), it has been recognized that multiplecosmic-ray-induced transient errors are a serious problem [OSAD03, 04]

Temporal errors have also been observed in microprocessor chips The trend towardsmaller geometries by ever-shrinking semiconductor designs results in lower operating signalvoltages and higher speed operation, and therefore brings additional transient or intermittenterrors into play [KARN04] In today’s ubiquitous digital device or system environment, PDAsand personal computers equipped with these high-speed microprocessor chips and high-density RAM chips are further prone to be damaged by even worse circumstances whenoperated in airplanes at high altitude or near the high-voltage electric power lines

Trang 21

The important point is that the faults due to temporary environmental problems do notneed repair because the hardware is physically undamaged.

Cosmic rays, however, can give rise to signiﬁcant transient errors, called soft errors[KARN04, MAKI00, HAZU00, ZIEG98, MASS96, CALV94] Figure 1.1 shows thecosmic ray and its inﬂuence at the earth surface level In the cosmic environment heavyparticles with very high energy from solar winds can penetrate the semiconductor chips insatellite digital systems and cause more than double-bit errors [MUEL99] Sometimesthey can cause physical faults such as latchup in CMOS circuits

A detailed report of ﬁeld testing for soft errors due to cosmic rays was presented in 1996[ZIEG96a, 96b, 96c, OGOR96, SRIN96] In the report cosmic rays are deﬁned as particles

in solar wind originating in the sun or as galactic particles that enter the solar systemstriking atmospheric atoms and creating a shower of secondary particles Most suchparticles produced by the shower either decay spontaneously or lose energy gradually, andeventually lose all energy in the cascade Some of these particles may strike the earth.Therefore the cosmic rays at sea level consist mostly of neutrons, protons, pions, muons,electrons, and photons About 95% of these particles are neutrons with no charge but withthe high energy (more than 10 MeV) that causes significant soft errors or latchups inelectronic circuits So cosmic rays can create multiple errors Altitude causes the neutronflux to increase exponentially, and hence the fail rate of electronic circuits at airplanealtitude is about one hundred times worse than at terrestrial level Concrete shielding withseveral feet of thickness can significantly attenuate the flux of these high-energy particles.Figure 1:2 shows how neutrons and other particles, including a-particles, generated bythe collision of nuclei in the atmosphere, can strike silicon chips and produce sufficientelectron-hole pairs in the chips to impair their functioning

Earth

Cosmic ray

Neutron Pion

Neutron Proton

-Meson -Particle

Proton Neutron

Atmospheric zone

Collision with nucleus

in atmosphere

Collision with nucleus Proton, Neutron, Pion

-Meson Neutron (energy level > 10 MeV):

0.01 Particles/(cm s) at sea level 2.1.0 Particles/(cm s) at 10,000 m high level 2.Cosmic zone

Figure 1.1 Cosmic rays.

Trang 22

1.1.2 Failures

A failure is defined as nonperformance that occurs when a delivered service no longercomplies with its specifications [LAPR92], and a failure is also defined as nonperformancewhen the system or component is unable to perform its intended function for a specifiedtime under specified environmental conditions [LEVE95]

Some types of failure are defined with respect to specific conditions For example, avalue failure means that the value of the delivered service does not comply with thespecification and a timing failure represents a response in incorrect timing, either faster orslower than the specified time A temporary failure means an erroneous behavior at acertain moment lasting only a short time A crash failure, or catastrophic failure, is the onethat stops the mission because the system is completely blocked

An error is a manifestation of an unexpected fault within a system that is liable to lead tosystem failure The transformation of a fault to an error is called fault activation Themechanism that creates errors in the system and ﬁnally provokes a failure is called errorpropagation Before provoking a failure, errors can be masked or corrected by some errorcontrol mechanisms such as error correcting codes, retries, or triple modular redundancy(TMR) and thus recovered without inducing a system failure

A fault remains in passive mode until an error ﬁrst appears at some structure of thesystem This occurrence is called an initial activation and the error is called a primitiveerror In this case latency is deﬁned as the mean time between the fault’s occurrence and itsinitial activation as an error Figure 1:3 presents the causal relationship between fault,error, and failure Various types of errors can occur, and these different types are coveredbelow

Charged

(Moved by collision) Si chip

(No collision) particle

Electron)

Figure 1.2 Electron holes in a silicon chip caused by particles.

Trang 23

1.2.1 Hard Errors and Soft Errors

Hard errors are caused by permanent faults; they therefore affect the system functions for

a long period of time This type of error is typically provoked by faults that appear as open

or short anywhere on the chips, modules, cards, or boards Hard errors are also calledpermanent errors

Soft errors, on the other hand, are caused by temporal faults, especially those resultingfrom external causes Soft errors have a limited duration, meaning they interrupt systemfunctions for a very short time period The most likely sources of soft errors are radioactiveparticles and external noise Alpha particles and cosmic particles [ZIEG96a, ZIEG96b,ZIEG96c, OGOR96, SRIN96] are the major contributors mentioned previously Thereforesoft errors are also called transient errors The intermittent errors are provoked byintermittent faults

1.2.2 Random Errors, Clustered Errors, and Their Mixed-Type ErrorsMultiple errors that occur randomly in time and / or space are called random errors.Error can occur in every bit position of a word with almost equal probability Therandom type of error is unpredictable and is typically caused by white noise orexternal particles

Errors may cluster non-uniformly in a word, and these multiple errors may gather inparticular and unpredictable positions in the word Clustered errors include burst errorsand byte errors Burst errors occur typically in disks or tape memory Byte errors aretypically found in semiconductor memory The difference is in the data-recordingmedium In disk memory, the data are recorded on a continuous surface In semiconductormemory, the data are stored in RAM chips, and a data fragment, called a byte, is read orstored in each chip In disk or tape memory, defects or dust particles on the recordingsurface can cause burst errors to occur anywhere in the continuous recording medium

Failure

interface User

Error Fault

Trang 24

Clustered errors may occur in the two-dimensional matrix symbols as well as in the tape ordisk memory of a continuous two-dimensional recording medium In semiconductormemory, on the other hand, byte errors may occur in a fragment of readout data, namely in

a single byte, corresponding to the faulty chip This is because each chip is physicallyseparated and independent, and therefore the presence of a fault in a chip does not extend

to the adjacent chips Figure 1.4 illustrates the different cases of random errors, byte errors,and burst errors

Another error model consists of mixed clustered and random errors in the operationalphase The clustered errors mentioned above are sometimes caused by physical faults due

Random Errors

External noise, particles, and permanent faults occurred randomly

Received data / Readout data

Faulty chip

Memory chip with

Memory card (Package)

Figure 1.4 Models of random errors, byte errors, and burst errors.

Trang 25

to aging problems However, systems and devices are more prone to damage fromtransient faults than from physical faults Transient faults are source of random errors.Therefore, when a physical fault occurs during the operational phase, both types oferror—clustered and random—must be taken into account For example, in semiconductormemories with byte-organized RAM chips, the major types of errors are transient errors,(i.e., random bit errors) caused by a-particles or external noises After some time inoperation, byte errors will occur due to the aging of RAM chips Therefore both bit errorsand byte errors, meaning both random errors and permanent errors, may occur separately

or simultaneously A similar situation holds for transmission systems, where both randombit errors and burst errors can occur Chapter 6 deals with the codes which control themixed type of single-byte errors and random bit errors

1.2.3 Symmetric Errors, Asymmetric Errors, and Unidirectional Errors

In binary systems the probability of errors that force 0 to 1, called 0-errors, is, in general, equal

to those going from 1 to 0, called 1-errors This class of errors is known as symmetric errors.When these errors occur with unequal probabilities, they are called asymmetric errors In thebinary asymmetric error model, only one type of error, either 0-errors or 1-errors, can occur,and the error type is known a priori If both error types occur but are not mixed, then this class

of errors is said to be unidirectional errors [BLAU93] In binary systems these errors arecaused by symmetric faults, asymmetric faults, or unidirectional faults

In nonbinary systems using numerals, 0; 1; 2; 3; ; 9, or alpha-numeric symbols,asymmetric errors are the type that occur That is, the probability of an error that forces onenonbinary symbol A to another symbol B is sometimes different from that of symbol Aforced to yet another symbol C For example, in handwritten character recognitionsystems, the probability of a 7 being mistaken for a 9 is much higher than that of a 7 beingmistaken for a 4, or pð9j7Þ pð4j7Þ, where pðBjAÞ means probability of a symbol A beingmistaken for another symbol B This is because the numbers 7 and 9 are close in shapewhereas 7 and 4 are not so similar Likewise in keyboard input systems the symbolslocated on adjacent keys can be more easily mistyped Figure 1:5 shows examples of theseerror models In the asymmetric error model, the error graphs are not perfect andsometimes not bi-directional On the other hand, in the symmetric nonbinary error model,they are perfect and bi-directional

If symbols are removed or added in a word, as is sometimes caused by human mistakes(i.e., human-made faults), this class of errors is called deletion errors or insertion errors,respectively

(a) Example of an asymmetric error graph for handwritten

character (numerals) recognition systems

(b) Asymmetric error graph for keyboard systems

Trang 26

1.2.4 Unequal Error Probability Model and the Unequal Error

Protection Model

The probability of error appearing in any position of a word is usually considered to beequal However, there is an error model to consider where some positions of a word havehigher error probability than other positions These are sometimes caused in the system byusing devices with low reliabilities in the corresponding positions of a word, or by havingerror-sensitive areas in some positions of a word which are more vulnerable to external noises

or have a low noise margin In such cases the erroneous positions or areas with high errorprobabilities are known a priori The type of error model that is relevant here is known as theunequal error probability model The codes based on this error model are called unequalerror control (UEC) codes Chapter 10 will discuss the UEC codes and present its application

to holographic memory, which has non-uniform error probability in the recording medium.Some types of computer words or communication messages have a structure such that theinformation included in one part of the word is more important or more valuable than that inother parts Control and address information in the computer or communication messages, andpointer information in the database words are good examples In general, errors in this part,such as errors in control information or in pointer information, will cause much more seriousdamage to the subsequent processes in the system Another example is error in the decimalnumbers During processing of digital data of conventional decimal numbers or measurementdata, errors in the higher order digits will yield more devastating effects on the subsequentprocesses in digital systems than errors in the lower order digits Therefore the higher orderdigits should be more strongly protected against errors than the lower order digits This type oferror model is known as an unequal error protection model The codes based on this are calledunequal error protection (UEP) codes and will also be discussed in Chapter 10

Error detection is an essential part of a dependable system design Ideally, error detectionwill block the propagation of an error during online operations, before it reaches thesystem interface and causes a system failure The error is best be detected immediately as

it occurs so that its effect can be minimized

Upon detecting an error by an error detection mechanism, some error recoverytechnique must mask the fault or remove it, and thus block the error’s propagation Amongsuch mechanisms, error correcting codes and triple modular redundancy (TMR) correcterrors or mask faults directly, that is, without an additional error detection procedure.Some important error detection techniques and error recovery techniques, comparative

to the error control coding techniques, are brieﬂy described below For more information,the reader is referred to the following excellent texts and papers on dependable systems ordesign techniques for fault tolerance: [AVIZ78, SIEW82, RENN84, EZHI86, ABRA86,PRAD86, JOHN89, LEE90, AVRE00]

1.3.1 Error Detection / Error Checking

Prediction & Comparison The basic error detecting or checking concept for onlineoperations exists in prediction & comparison That is, the output of the circuit / module ispredicted from the input, and then the predicted output and the original circuit / module

Trang 27

output are compared bit by bit The errors are detected if the actual output is not perfectlymatched to the predicted output.

Duplication is an important and popularly used error detection technique in dependabledigital systems This is a special case of prediction & comparison, because the output isgenerated, or predicted, by a copy of the circuit / module and then compared with that ofthe original This concept exists also in software duplication where a copy of the same orequivalent software is prepared and executed, and then the outputs are compared.Parity-prediction is another important and popularly applied technique The outputparity bit is predicted from the input, and then compared with the parity bit generated fromthe original output

Error Detecting Codes Error detecting codes typically deal with simple parity-checkcodes, cyclic codes, checksum codes, and other basic linear codes, as will be explained inSection 2.3 Some further important and newly developed codes will be presented in laterchapters

The application of error detecting codes in online operations is also called checking or

an online testing The error detection circuit is denoted as a checker These applicationswill be examined in-depth in Section 12.1 where the self-checking concept is presented.Additional topics on how to detect errors caused by faults in the checker itself and how

to design such checkers are covered in Section 12.2 where self-testing checkers are cussed In summary, Chapter 12 covers error-checking concepts, self-testing checker designmethodologies, and concrete checker design for logic circuits and for computer systems

dis-Watchdog Timer and dis-Watchdog Processor A watchdog timer is very useful fordetecting faults in a system The idea behind this scheme is that some part of the system shouldact to indicate fault-free status so that absence of this action is indicative of a fault Also thetimer must be repeatedly reset by the system Failure of the system to perform the reset func-tion results in the system being turned off to prevent a system failure from occurring

A watchdog timer can be used to detect faults in both the hardware and the software of asystem In many applications software routines are expected to execute within pre-speciﬁed time frame In digital control systems, for example, the routines executerepetitively at speciﬁed intervals If a routine suddenly needs more than the expected time

to execute, the fault may be in the software’s, for example, inﬁnite loop [JOHN89] In thisregard the watchdog timer is an important control ﬂow check tool

A watchdog processor is an extension of the concept of a watchdog timer This is aspecial subprocessor that checks the online operations of the processor being checked Thewatchdog processor runs the watchdog programs that collect information from theprocessor being checked and generate signatures, such as address and data information,and processor state information, during online operations The new information is thencompared to that already prepared in the watchdog program

1.3.2 Error Recovery / Error Masking

Error recovery techniques are essential to improving system reliability The importantrecovery techniques, as was mentioned before, include coding techniques and some modularredundancy techniques, such as TMR, that correct or mask the faults directly Othereffective error detection methods are also available to mask the faults after the detection oferrors, for example, self-checked duplication and sift-out redundancy, as discussed below

Trang 28

Error Correcting Codes Many different error control codes have been studied anddeveloped to correct and / or detect the types of errors mentioned in Section 1.2 Amongthe most practical matrix codes are those presented in this book.

Error correcting codes head the list of the most effective and efﬁcient techniques used

to mask faults, both temporal and permanent The coding approach involves someredundancy, for example, additional check bits, additional hardware in the form ofencoding / decoding logic circuits, and additional decoding time delay Nevertheless, thecoding performance is superior to that competitive techniques, especially in quicklymasking of temporal faults For this reason error control codes are still being extensivelyapplied to various digital systems to improve their reliability

Retry Just as space redundancy requires additional hardware resources, the retrymethod called time redundancy which requires additional time to perform multiple iden-tical operations of commands or programs immediately after errors are detected This verysimple technique requires almost no additional hardware but can very effectively recoversystem operations from temporary faults, meaning transient and intermittent faults There-fore the retry method is popularly applied to digital systems, including processors, mainmemories, disk memories, tape memories, and I/O devices

Alternate data retry, abbreviated by ADR [SHED78], is a kind of retry operation that iseffective in masking permanent faults besides temporary faults Figure 1:6 presents theprinciple behind masking a single permanent fault by ADR Note that this simple exampleshows the even-parity encoded bus circuit with four lines, including a parity line Figure

1:6(a) shows that if a stuck-at ‘0’ permanent fault occurs in the ﬁrst bus line, then the parity encoded data from circuit A, here 1001, is received at the input of the circuit B as

even-0001, which is an odd-parity encoded data Therefore a single error can be detected byexamining the parity check of the data Next, by the ADR method, in Figure 1:6(b), thebit-by-bit complement of the original data, which is 0110, is transmitted from circuit A to

(a) Error detection by parity check

A

1 0 0 1

0 0 0 1 stuck-at ‘0’ fault

(b) Retried by complemented dataA

B

0 1 1 0

stuck-at ‘0’ fault

Inverted

1 0 0 1

Figure 1.6

Trang 29

the input of the circuit B Even though the ﬁrst line is still preserving a stuck-at ‘0’ fault,the fault is masked because the data on this line are also a ‘0’ Finally the received data areinverted, and then the original correct data, 1001, are recovered In this example, apermanent fault is masked at the second stage of ADR, and ﬁnally the correct data arerecovered at the third stage of ADR Also, in this example, if the fault in Figure 1:6(a) is atemporary fault, the error it caused can be completely masked and will have no effectbecause the temporary fault will disappear by the time of the second stage of ADR.

In general, if the logic circuit that performs the function FðXÞ for the circuit input Xsatisﬁes the relation

FðXÞ ¼ FðXÞ;

where X means the complement of X, then the ADR with bit-by-bit complementary retry

at the second stage can be performed successfully The function F that satisﬁes the relationabove is called a self-complementary function, and the circuit that satisﬁes the relation iscalled a self-complementary circuit The former even-parity busline circuit is a self-complementary circuit The adder, the multiplier, and the divider are also good examples

of self-complementary circuits

N-Modular Redundancy (NMR) and Reconﬁguration Triple modular redundancy(TMR) is the most typical form of N-modular redundancy The TMR method triplicates theoriginal module and performs a majority vote to determine the output of the system If one ofthe modules becomes faulty, the other two fault-free modules mask the results of the faultyone when the majority vote is performed This is shown in Figure 1:7(a) This voting concept

(a) Triple modular redundancy (TMR)

(B) Triple modular redundancy with triplicated voters

Input 1

Output 1 Voter

Module 1

Input 2

Output 2 Voter

Module 2

Input 3

Output 3 Voter

Trang 30

is applied to TMR software to protect against software faults in any one of three identical orequivalent software programs that perform the same function.

The difﬁculty in the TMR exists in its voter That is, if the voter fails, the systemcompletely fails One approach is to apply TMR to the voter itself such that three voters areused and three independent voting results are provided as shown in Figure 1:7(b) Thethree modules are functionally identical and receive identical inputs The results generated

by three modules are voted on by the three voters to produce three results Each result iscorrect unless more than one module or input is faulty

N-Modular redundancy (NMR) is a generalization of the TMR and is a typical spaceredundancy technique In most cases, N is selected as an odd number so that a majorityvoting principle can be applied For example, the 5MR system consists of ﬁve identicalmodules and a voter This system produces correct output in the presence of, or masks, asmany as two faulty modules

The modular redundancy concept has been extended and modiﬁed by combining theconcept of reconﬁguration The following forms show some such combinations.Self-checked duplication is an extended form of duplication in which each module hasits own self-checking mechanism in order to identify the faulty state of the module itself

In this system, two self-checked modules are operated and checked in parallel at all times

If one module is found to have errors by its own error detection mechanism, then thesystem output is switched to the error-free module, meaning it is reconﬁgured This concept

is a form of hot standby sparing in which the spare module operates synchronously with theonline module and is prepared to take over at any time When the online module is failed,the standby spare module takes over immediately In contrast to the hot standby sparing,there exists cold standby sparing where the spare is unpowered until needed to replace afaulty module

N-Modular redundancy with spares is also known as hybrid redundancy It provides abasic core of N modules arranged in a voting system, and in addition spares are provided toreplace faulty modules For example, while the TMR with one spare masks one faultymodule, the spare will replace the faulty module immediately upon the detection of thefault After that spare is used, the system is still capable of masking another faulty module.Therefore two faulty modules can be masked in this system The aforementioned 5MRrequires ﬁve modules in order to mask two faulty modules, but the TMR with one spareapproach requires only four modules The system remains in the basic NMR conﬁgurationuntil the disagreement detector determines that a faulty module exists One approach tofault detection is to compare the output of the voter with the individual outputs of themodules A module that disagrees with the majority is regarded as faulty and removedfrom the NMR core A spare module is then switched in to replace the faulty module Thereliability of the basic NMR system is maintained as long as the pool of spares is notexhausted This is shown in Figure 1:8

Self-Purging Redundancy is similar to the NMR with the spare modules approach Themain difference is that all modules operate actively in this redundancy system, unlike theNMR with spares where some spare modules are not an active part of the system until afault occurs This is shown in Figure 1:9 Each switch in the self-purging redundancyseparates the faulty module if the module output is not equal to the voter output Thereconﬁguration is essentially accomplished by the system logically removing the faultymodule via the switch and thus reducing the number of N in the reconﬁgured NMR system.Sift-out redundancy also requires N identical modules in the system but with every pair

of two module outputs compared to identify faulty modules If there exist N¼ 5 modules,

Trang 31

ten comparisons are performed This redundancy requires an N-way multiplexer instead of

an N-input voter, as shown in Figure 1:10 The comparator in this redundancy circuitreceives all outputs of the modules and produces comparison outputs of every twomodules, that is, NðN 1Þ=2 outputs, and then determines the faulty modules in thedetection circuit Finally the output of the N-way multiplexer is selected based on thefaulty indication outputs of the detection circuit This essentially masks the effects of anyfaulty modules

This redundancy can tolerate up to N 2 faulty modules Its tolerance is thereforeequal to the TMR system with N 3 spares and also to the self-purging system having avoter with threshold level of two

Switch

Disagreement identification

Trang 32

System Recovery by Software Retry techniques require error detection by checkers,and immediately after the error detection the same operations are performed In contrast,checkpoint techniques allow some latency time after error detection because the processcan be restored to an earlier point of execution Checkpointing is mostly implemented in soft-ware and requires some hardware to store the backup data The techniques result from a com-bination of checkpointing and rollback In checkpointing, complete copy of the system stateshould be saved at speciﬁc points, namely checkpoints, during process execution The infor-mation to be stored is the set of system state including data, programs, machine state, and soforth, which is necessary to restart the continued successful execution from the checkpoint.Rollback is a part of actual recovery process and occurs after the repair, such as by reconﬁ-guration, that removes faulty modules or equipments from the system, or after the error due totransient faults has died out An important design criterion is how often checkpoints are to beset, that is, in determining checkpoint intervals If the checkpoints are too infrequent for theactual error rate experienced, too much computation time will be lost due to rollback On theother hand, too frequent checkpoints result in an unnecessary increase in operation time andmemory due to the overhead of saving system states when establishing checkpoints.

What types of dependable techniques are the most effective in the design of dependablesystems? In some cases other than coding techniques, or a combination of codingtechniques and other dependable techniques, will better meet the reliability requirement orthe cost / performance requirement of a system

Module 2

Module N

System

inputs

Module 1

N

Detection circuit

N(N-1)/2 Comparator

N-Input multiplexer

System output

indicating faulty module Figure 1.10 Shift-out redundancy.

Trang 33

Before designing the error control codes, we therefore have to pay attention to a number ofpreconditions or preparatory measures: Where to apply the code? How to apply the codeeffectively? How much reliability of the system to improve and satisfy its performance by codingtechniques? What are the requirements for decoding speed, and how much decoder hardware?What about the detection capability of errors falling outside the capability of the code? Thissection addresses all these important questions with respect to the code design process.

1.4.1 Code Functions

Error detection and error correction are the more known code functions An importantcode function that lies midway between these two functions is error location The errorlocating code indicates which blocks, or components of a word contain error but does notindicate the precise erroneous digit position nor the error value This is a code function that

is efﬁcient for retransmission of a word segment, especially in communication systemswhere whole words do not need retransmitting [WOLF63] Also in computer systems theerror locating code provides the information on where to ﬁnd the faulty module, faultypackage / card, or faulty device, which is very useful for system maintenance If the system

is equipped with spares, then the system can be recovered by removing the faulty blocksand switching to the spares

Figure 1:11 shows the different functions of these three code types Because erroneousposition, and error value can be determined by use of ‘‘error correction’’, all errors can be

Figure 1.11

Trang 34

corrected Of course, use of ‘‘error detection’’ alone does not allow any erroneousposition nor error value to be determined; it only indicates the presence of error in aword For ‘‘error location’’, as was mentioned before, only the area where the wordincludes an erroneous position is indicated by the code For example, note in Figure 1:11that the code’s information is that errors exist in the second block of the word and

no deﬁnite error positions in the block nor the error values are determined Errorlocating codes will be covered in Chapter 9 Many practical codes, in general, have

a mixture of these code functions, for example, single error correction and doubleerror detection

1.4.2 Code Deisgn Process

Before attempting the design of codes, we need to give the following items our carefulconsideration:

1 Circumstance where the systems or equipments with the coding techniques are to beapplied, for example, the particular needs of medical appliances, nuclear appli-ances, or digital systems in aircraft or satellite,

2 Fabrication structure, that is, how the systems or the equipments are organized, forexample, chip / card (package) organization, bit / byte organization, or binary / nonbinary,

3 Devices, such as memories, logic circuits, or FPGAs that are used in the system towhich the coding techniques are to apply

4 Combination of fault / error masking techniques with coding techniques

The design process for the error control codes is presented next, and is shown in Figure

1:12 Steps 1 through 3 pertain to the phase of setting code parameters, and steps 4 and 5are for the phase of code designing

Step 1 Determine error rates and error types:

Raw error rate of devices, modules, or systems, and what target error rate to attain

Whether symmetric error, asymmetric error, or unidirectional error

Whether equal error or unequal error

Whether random bit error, byte error, spotty byte error,aor burst error

Whether bit or byte error,*or rather, bit plus byte errorb

Step 2 Determine code parameters and code constraints:

Information-bit length, and required check-bit length

Maximum random bit error length—or byte error length, spotty error length,aorburst error length

Required decoding speed

Required decoder hardware complexity

a See Chapter 7.

b See Chapter 6.

Trang 35

Step 3 Determine code function:

Error detection, error correction, error location, or mixed type of these code functions

Step 4 Design code, and calculate code bounds:

Theoretical bound on code length or check-bit length

Mathematical knowledge required for code design, for example, algebra, torial mathematics, number theory, graph theory, statistics, and probability theory

combina-Step 5 Evaluate the code designed:

Check-bit length, and comparison to its bound

Decoding speed

Decoder hardware complexity

Error detection probability of multiple errors beyond the code capability

If the code does not satisfy the requirements, then go back to step 4

Trang 36

[AVIZ78] A Avizienis, ‘‘Fault Tolerance, the Survival Attribute of Digital Systems’’ Proc IEEE, 66(October 1978): 1109–1125.

[AVIZ04] A Avizienis, J.-C Laprie, B Randell, and C Landwehr, ‘‘Basic Concepts and Taxonomy

of Dependable and Secure Computing,’’ IEEE Trans Depend Secure Comput., 1 (January–March2004): 11–33

[AVRE00] D R Avresky (ed.), Dependable Network Computing, Kluwer Academic Publishers(2000)

[BLAU93] M Blaum, Codes for Detecting and Correcting Unidirectional Errors, IEEE ComputerSociety Press (1993)

[CALV94]P Calvel, P Lamothe, and C Barillot, ‘‘Space Radiation Evaluation of 16 Mbit DRAMsfor Mass Memory Applications,’’ IEEE Trans Nucl Sci., 41 (December 1994): 2267–2271.[EZHI86] P D Ezhilchevan and S K Shrivastava, ‘‘A Characterization of Faults in Systems,’’ Proc.5th Symp on Reliability in Distributed Software and Database Systems (January 1986): 215–222.[GEFF02] J.-C Geffroy and G Motet, Design of Dependable Computing Systems, Kluwer AcademicPublishers (2002), chs 1–5

[HAZU00] P Hazucha, C Svensson, and S A Wender, ‘‘Cosmic-Ray Soft Error Rate

[LAPR92] J C Laprie (ed.), Dependability: Basic Concepts and Terminology, Springer-Verlag (1992).[LEE90] P A Lee and T Anderson, Fault Tolerance, Principles and Practice, Springer-Verlag(1990)

[LEVE95] N Leveson, Safeware, Addison-Wesley (1995)

[LO05] J C Lo and E Fujiwara,‘‘Transient Behavior of the Encoding/Decoding Circuits of ErrorControl Codes,’’ Proc IEEE Int Symp on Defect and Fault Tolerance in VLSI Systems (October2005):

[MAKI00] A Makihara, H Shindou, N Nemoto, S Kuboyama, S Matsuda, T Oshima, T Hirao, H.Itoh, S Buchner, and A B Campbell, ‘‘Analysis of Single-Ion Multiple-Bit Upset in High-Density DRAMs,’’ IEEE Trans Nucl Sci., 47 (December 2000): 2400–2404

[MASS96] L W Massengill, ‘‘Cosmic and Terrestrial Single-Event Radiation Effects in DynamicRandom Access Memories,’’ IEEE Trans Nucl Sci., 43 (April 1996): 576–593

[MAY79] T C May, ‘‘Soft Errors in VLSI: Present and Future,’’ IEEE Trans Comp Hybrids Manuf.Technol., CHMT-2 (December 1979): 377–387

[MUEL99] M Mueller, L C Alves, W Fischer, M L Fair, and I Modi, ‘‘RAS Strategy for IBMS/390 G5 and G6,’’ IBM J Res Dev., 43 (September–November 1999): 875–888

[NOOR80] D J W Noorlag, L M Terman, and A G Konheim, ‘‘The Effect of Induced Soft Errors on Memory Systems with Error Correction,’’ IEEE J Solid-State Circ., SC-15(June 1980): 319–325

Alpha-Particle-[OGOR96] T J O’Gorman, J M Ross, A H Taber, J F Ziegler, H P Muhlfeld, C J Montrose,

H W Curtis, and J L Walsh, ‘‘Field Testing for Cosmic Ray Soft Errors in SemiconductorMemories,’’ IBM J Res Dev., 40 (January 1996): 41–50

Trang 37

[OSAD03] K Osada, Y Saitoh, E Ibe, and K Ishibashi, ‘‘16.7-fA/Cell Tunnel-Leakage-Suppressed16-Mb SRAM for Handling Cosmic-Ray-Induced Multierrors,’’ IEEE J Solid-State Circ., 38(November 2003): 1952–1957.

[OSAD04] K Osada, K Yamaguchi, Y Saitoh, and T Kawahara, ‘‘SRAM Immunity to Induced Multierrors Based on Analysis of an Induced Parasitic Bipolar Effect,’’ IEEE J Solid-State Circ., 19 (May 2004): 827–833

Cosmic-Ray-[PRAD86] D K Pradhan, Fault-Tolerant Computing, Vol 1 and 2, Prentice-Hall (1986).[RENN84] D A Rennels, ‘‘Fault-Tolerant Computing—Concepts and Examples,’’ IEEE Trans.Comput., C-33 (December 1984): 1116–1129

[SAIH82] G A Sai-Halasz, M R Wordeman, and R H Denard, ‘‘Alpha-Particle-Induced Soft ErrorRate in VLSI Circuits,’’ IEEE J Solid-State Circ., SC-17 (April 1982): 355–361

[SELL68] F F Sellers, Jr., M Y Hsiao, L W Bearnson, Error Detecting Logic for DigitalComputers, McGraw-Hill (1968)

[SHED78] J J Shedletsky, ‘‘Error Correction by Alternate-Data Retry,’’ IEEE Trans Comput., C-27(February 1978): 106–112

[SIEW82] D P Siewiorek and R S Swarz, The Theory and Practice of Reliable System Design,Digital Press (1982)

[SRIN96] G R Srinivasan, ‘‘Modeling the Cosmic-Ray-Induced Soft-Error Rate in IntegratedCircuits: An Overview,’’ IBM J Res Dev., 40 (January 1996): 77–89

[WOLF63] J K Wolf and B Elspas, ‘‘Error-Locating Codes—A New Concept in Error Control,’’IEEE Trans Info Theory, IT-9 (April 1963): 113–117

[ZIEG96a] J F Ziegler, H W Curtis, H P Muhlfeld, C J Montrose, B Chin, etc., ‘‘IBMExperiments in Soft Fails in Computer Electronics (1978–1994),’’ IBM J Res Dev., 40 (January1996): 3–18

[ZIEG96b] J F Ziegler, ‘‘Terrestrial Cosmic Rays,’’ IBM J Res Dev., 40 (January 1996): 19–39.[ZIEG96c] J F Ziegler, H P Muhlfeld, C J Montrose, H W Curtis, T J O’Gorman, and J M Ross,

‘‘Accelerated Testing for Cosmic Soft-Error Rate,’’ IBM J Res Dev., 40 (January 1996): 51–72.[ZIEG98] J F Ziegler, M E Nelson, J D Shell, R J Peterson, C J Gelderloos, H P Mahlfeld, and

C J Montrose, ‘‘Cosmic Ray Soft Error Rates of 16-Mb DRAM Memory Chips,’’ IEEE J State Circ., 33 (February 1998): 246–252

Trang 38

2.1 Introduction to Algebra 232.1.1 Groups and Rings 232.1.2 Fields 262.1.3 Representation for Elements of Galois Fields 282.1.4 Properties of Galois Field GFð2mÞ 312.2 Linear Codes 332.2.1 Vector Space and Subspace 342.2.2 Linear Codes as Vector Spaces 352.2.3 Matrix Algebra 362.2.4 Distance and Error Control Capability 392.2.5 Parity-Check Matrices for Linear Codes 412.3 Basic Matrix Codes 482.3.1 Simple Parity-Check Codes 482.3.2 Hamming Single Error Correcting (SEC) Codes 492.3.3 Hamming Single Error Correcting and Double Error Detecting

(SEC-DED) Codes 522.3.4 Cyclic Codes 532.3.5 Binary BCH Codes 582.3.6 Reed-Solomon Codes as Nonbinary BCH Codes 652.3.7 Burst Error Correcting Fire Codes 68Exercises 71References 75

Trang 39

Mathematical Background

and Matrix Codes

The research in error control codes has relied to a large extent on the powerful structures ofmodern algebra A number of important and practical codes based on the structure of ringsand Galois ﬁelds have been developed This chapter provides the algebraic structures andthe fundamental codes, expressed mostly by matrices, necessary to understand thesubsequent chapters and to design codes that ﬁt practical requirements The level of thediscussion is introductory For a more rigorous treatment, the reader is advised to consultthe following excellent texts on coding theory: [PETE72, MACW77, BRAH84, BERL84,PLES98, LIN04]

The most important ideas in coding theory are based on the arithmetic systems of modernalgebra These systems are not so familiar to most of us, so here we pause to develop abackground of this mathematics before we proceed to study coding theory and to designpractical codes

2.1.1 Groups and Rings

A group is a mathematical abstraction of an algebraic structure A ring is also an abstractset that is an Abelian group and has an additional structure

Code Design for Dependable Systems: Theory and Practical Applications, by Eiji Fujiwara

23

Trang 40

a b ¼ b a:

This is called a commutative axiom Groups with this additional axiom are calledcommutative groups, or Abelian groups In every group the identity element is unique.Also the inverse of each group element is unique, meaningða1Þ1 ¼ a The proofs ofthese are left to the reader to complete

Homomorphism and Isomorphism The number of elements in a group is said

to be the order of the group A homomorphism is a mapping f that preserves thestructure of the sets between two groups fA; g and fA0; 0g, meaning f : A ! A0,which satisﬁes fða bÞ ¼ f ðaÞ 0fðbÞ for a; b 2 A If there exists a homomorphismfrom A onto A0, then A0is said to be homomorphic for A In particular, if there existsone-to-one mapping between A and A0, that is, f0: A0! A is also a homomorphism,

or bijection (one-to-one and onto), then f is said to be an isomorphism, and A and

A0 are said to be isomorphic The reader is advised to ﬁnd an isomorphism betweenthe set of integers under addition Z4 ¼ f0; 1; 2; 3g and the multiplicative group

G¼ f1; 2; 3; 4g

Subgroups

Deﬁnition 2.2 Let G be a group and let F be a subset of G Then F is called a subgroup of

G if F satisﬁes all properties of a group with the same operation &

Tiêu đề	Code Design for Dependable Systems Theory and Practical Applications
Tác giả	Eiji Fujiwara
Trường học	Tokyo Institute of Technology
Thể loại	Book
Thành phố	Tokyo

Định dạng
Số trang	718
Dung lượng	11,86 MB