Fault Tolerant Computer Architecture... Hill, University of Wisconsin, Madison Synthesis Lectures on Computer Architecture publishes 50 to 150 page publications on topics pertaining to t
Trang 1Fault Tolerant Computer Architecture
Trang 3Chapter Title here
Kratos
Editor
Mark D Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50 to 150 page publications on topics pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals
Fault Tolerant Computer Architecture
Daniel Sorin
2009
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines
Luiz André Barroso and Urs Hölzle
2009
Computer Architecture Techniques for Power-Efficiency
Stefanos Kaxiras and Margaret Martonosi
2008
Chip Mutiprocessor Architecture: Techniques to Improve Throughput and Latency
Kunle Olukotun, Lance Hammond, James Laudon
2007
Transactional Memory
James R Larus, Ravi Rajwar
2007
Quantum Computing for Computer Architects
Tzvetan S Metodi, Frederic T Chong
2006
Synthesis Lectures on Computer
Architecture
Trang 4Copyright © 2009 by Morgan & Claypool
All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
Fault Tolerant Computer Architecture
Daniel Sorin
www.morganclaypool.com
ISBN: 9781598299533 paperback
ISBN: 9781598299540 ebook
DOI: 10.2200/S00192ED1V01Y200904CAC005
A Publication in the Morgan & Claypool Publishers series
SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE
Lecture #5
Series Editor: Mark D Hill, University of Wisconsin, Madison
Series ISSN
ISSN 1935-3235 print
ISSN 1935-3243 electronic
Trang 5Fault Tolerant Computer Architecture
Daniel J Sorin
Duke University
SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #5
Trang 6For many years, most computer architects have pursued one primary goal: performance Architects have translated the ever-increasing abundance of ever-faster transistors provided by Moore’s law into remarkable increases in performance Recently, however, the bounty provided by Moore’s law has been accompanied by several challenges that have arisen as devices have become smaller, includ-ing a decrease in dependability due to physical faults In this book, we focus on the dependability challenge and the fault tolerance solutions that architects are developing to overcome it The two main purposes of this book are to explore the key ideas in fault-tolerant computer architecture and
to present the current state-of-the-art—over approximately the past 10 years—in academia and industry
vi
KEyWoRDS
fault tolerance (or fault tolerant), reliability, dependability, computer architecture, error detection, error recovery, fault diagnosis, self-repair, autonomous, dynamic verification
Trang 7“To Deborah, Jason, and Julie”
Dedication
Trang 8I would like to thank my family for their support while I was writing this lecture I would also like to thank Mark Hill for inviting me to write this lecture and Mike Morgan for organizing the produc-tion of the lecture Valuable feedback on early drafts of the lecture was provided by Babak Falsafi, Jude Rivers, and Mark Hill I would also like to thank Lihao Xu for helping me with a question about error coding
Acknowledgments
Trang 91 Introduction 1
1.1 Goals of this Book 1
1.2 Faults, Errors, and Failures 2
1.2.1 Masking 2
1.2.2 Duration of Faults and Errors 3
1.2.3 Underlying Physical Phenomena 3
1.3 Trends Leading to Increased Fault Rates 5
1.3.1 Smaller Devices and Hotter Chips 5
1.3.2 More Devices per Processor 6
1.3.3 More Complicated Designs 6
1.4 Error Models 7
1.4.1 Error Type 7
1.4.2 Error Duration 8
1.4.3 Number of Simultaneous Errors 8
1.5 Fault Tolerance Metrics 9
1.5.1 Availability 9
1.5.2 Reliability 10
1.5.3 Mean Time to Failure 10
1.5.4 Mean Time Between Failures 10
1.5.5 Failures in Time 10
1.5.6 Architectural Vulnerability Factor 11
1.6 The Rest of This Book 12
1.7 References 13
2 Error Detection 19
2.1 General Concepts 19
2.1.1 Physical Redundancy 19
2.1.2 Temporal Redundancy 22
Contents
ix
Trang 102.1.3 Information Redundancy 22
2.1.4 The End-to-End Argument 25
2.2 Microprocessor Cores 27
2.2.1 Functional Units 27
2.2.2 Register Files 29
2.2.3 Tightly Lockstepped Redundant Cores 29
2.2.4 Redundant Multithreading Without Lockstepping 30
2.2.5 Dynamic Verification of Invariants 34
2.2.6 High-Level Anomaly Detection 39
2.2.7 Using Software to Detect Hardware Errors 41
2.2.8 Error Detection Tailored to Specific Fault Models 42
2.3 Caches and Memory 44
2.3.1 Error Code Implementation 44
2.3.2 Beyond EDCs 45
2.3.3 Detecting Errors in Content Addressable Memories 46
2.3.4 Detecting Errors in Addressing 47
2.4 Multiprocessor Memory Systems 48
2.4.1 Dynamic Verification of Cache Coherence 49
2.4.2 Dynamic Verification of Memory Consistency 50
2.4.3 Interconnection Networks 52
2.5 Conclusions 52
2.6 References 52
3 Error Recovery 61
3.1 General Concepts 61
3.1.1 Forward Error Recovery 61
3.1.2 Backward Error Recovery 62
3.1.3 Comparing the Performance of FER and BER 68
3.2 Microprocessor Cores 69
3.2.1 FER for Cores 69
3.2.2 BER for Cores 69
3.3 Single-Core Memory Systems 71
3.3.1 FER for Caches and Memory 71
3.3.2 BER for Caches and Memory 72
3.4 Issues Unique to Multiprocessors 73
x FAULT ToLERANT CoMPUTER ARCHITECTURE