BlueprintsForHighAvailability TV pdf Evan Marcus Hal Stern Blueprints for High Availability Second Edition Blueprints for High Availability Second Edition Executive Publisher Robert Ipsen Executive Ed[.]
Trang 2Evan Marcus
Hal Stern
Blueprints for High Availability
Second Edition
Trang 3Blueprints for High Availability
Second Edition
Trang 4Development Editor: Scott Amerman
Editorial Manager: Kathryn A Malm
Production Editor: Vincent Kunkemueller
Text Design & Composition: Wiley Composition Services
Copyright © 2003 by Wiley Publishing, Inc., Indianapolis, Indiana All rights reserved Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rose- wood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8700 Requests to the Pub- lisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc.,
10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4447, E-mail: permcoordinator@wiley.com.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect
to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may
be created or extended by sales representatives or written sales materials The advice and strategies contained herein may not be suitable for your situation You should consult with
a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, inci- dental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Trademarks: Wiley, the Wiley Publishing logo and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc and/or it’s affiliates in the United States and other countries, and may not be used without written permission All other trademarks are the property of their respective owners Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
Wiley also publishes its books in a variety of electronic formats Some content that appears
in print may not be available in electronic books.
Library of Congress Cataloging-in-Publication Data is available from the publisher.
ISBN: 0-471-43026-9
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
Trang 5For Carol, Hannah, Madeline, and Jonathan
—Evan MarcusFor Toby, Elana, and Benjamin
—Hal Stern
Trang 6Contents vii
For the Second Edition xix
Preface from the First Edition xxiv
Trang 7Chapter 3 The Value of Availability 31
What Is High Availability? 31
Direct Costs of Downtime 34Indirect Costs of Downtime 36The Value of Availability 37Example 1: Clustering Two Nodes 42Example 2: Unknown Cost of Downtime 46The Availability Continuum 47The Availability Index 51The Lifecycle of an Outage 52
Chapter 4 The Politics of Availability 61
Beginning the Persuasion Process 61
Delivering the Message 70The Slide Presentation 70
After the Message Is Delivered 73
viii Contents
Trang 8Chapter 5 20 Key High Availability Design Principles 75
#18: Remove Single Points of Failure (SPOFs) 78
#16: Consolidate Your Servers 81
#14: Enforce Change Control 83
#13: Document Everything 84
#12: Employ Service Level Agreements 87
#9: Separate Your Environments 90
#8: Learn from History 92
#6: Choose Mature Software 94
#5: Choose Mature, Reliable Hardware 95
#4: Reuse Configurations 97
#3: Exploit External Resources 98
#2: One Problem, One Solution 99
#1: K.I.S.S (Keep It Simple ) 101
Chapter 6 Backups and Restores 105
The Basic Rules for Backups 106
Do Backups Really Offer High Availability? 108What Should Get Backed Up? 109
Getting Backups Off-Site 110
Commercial or Homegrown? 111Examples of Commercial Backup Software 113Commercial Backup Software Features 113
Improving Backup Performance:
Solving for Performance 122
Trang 9Use More Hardware 135
Third-Mirror Breakoff 136Sophisticated Software Features 138Copy-on-Write Snapshots 138
Fast and Flash Backup 141Handling Backup Tapes and Data 141General Backup Security 144
Disk Space Requirements for Restores 146
Chapter 7 Highly Available Data Management 149
Four Fundamental Truths 150Likelihood of Failure of Disks 150
Ensuring Data Accessibility 151Six Independent Layers of Data Storage and Management 152Disk Hardware and Connectivity Terminology 153
x Contents
Trang 10Managing Disk and Volume Availability 180
Chapter 8 SAN, NAS, and Virtualization 183
Storage Area Networks (SANs) 184
Network Failure Taxonomy 204Network Reliability Challenges 205Network Failure Modes 207Physical Device Failures 208
IP Address Configuration 209
Congestion-Induced Failures 211Network Traffic Congestion 211Design and Operations Guidelines 213Building Redundant Networks 214
Redundant Network Connections 216Redundant Network Attach 217Multiple Network Attach 217
Configuring Multiple Networks 220
IP Routing Redundancy 223Dynamic Route Recovery 224Static Route Recovery with VRRP 225Routing Recovery Guidelines 226Choosing Your Network Recovery Model 227Load Balancing and Network Redirection 228
Trang 11Network Service Reliability 232Network Service Dependencies 233Hardening Core Services 236Denial-of-Service Attacks 237
Chapter 10 Data Centers and the Local Environment 241
Advantages and Disadvantages to Data Center Racks 244The China Syndrome Test 247Balancing Security and Access 247
Off-Site Hosting Facilities 250
Chapter 11 People and Processes 263
System Management and Modifications 264Maintenance Plans and Processes 265
Working with Your Vendors 274The Vendor’s Role in System Recovery 275
Vendor Consulting Services 277
The Audience for Documentation 282Documentation and Security 283Reviewing Documentation 284System Administrators 284
xii Contents
Trang 12Chapter 12 Clients and Consumers 291
Hardening Enterprise Clients 292
Database Application Recovery 299
Chapter 13 Application Design 303
Application Recovery Overview 304Application Failure Modes 305Application Recovery Techniques 306Kinder, Gentler Failures 308Application Recovery from System Failures 309Virtual Memory Exhaustion 309
Memory Corruption and Recovery 318
Boundary Condition Checks 322
Chapter 14 Data and Web Services 333
Network File System Services 334Detecting RPC Failures 334NFS Server Constraints 336Inside an NFS Failover 337Optimizing NFS Recovery 337
Trang 13Database Servers 342Managing Recovery Time 343
Web Services Standards 357
Chapter 15 Local Clustering and Failover 361
A Brief and Incomplete History of Clustering 362Server Failures and Failover 365Logical, Application-centric Thinking 367Failover Requirements 369
Chapter 16 Failover Management and Issues 387
Failover Management Software (FMS) 388
Who Performs a Test, and Other Component Monitoring Issues 391When Component Tests Fail 392Time to Manual Failover 393Homemade Failover Software or Commercial Software? 395Commercial Failover Management Software 397When Good Failovers Go Bad 398
Causes and Remedies of Split-Brain Syndrome 400Undesirable Failovers 404xiv Contents
Trang 14Verification and Testing 404State Transition Diagrams 405
Chapter 17 Failover Configurations 415
Two-Node Failover Configurations 416Active-Passive Failover 416Active-Passive Issues and Considerations 417How Can I Use the Standby Server? 418Active-Active Failover 421Active-Active or Active-Passive? 424Service Group Failover 425Larger Cluster Configurations 426
Trang 15Chapter 19 Virtual Machines and Resource Management 465
Partitions and Domains: System-Level VMs 466Containers and Jails: OS Level VMs 468
Chapter 20 The Disaster Recovery Plan 473
Should You Worry about DR? 474Three Primary Goals of a DR Plan 475Health and Protection of the Employees 475The Survival of the Enterprise 476The Continuity of the Enterprise 476What Goes into a Good DR Plan 476Preparing to Build the DR Plan 477
So What Should You Do? 490
Equipping the DR Site 498
Is Your Plan Any Good? 500Qualities of a Good Exercise 500Planning for an Exercise 501Possible Exercise Limitations 503Make It More Realistic 503Ideas for an Exercise Scenario 504
Three Types of Exercises 507
The Effects of a Disaster on People 509Typical Responses to Disasters 509What Can the Enterprise Do to Help? 510
xvi Contents
Trang 16Chapter 21 A Resilient Enterprise* 513
The New York Board of Trade 514
No Way for a Major Exchange to Operate 517
Chaotic Trading Environment 528Improvements to the DR Site 531
The New Trading Facility 533Future Disaster Recovery Plans 534
The Outcry for Open Outcry 535Modernizing the Open Outcry Process 536The Effects on the People 538
Trang 18The strong positive response to the first edition of Blueprints for High ity was extremely gratifying It was very encouraging to see that our messageabout high availability could find a receptive audience We received a lot ofgreat feedback about our writing style that mentioned how we were able toexplain technical issues without getting too technical in our writing
Availabil-Although the comments that reached us were almost entirely positive, thisbook is our child, and we know where the flaws in the first edition were In thissecond edition, we have filled some areas out that we felt were a little flat thefirst time around, and we have paid more attention to the arrangement of thechapters this time
Without question, our “Tales from the Field” received the most praise fromour readers We heard from people who said that they sat down and justskimmed through the book looking for the Tales That, too, is very gratifying
We had a lot of fun collecting them, and telling the stories in such a positiveway We have added a bunch of new ones in this edition Skim away!
Our mutual thanks go out to the editorial team at John Wiley & Sons Onceagain, the push to complete the book came from Carol Long, who would notlet us get away with slipped deadlines, or anything else that we tried to pull
We had no choice but to deliver a book that we hope is as well received as thefirst edition She would accept nothing less Scott Amerman was a new addi-tion to the team this time out His kind words of encouragement balanced withhis strong insistence that we hit our delivery dates were a potent combination
From Evan Marcus
It’s been nearly four years since Hal and I completed our work on the first tion of Blueprints for High Availability, and in that time, a great many things
edi-Preface For the Second Edition
Trang 19have changed The biggest personal change for me is that my family has had anew addition At this writing, my son Jonathan is almost three years old Amore general change over the last 4 years is that computers have become muchless expensive and much more pervasive They have also become much easier
to use Jonathan often sits down in front of one of our computers, turns it on,logs in, puts in a CD-ROM, and begins to play games, all by himself He canalso click his way around Web sites like www.pbskids.org I find it quiteremarkable that a three-year-old who cannot quite dress himself is so comfort-able in front of a computer
The biggest societal change that has taken place in the last 4 years (and, infact, in much longer than the last 4 years) occurred on September 11, 2001, withthe terrorist attacks on New York and Washington, DC I am a lifelong resident
of the New York City suburbs, in northern New Jersey, where the loss of ourfriends, neighbors, and safety is keenly felt by everyone But for the purposes
of this book, I will confine the discussion to how computer technology andhigh availability were affected
In the first edition, we devoted a single chapter to the subject of disasterrecovery, and in it we barely addressed many of the most important issues Inthis, the second edition, we have totally rewritten the chapter on disasterrecovery (Chapter 20, “A Disaster Recovery Plan”), based in part on many ofthe lessons that we learned and heard about in the wake of September 11 Wehave also added a chapter (Chapter 21, “A Resilient Enterprise”) that tells themost remarkable story of the New York Board of Trade, and how they wereable to recover their operations on September 11 and were ready to resumetrading less than 12 hours after the attacks When you read the New YorkBoard of Trade’s story, you may notice that we did not discuss the technologythat they used to make their recovery That was a conscious decision that wemade because we felt that it was not the technology that mattered most, butrather the efforts of the people that allowed the organization to not just sur-vive, but to thrive
Chapter 21 has actually appeared in almost exactly the same form in anotherbook In between editions of Blueprints, I was co-editor and contributor to aninternal VERITAS book called The Resilient Enterprise, and I originally wrotethis chapter for that book I extend my gratitude to Richard Barker, Paul Mas-siglia, and each of the other authors of that book, who gave me their permis-sion to reuse the chapter here
But some people never truly learn the lessons Immediately after September
11, a lot of noise was made about how corporations needed to make selves more resilient, should another attack occur There was a great deal ofdiscussion about how these firms would do a better job of distributing theirdata to multiple locations, and making sure that there were no single points offailure Because of the economy, which suffered greatly as a result of theattacks, no money was budgeted for protective measures right away, and as
them-xx Preface
Trang 20time wore on, other priorities came along and the money that should havegone to replicating data and sending backups off-site was spent other ways.Many of the organizations that needed to protect themselves have done little
or nothing in the time since September 11, and that is a shame If there isanother attack, it will be a great deal more than a shame
Of course, technology has changed in the last 4 years We felt we needed toadd a chapter about some new and popular technology related to the field ofavailability Chapter 8 is an overview of SANs, NAS, and storage virtualization
We also added Chapter 22, which is a look at some emerging technologies
Despite all of the changes in society, technology, and families, the basic ciples of high availability that we discussed in the first edition have notchanged The mission statement that drove the first book still holds: “You can-not achieve high availability by simply installing clustering software andwalking away.” The technologies that systems need to achieve high availabil-ity are not automatically included by system and operating system vendors.It’s still difficult, complex, and costly
prin-We have tried to take a more practical view of the costs and benefits of highavailability in this edition, making our Availability Index model much moredetailed and prominent The technology chapters have been arranged in anorder that maps to their positions on the Index; earlier chapters discuss morebasic and less expensive examples of availability technology like backups anddisk mirroring, while later chapters discuss more complex and expensive tech-nologies that can deliver the highest levels of availability, such as replicationand disaster recovery
As much as things have changed since the first edition, one note that weincluded in that Preface deserves repeating here: Some readers may begrudgethe lack of simple, universal answers in this book There are two reasons forthis One is that the issues that arise at each site, and for each computer system,are different It is unreasonable to expect that what works for a 10,000-employee global financial institution will also work for a 10-person law office
We offer the choices and allow the reader to determine which one will workbest in his or her environment The other reason is that after 15 years of work-ing on, with, and occasionally for computers, I have learned that the most cor-rect answer to most computing problems is a rather unfortunate, “It depends.” Writing a book such as this one is a huge task, and it is impossible to do italone I have been very fortunate to have had the help and support of a hugecast of terrific people Once again, my eternal love and gratitude go to mywonderful wife Carol, who puts up with all of my ridiculous interests andhobbies (like writing books), our beautiful daughters Hannah and Madeline,and our delightful son Jonathan Without them and their love and support,this book would simply not have been possible Thanks, too, for your love andsupport to my parents, Roberta and David Marcus, and my in-laws, Gladysand Herb Laden, who still haven’t given me that recipe