Identify and troubleshoot common server cluster failures: network communications, small computer system interface SCSI configuration problems, group, resource, and quorum failures.. • Ba
Trang 1Contents
Overview 1
Troubleshooting Cluster Service 11
Review 30
Module 7: Server Cluster Maintenance and Troubleshooting
Trang 2to represent any real individual, company, product, or event, unless otherwise noted Complying with all applicable copyright laws is the responsibility of the user No part of this document may
be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of Microsoft Corporation If, however, your only means of access is electronic, permission to print one copy is hereby granted
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property
2000 Microsoft Corporation All rights reserved
Microsoft, Active Directory, BackOffice, Jscript, PowerPoint, Visual Basic, Visual Studio, Win32, Windows, Windows NT are either registered trademarks or trademarks of Microsoft Corporation
in the U.S.A and/or other countries
Other product and company names mentioned herein may be the trademarks of their respective owners
Program Manager: Don Thompson
Product Manager: Greg Bulette
Instructional Designers: April Andrien, Priscilla Johnston, Diana Jahrling
Subject Matter Experts: Jack Creasey, Jeff Johnson
Technical Contributor: James Cochran
Classroom Automation: Lorrin Smith-Bates
Graphic Designer: Andrea Heuston (Artitudes Layout & Design)
Editing Manager: Lynette Skinner
Editor: Elizabeth Reese
Copy Editor: Bill Jones (S&T Consulting)
Production Manager: Miracle Davis
Build Manager: Julie Challenger
Print Production: Irene Barnett (S&T Consulting)
CD Production: Eric Wagoner
Test Manager: Eric R Myers
Test Lead: Robertson Lee (Volt Technical)
Creative Director: David Mahlmann
Media Consultation: Scott Serna
Illustration: Andrea Heuston (Artitudes Layout & Design)
Localization Manager: Rick Terek
Operations Coordinator: John Williams
Manufacturing Support: Laura King; Kathy Hershey
Lead Product Manager, Release Management: Bo Galford
Lead Technology Manager: Sid Benavente
Lead Product Manager, Content Development: Ken Rosen
Group Manager, Courseware Infrastructure: David Bramble
Group Product Manager, Content Development: Julie Truax
Director, Training & Certification Courseware Development: Dean Murray
General Manager: Robert Stewart
Trang 3Instructor Notes
This module is intended to prepare the students to successfully back up and restore a server cluster Students need to know how to use the troubleshooting tools available for troubleshooting server cluster problems The module covers common Cluster service problems and possible resolutions
After completing this module, you will be able to:
Perform the steps to successfully back up a server cluster
Perform the steps to successfully restore a server cluster
Evict a node from a server cluster
Identify the tools that are necessary to troubleshoot a cluster failure
Interpret the entries on the cluster log
Identify and troubleshoot common server cluster failures: network communications, small computer system interface (SCSI) configuration problems, group, resource, and quorum failures
Materials and Preparation
This section provides the materials and preparation tasks that you need to teach this module
Required Materials
To teach this module, you need the Microsoft® PowerPoint® file 2087A_02.ppt
Preparation Tasks
To prepare for this module, you should:
Read the materials for this module and anticipate questions students may ask
Read Q224075, Q257892, Q248998, Q172951, Q266274, Q234767, Q193890, Q245762 and “Interpreting MSCS Cluster Log, on the Student compact disk
Be familiar with the Resource Kit Utilities
Practice the labs
Study the review questions and prepare alternative answers for discussion
Presentation:
45 Minutes
Lab:
15 Minutes
Trang 4Module Strategy
Use the following strategy to present this module:
Because backing up the cluster is a key maintenance task, the first section begins with information on how to backup the cluster configuration files The following pages cover the complete procedure for restoring an entire cluster in case of catastrophic failure You can also use each of the topics as a separate procedure for performing a specific task
The troubleshooting section lists the tools that are available for troubleshooting Cluster service and gives common problems and suggested resolutions
Cluster Maintenance Cluster service is self-tuning and requires no maintenance other than daily backups
• Backup: Backing up the system state backs up the cluster configuration files; however, you also need to back up each node’s data and operating system and the cluster disks
• Restoring the First Node: The overall procedure for restoring a cluster is outlined on this page The first step, restoring the operating system on the first node, is also covered The remaining steps are covered in detail
on the following pages
• Restoring Cluster Disks: Cluster service uses the disk signature file to identify the cluster disk To replace this disk, you must write the disk signature file of the old disk onto the new disk
• Restoring the Second Node: Restoring the remaining nodes of the cluster
is similar to restoring the first node, except that after it is restored, you need to test the failover capabilities of the cluster before putting the cluster back into the production environment
• Evicting a Node: Evicting a node is a manual process through Cluster Administrator As always, it is important to have a good backup of the server prior to the eviction process
Trang 5Troubleshooting Cluster Service The key point of this section is to give the students the tools and techniques that are useful in reducing the time it takes to find a root cause for common Cluster service problems
• Troubleshooting Tools: The tools that are used to help troubleshoot a problem with Cluster service are the same tools that are used to help troubleshoot a server running Microsoft Windows® 2000
• Examining the Cluster Log: Cluster service logs every change configuration and problem to the cluster log It is important for the students to become familiar with the syntax of the log
• Troubleshooting Network Communications: Students need to know that there are different troubleshooting paths to follow depending on whether the network problem is a node-to-node or a client-to-node problem
• SCSI Configuration Problems: SCSI is less reliable than Fibre There can be problems with the SCSI controller, SCSI termination, and SCSI cabling
• Group and Resource Failures: Remind students to keep dependency trees vertical so that if a resource fails, it is easier to find a root cause as to which resource is causing the failure of the group
• Quorum Log Corruption: If Cluster service cannot write information to the quorum log, it will not start You can attempt to reset the quorum log, or you can delete the quorum log and let Cluster service create a new log
Trang 6Instructor Setup for a Lab
Lab Strategy
This lab is designed to prepare the students to use Backup and Clusrest.exe to perform the proper backup and restore procedures Students will uninstall Cluster service in preparation for the Network Load Balancing (NLB) portion
of the course NLB and Cluster service cannot run on the same computer
Lab A: Cluster Maintenance
To conduct this lab:
Read though the lab carefully, paying close attention to the instructions and details
Students will need the Clusrest utility from c:\moc\2087\labfiles\mscs
Students work in teams of two, grouped together by their shared bus
Help the students determine whether they are Node A or Node B In these exercises each node performs a specific task in the backup and restoration procedures Both nodes will uninstall Cluster service
Trang 7Overview
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
Server cluster maintenance and troubleshooting are considered two separate disciplines Maintenance is continuous, whereas troubleshooting has a beginning when the problem is discovered, and an end when the problem is resolved The two disciplines are complimentary, however When every troubleshooting procedure that you follow fails, you will need to rebuild the cluster from a backup tape that was generated during a maintenance procedure After completing this module, you will be able to:
Perform the steps to successfully back up a server cluster
Perform the steps to successfully restore a server cluster
Evict a node from a server cluster
Identify the tools that are necessary to troubleshoot a cluster failure
Interpret the entries on the cluster log
Identify and troubleshoot common server cluster failures: network communications, small computer system interface (SCSI) configuration problems, group, resource, and quorum failures
In this module, we will cover
Cluster maintenance in the
form of backing up and
restoring a cluster, and
troubleshooting Cluster
service
Trang 8Cluster Maintenance
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
Cluster service uses the self-tuning features of Microsoft® Windows® 2000 and requires very little maintenance The only day-to-day maintenance operation that you need to perform is to back up the cluster
Under special circumstances, a node in the cluster may need to be replaced, for example, when your organization decides to perform a hardware upgrade In this situation, you need to evict a node from the cluster and add the upgraded node to the cluster
Topic Objective
To introduce the
fundamental tasks for
maintaining a server cluster
Trang 9Backup
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
Backing up the cluster is no different from backing up Microsoft Windows 2000 Advanced Server It is recommended that you perform regular backups by using the Windows 2000 Backup program (NTBackup), or other compatible backup programs Additional backup agents are still necessary to back up applications running on the cluster, such as Microsoft SQL Server™
and Microsoft Exchange
A cluster-aware backup program will be able to perform the same backup operations as NTBackup, especially with regard to backing up the System State and the cluster configuration database
Backing Up the System State
The configuration information for the cluster is located on the registry on each node (HKEY_LOCAL_MACHINE\Cluster) The Backup tool that is included with Windows 2000 backs up the cluster database when you back up each node’s system state
NTBackup backs up the system state on each node The system state includes:
The quorum log
The local registry
The Cluster registry hive
Topic Objective
To describe how to back up
the system state, node, and
cluster disks
Lead-in
A backup of the cluster
includes the system state,
the node, and the cluster
disk
Note
Trang 10Backing Up the Local Disk
Follow standard computer backup procedures to back up the operating system and the data on the local drives You must also back up key cluster files on the local disks
On each node, back up the cluster database files:
Backing Up the Cluster Disks
It is critical to back upcluster files on the quorum diskanddata on the cluster disks, because Cluster service will write information to files in the
\mscsdirectory on the quorum disk and cluster-aware applications will likely be placing data on the cluster disk Because either node of the cluster could own the cluster disk resource at any time, it is possible for each node to back up the data on the drive However, having each node back up data would require you
to install backup hardware and software on each cluster node, which is not the best solution
One possibility is to identify a nonclustered server running Windows 2000 Server and schedule it to back up data remotely through a network connection
to the Cluster disk’s administrative share or a hidden share that you create For example, you might create FBackup$, GBackup$, HBackup$, and WBackup$ file share resources on the virtual server for the root of drives F, G, H, and W
F, G, and H would be cluster disks with data, and W would be the drive letter for the quorum disk Hidden shares would not appear in a browse list and you could configure them to allow access only to members of the Backup Operators group
Note
Trang 11Restoring the First Node
Steps For Restoring a Server Cluster:
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
The following sections describe the procedure for restoring a server cluster in the event that both nodes and the cluster disk fail It is possible that any one of the components in the cluster could fail independently In the case of a failed component, you follow the same procedure for restoring that specific component
Performing a complete restore of a server cluster is a straightforward process
1 Restore a node of the cluster
2 Restore the cluster disks of the restored first node
3 Restore the remaining node of the cluster
4 Perform node testing
Topic Objective
To list the steps for restoring
a server cluster and
describe how to restore the
first node
Lead-in
In the event of a complete
cluster failure, you first
restore a node
Delivery Tip
This page lists the four
steps that are involved in
restoring a complete cluster
and covers the first step,
Restoring a Node Details
about the other three steps
follow on the next pages
Trang 12Restoring a Node of the Cluster
To restore a node in a server cluster, you follow the same procedure that you would use in restoring a Windows 2000 operating system
1 Install a fresh copy of Windows 2000 Advanced Server on the node to be restored
2 Log on as Administrator and restore the system and boot partition, system state, and associated volumes from the backup Make sure that you select the option to restore the system state to the original location in the backup program
3 Restart the node
4 Perform the steps for restoring the cluster disk These steps follow in the next section
The difference between the time of the backup and the time of the restoration to the new computer may affect the computer account on the domain controller You may have to join a workgroup and then rejoin the domain
Note
Trang 13Restoring Cluster Disks
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
After you have restored a node in the cluster, you must restore the cluster disks Restoring the cluster disks involves restoring the disk signature file that the cluster uses to identify the disk You may also need to restore a cluster disk if you are running out of disk space or if there is impending disk failure of a disk
It can be costly to make mistakes while replacing a cluster disk; the consequence can be the irrecoverable loss of all of the data on that disk If the disk is the quorum disk, the server cluster's configuration data is at risk
Before restoring the cluster disks, stop Cluster service on all of the nodes of the cluster Stopping Cluster service will ensure that it will not attempt to start, which would place a lock on the disks
Restoring Disk Signature Files
Because Cluster service relies on disk signatures to identify and mount volumes, if a disk is replaced, or if the bus is re-enumerated, Cluster service will not find the disk signatures that it is expecting and will not function You can run Dumpcfg.exe to extract the disk signature from the registry and write it to the new disk Cluster service will recognize the new disk and successfully start the resource
The Dumpcfg.exe is a resource kit utility that restores an old disk signature file to a new disk
If the disk that you are replacing is the quorum disk, use Cluster Administrator
to move the quorum to a different disk, and proceed in the replacement of the disk After the disk is brought back online, you can move the quorum back to the new disk
Topic Objective
To describe how to restore
the cluster disk by restoring
signature files, data and
cluster configuration files
Lead-in
Restoring a cluster disk
involves restoring the disk
Cluster,” found on the
Student compact disk
Note
Trang 14Restoring the Data on the Cluster Disk
Restoring the data on the cluster disk is the same as a restore of a local disk Before restoring the data, make sure that you have associated each cluster disk
to the same drive letter as before the disaster or failure When restoring, make sure that you restore the data to the original location and verify the integrity after you have completed the restore
Restoring the Cluster Configuration Files
The cluster configuration files include the cluster database and the quorum log The cluster database is the database or configuration data (cluster objects and their settings) that are pertinent to the cluster This database is the product of the cluster registry key checkpoint and the changes that are recorded in the quorum log All of the nodes of the cluster hive maintain a local copy of this database in the nodes local registry
After you have restored the disk signature file and data, you can start the server cluster If the cluster files were not restored, or were corrupted, the following procedure can restore the cluster database from the registry of the restored node Identify the node on which you will restore the database (in the case of a disaster restore, this will be the first node that you have restored) Restore the cluster database on the selected node by restoring the system state Restoring the system state creates a temporary folder under the %Systemroot%\Cluster folder called Cluster_backup
You use NTBackup to restore the cluster configuration files, which places them
on the node You then restore the cluster database to the node’s registry by using the Clusrest.exe tool Clusrest.exe restores both the quorum log (Quorum.log) file and the cluster database (Clusdb)
The Clusrest.exe tool is available in the Windows 2000 Resource Kit This tool is a free download from www.microsoft.com
Note
Trang 15Restoring the Second Node
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
After you complete the process of restoring a node of a cluster, and Cluster service has started successfully on the newly restored node, you can start the restore process on the other node of the cluster
Restoring the Remaining Node(s) of the Cluster
The restoration of the second node of a cluster is the same procedure as restoring the first node of a cluster, except that you will not have to restore the cluster disks
Performing Node Testing
Testing the failover and failback policy is recommended before putting the cluster back into production
1 Verify that the disk and cluster resources are available on the correct node
2 Fail over each group and resource to verify that they can successfully start
on the other node of the cluster
3 Test the failback policy of each resource by allowing the resource to fail back to a preferred owner after the node has come back online
Topic Objective
To describe how to restore
the second or remaining
nodes of a cluster and test
the failover and failback
policies
Lead-in
The last step in restoring the
cluster is to restore the
second node and then test
the components of the
cluster
Trang 16Evicting a Node
Steps for Evicting a Node
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
If you need to change a node of a cluster, for example, to add a more powerful server, you need to logically remove the node before physically removing the node from the cluster When you configure a new server with the shared bus, and the public and private networks, you can then run the Cluster Installation Wizard
To remove a node from a cluster, from Cluster Administrator, right-click on the
node to access the menu with the Stop Cluster option and Evict Node options
To evict a node:
1 Back up both nodes
2 Verify backup
3 Move all of the groups to the remaining node
4 Stop Cluster service on the node that is to be removed
5 Evict the node
6 Unplug the server from the shared bus (if the shared bus is a SCSI bus, be careful about termination)
If a new server is to join the cluster later, run the Cluster Installation
Wizard and select Join a Cluster
Topic Objective
To describe how to evict a
node from a cluster
Lead-in
You must first evict a node
from the cluster to add a
new node to the cluster
Note
Trang 17Troubleshooting Cluster Service
***************************** ILLEGAL FOR NON - TRAINER USE ******************************
Troubleshooting a problem with Cluster service can be more complex than troubleshooting a single server because of the virtual servers and the need for intracluster communications Virtual servers change ownership from one node
to another, which may cause network connectivity problems Applications running on the cluster are difficult to troubleshoot, because they are running on
a virtual server instead of a physical server You could also have a node-to-node communication problem because servers usually work independently of each other and not together You might experience hardware problems with the shared bus and the cluster disk resources
The most common failures are due to improper configurations within groups and resources Cluster service will fail if the quorum log becomes corrupt It is important to know how to repair the quorum log to restart the cluster
You use the same tools to identify problems on the cluster as you would use to identify problems on a physical server The best resource for troubleshooting is the cluster log because Cluster service records the activity of each node in the cluster log This log can help you identify problems on the node or in the cluster
This section provides an
overview of the tools that
are available for
Cluster Service Startup
Issues” on the Student
compact disk
Trang 18***************************** ILLEGAL FOR NON - TRAINER USE ******************************
When troubleshooting Cluster service, you can use the same tools and methodologies that you would when troubleshooting Windows 2000 Advanced Server
Cluster service writes logging information to the system log of every node in the cluster Cluster service also writes a more detailed log of cluster activity to the cluster log on each node Use these two sources to gather information when you begin troubleshooting a problem You will be able to determine whether the problem is related to the network, to services or applications, or to physical components in the cluster
Use Event Viewer to filter the system log on event source: ClusSvc You can view general events, such as if Microsoft Cluster service failed to join the cluster on this node and Microsoft Cluster service successfully created a cluster
on this node
After you have determined the type of problem, you can use the following tools
to search for the source of the problem You must check each node individually when using any of these tools
Disk Manager You check disk manager to find out the health of the cluster
disk You can check whether the operating system recognizes the disks, and whether the cluster disks are basic versus dynamic You also need to verify that the drive letters of the cluster disks are the same on both nodes
Task Manager You can verify that Cluster service is running in Microsoft Windows 2000 Task Manager You can also use Task Manager as a
performance monitor, but you do not obtain the level of detail as you would with a performance monitor In Task Manager, you will be able to verify the CPU utilization percentage and the memory resources on the node
Topic Objective
To describe the tools that
are used for troubleshooting
Cluster service problems
Lead-in
The tools that you use for
troubleshooting a cluster are
the same tools that you use
to troubleshoot a server
Note