A network problem might be preventing reliable communication between cluster nodes.Event ID 1007 • Source ClusSvc • Description A new node, ComputerName, was added to the cluster.. Event
Trang 1APPENDIX A
Project Plan Sample
351
Trang 2I n this appendix, you look at creating a project plan for rolling out a high-availability
solution You see about 150 separate tasks set within a project that can be customized
to your needs The purpose of this appendix is to give you a tool to build your own project plan, if needed.
HIGH-AVAILABILITY PROJECT PLANNING
This appendix will be valuable to Project Managers, Team Leaders, Architecture
Designers, and Supervisors Anyone can use the appendix as an aid to help build a
project plan for a high-availability solution.
Again, you can use this appendix as a guide and change it as you see fit Many times, I use templates for projects.
Build the Project
In this section of the appendix, you see all the sections you need to plan before you
begin the roll out.
First, get a vision of the project Project Managers will call this a Scope Document, but I’ll keep it simple enough for anyone to follow here In Figure A-1, I started a project
plan on a Gantt chart You don’t have to use Project 2000 to do this The whole point is
to organize everything, so you don’t forget any steps and you have a way to track
what’s being done on the entire project.
Figure A-1. Viewing a Gantt chart
Trang 3You must lay out the major tasks that need to be accomplished In this appendix, we set up a project plan for a small company for a load-balanced solution with two nodes.
1 Group major tasks together What are the major points at each transition of the plan? You need to start with a kick-off meeting What about planning the design and getting a budget? Who will supervise the whole team? Who will work with all the teams within the group? You need to start thinking about people as resources Where can you use them to get the project accomplished in a timely and accurate manner?
2 After you brainstorm the project, you need to commit it to paper (or electronically).
You can group subtasks under major tasks If you do this correctly, you’ll have
a list like this:
• Project Vision (Main Task)
• Create the vision/scope document: this is used to start the documentation
of the NLB solution you want to roll out.
• Define and write the project vision statement and scope: you need to assign someone to do this (as a resource) This will most likely be the Project Manager, if you have one.
• Identify business drivers and constraints: what is driving this project? The customer needs a Highly Available solution and you need to provide it for them However, they might be unable to afford what you propose.
• Identify critical dates: does this have to be done before December when everyone will be shopping online?
• Gain vision/scope document approval: you need stakeholders to sign off
on the document, so you can get funding and approval to move forward.
• Plan the meeting: this is your kickoff meeting where everyone meets and the project begins.
• Obtain vision/scope document approval and signoff: you need signatures
on the documentation you created The kickoff meeting could be the place
to do it when everyone is assembled.
• Create the conceptual design: now that you’re funded, you can begin the design This can be done in many ways, but you can refer to the Visio diagrams provided within the book.
• Planning (Main Task)
• Define project structure: you can do this by explaining what you’re presently creating—the structure of the project.
• Assign project team roles and responsibilities: this is an important task because you need to know what people will be available, what they’re going to do, and what their roles will be as the project progresses.
Trang 4• Assess customer infrastructure: you can’t deploy a project without having
an idea on how your plan fits into it This is critical to get the project solution to work.
• Acquire reference materials and software tools: of course, you need to make sure you have documents, books, tools, and anything else you need
to get the job done.
• Assess and mitigate risks: what are the risks? Once you determine them, either make plans to back out of problems that occur (DRP) or get rid of risk altogether, if possible.
• Implement the testing resources: you need to make sure you have enough
to pilot the solution or set up a test lab.
• Create a communications plan: communications are essential to success If you’re out of the loop, you might find it hard both to get tasks completed and to get them completed on time.
• Identify current network infrastructure: critical to the success of a balanced (or any other) solution You must know the network layout and its data flows.
load-• Physical network topology: WAN and LAN charts are needed to help the planning of the high-availability solution.
• Protocol address management: you need to know the Layer 2 and Layer 3 (MAC and IP) addresses for the network if you’re to populate it with a load-balanced solution.
• Remote access: will there be remote access to the NLB cluster? If so, then you need to plan it.
• Network operations/performance management: covered in detail in Chapter 8 You must know who will monitor and maintain the solution once it’s in place.
• Training: are your people ready to implement and maintain this solution?
If not, then you must train them.
• Identify current user environment: do you know who you have on the floor and how the new NLB cluster will affect them? What about web access or business partners?
• Assess infrastructure readiness: is your infrastructure ready to put this new NLB cluster in place? Will you have enough ports in the switch?
• Specify functionality to be delivered: you need to document what this solution will provide.
• Build the master project plan: a master project plan contains smaller grouped plans In other words, you can make this one the master project plan, and then you can add the high-availability implementation into this one once you’re ready to do it.
Trang 5• Build the master project plan: now that you have the master plan, you need to build and document it.
• Update the master project plan: Now that it’s ready to go or in the works, you need to keep it updated and manage it.
• Developing (Main Task)
• Create the logical design: now you need to develop the plan and the solution This section is highly flexible and can be made to meet any needs your project has.
• Server installation and configuration: this can be broken down further but, for this example, let’s keep it simple to the two nodes we’ll implement.
• Install NLB node (select the first node): plan development.
• Install NLB node (select the second node): plan development.
• Install NLB drivers: plan development.
• Configure the NLB drivers to design specifications: plan development.
• Validate and approve logical design: now that you know what your install
is going to be composed of, you need to make sure everyone else agrees with a peer review.
• Validate logical design: check, validate, and then sign off on the logical design.
• Implement the design into a pilot: this is where you can build the pilot based on the design you created.
• Conduct the pilot: make sure you build a good pilot and you demonstrate
it properly.
• Complete the pilot and controlled introduction, and then document the results.
• Move from controlled introduction to enterprisewide deployment.
• Deployment (Main Task)
• Deploy the system: now you’re ready to go! This is where you do the actual deployment Again, this is something you can break down deeper, but for this plan, you can use the second half of Chapter 3 to fill in the various subtasks involved with NLB clustering.
• Monitor user satisfaction: test the solution and see if it works Is it better?
Simulate failures and see how long you take to get it back together.
3 Now, populate Microsoft Project with this, if you have it If not, you can make
a simple spreadsheet to keep track of what’s listed.
4 Last, assign resources (this also includes people) to each task This should complete a simple project plan for you.
Again, modify this as you see necessary Understand, this is a template to help you build your own project plans as needed.
Trang 7APPENDIX B
Advanced Troubleshooting:
Event IDs
357
Trang 8I n this appendix, you look at Microsoft Cluster Server (MSCS) event messages The
intent of this appendix is to make it quick and easy for you to look up possible problems you might experience with your Windows-based high-availability solution.
In this section, you look at Event IDs that appear in logs while working with
high-availability solutions, such as clustering and load balancing This appendix was created
to consolidate the most-likely seen errors in one section of the book for easy reference.
If you need to research some less-common events, you can search http://www.microsoft
or other causes.
Event ID 1002
• Source ClusSvc
• Description Microsoft Cluster Server handled an unexpected error at line 528
of source module X The error code was 5007.
• Problem Messages similar to this might occur after installation of Microsoft Cluster Server If the Cluster Service starts and successfully forms or joins the cluster, they could be ignored Otherwise, these errors could indicate a corrupt quorum logfile or other problem
• Solution Ignore the error if the cluster appears to be working properly.
Otherwise, you might want to try creating a new quorum log file using the
-noquorumlogging or -fixquorum parameters, as documented in the Microsoft
Cluster Server Administrator’s Guide.
Trang 9• Solution Check network adapters and connections between nodes Check the system event log for errors A network problem might be preventing reliable communication between cluster nodes.
Event ID 1007
• Source ClusSvc
• Description A new node, ComputerName, was added to the cluster.
• Information The Microsoft Cluster Server Setup program ran on an adjacent computer The setup process completed and the node was admitted for cluster membership No action required.
And, because a cluster already exists with the same cluster name, the node couldn’t form a new cluster with the same name.
• Solution Remove MSCS from the affected node and reinstall MSCS on that system, if desired.
Event ID 1010
• Source ClusSvc
• Description Microsoft Cluster Server is shutting down because the current node isn’t a member of any cluster Microsoft Cluster Server must be reinstalled
to make this node a member of a cluster.
• Problem The Cluster Service attempted to run, but found it isn’t a member of
an existing cluster This could be because of eviction by an administrator or an incomplete attempt to join a cluster This error indicates a need to remove and reinstall the cluster software.
• Solution Remove MSCS from the affected node and reinstall MSCS on that server, if desired.
Event ID 1011
• Source ClusSvc
• Description Cluster Node ComputerName has been evicted from the cluster.
• Information A cluster administrator evicted the specified node from the cluster.
Trang 10Event ID 1015
• Source ClusSvc
• Description No checkpoint record was found in the logfile X:\Mscs\
Quolog.log The checkpoint file is invalid or was deleted.
• Problem The Cluster Service experienced difficulty reading data from the quorum log file The log file could be corrupted.
• Solution If the Cluster Service fails to start because of this problem, try manually starting the Cluster Service with the -noquorumlogging parameter If you need
to adjust the quorum disk designation, use the -fixquorum startup parameter when starting the Cluster Service Both of these parameters are covered in the
• Solution You could need to use procedures to recover from a corrupt quorum log file You might also need to run chkdsk on the volume to ensure against file system corruption.
Event ID 1019
• Source ClusSvc
• Description The log file X:\MSCS\Quolog.log was found to be corrupt An attempt will be made to reset it or you should use the Cluster Administrator utility to adjust the maximum size.
• Problem The quorum logfile for the cluster was found to be corrupt The system will attempt to resolve the problem.
• Solution The system will attempt to resolve this problem This error could also be an indication that the cluster property for maximum size should be increased through the Quorum tab You can manually resolve this problem by using the -noquorumlogging parameter.
Event ID 1021
• Source ClusSvc
• Description Insufficient disk space remains on the quorum device Please free up some space on the quorum device If no space exists on the disk for the quorum log files, then changes to the cluster registry will be prevented.
Trang 11• Problem Available disk space is low on the quorum disk and must be resolved.
• Solution Remove data or unnecessary files from the quorum disk, so sufficient free space exists for the cluster to operate If necessary, designate another disk with adequate free space as the quorum device.
Event ID 1023
• Source ClusSvc
• Description The quorum resource wasn’t found The Microsoft Cluster Server has terminated.
• Problem The device designated as the quorum resource couldn’t be found.
This could be because the device failed at the hardware level, that the disk resource corresponding to the quorum drive letter doesn’t match, or that it no longer exists.
• Solution Use the -fixquorum startup option for the Cluster Service Investigate and resolve the problem with the quorum disk If necessary, designate another disk
as the quorum device and restart the Cluster Service before starting other nodes.
Event ID 1024
• Source ClusSvc
• Description The registry checkpoint for cluster resource resourcename couldn’t be restored to registry key registrykeyname The resource might not function correctly Make sure no other processes have open handles to registry keys in this registry subkey.
• Problem The registry key checkpoint imposed by the Cluster Service failed because an application or process has an open handle to the registry key or subkey.
• Solution Close any applications that might have an open handle to the registry key, so it might be replicated as configured with the resource properties If necessary, contact the application vendor about this problem.
Trang 12Event ID 1034
• Source ClusSvc
• Description The disk associated with cluster disk resource name couldn’t
be found The expected signature of the disk was signature If the disk was removed from the cluster, the resource should be deleted If the disk was replaced, the resource must be deleted and created again to bring the disk online If the disk hasn’t been removed or replaced, it might be inaccessible
at this time because it’s reserved by another cluster node.
• Problem The Cluster Service attempted to mount a physical disk resource in the cluster The cluster disk driver couldn’t locate a disk with this signature.
The disk could be offline or it might have failed This error could also occur
if the drive has been replaced or reformatted This error might also occur if another system continues to hold a reservation for the disk.
• Solution Determine why the disk is offline or nonoperational Check cables, termination, and power for the device If the drive has failed, replace the drive and restore the resource to the same group as the old drive Remove the old resource Restore data from a backup and adjust resource dependencies within the group to point to the new disk resource.
Event ID 1035
• Source ClusSvc
• Description Cluster disk resource %1 couldn’t be mounted.
• Problem The Cluster Service attempted to mount a disk resource in the cluster and couldn’t complete the operation This could be because of a file-system problem, a hardware issue, or a drive-letter conflict.
• Solution Check for drive-letter conflicts, evidence of file-system issues in the system event log, and for hardware problems.
Event ID 1040
• Source ClusSvc
• Description Cluster generic service ServiceName couldn’t be found.
• Problem The Cluster Service attempted to bring the specified generic service resource online The service couldn’t be located and couldn’t be managed by the Cluster Service.
• Solution Remove the generic service resource if this service is no longer installed The parameters for the resource might be invalid Check the generic service resource properties and confirm correct configuration.
Trang 13Event ID 1042
• Source ClusSvc
• Description Cluster generic service resourcename failed.
• Problem The service associated with the mentioned generic service resource failed.
• Solution Check the generic service properties and service configuration for errors Check system and application event logs for errors.
Event ID 1043
• Source ClusSvc
• Description The NetBIOS interface for IP Address resource has failed.
• Problem The network adapter for the specified IP address resource has experienced a failure As a result, the IP address is either offline or the group has moved to a surviving node in the cluster.
• Solution Check the network adapter and the network connection for problems.
Resolve the network-related problem.
driver-to determine if a specific OEM version of the driver is a requirement If you already have many IP address resources defined, make sure you haven’t reached the NetBIOS limit of 64 addresses If you have IP address resources defined that don’t have a need for NetBIOS affiliation, use the IP Address private property to disable NetBIOS for the address This option is available
in SP4 and helps to conserve NetBIOS address slots.
Event ID 1045
• Source ClusSvc
• Description Cluster IP address IP address couldn’t create the required TCP/
IP Interface.
Trang 14• Problem The Cluster Service tried to bring an IP address online The resource properties might specify an invalid network or malfunctioning adapter This error could occur if you replace a network adapter with a different model and continue to use the old, or inappropriate, driver As a result, the IP address resource can’t be bound to the specified network.
• Solution Resolve the network adapter problem or change the properties of the IP address resource to reflect the proper network for the resource.
Event ID 1056
• Source ClusSvc
• Description The cluster database on the local node is in an invalid state.
Please start another node before starting this node.
• Problem The cluster database on the local node might be in a default state from the installation process and the node hasn’t properly joined with an existing node.
• Solution Make sure another node of the same cluster is online first before starting this node On joining with another cluster node, the node will receive
an updated copy of the official cluster database and should alleviate this error.
Event ID 1062
• Source ClusSvc
• Description Microsoft Cluster Server successfully joined the cluster.
• Information When the Cluster Service started, it detected an existing cluster
on the network and was able to join the cluster successfully No action needed.
Event ID 1063
• Source ClusSvc
• Description Microsoft Cluster Server was successfully stopped.
• Information The administrator stopped the Cluster Service manually.
Trang 15Event ID 1068
• Source ClusSvc
• Description The cluster file share resource resourcename failed to start Error 5.
• Problem The file share can’t be brought online The problem could be caused
by permissions to the directory or the disk in which the directory resides This might also be related to permission problems within the domain.
• Solution Check to make sure the Cluster Service account has rights to the directory to be shared Make sure a domain controller is accessible on the network.
Make sure dependencies for the share and for other resources in the group are set correctly Error 5 translates to Access Denied.
Event ID 1069
• Source ClusSvc
• Description Cluster resource Disk X: failed.
• Problem The named resource failed and the Cluster Service logged the event.
In this example, a disk resource failed.
• Solution For disk resources, check the device for proper operation Check cables, termination, and log files on both cluster nodes For other resources, check resource properties for proper configuration and check to make sure dependencies are configured correctly Check the diagnostic log (if it’s enabled) for status codes corresponding to the failure.
• Solution If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server.
Event ID 1071
• Source ClusSvc
• Description Cluster node two attempted to join, but was refused Error 5052.
• Problem Another node attempted to join the cluster and this node refused the request.
Trang 16• Solution If the node was previously evicted from the cluster, you must remove and reinstall MSCS on the affected server Look in Cluster Administrator to see
if the other node is listed as a possible cluster member.
Event ID 1105
• Source ClusSvc
• Description Microsoft Cluster Server failed to initialize the RPC services.
The error code was %1.
• Problem The Cluster Service attempted to use required RPC services and couldn’t successfully perform the operation.
• Solution Use the net helpmsg errorcode command to find an explanation of the underlying error Check the system event log for other RPC-related errors
or performance problems.
Event ID 1107
• Source ClusSvc
• Description Cluster node node name failed to make a connection to the node.
The error code was 1715.
• Problem The Cluster Service attempted to connect to another cluster node over a specific network and couldn’t establish a connection This error is a warning message.
• Solution Check to make sure the specified network is available and functioning correctly If the node experiences this problem, it might try other available networks to establish the desired connection.
Event ID 5719
• Source Netlogon
• Description No Windows domain controller is available for the domain
“domain.” (This event is expected and can be ignored when booting with the