Answer D is incorrect because the server must be turned onfor the diagnostic card to read from the POST routine.. Answer A is incorrect because the NOS error event logs will not help you
Trang 116 When looking up an OS software program on a vendor’s technical support
knowledgebase on Web, you are asked to download a hot fix file What should
be you next step in troubleshooting the problem?
A Download and install the hot fix file.
B Call the vendor’s technical support and verify the hot fix.
C Boot the server with the hot fix boot disk.
D Look up the software error code in the OS manual.
17 A technician is using a POST diagnostic card to troubleshoot hardware errors
during a server startup However, the error codes displayed on the card donot match any error codes in the POST card’s manual Where should the tech-nician check next to decipher the error code?
A Examine the NOS error event logs.
B Check the readmefile on the diagnostic boot disk
C Call the OS vendor’s technical support line.
D Check the POST card vendor’s Web site.
18 A server manufacturer has installed their own hardware self-diagnostic
utili-ties on your server While checking the error logs of the utility, the softwarelists an error code for one of your motherboard components Where can thiserror code be checked to find out what it means?
A The diagnostic utility’s manual
B The OS vendor’s Web site
C The motherboard vendor’s technical support phone line
D The NOS application logs
19 While off-site, you receive a call from a user that one of the servers has lost
power The server has a remote access card installed, but when you try to dial
in by modem, you are unable to connect to the remote access card What isthe most likely cause of the problem?
A The card will not answer if no operating system is present.
B The remote access card was attached to the same power source as the
server
C The battery on the remote access card is dead.
D The remote access card is not plugged in to a UPS.
Trang 220 A hardware vendor’s technical support has sent you a Field Replacement Unit
(FRU) for your server’s power supply When you try to install it, you noticethat it will not fit into the machine, and the connectors do not match up What
is the most likely cause of the problem?
A You need to consult the power supply’s manual to install it properly.
B The wrong FRU part was delivered.
C The vendor sent you a better version of the same power supply.
D The motherboard needs to be replaced to support the power supply.
Scenarios
1 A critical database server has suffered a hardware failure What steps should
you take to troubleshoot the problem, keeping in mind that the amount ofdowntime has to be kept to a minimum?
2 A manager has asked you to take a look at an older server that has not been
working for a long time Because the department’s budget has been cut, themanager cannot buy a new server to perform some simple reporting tasks,and would like to get the old server working The server hangs on the POSTroutine, but there is no error message except for a single beep What stepscan you take to fix the server?
Answers to Chapter Questions
Chapter pre-test
1 Any software updates, patches, and upgrades will be automatically sent to
you at no cost
2 A POST card plugs into a computer’s expansion slot to give detailed POST
information during the server’s boot-up routine
3 The most current troubleshooting information for your system can be found
at the vendor’s Web site
4 This documentation will help you troubleshoot future problems that might
have happened before
5 FRU stands for Field Replacement Unit, a hardware vendor’s term for a
hard-ware component that can be quickly replaced on-site with an identical part
6 Wake-on-LAN technology enables a client PC to be turned on by signaling the
computer’s network card
Trang 37 You should know the serial number, model number, and vendor part number
before ordering a replacement part
8 You should call the vendor’s technical support when conventional
trouble-shooting methods have not fixed a problem, or if the issue is an emergency
9 You need access to a modem to be able to dial into a server’s remote access
card
10 Documenting the solution will aid you in the future if the problem happens
again
Assessment questions
1 D A Wake-on-LAN network card will allow a client PC to be remotely turned
on by sending a signal to the network port Answer A is incorrect because amodem will only let you dial in to a machine that is already turned on Answer
B is incorrect because Remote Access Services (RAS) only enables you to dial
in to a machine Answer C is incorrect because a remote access card cannotturn a server on For more information, see the “Wake-on-LAN” section
2 B The POST card needs to be plugged into an available expansion slot, and
then the server is rebooted Errors codes will appear on the card’s built-indisplay Answer A is incorrect because there is no need for a diagnostic bootdisk Answer C is incorrect because a POST card plugs into an availableexpansion slot Answer D is incorrect because the server must be turned onfor the diagnostic card to read from the POST routine For more information,see the “POST card” section
3 A The vendor’s Web site is the best place to get the most current error codes.
Answer B is incorrect because the CD-ROM will only have information current
up to the release for the product Answer C is incorrect because you can findthe information more quickly by scanning the vendor’s website Answer D isincorrect because the readmefile will only be as current as the release of theproduct For more information, see the “Using vendor resources” section
4 C You must contact the hardware phone support line to get the server back
on-line as soon as possible Answer A is incorrect because waiting for ane-mail response might take a few days Answer B is incorrect because using adiagnostic card will help find the problem, but you must contact the vendoranyway to get a replacement part Answer D is incorrect because the problem
is with the hardware, not the NOS For more information, see the “TechnicalSupport” section
5 C The vendor will be able to send an identical card if you give them the
net-work card’s serial number Answer A is incorrect because there might bemany different sub-versions of a model of network card Answer B is incorrectbecause the serial number of the server will not help identify the right net-work card Answer D is incorrect because this is not specific enough informa-tion for the vendor to identify the right card For more information, see the
“Replacing Hardware” section
Trang 46 B Because the OS is locked up, there is nothing that can be done other than to
reset the server, and see if it boots up properly Answer A is incorrect becausethere is no need to go on-site if the server can be remotely rebooted Answer C
is incorrect because a server should not be running a wake-on-LAN, whichallows a device to be turned on remotely Answer D is incorrect because thetechnician will not be able to connect to the server’s console if the OS islocked For more information, see the “Troubleshooting Remotely” section
7 D The error codes for the flashing lights can be found in the tape drive’s
man-ual Answer A is incorrect because there is no error information printed onthe bottom of a tape drive Answer B is incorrect because a POST card will not
be able to troubleshoot an external device Answer C is incorrect because theerror codes for the hardware device will not be found on the Web site for the
OS For more information, see the “Manual” section
8 A The previous technician’s documentation should be examined to see if the
same problem has happened in the past, and a solution was found Answer B
is incorrect because the server vendor will not be able to help with an OS ware problem Answer C is incorrect because the Web server manual will onlylist errors specific to the Web server software Answer D is incorrect because
soft-it may take some time before receiving a reply from a vendor’s e-mail support.For more information, see the “Server documentation” section
9 C If the POST error code is looked up in the documentation, it will indicate
which hardware is malfunctioning Answer A is incorrect because the ware error indicated by the internal POST routine is sufficient to track downthe problem Answer B is incorrect because the OS event logs will not help,because the error occurs before the operating system is loaded Answer D isincorrect because the OS vendor will not be able to help you in troubleshoot-ing a hardware problem For more information, see the “Hardware diagnostictools” section
hard-10 B Because the problem is not urgent, sending an e-mail to the OS vendor’s
technical support will usually get an answer in a day or two Answer A isincorrect because this problem is not an emergency, and you may be charged
a lot of money for the emergency response Answer C is incorrect because theproblem is with the OS software, not hardware Answer D is incorrect becausereinstalling the OS will not fix the configuration problem For more informa-tion, see the “E-mail support” section
11 B By entering the error into the vendor’s knowledgebase, you should be able
to obtain technical documents related to that problem Answer A is incorrectbecause the readmewill only contain error information current at the time ofthe original release of the OS Answer C is incorrect because the answershould be found using the free Web knowledgebase Answer D is incorrectbecause the motherboard manual will not aid you with OS software errors.For more information, see the “Web site” section
12 C Using the Wake-on-LAN support on the computers, they can be remotely
set to turn on, receive the update, and then shut down again Answer A isincorrect because the scheduling program will not be able to update
Trang 5computers that are turned off Answer B is incorrect because there is no needfor the technician to be on-site if the wake-on-LAN technology is used Answer
D is incorrect because the server should not have a wake-on-LAN card; thecards should be on the client computers For more information, see the
“Wake-on-LAN” section
13 A The software diagnostic disk will be able to scan the memory modules for
errors Answer B is incorrect because the performance monitoring tool willnot be able to scan for hardware errors Answer C is incorrect because the OSerror logs will not show memory errors Answer D is incorrect because paus-ing the POST routine during the error check will not aid in diagnosing theproblem For more information, see the “Software diagnostic tools” section
14 C The local application logs will contain any OS errors Answer A is incorrect
because the Web site for the hardware vendor will not contain any OS ware related information Answer B is incorrect because the technician willnot access to the manuals while off-site Answer D is incorrect becauserestarting the server will not permanently fix the OS errors For more informa-tion, see the “Troubleshooting Remotely” section
soft-15 A By giving the support staff the case ticket number, you can continue the
call from where you left off, without having to start over Answer B is incorrectbecause you will waste valuable time if you have to start explaining the prob-lem from the beginning Answer C is incorrect because the OS vendor will not
be able to help you with a hardware problem Answer D is incorrect becauseyou will still have to explain the problem from the beginning, wasting valuabletime For more information, see the “Phone support” section
16 B Before installing any patch or hot fix, you should verify that the problem
resolved by the hot fix is the one you are actually experiencing Answer A isincorrect because the problem might be something else entirely if you haven’tinvestigated further, and installing the hot fix may damage your system
Answer C is incorrect because the hot fix may be for a problem that is not thesame as yours, since it wasn’t verified with the vendor Answer D is incorrectbecause the most current error information will be on the Web site For moreinformation, see the “Technical support” section
17 D The most current error codes will be listed on the vendor’s Web site.
Answer A is incorrect because the NOS error event logs will not help you with
a hardware problem Answer B is incorrect because there is no diagnosticboot disk with a POST card Answer C is incorrect because the OS vendor’stechnical support will not help you solve a hardware problem For more infor-mation, see the “POST card” section
18 A The diagnostic utility’s manual will contain a listing of all of its error codes.
Answer B is incorrect because the OS vendor’s Web site will not help you with
a hardware problem Answer C is incorrect because the motherboard vendorwill not know the server vendor’s specific error codes Answer D is incorrectbecause the NOS application logs will not list any hardware problems Formore information, see the “Manual” section
Trang 619 C The remote access card contains a battery that enables it to remain on
dur-ing a power failure Answer A is incorrect because the remote access cardoperates independently of the operating system Answer B is incorrectbecause the remote access card uses its own battery for power Answer D isincorrect because the remote access card uses its own battery for power Formore information, see the “Remote access cards” section
20 B The vendor has sent you the wrong part for your server Answer A is
incor-rect because the power supply FRU should be identical to the one you alreadyhave Answer C is incorrect because a FRU is identical to the part that is beingreplaced Answer D is incorrect because the FRU should be identical to theoriginal part, and be compatible with the host system For more information,see the “Replacing Hardware” section
Scenarios
1 You should reboot the server, so that you can check the POST routine or
ven-dor hardware diagnostic utility for any errors If you receive an error, checkthe server’s manual to decipher the error code, to identify the failedhardware
Once you have verified this information, or if you cannot troubleshoot anyfurther, call the hardware vendor’s phone support immediately, to help getthe problem fixed while minimizing downtime Depending on the type of main-tenance agreement or level of support you have with the vendor, they willeither send you a FRU, which will replace your failed component, or they willsend a technician to do the replacement
Depending on your level of support, you will have a replacement part within afew hours or longer, but there is no faster way of getting the problem fixed
2 If the single beep in the POST routine does not properly identify the problem,
even after checking the server manual for POST error messages, you shouldtry booting the server with a POST diagnostic card
Install the card into an available expansion slot, and reboot the server ThePOST card will give more detailed error messages that you can examine inmore detail by using the diagnostic utility’s manual
Once you have identified the failed hardware component, you will need to callthe server manufacturer to see if they can get you a replacement Because theserver is fairly old, this might not be possible, and other upgrades may benecessary If you give the vendor’s support staff the serial number of the com-ponent, they will be able to track it down more easily
Since the server is probably not under warranty anymore, you should give theserver serial number to the vendor to verify this If not, you will have to getthe manager to authorize payment for the replacement part
Trang 7Disaster Recovery
As a systems administrator, you must have a disaster
recovery documented and tested How to create aneffective disaster recovery plan is discussed in this Part Theuses of hot and cold sites are addressed, along with what youmust know to implement a successful recovery if such adisaster occurs that renders your current network inoperable
Losing the use of your systems for even a few days because of
a natural disaster or even vandalism can damage your pany’s health A business interruption can cause a company
com-to lose market share, image, and credibility, can reduce tomer satisfaction or brand value, and can strain relationshipswith suppliers and alliance partners The chapters in this Partwill teach you what you need to know to recover quickly if adisaster occurs
In This Part Chapter 17
Planning for DisasterRecovery
Chapter 18
Ensuring FaultTolerance
Chapter 19
Backing Up andRestoring
VII
Trang 9Planning for Disaster Recovery
EXAM OBJECTIVES 7.1 Plan for disaster recovery
• Plan for redundancy (e.g., hard drives, power supplies, fans,NICs, processors, UPS)
• Use the concepts of fault tolerance/fault recovery to crate adisaster recovery plan
• Develop disaster recovery plan
• Confirm and use off site storage for backup
• Document and test disaster recovery plan regularly, andupdate as needed
7.2 Restoring
• Identify hot and cold sites
• Implement disaster recovery plan
17C H A P T E R
Trang 10CHAPTER PRE-TEST
1.List the common requirements of any disaster recovery plan
2.Describe high availability as it relates to the server environment
3.List some common types of natural disasters
4.What are the human influences that can lead to a disaster?
5.In order to increase your systems up time and limit the amount ofdowntime you will want to ensure that your systems have a highrate of _
6.What are redundant NICs?
7.What is a hot site?
✦ Answers to these questions can be found at the end of the chapter ✦
Trang 11Disaster recovery is perhaps the most important and most challenging aspect
of being a systems administrator After a disaster occurs, you may find self in a completely different location, without the tools that you are comfortablewith It can be very difficult to try to restore a company’s systems and data without
your-a plyour-an in plyour-ace The more your-accuryour-ate your document of the computer environment,the faster you can get the business back up and running There are specific compa-nies who specialize in providing “hot sites” for companies with mission-criticalservers These hot sites are a separate physical location that your company canuse in a time of crisis This will ensure that your company will have the systemsavailable in the event of a disaster Being prepared is essential to protecting yourcompany in the event of a systems disaster
Disasters can be a result of natural disasters such as earthquakes, floods, or canes More commonly, disasters are a result of human factors such as viruses, sys-tem outages, and sabotage by employees and non-employees Preparation willminimize downtime, and enable the company to resume operations quickly
hurri-Most companies are not prepared for a disaster because of barriers such as a lack
of funding, underestimating the importance of disaster planning, and a lack of port from management Companies need to do an assessment on how a disasterwould impact the business
sup-Forming a Disaster Recovery Plan
7.1 Plan for disaster recovery
• Develop disaster recovery planThe idea behind the disaster recovery plan (DRP) is to prepare your company for apotential disaster The finished product is a document outlining the steps to take inthe event of a disaster Most important, you must have a viable backup method,which is discussed later in this chapter, and in Chapter 19, to implement a success-ful disaster recovery plan A disaster recovery plan involves several phases beforeits completion The following is a common approach to the disaster recovery plan-ning process:
Trang 12Before you begin work on any disaster recovery plan, you should ask yourself thesequestions Write down your answers, and then work on the weak spots If your com-pany does not have a plan in place, then review these questions, and use them aschecklist.
1 Do you have a plan?
2 Has the plan been fully tested?
3 How did the test turn out?
4 Have you updated the plan since the test?
5 Has a cost analysis been performed to determine the cost of not having a plan
in place?
6 Do you keep the plan current?
7 Are backups performed on a regular schedule?
8 Is there a detailed list of what is being backed up, especially mission-critical
applications and data?
9 Do you have a service agreement for off-site tape storage?
10 Is the plan understood and supported throughout the various departments in
the company?
11 Are the appropriate people committed to the plan?
12 Do you have the proper security in place: server room locked, UPS systems,
and so forth?
13 Have you been given management support, and an appropriate budget to
ful-fill the plan?
14 Does the plan include backup and archive procedures?
After reviewing your disaster recovery plan, you should be able to identify theweaknesses in the current plan and correct them
The disaster recovery team
The purpose of the disaster recovery team is to establish and direct plans of action
to be followed during an interruption or cessation of computer services caused by
a disaster or lesser emergency The disaster recovery team maintains readiness foremergencies by means of the disaster recovery plan The team is also responsiblefor managing the disaster recovery activities following a disaster and can bethought of as the disaster management team Through the disaster, the team willprovide for the safety of personnel, the protection of property, and the continuation
of business
Exam Tip
Trang 13The disaster recovery team consists of a team leader or emergency coordinator, analternate emergency coordinator, action team leaders, and any other designatedindividuals The responsibilities of individuals assigned to the team are in addition
to their regular duties and are assigned on the basis of familiarity and competence
in their respected areas or specialties The team leader and the alternate gency coordinator administer the plan itself Action teams are used to facilitate theresponse to various types of emergency situations
emer-Risk analysis
A risk analysis will incorporate all of the components of your LAN that could bedestroyed, whether it is lost connections, computers, or data In order to findpotential threats, you must know what to protect This usually involves a detailedbreakdown of business operations You begin by analyzing exactly how your com-pany produces its product or service This seems tedious, but it is vital to analyzingthe risk of what a potential catastrophe can do to your business A risk analysis willbring undesirable outcomes to light, measure the impact on the operations of thecompany, and estimate any potential loss whether it is in lost revenue or marketsegment To aid you in determining the different pieces of computer equipment thatare vital to the computer environment, a diagram of all the pieces of your networkshould be created so that you have an inventory of all the items that you mighthave to replace after a disaster This includes all relevant software products as well
Doing an inventory can be much simpler if you invest in software that can analyzeyour network and automatically inventory the different devices and the software onthe servers and workstations An excellent software package for performing a com-puterized inventory is Track-It by BlueOcean Software, Inc., which can be found atwww.blueocean.com If you miss something in your inventory, you could be look-ing at a failure when you try to restore your network after a disaster Do not forget
to note the less obvious things like modem cables or network cables; forgettingthese pieces could result in unwelcome delays
Remember that during a disaster, almost anything can go wrong, so you shouldtherefore plan for all possible scenarios Natural disasters can happen almost any-where Flooding can happen from too much rainfall, or when snow melts rapidly, oreven from sprinkler systems These and other disasters are discussed in greaterdetail later in this chapter In any event, you must plan for ways to access your net-work in case you are unable to get into your building for whatever reason Forexample, a chemical spill may prevent you from getting to your building, eventhough it may not be affected by the spill
The likelihood of a disaster occurring may be greater than you think However, toaccurately assess this you have to take into account your geographic location andthe typical weather scenarios that may contribute to a disaster For example, youmay live in a high earthquake zone, or an area that experiences annual flooding
The following list contains the most common types of business outages and the quency at which they occur:
Trang 14fre-✦ Power outages account for 28%
✦ Water damage accounts for 27%
✦ Hardware failure accounts for 15%
✦ Earthquakes account for 11%
✦ Fire accounts for 9%
✦ Hurricanes account for 4%
✦ Building damage accounts for 4%
✦ Corrupt data accounts for 2%
Business Impact Analysis
Vital to any plan is understanding which functions of the business are critical to thecompany’s successful operation In order to understand this you should perform aBusiness Impact Analysis (BIA) This process will expose your weaknesses andstrength, thus allowing you to take corrective measures
A BIA looks at the loss of revenue, customer service, and legal liabilities This ment lets you know exactly what it will cost if your company is non-operational forany length of time The BIA defines critical, necessary, and non-essential functionsfor your business It also identifies techniques that you can use to recover from thedisaster It identifies critical functions, and the priority in which they should berecovered It also identifies which functions rely on other functions so that youcan set a recovery timeline This will enable you to determine what must be doneimmediately following a disaster, and what can wait
assess-A Business Impact assess-Analysis answers the following questions:
1 What is the cost in lost dollars, market share, and customer loyalty you can
expect if the company suffers a disaster? Can the company recover theirreceivables if the accounts receivable records are destroyed?
2 How should you prioritize your recovery options following a disaster (What
has to be recovered first, second, and so on)?
3 How soon do you need to get up and running?
4 How much can the company afford in lost revenue?
The disaster recovery team leader’s role during the Business Impact Analysis is:
1 Identify organization functions.
2 Identify key personnel for the business functions (Finance, Operations,
Marketing, and so on)
3 Define the critical business functions.
4 Involve management to gain support and approval.
Trang 155 Coordinate the analysis process.
6 Identify which functions are dependent on other business functions.
7 Define recovery objectives and timelines (recovery times, losses, and critical
business function priorities)
8 Identify information requirements.
9 Identify resources needed.
10 Develop the format of the report.
11 Prepare the plan and present it to key management for final approval.
Prioritizing applications
After a disaster, when you are starting to piece your network back together, yourneed to know which applications to restore first As mentioned in the previous sec-tion, you should restore your mission-critical applications first Your company mayhave several applications that fall into this category Keep in mind that everyone willwant their applications restored first, and they will try to convince you that theirapplications are the most important If your company uses Enterprise ResourcePlanning (ERP) software, your task of choosing what should be restored firstbecomes much clearer Typically, the individual parts of ERP programs make up onelarge program, meaning you would have to restore the entire ERP software in orderfor anything to run Obviously, this has the benefits of being easier to restore, andthe only real drawback is that everyone waits until the entire program is restored
The BIA should determine which particular program comes first However, if yourcompany did not invest in a BIA, you will need to sit with management and person-nel to determine the most logical order in which to restore the applications
All departments must accept the application prioritization process, and everyonemust adhere to the prioritized list of applications You must ensure that departmentheads have signed off on the list
Don’t forget the details! This means that printers, fax machines, and e-mail willmost likely be a priority for sending and receiving information
Lay the foundation first
Many administrators set out to re-create their entire network all in one shot,instead of setting a priority as to who needs what first Just because you had 80workstations before to the disaster does not mean that you should concentrateyour efforts on rebuilding all of them Rather, what you should do is setup the mini-mal amount of workstations to get the company functioning Perhaps one worksta-tion per department or business section is all that is necessary
An important thing to remember when prioritizing applications is that you do notneed to restore the entire server Only restore what is necessary to get the mission-critical applications running This will allow you to get the system up and runningfaster, and will allow you to concentrate on what is really important, getting the
Tip
Trang 16applications running Make sure that department heads inform their employees ofthis process, so that you do not have to waste time dealing with user requests ofthis nature.
Know where the data is
In order to restore the priority applications first, you will need to know where allthe data is, and any dependencies there might be There may be system files such
as INIs and DLLs that need to be restored in order for the applications to workproperly Make sure you are completely familiar with your backup up system andsoftware or this may become more challenging than it already is Having the propermedia rotation schedule will make this process easier
Restoring data is the focus of Chapter 19
Recovery requirements
Recovery requirements are a crucial part of the plan, and you must determine anacceptable and realistic amount of time to recover the systems and the network Asmentioned previously, you need to get the most important applications and sys-tems running first Some refer to this as the trickle approach, because you onlyrecover exactly what you need to operate, and then recover everything else whenyour company is at comfortable operation level
You must give yourself ample time, and not be unrealistic when planning your ery It will be extremely difficult to perform your recovery duties if you promise theimpossible People will undoubtedly complain, and the company may lose confidence
recov-in your abilities to recover the systems A good systems admrecov-inistrator will always planfor the worst, and think about potential roadblocks in the recovery Taking this intoaccount, you can add extra time to your recovery plan Some experts say that what-ever your initial recovery timeline is, double it If things go smoothly, great, things will
be accomplished ahead of schedule If things do not go as planned, and you accountedfor these scenarios, the system will still be recovered within the timeline
The recovery plan has to be tested to ensure that your goals are met, and areattainable by you and others in the company In the event that you are unavailable
to perform the recovery yourself, some will have to step in and perform the ery This is why it is essential that the recovery is fully tested, so that you have realstatistics to add to your recovery document
recov-Management needs to work closely with IS staff to develop the recovery ments Do not forget that applications are different in nature, and so are their recov-ery times You may have two servers and an AS400 mainframe in your company,and how you prioritized your critical applications will determine the recoverytimes of each system For example, suppose the AS400 houses all the data, so it willget restored first Server One may be the application server that contains the ERPsoftware that connects to the AS400 and the workstations Server Two might be afile server and e-mail server, so it is recovered last
require-
Cross-Reference
Trang 17While determining the recovery time, include the time it will take to get the tapesfrom off-site storage facilities Also include the time it will take to cut purchaseorders and receive new equipment.
The disaster recovery document
The disaster recovery document must be so precise and detailed that anyone canfollow it Achieving this will take a large effort on the part of several key people inyour organization The great thing about creating this document is that it provides
a comforting reassurance that in the event of a disaster you have the necessarydocument in place to rebuild the company It is also a great opportunity to learn themany aspects of your systems
Have a hard copy of the plan kept both at your location and at an off-site tion, because it is probably the most important documentation that you have
loca-Every disaster recovery document should contain the following information
to a disaster Time is against you and you must start rebuilding your network as fast
as possible, restoring the mission critical applications in the priority that you havepreviously defined All personal must be given clear direction and responsibilities
You must also document the relationship between tasks, so you can identify anyproblems as they arise Last but not least, you must have detailed operations andtasks showing precise installation and recovery operations These must be veryeasy to read and follow so that nothing is overlooked or missed
Make sure you know how to issue a purchase order as this will save you a lot oftime when ordering replacement equipment Make sure that your inventory list alsohas the make, model and serial number of all hardware equipment, including phonenumbers of the vendors It is probably best to have copies of the original invoices
in your disaster recovery plan This will make it much easier when trying to order areplacement
Exam Tip Tip
Trang 18Having up-to-date and accurate network diagrams is obviously very important Notonly will this help you to reconstruct your network, but they will also let you knowhow much network cable you will need after the disaster, should it be destroyed.This is also an advantage when hiring contractors to lay the cable, because theycan look at your wiring diagram to determine what is needed, and will be able to getthe job done much quicker than you could.
The disaster recovery team leader is responsible for keeping the plan up to date.This person should periodically review and evaluate the plan to ensure all contin-gency site procedures have been adequately considered and prepared This meansthat this person will be responsible for ensuring that he document is up to date,and includes any new equipment, personnel changes, and so on The plan should
be reviewed semi-annually to ensure its accuracy
Types of Disasters
Disasters come in many forms, but all can be disruptive to a business They mayonly affect a particular person, or they may affect the entire company Some disastersmay even force the company to close its doors forever Viruses, hackers, hardware orsoftware failures can impair the operations of your systems environment, and evendestroy your data Natural disasters such as tornadoes, earthquakes, flooding, andeven hurricanes may completely destroy the company’s physical location
Natural disasters
Natural disasters are usually the most devastating because they not only destroythe LAN environment, but they sometimes also destroy part or all of the building.The most common types of natural disasters are fire, water damage, flooding, earth-quakes, tornadoes, and hurricanes The only way you can truly protect yourselffrom data loss is to ensure you have a good backup plan, and that the backupmedia are verified and are stored at a reputable off-site agency You may choose touse an off-site company in the same city or town as your company, or one that is 50
to 100 miles away Regardless of this, you must ensure that they provide a safebuilding to keep your backup media There is more about choosing an off-siteagency later in this chapter The company can always rebuild or relocate to a newbuilding They can also purchase new office furniture, equipment, computer hard-ware, and software, but there is now way to replace the years of customer andfinancial data Make sure you secure it properly as this data can sometimes make orbreak a company
Human error
Human error can consist of accidental file deletion, crashing a server while addingnew hardware, or tripping in the computer room and knocking over one of theservers These types of incidents are actually quite common, and the only thingthat prevents these from being full-blown disasters is the integrity of your backups
Trang 19Vandalism and sabotage
Vandalism is particularly devastating because it is senseless, and almost impossible
to predict Typically, vandalism is in the form of a fire or theft A fire can result indamage on several different levels First, the fire itself could destroy the entirebuilding, including all the computer hardware, or perhaps just a few of the servers
in the computer Second, the sprinkler systems would more than likely damage anyelectrical equipping including your users’ workstations If the server room isequipped with regular water-based sprinkler systems instead of a dry sprinkler sys-tems or special fire abatement systems specifically designed for computer rooms,the damage from the water alone would be enough to cripple the computer equip-ment Third, the fire department will be using water to extinguish the fire, whichwill also result in increased water damage
See Chapter 14 for more information about fire and your server room
Theft can occur either by a person stealing hardware, software, or both If someonebreaks into your computer room and steals your servers, you will have to imple-ment part of your disaster recovery plan to purchase new systems and restorethem to their previous working state from the last good backups Do not underesti-mate the employees in your company either Leaving the server room unlocked issometimes all the incentive a likely thief needs We recommend that you incorpo-rate the use of combination locks, swipe cards, or keys Key locks are probably theworst of the three because people tend to lose keys, and they can be duplicated
Combination locks are good only if you do not give the combination out to one who wants entry Electronic swipe cards are probably the best, because youcan track who went into the room, and at what time they went in You can alsoassign temporary cards for people, such as air conditioning repair technicians whoneed access to the room on a specific day
every-See Chapter 13 for more information about security measures
Sabotage usually is the result of a disgruntled employee seeking some sort of tution from the company These people could delete important company files, cor-rupt files by inputting false data, or worse Having proper security, and tapebackups can help limit the damage
resti-Logic bombs are a typical device used by disgruntled employees resti-Logic bombs are
just software programs, and are triggered by a timing device and can cause severedamage to your computer systems The activation timer is usually a specific date, orperhaps a specific event that occurs on the system For example, after a specific jobruns each night, the logic bomb might be triggered to delete data on your systems
Hackers
Hackers in the truest sense of the word are a special breed of people who devote alltheir energy to trying to unlock the proverbial door Most true hackers follow some
In the Real World
Cross-Reference
Cross-Reference
Trang 20sort of ethical code that prevents them from doing any real damage to the systemsthey break into In fact, some hackers will even inform you of their discoveries andlet you know where you need to improve your system security Regardless of theirethics, this practice is still considered a crime by most governments However,there are those that are not so ethical, and when they have accomplished the task
of breaking in, they do not stop They will try to search for important documentsand copy them onto their systems If the information that they steal is valuableenough, they may even fetch a hefty price for it Do not underestimate your compe-tition’s ability to cross the line and hire hackers to get into your systems and stealconfidential documents Some companies will stop at nothing to gain the upperhand, even if it means breaking the law There are also hackers that enjoy breaking
in and wreaking havoc on your computer environment They will tamper with,erase, and destroy information that is vital to the company, the system operatingsystem, or both Either way, the damage will definitely result in many hours of work
as you restore deleted files, or an entire system
Most hackers use what is known as the brute force method This method is the art of
trying to break into a system by trying different passwords They will keep tryinguntil they eventually break the password, and eventually it will work The problemfor the hacker is that this can be very time consuming; the problem for the adminis-trator is that most user passwords are easy to figure out
Recently I downloaded some software that was sold for the purpose of recoveringforgotten passwords However, I doubt that is what most people are using it for, so
I tried it on my system Within minutes, this software had cracked a substantialamount of the user’s passwords About five hours later, it had cracked almostevery password on the system except for the administrator’s, but eventually thisone was cracked as well Instruct your users on the use of proper passwords, andstay on top of current hacking technologies
The other method that hackers use is referred to as social engineering This means
that the hacker will trick the users into revealing vital information to them This can
be done very easily in large companies when a typical user will not know one tems specialist form the other For example, a hacker may call a user, pretending to
sys-be a systems specialist, and ask the user for his user ID and password The userassumes that the request is authentic, and provides the information
The only way you can prevent this form of hacking is to have the proper policies inplace prior to the incident All users should be informed that they are not to givetheir passwords out to anyone under any circumstances They should be madeaware of the fact that the IS department would not ask a user for his or her pass-word The technician would temporarily change the user’s password, and wouldnotify the user of the change The user would be required to change his or her pass-word at the next logon
Hackers commonly use two types of programs that are usually thought of as viruses,but they are actually destructive programs designed to cause damage on your sys-tems: Trojan horses and worms Many hackers use these programs for various rea-sons, but it usually just because they want to cause damage to your systems
In the Real World
Trang 21Worms are programs that infiltrate your programs and destroy your data Worms do
not replicate themselves like viruses do and, therefore, are not as serious in thesense that you have to worry that every system might be infected However, justbecause a worm is not a virus does not mean that it is not destructive A worm can
be designed to tap into your system and destroy all the files Destroying a worm istypically easier than destroying a virus because you should only have to seek outone copy of the program
Trojan horses are programs of a very destructive nature that are typically hidden in
another piece of software In fact, other viruses have been discovered inside ofTrojan horses Trojan horses, like worms, do not spread themselves to other com-puters The idea behind a Trojan Horse is that the hacker takes an attractive andtempting piece of software and places a malicious program inside of it The unsus-pecting user will download and install the software they wanted, and then theTrojan horse will be unleashed on to the user’s system There have been a variety
of uses for Trojan Horses to date, from simple programs that delete files on theusers system, to ones that tap into mainframe computers and embezzle money
Viruses
Computer viruses are programs that typically replicate themselves, attach to otherprograms, and perform hidden and often malicious actions The self-replication iswhat distinguishes viruses from other damaging programs Viruses can be destruc-tive to productivity as well as data Some viruses only interrupt users with annoy-ing messages, while others delete information from hard drives No matter what thevirus actually does, it wastes valuable time and money in resources as the I.S per-son attempts to clean the infected files For a virus programmer to consider a virussuccessful, it needs to be undetected for a sustained period of time This way it canpropagate from one file to the next and typically from one computer to the next
Typical signs that a virus might be present are:
✦ Files missing or corrupt
✦ Disk space decreases suddenly
✦ System is slower
✦ Unusual messages keep displaying
Planning for Redundancy
✦ Plan for redundancy (e.g., hard drives, power supplies, fans, NICs, processors,UPS)
Redundancy is a key component in making your network stable and reliable Youneed to plan your redundancy strategy the same as you plan any other part ofdisaster recovery Making your systems reliable is a calumniation of hardware,
Objective
Trang 22software, design, and planning The problem for most administrators is after the tial network is installed, everything after that is not very well planned or docu-mented This is usually attributed to the time constraints and sometime lack ofexpertise An hour of planning can save may hours of aggravation if something goeswrong.
ini-You should ensure that employees are cross-trained on the various systems in yourcompany In the event that an employee is hurt during a disaster, having other peo-ple that are capable of filling in will help reduce the impact of having that personunavailable while trying to recover from the disaster At a minimum, you shouldhave copies of the manuals and procedures for your mission-critical applicationsstored at an off-site facility
Hard drives
Disk drives have become increasingly more reliable, but they still can fail, and youneed to be prepared to face that challenge Disk mirroring, duplexing and RAID arethe most common methods of disk fault tolerance If you have not implementedRAID on your servers, you should Redundant disk drives enable the system to keepoperating if one drive fails These systems work well, and are relatively inexpensivecompared to the time it would take to get a new server up and running if the harddisk failed
RAID is extensively discussed in Chapter 4 and Chapter 18
External hard drives have bigger advantages over internal drives in a couple ofareas First of all, an external drive system can be moved or replaced without takingapart the server Secondly, they provide easier server redundancy If a backupserver is configured properly, you can very easily replace a defective server in min-utes You would just need to unplug the disk systems, and plug them in to thebackup server At most, you may lose a little bit of data, but this can be found onyour media backups
You should also keep idle spares to replace a hard drive at moments notice Theidle spare should be the exact same as the original drive, making it easier to set up
if you need it
Along with redundant drives, you may want to use redundant I/O controllers This
is usually referred to as duplexing The extra controller eliminates the server’s disk
controller as a single failure point Redundant controllers also increase the diskread performance of the system
Power supplies
You should always use dual power supplies in servers or other high-end computers,especially if they are critical to the operation of your company Dual power suppliesbalance the load by simultaneously supplying power throughout the system If one
Cross-Reference
Trang 23of the power supplies fails, the other one will take the full load of the system untilyou can get a replacement Before purchasing the server, make sure that one powersupply can accommodate the full load of the system on its own, or this feature will
be useless Figure 17-1 shows dual power supplies in action The system on theright has one power supply connected to the motherboard, and one connected tothe hard drive If either one of these power supplies fails, the system will not work
The system on the left has two power supplies connected to the entire system
Either one can fail, and the backup will run the system until the faulty power supplycan be replaced
Figure 17-1: On the left, a system with dual power supplies; on the right, a
system with two power supplies connected to separate components
RDBMS
SQL/Oracle
IBM AS/400
Trang 24Network Interface Cards
Some server operating systems can support redundant network cards (NICs).Having multiple NICs will provide fault tolerance and will load-balance the networktraffic For the fault tolerance to work, each NIC must be connected to a differentswitch That way, if one of the switches fails, the server will not lose networkaccess If you have multiple network cards it is best to connect them to separateswitches in the even that one switch fails You should also use high-speed networkcards (typically 100Mbps) to provide the fastest performance for your network.Your switches must also accommodate the same speed in order for this to be use-ful You should also keep identical replacement network cards for each server incase one of the network cards in one of your servers should fail Figure 17-2 shows
a server with redundant network cards, each connected to a separate switch
Figure 17-2: Server with two network cards, each on its own switch
Processors
Typically, server processors do not fail, but it can happen Having redundant cessors can save you a lot of time if original one fails A redundant CPU tracks theoperations of the primary CPU, but without interfering If the primary CPU fails, thesecondary CPU should be able to take over the system based on the informationtracked in its internal memory
Trang 25Symmetric multiprocessing servers use multiple CPUs to divide the work betweenthe processors and provide a degree of fault tolerance If one CPU fails, the systemcan run on the other processor However, multiple CPUs in these situations aremore for performance then fault tolerance This form of multiprocessing relies onthe OS to manage memory between tasks running on the processors A failed CPUcan often result in a system crash in this type of setup, because the process thatcrashed the first CPU will most likely crash the second one.
Asymmetric multiprocessing systems take a different approach to multiprocessingthan symmetric In this design, CPUs have specific tasks assigned to them In mostcases, one CPU will handle I/O operations and the other executes the programs Ifeither CPU fails in this situation, the system will crash
Tape backup system
Ensure that you have a working tape backup system, and if possible keep an exactspare system, including cables, SCSI adapters, and device drivers at an off-site loca-tion This will definitely help speed things up should the tape device be damaged aswell If you end up purchasing a new tape backup system, make sure the spare one
is updated as well, or it may be incompatible with the new tape media, which wouldmake it absolutely useless You should also verify that your backups are good byrestoring data from backup tapes periodically
UPS
A UPS is one of the most important pieces of equipment in the server room If there
is a power failure, a UPS can safely keep your servers on-line for an extended period
of time This will also enable you to shut down the servers cleanly, and prevent anydamage The UPS will also protect your systems from brownouts or sags Brown-outs can affect computer systems, causing electrical components to fail afterrepeated occurrences To ensure that your UPS can handle the power requirements
of your servers, you should first determine the battery capacity that you need inorder to meet your power requirements The best thing to do is to make a list of allthe equipment that the UPS will handle, and then determine the power consump-tion of each You will then be able to correctly choose the right UPS configuration
UPS systems are discussed in detail in Chapters 1 and 2
Part of your duties will require you to test the UPS batteries several times per year
Doing this will ensure that the UPS is sustaining a full charge, and that you will nothave any surprises should you have a power failure Most UPS systems come withsoftware that enables you to check system events, and to determine the currentcondition of the unit The software also enables you to simulate a power failure soyou can test the UPS devices Be sure to perform these tests when you will not dis-rupt any of the users on the network
Cross-Reference
Trang 26Ensuring Service
✦ Identify hot and cold sites
If a disaster strikes and your building is destroyed, including all your computerhardware, you can resume business in a relatively short period of time with hot site
or cold site services
Hot site
A hot site is a location that offers backup computing resources Many companies
specialize in providing hot site services A hot site should have the same hardwareand network environment that is compatible with your company’s computer envi-ronment These sites typically provide an enhanced environment that is protectedagainst most natural disasters and they should have backup power available so thatyou are not dependent on the power provided by the local utilities company Thiswill allow you to restore your environment, and set up a temporary network be-tween the hot site and your home site to restore operations This is the quickest way
to restore operations, but it is also the most expensive Ideally, the hot site is locatedless than 30 miles away from your company, so that employees can get there easily
It is very important to know exactly what you will need when you establish a hot sitecontract, because most service providers will not allow you to change your require-ments when disaster strikes, well at least not without a hefty fee
Hot sites are the best choice for insuring availability of your data, and typicallyoffer:
✦ Use of recovery center
✦ Support staff on hand to assist you
✦ Duplicate systems
✦ Configured systems transported to your company’s site
✦ Shipping of replacement equipment for anything that has been lost
✦ Work group spaces equipped with furnishings
✦ Workstations and printers
✦ Call management center
✦ WAN links redirected to the hot site
Cold site
A cold site has all the appropriate power requirements, network requirements, and
floor space to install the hardware and to enable you to recreate your computerenvironment, but does not provide the actual equipment Many of the same compa-nies that provide hot sites also provide cold sites These facilities provided by the
Objective
Trang 27companies will be comparable to their hot site facilities, but they will not includethe running systems and network equipment It may be reasonable for your com-pany to consider creating its own cold site if your company has floor space avail-able in a different location from the home site Cold sites are far less expensive thanhot sites, but because you have to purchase or move equipment, and software torecreate your environment, they will require a significantly longer outage beforeoperations can be restored Most cold site services will include:
✦ Off-site/off-line storage of backup media
✦ Replacement servers for mission critical operations at an added cost
✦ May provide help desk assistance
✦ Alternate location for storage of data
Backup Plan
✦ Confirm and use off site storage for backup
The first part of any backup plan is to ensure you have adequate backups and site storage facilities to store the backups What is backed up, and how often,should be determined by management and the appropriate employees Minimally,you should backup your servers daily and perform weekly or monthly backups forhigh end workstations The methods you employ for performing these backups will
off-be determined based on the time and data constraints you have These strategiesare discussed in greater detail in chapter 19
One of the most significant factors in a backup plan is where to keep the media
Many smaller companies keep the daily backups at the network administrator’shouse, hopefully in a fireproof safe However, this is not realistic for a larger com-pany Ideally, tapes that cannot be rotated off-site should be locked away in a fire-proof safe in a safe part of the building, such as the server room The weekly andmonthly tapes are typically stored at an off-site agency The off-site agency mustensure that they can protect your vital data, and you should ensure that they have
a disaster recovery plan of their own If they do not, look elsewhere for a reputablecompany that is qualified to protect you data
As the administrator, you must ensure that you know where the media is kept, andany hazards that may be encounter during its transit Look for agencies that do notadvertise their business on company vehicles It would be an easy target for thieveswhom could steal the data and sell it to your competitors You must also ensurethat you completely document the backup plan and label all media accordingly
Objective
Trang 28Testing the Disaster Recovery Plan
✦ Document and test disaster recovery plan regularly, and update as needed
✦ Implement disaster recovery planThe final step in creating your disaster recovery plan is to test it It is almost impos-sible to simulate a realistic test that will not disrupt the operation of your organiza-tion However, you can perform tests that can give you a reasonable idea of whatyou can expect under certain circumstances There are also a wide variety of com-panies that can help you simulate disaster recovery scenarios Whatever test youchoose, you must establish measurable objectives for the test to be effective Youshould try to define exactly what you are trying to accomplish with the test,because if you cannot determine the result, you cannot determine if the test is suc-cessful For example, a good objective would be to determine if you can install thesystem’s operating system, backup software, and hardware, and recover that data
at a recovery site
You must have well-documented, step-by-step procedures in order to accomplishthis If you do not, you will not be able to identify if the testing is going according toplan You should also keep the tests relatively simple even though and actual disas-ter could be very complicated If the test is too complicated, you will not be able todetermine what part of test was successful and what was not It will also make it dif-ficult to determine the causes of various problems you may encounter If you aretesting a complex procedure, make sure you break it down into various parts Youcan test each part individually, and determine the success or failure of those parts,and how to correct them
Record every single detail while you are performing the test Never wait until afterthe test is done, because you will forget the important details Invite various per-sonnel to the tests to observe everything that goes on They will be able to helpyou determine if you documentation is easy to follow, and if certain proceduresneed clarification This will also help them to refine their own plans The otheradded benefit is that more points of view will result in different perspectives onhow to solve problems that you would not have noticed It is a good idea to includepeople who are involved with the plan but not necessarily with the writing proce-dures You will benefit from their viewpoints, and you may discover that things thatare obvious to you are not so obvious to other team members
Once the test is completed, ensure you review and update the Disaster Recoverydocument accordingly It will be the team leader’s responsibility to make sure thatthe plan is updated after the testing, and periodically throughout the year It is alsothe team leader’s responsibility to ensure that the plan is tested at least once peryear The effectiveness of the plan is impacted by changes in the environment thatthe plan was originally created to protect Some major factors that will impact theplan are: new equipment, changing the software environment, staff and organiza-tional changes, and new or changing applications The plan should be reviewed andupdated by the team leader twice per year
Objective
Trang 29Testing is an on-going process, and it is an excellent way of determining if yourrecovery procedures work It will also improve your procedures, and familiarize thestaff with any new procedures Testing should also include your emergency repairprocedures for servers, and cable testing Remember that testing is probably themost valuable training tool you will have for your company when planning for a dis-aster The more often you practice the more prepared you will be.
Implementing the plan will require the cooperation of the disaster recovery team,and the support of the company as a whole Ultimately, the implementation of theplan will fall in the hands of the team leader, and this person will have to ensurethat each of the various stages of plan are met, before continuing on to the nextphase In order to meet the organization’s goal of having its business resume opera-tions quickly in the event of a crisis, the plan must be repeatedly tested, and thekey team members must be adequately trained If the testing becomes routine, ade-quate feedback will be given to help keep the plan current This will also make theplan go smoother should it need to be implemented Testing is vital to the planimplementation, and ultimately to the survival of the business
Key Point Summary
This chapter focused on the elements that are key to the recovery of a system after
a disaster These elements included the disaster recovery plan, types of disasters,and the backup plan Keep the following points in mind for the exam:
✦ Conduct business impact analysis
✦ Assess the risk of particular disasters based on company profile and location
✦ Select a hot site or a cold site recovery option to ensure service
✦ Adopt redundancy plans for drives, controllers, power supplies and NICs toensure high availability of systems
✦ Test your recovery plan at least once per year
✦ Verify your backups to ensure they are working properly
✦ Be aware of the different kinds of natural and man-made disasters
Trang 30STUDY GUIDE
The Study Guide section provides you with the opportunity to test your knowledgeabout planning for disaster recovery The Assessment Questions provide practicefor the test, and the Scenarios provide practice with real situations If you get anyquestions wrong, use the answers to determine the part of the chapter you shouldreview before continuing
Assessment Questions
1 Most companies are not ready for a disaster because of which of the following
barriers? Choose all that apply
A Lack of knowledge
B Little funding
C Lack of appreciation by support personnel
D Not enough support from management
2 The Disaster Recovery Plan consists of several phases What are they?
A Risk analysis, Business Impact Analysis, prioritizing recovery
require-ments, producing the document, testing the plan
B Risk analysis, Business Impact Analysis, prioritizing applications,
recov-ery requirements, producing the document, testing the plan
C Risky business, Cost Impact Assessment, application recovery, analyzing
the document, producing the test
D None of the above
3 A business impact analysis will perform the following functions for your
com-pany? Choose all that apply
A Expose weaknesses and strengths.
B Recommend the proper architectural layout for your building.
C Define critical, necessary, and non-essential functions of your business.
D Completely protect you from disaster.
Trang 314 You should keep a hard copy to the DRP document in a minimum of two
loca-tions What are they?
A Keep both copies off-site.
B Keep both copies on-site; one in the computer room, and one in a safe
in your office
C Keep one copy on-site, and one copy at an off-site location.
D Keep both copies in a locked safe on-site.
5 During the creation of the DRP document, someone suggests you should have
the appropriate purchasing information Why is this a good idea?
A Having the appropriate purchasing information on-hand will reduce the
time it takes to issue a purchase order and get new equipment
B It will minimize the effects of a disaster.
C All employees should learn how to issue purchase orders so that they
can order the new equipment on their own
D It is only necessary for the people in the purchasing department to issue
purchase orders
6 Your boss asks you what steps you have taken to ensure you have a sound
tape backup system How should you respond?
A Verify your backups, and purchase a backup unit for the off-site location.
B Verify backups, keep an exact spare tape unit, backups, cables, SCSI
adapters and device drives at an off-site location
C Keep an additional tape backup unit at your on-site location, so that you
can recover even faster
D Rotate the media to an off-site location.
7 Certain types of disasters are more devastating than others, what are they?
A Air disasters
B Sabotage
C Natural disasters
D Viruses