ac-20.1.7.3 Radios Because the maintenance window is tightly scheduled, the high number ofdependencies, and the occasional unpredictability of system administrationwork, all the SAs have
Trang 120.1.7.2 KVM and Console Service
Two data center elements that make management easier are KVM switchesand serial console servers Both can be instrumental in making maintenancewindows easier to run by making it possible to remotely access the console
of a machine
A KVM switch permits multiple computers to all share the same board, video display, and mouse A KVM switch saves space in a data center—monitors and keyboards take up a lot of space—and makes access more con-venient; indeed, more sophisticated console access systems can be accessedfrom anywhere in the network
key-A serial console server connects devices with serial consoles—systemswithout video output, such as network routers, switches, and many UNIXservers—to one central device with many serial inputs By connecting to theconsole server, a user can then connect to the serial console of the otherdevices All the computer room equipment that is capable of supporting aserial console should have its serial console connected to some kind of consoleconcentrator, such as a networked terminal server
Much work during a maintenance window requires direct console cess Using console access devices permits people to work from their owndesks rather than having to try to coordinate access for many people to thevery limited number of monitors in the computer room or having to wastecomputer room space, power, and cooling with more monitors It is also moreconvenient for the individual SAs to work in their own workspace with theirpreparatory notes and reference materials around them
ac-20.1.7.3 Radios
Because the maintenance window is tightly scheduled, the high number ofdependencies, and the occasional unpredictability of system administrationwork, all the SAs have to let the flight director know when they are finishedwith a task, and before they start a new task, to make sure that the prerequi-site tasks have all been completed We recommend using handheld radios tocommunicate within the group Rather than seeking out the flight director,
an SA can simply call over the radio Likewise, the flight director can contactthe SAs to find out status, and team members and team leaders can find oneanother and coordinate over the radio If SAs need extra help, they can alsoask for it over the radio There are multiple radio channels, and long con-versations can move to another channel to keep the primary one free Theradios are also essential for systemwide testing at the end of the maintenancewindow (see Section 20.1.9)
Trang 2Several options exist for selecting the radios, and what you choose pends on the coverage area that you need, the type of terrain in that area,availability, and your skill level It is useful to have multiple channels, or fre-quencies, available on the handheld radios, so that long conversations canswitch to another channel and leave the primary hailing channel open forothers (see Table 20.3).
de-Line-of-sight radio communications are the most common and typicallyhave a range of around 15 miles, depending on the surrounding terrain andbuildings Your retailer should be able to set you up with one or more fre-quencies and a set of radios that use those frequencies Make sure that theretailer knows that you need the radios to work through buildings and thecoverage that you need
Table 20.3 Comparison of Radio Technologies
Line of sight • Frequency license Simple • Limited range
Repeater • Frequency license • Better range • More complex to run
• Radio operator license • Repeater on mountain • Skill qualifications
enables communication over mountain
Cellular Service availability • Simple • Higher cost
• Wide range • Available only in
• Unaffected by terrain cellphone providers’
• Less to carry coverage area
• Company contracts may limit options
• Multiple channels may not be available
Trang 3Repeaters can be used to extend the range of a radio signal and areparticularly useful if a mountain between campus buildings would blockline-of-sight communication It can be useful to have a repeater and an an-tenna on top of one of the campus buildings in any case, for additional range,with at least the primary hailing channel using the repeater This configura-tion usually requires that someone with a ham radio license set up and operatethe equipment Check your local laws.
Some cellphone companies offer push-to-talk features on cellphones sothat phones work more like walk-talkies This option will work whereverthe telephones operate The provider should be able to provide maps of thecoverage areas The company should supply all SAs with a cellphone withthis service This has the advantage that the SAs have to carry only the phone,not a phone and radio This can be a quick and convenient way to get a newgroup established with radios but may not be feasible if it requires everyone
to change to the same cellphone provider
If radios won’t work or work badly in your data center because of radiofrequency (RF) shielding, put an internal phone extension with a long cord atthe end of every row, as shown in Figure 6.14 That way, SAs in the data centercan still communicate with other SAs while working in the data center Atworst, they can go outside the data center, contact someone on the radio, andarrange to talk to that person on a specific telephone inside the data center.Setting up a conference call bridge for everyone to dial in to can havethe benefits of radio communication with the benefit that people can dial inglobally to participate Having a permanent bridge number assigned to thegroup makes it easier to memorize and can save critical minutes when neededfor emergencies
Communication During an Emergency
A major news web site was flooded by users during the attacks of September 11, 2001.
It took a long time to request and receive a conference call bridge and even longer for all the key players to receive dialing instructions.
20.1.8 Deadlines for Change Completion
A critical role of the flight director is tracking how the various tasks areprogressing and deciding when a particular change should be aborted andthe back-out plan for that change executed For a general task with no otherdependencies and for which those involved had no other remaining tasks,that time would be 11PM on Saturday evening, minus the time required to
Trang 420.1 The Basics 489
implement the back-out plan, in the case of a weekend maintenance window.The flight director should also consider the performance level of the SA team
If the members are exhausted and frustrated, the flight director may decide
to tell them to take a break or to start the back-out process early if they won’t
be able to implement it as efficiently as they would when they were fresh
If other tasks depend on that system or service being operational, it
is particularly critical to predefine a cut-off point for task completion Forexample, if a console server upgrade is going badly, it can run into the timeregularly allotted for moving large data files Once you have overrun one timeboundary, the dependencies can cascade into a full catastrophe, which can
be fixed only at the next scheduled downtime, perhaps another week away.Make note of what other tasks are regularly scheduled near or during yourmaintenance window, so you can plan when to start backing out of a problem
20.1.9 Comprehensive System Testing
The final stage of a maintenance window is comprehensive system testing Ifthe window has been short, you may need to test only the few componentsthat you worked on However, if you have spent your weekend-long main-tenance window taking apart various complicated pieces of machinery andthen putting them back together and all under a time constraint, you shouldplan on spending all day Sunday doing system testing
Sunday system testing begins with shutting down all of the machines inthe data center, so that you can then step through your ordered boot sequence.Assign an individual to each machine on the reboot list The flight directorannounces the stages of the shutdown sequence over the radio, and each indi-vidual responds when the machine under their responsibility has completelyshut down When all the machines at the current stage have shut down, theflight director announces the next stage When everything is down, the order
is reversed, and the flight director steps everyone through the boot stages
If any problems occur with any machine at any stage, the entire sequence ishalted until they are debugged and fixed Each person assigned to a machine
is responsible for ensuring that it shut down completely before respondingand that all services have started correctly before calling it in as booted andoperational
Finally, when all the machines in the data center have been successfullybooted in the correct order, the flight director splits the SA team into groups.Each group has a team leader and is assigned an area in one of the cam-pus buildings The teams are given instructions about which machines theyare responsible for and which tests to perform on them The instructions
Trang 5always include rebooting every desktop machine to make sure that it comes
up cleanly The tests could also include logging in, checking for a particularservice, or trying to run a particular application, for example Each person
in the group has a stack of colored sticky tabs used for marking offices andcubicles that have been completed and verified as working The SAs also have
a stack of sticky tabs of a different color to mark cubicles that have a lem When SAs run across a problem, they spend a short time trying to fix itbefore calling it in to the central core of people assigned to stay in the mainbuilding to help debug problems As it finishes its area, a team is assigned
prob-to a new area or prob-to help another team prob-to complete an area, until the wholecampus has been covered
Meanwhile, the flight director and the senior SA troubleshooters keeptrack of problems on a whiteboard and decide who should tackle each prob-lem, based on the likely cause and who is available By the end of testing,all offices and cubicles should have tags, preferably all indicating success Ifany offices or cubicles still have tags indicating a problem, a note should beleft for that customer, explaining the problem; someone should be assigned
to meet with that person to try to resolve it first thing in the morning.This systematic approach helps to find problems before people come in
to work the next day If there is a bad network segment connection, a failedsoftware depot push, or problems with a service, you’ll have a good chance
to fix it before anyone else is inconvenienced Be warned, however, that somemachines may not have been working in the first place The reboot teamsshould always make sure to note when a machine did not look operationalbefore they rebooted it They can still take time to try to fix it, but it islower on the priority list and does not have to happen before the end of themaintenance window
Ideally, the system testing and sitewide rebooting should be completedsometime on Sunday afternoon This gives the SA team time to rest after astressful weekend before coming into work the next day
20.1.10 Post-maintenance Communication
Once the maintenance work and system testing have been completed, theflight director sends out a message to the company, informing everyone thatservice should now be fully restored The message briefly outlines the mainsuccesses of the maintenance window and briefly lists any services that areknown not to be functioning and when they will be fixed
This message should be in a fixed format and written largely in advance,because the flight director will be too tired to be very coherent or upbeat to
Trang 620.1 The Basics 491
write the message at the end of a long weekend There is also little chancethat anyone who proofreads the message at that point is going to be able tohelp, either
From: IT
To: Everyone in the company
All servers in the Burlington office are up and running Should you have any issues accessing servers, please open a helpweb ticket.
From: A Developer
To: IT
Devwin8 is down.
From: IT
To: Everyone in the company
Whoever has devwin8 under their desk, turn it on, please.
20.1.11 Re-enable Remote Access
The final act before leaving the building should be to reenable remote accessand restore the voicemail on the helpdesk phone to normal Make sure thatthis appears on the master plan and the individual plans of those responsible
It can be very easily forgotten after an exhausting weekend, but it is a veryvisible, inconvenient, and embarrassing thing to forget, especially because itcan’t be fixed remotely if all remote access was turned off successfully
20.1.12 Be Visible the Next Morning
It is very important for the entire SA group to be in early and to be visible
to the company the morning after a maintenance window, no matter howhard they have worked during the outage If everyone has company or groupshirts, coordinate in advance of the maintenance window so that all the SAswear those shirts on the day after the outage Have the people who look afterparticular departments roam the corridors of those departments, keeping eyesand ears open for problems
Trang 7Have the flight director and some of the senior SAs from the central services group, if there is one, sit in the helpdesk area to monitor incomingcalls and listen for problems that may be related to the maintenance window.These people should be able to detect and fix them sooner than the regu-lar helpdesk staff, who won’t have such an extensive overview of what hashappened.
core-A large visible presence when the company returns to work sends the sage: “We care, and we are here to make sure that nothing we did disruptsyour working hours.” It also means that any undetected problems can be han-dled quickly and efficiently, with all the relevant staff on-site and not having
mes-to be paged out of their beds Both of these facmes-tors are important in the overallsatisfaction of the company with the maintenance window If the company isnot satisfied with how the maintenance windows are handled, the windowswill be discontinued, which will make preventive maintenance more difficult
20.1.13 Postmortem
By about lunchtime of the day after the maintenance window, most of theremaining problems should have been found At that point, if it is sufficientlyquiet, the flight director and some of the senior SAs should sit down and talkabout what went wrong, why, and what can be done differently That shouldall be noted and discussed with the whole group later in the week Overtime, with the postmortem process, the maintenance windows will becomesmoother and easier Common mistakes early on are taking on too much, notdoing enough work ahead of time, and underestimating how long somethingwill take
20.2.1 Mentoring a New Flight Director
It can be useful to mentor new flight directors for future maintenance dows Therefore, flight directors must be selected far enough in advance sothat the one for the next maintenance window can work with the currentflight director
Trang 8win-20.2 The Icing 493
The trainee flight director can produce the first draft of the master plan,using the change requests that were submitted, adding in any dependenciesthat are missing, and tagging those additions The flight director then goesover the plan with the trainee, adds or subtracts dependencies, and reorga-nizes the tasks and personnel assignments as appropriate, explaining why Al-ternatively, the flight director can create the first draft along with the trainee,explaining the process while doing so The trainee flight director can also helpout during the maintenance window, time permitting, by coordinating withthe flight director to track status of certain projects and suggesting realloca-tion of resources where appropriate The trainee can also help out before thedowntime by discussing projects with some of the SAs if the flight directorhas questions about the project and by ensuring that the prerequisites listed
in the change proposal are met in advance of the maintenance window
20.2.2 Trending of Historical Data
It is useful to track how long particular tasks take and then analyze thedata later and improve on the estimates in the task submission and planningprocess For example, if you find that moving a certain amount of data be-tween two machines took 8 hours and you have a large data move betweentwo similar machines on similar networks another time, you can more accu-rately predict how long it will take If a particular software package is alwaysdifficult to upgrade and takes far longer than anticipated, that will be tracked,anticipated, allowed for in the schedule, and watched closely during the main-tenance interval
Trending is particularly useful in passing along historical knowledge.When someone who used to perform a particular function has left the group,the person who takes over that function can look back at data from previousmaintenance windows to see what sorts of tasks are typically performed inthis area and how long they take This data can give people new to the groupand to planning a maintenance window a valuable head start so that theydon’t waste a maintenance opportunity and fall behind
For each change request, record actual time to completion for use whencalculating time estimates next time around Also record any other notes thatwill help improve the process next time
20.2.3 Providing Limited Availability
It is highly likely that at some point, you will be asked to keep service availablefor a particular group during a maintenance window It may be something
Trang 9unforeseen, such as a newly discovered bug that engineering needs to work
on all weekend, or it may be a new mode of operation for a division, such
as customer support switching to 24/7 service and needing continuous access
to its systems to meet its contracts Internet services, remote access, globalnetworks, and new-business pressure reduce the likelihood that a full andcomplete outage will be permitted
Planning for this requirement could involve rearchitecting some services
or introducing added layers of redundancy to the system It may involvemaking groups more autonomous or distinct from one another Makingthese changes to your network can be significant tasks by themselves,likely requiring their own maintenance window; it is best to be preparedfor these requests before they arrive, or you may be left without time toprepare
To approach this task, find out what the customers will need to beable to do during the maintenance window Ask a lot of questions, anduse your knowledge of the systems to translate these needs into a set ofservice-availability requirements For example, customers will almost cer-tainly need name service and authentication service They may need to beable to print to specific printers and to exchange email within the com-pany or with customers They may require access to services across wide-area connections or across the Internet They may need to use particulardatabases; find out what those machines depend on Look at ways to makethe database machines redundant so that they can also be properly main-tained without loss of service Make sure that the services they depend onare redundant Identify what pieces of the network must be available forthe services to work Look at ways to reduce the number of networks thatmust be available by reducing the number of networks that the group usesand locating redundant name servers, authentication servers, and printservers on the group’s networks Find out whether small outages are ac-ceptable, such as a couple of 10-minute outages for reloading networkequipment If not, the company needs to invest in redundant networkequipment
Devise a detailed availability plan that describes exactly what servicesand components must be available to that group Try to simplify it by consol-idating the network topology and introducing redundant systems for thosenetworks Incorporate availability planning into the master plan by ensuringthat redundant servers are not down simultaneously
Trang 10These sites still need to perform maintenance on the systems in service.Although the availability guarantees that these sites make to their customerstypically exclude maintenance windows, they will lose customers if they havelarge planned outages.
as once a week, and shorter, perhaps 4 to 6 hours in duration
• They need to let their customers know when maintenance windowsare scheduled For ISPs, this means sending an email to the customers.For an e-commerce site, this means having a banner on the site Inboth cases, it should be sent only to those customers who may be af-fected and should contain a warning that small outages or degradedservice may occur during the maintenance window and give the times
of that window There should be only a single message about thewindow
2 High availability is anything above 99.9 percent Typically, sites will be aiming for three nines (99.9 percent) (9 hours downtime per year), four nines (99.99 percent) (1 hour per year), or five nines (99.999 percent) (5 minutes per year) Six nines (99.9999 percent) (less than 1 minute a year) is more expensive than most sites can afford.
3 Recall that n + 1 redundancy is used for services such that any one component can fail without bringing the service down, n + 2 means any two components can fail, and so on.
Trang 11• Planning and doing as much as possible beforehand is critical becausethe maintenance windows should be as short as possible.
• There must be a flight director who coordinates the scheduling andtracks the progress of the tasks If the windows are weekly, this may be
a quarter-time or half-time job
• Each item should have a change proposal The change proposal shouldlist the redundant systems and include a test to verify that the redundantsystems have kicked in and that service is still available
• They need to tightly plan the maintenance window Maintenance dows are typically smaller in scope and shorter in time Items scheduled
win-by different people for a given window should not have dependencies
on each other There must be a small master plan that shows who haswhat tasks and their completion times
• The flight director must be very strict about the deadlines for changecompletion
• Everything must be fully tested before it is declared complete
• Remote KVM and console access benefit all sites
• The SAs need to have a strong presence when the site approaches andenters its busy time They need to be prepared to deal quickly with anyproblems that may arise as a result of the maintenance
• A brief postmortem the next day to discuss any remaining problems orissues that arose is useful
• It is not necessary to disable access Services should remain available
• It is not necessary to have a full shutdown/boot list, because a fullshutdown/reboot does not happen However, there should be a depen-dency list if there are any dependencies between machines.4
4 Usually, high-availability sites avoid dependencies between machines as much as possible.
Trang 1220.3 Conclusion 497
• Because ISPs and e-commerce sites do not have on-site customers, beingphysically visible the morning after is irrelevant However, being avail-able and responsive is still important Find ways to increase your visibil-ity and ensure excellent responsiveness Advertise what the change was,how to report problems, and so on Maintain a blog, or put banner ad-vertisements on your internal web sites advertising the newest features
• A post-maintenance communication is usually not required, unless tomers must be informed about remaining problems Customers don’twant to be bombarded with email from their service providers
cus-• The most important difference is that the redundant architecture of thesite must be taken into account during the maintenance window plan-ning The flight director needs to make sure that none of the scheduledwork can take the service down The SAs need to make sure that theyknow how long failover takes to happen For example, how long doesthe routing system take to reach convergence when one of the routersgoes down or comes back up? If redundancy is implemented within asingle machine, the SA needs to know how to work on one part of themachine while keeping the system operating normally
• Availability of the service as a whole must be closely monitored ing the maintenance window There should be a plan for how to dealwith any failure that causes an outage as a result of temporary lack ofredundancy
dur-20.3 Conclusion
The basics for successfully executing a planned maintenance window fallinto three categories: preparation, execution, and post-maintenance customercare The advance preparation for a maintenance window has the most effect
on whether it will run smoothly Planning and doing as much as possible inadvance are key The group needs to appoint an appropriate flight directorfor each maintenance window Change proposals should be submitted to theflight director, who uses them to build a master plan and set completiondeadlines for each task
During the maintenance window, remote access should be disabled, andinfrastructure, such as console servers and radios, should be in place Theplan needs to be executed with as few hiccups as possible The timetablemust be adhered to rigidly; it must finish with complete system testing
Trang 13Good customer care after the maintenance window is important to itssuccess Communication about the window and a visible presence the morn-ing after are key.
Integrating a mentoring process, saving historical data and doing trendanalysis for better estimates, providing continuity, and providing limitedavailability to groups that request it can be incorporated at a later date Properplanning, good back-out plans, strict adherence to deadlines for change com-pletion, and comprehensive testing should avert all but some minor disas-ters Some tasks may not be completed, and those changes will need to bebacked out In our experience, a well-planned, properly executed maintenancewindow never leads to a complete disaster A badly planned or poorly exe-cuted one could, however These kinds of massive outages are difficult andrisky We hope that you will find the planning techniques in this chapteruseful
Exercises
1 Read the paper on how the AT&T/Lucent network was split (Limoncelli
et al 1997), and consider how having a weekend maintenance windowwould have changed the process What parts of that project would havebeen performed in advance as preparatory work, what parts would havebeen easier, and what parts would have been more difficult? Evaluate therisks in your approach
2 A case study in Section 20.1.1 describes the scheduling process for aparticular software company What are the dates and events that youwould need to avoid for a maintenance window in your company? Try
to derive a list of dates, approximately 3 months apart, that would workfor your company
3 Consider the SAs in your company Who do you think would make goodflight directors and why?
4 What tasks or projects can you think of at your site that would be propriate for a maintenance window? Create and fill in a change-requestform What preparation could you do for this change in advance of themaintenance window?
ap-5 Section 20.1.6 discusses disabling access to the site What specific taskswould need to be performed at your site, and how would you re-enablethat access?
Trang 14Exercises 499
6 Section 20.1.7.1 discusses the shutdown and reboot sequence Build anappropriate list for your site If you have permission, test it
7 Section 20.2.3 discusses providing limited availability for some people
to be able to continue working What groups are likely to require 24/7availability? What changes would you need to make to your network andservices infrastructure to keep services available to each of those groups?
8 Research the flight operations methodologies used at NASA Relate whatyou learned to the practice of system administration
Trang 16Chapter 21
Centralization and Decentralization
This chapter seeks to help an SA decide how much centralization isappropriate, for a particular site or service, and how to transition betweenmore and less centralization
Centralization means having one focus of control One might have two
DNS servers in every department of a company, but they all might be
trolled by a single entity Alternatively, decentralized systems distribute
con-trol to many parts In our DNS example, each of those departments mightmaintain and control its own DNS server, being responsible for maintain-ing the skill set to stay on top of the technology as it changes, to architectthe systems as it sees fit, and to monitor the service Centralization refers
to nontechnical control also Companies can structure IT in a centralized ordecentralized manner
Centralization is an attempt to improve efficiency by taking advantage
of potential economies of scale: improving the average; it may also improvereliability by minimizing opportunities for error Decentralization is an at-tempt to improve speed and flexibility by reorganizing to increase localcontrol and execution of a service: improving the best case Neither is al-ways better, and neither is always possible in the purest sense When each
is done well, it can also realize the benefits of the other: odd paradox,isn’t it?
Decentralization means breaking away from the prevailing hegemony,revolting against the frustrating bureaucratic ways of old Traditionally, itmeans someone has become so frustrated with a centralized service that “do
it yourself” has the potential of being better In the modern environmentdecentralization is often a deliberate response to the faster pace of businessand to customer expectations of increased autonomy
501
Trang 17Centralization means pulling groups together to create order and enforceprocess It is cooperation for the greater good It is a leveling process It seeks
to remove the frustrating waste of money on duplicate systems, extra work,and manual processes New technology paradigms often bring opportunitiesfor centralization For example, although it may make sense for each depart-ment to have slightly different processes for handling paper forms, no one de-partment could fund building a pervasive web-based forms system Therefore,
a disruptive technology, such as the web, creates an opportunity to replacemany old systems with a single, more efficient, centralized system Conversely,standards-based web technology can enable a high degree of local autonomyunder the aegis of a centralized system, such as delegated administration
21.1 The Basics
At large companies in particular, it seems as if every couple of years, agement decides to centralize everything that is decentralized and vice versa.Smaller organizations encounter similar changes driven by mergers or opening
man-of new campuses or field man-offices In this section, we discuss guiding principlesyou should consider before making such broad changes We then discuss someservices that are good candidates for centralization and decentralization
21.1.1 Guiding Principles
There are several guiding principles related to centralization and ization They are similar to what anyone making large, structural changesshould consider
decentral-• Problem-Solving: Know what specific problem you are trying to solve.
Clearly define what problem you are trying to fix “Reliability is consistent because each division has different brands of hardware.”
in-“Services break when network connections to sales offices are down.”Again, write down the specific problem or problems and communicatethese to your team Use this list as a reality check later in the project tomake sure that you haven’t lost sight of the goal If you are not solving
a specific problem, or responding to a direct management request, stopright here Why are you about to make these changes? Are you sure this
is a real priority?
• Motivation: Understand your motivation for making the change Maybe
you are seeking to save money, increase speed or become more flexible
Trang 1821.1 The Basics 503
Maybe your reasons are political: You are protecting your empire oryour boss, making your group look good, or putting someone’s personalbusiness philosophy into action Maybe you are doing it simply to makeyour own life easier; that’s valid too Write down your motivation andremind yourself of it from time to time to verify that you haven’t strayed
• Experience Counts: Use your best judgment Sometimes, you must use
experience and a hunch rather than specific scientific measurements.For example, we’ve found that when centralizing email servers, our ex-perience has developed these rules of thumb: Small companies—fivedepartments with 100 people—tend to need one email server Largercompanies can survive with an email server per thousands of people,especially if there is one large headquarters and many smaller salesoffices When the company grows to the point of having more thanone site, each site tends to require its own email server but is unlikely
to require its own Internet gateway Extremely large or geographicallydiverse companies start to require multiple Internet gateways at differentlocations
• Involvement: Listen to the customers’ concerns Consult with customers
to understand their expectations: Retain the good aspects and fix thebad ones Focus on the qualities that they mention, not the implemen-tation People might say that they like the fact that “Karen was alwaysright there when we needed new desktop PCs installed.” That is animplementation The new system might not include on-site personnel.What should be retained is that the new service has to be responsive—asresponsive as having Karen standing right there That may mean theuse of overnight delivery services or preconfigured and “ready to eat”systems1stashed in the building, or whatever it takes to meet that expec-tation Alternatively, you must do expectation setting if the new system
is not going to deliver on old expectations Maybe people will have toplan ahead and ask for workstations a day in advance
• Be Realistic: Be circumspect about unrealistic promises You should
thoroughly investigate any claims that you will save money by izing, add flexibility by centralization, or have an entirely new systemwithout pain: The opposite is usually the case If a vendor promisesthat a new product will perform miracles but requires you to centralize
decentral-1 Do not attempt to eat a computer “Ready to eat” systems are hot spares that will be fully functional when powered up: absolutely no configuration files to modify and so on.
Trang 19(or decentralize) how something is currently organized, maybe the efits come from the organizational change, not the product!
ben-• Balance: Centralize as much as makes sense for today, with an eye
toward the future You must find the balance between centralization
and decentralization There are time considerations: Building the perfectsystem will take forever You must set realistic goals yet keep an eye tofuture needs For example, in 6 months, the new system will be completeand then will be expected to process a million widgets per day However,
a different architecture will be required to process 2 million widgets perday, the rate that will be needed a year later, and will require consider-ably more development time You must balance the advantage of having
a new system in 6 months—with the problem of needing to start ing the next-generation system immediately—versus the advantage ofwaiting longer for a system that will not need to be replaced so soon
build-• Access: The more centralized something is, the more likely it is that some
customers will need a special feature or some kind of customization.
An old business proverb is: “All of our customers are the same: Theyeach have unique requirements.” One size never fits all You can’t do
a reasonable job of centralizing without being flexible; you’ll doom theproject if you try Instead, look for a small number of models Somecustomers require autonomy Some may require performing their ownupdates, which means creating a system of access control so that cus-tomers can modify their own segments without affecting others
• No Pressure: It’s like rolling out any new service Although more
emo-tional impact may be involved than with other changes, both tion and decentralization projects have issues similar to building a newservice That said, new services require careful coordination, planning,and understanding of customer needs to succeed
centraliza-• 110 Percent: You have only one chance to make a good first impression.
A new system is never trusted until proven a success, and the first perience with the new system will set the mood for all future customerinteractions Get it right the first time, even if it means spending moremoney up front or taking extra time for testing Choose test customerscarefully, making sure they trust you to fix any bugs found while testing,and won’t gossip about it at the coffee machine Provide superior servicethe first month, and people will forgive later mistakes Mess up right atthe start, and rebuilding a reputation is nearly impossible
Trang 20ex-21.1 The Basics 505
• Veto Power: Listen to the customers, but remember that management
has the control The organizational structure can influence the level of
centralization that is appropriate or possible The largest impediment tocentralization often is management decisions or politics Lack of trustmakes it difficult to centralize If the SA team has not proved itself, man-agement may be unwilling to support the large change Managementmay not be willing to fund the changes, which usually indicates that thechange is not important to them For example, if the company hasn’tfunded a central infrastructure group, SAs will end up decentralized Itmay be better to have a central infrastructure group; lacking manage-ment support, however, the fallback is to have each group make the bestsubinfrastructure it can—ideally, coordinating formally or informally toset standards, purchase in bulk, and so on Either way, the end goal is
to provide excellent service to your customers
21.1.2 Candidates for Centralization
SAs continually find new opportunities to centralize processes and services.Centralization does not innately improve efficiency It brings about the op-portunity to introduce new economies of scale to a process What improvesefficiency is standardization, which is usually a by-product of centralization.The two go hand in hand
The cost savings of centralization come from the presumption that therewill be less overhead than the sum of the individual overheads of each decen-tralized item Centralization can create a simpler, easier-to-manage architec-ture One SA can manage a lot more machines if the processes for each arethe same
To the previous owners of the service being centralized, centralization isabout giving up control Divisions that previously provided their own servicenow have to rely on a centralized group for service SAs who previously didtasks themselves, their own way, now have to make requests of someone elsewho has his or her own way to do things The SAs will want to know whetherthe new service provider can do things better
Before taking control away from a previous SA or customer, ask yourselfwhat the customer’s psychological response will be Will there be attempts tosabotage the effort? How can you convince people that the new system will
be better than the old system? How will damage control and rumor control
be accomplished? What’s the best way to make a good first impression?
Trang 21The best way to succeed in a centralization program is to pick the rightservices for centralization Here are some good candidates.
• Distributed Systems: Management of distributed systems Historically,
each department of an organization configured and ran its own webservers As the technology got more sophisticated, less customization ofeach web server was required Eventually, there was no reason not tohave each web server configured exactly the same way, and the needfor rapid updates of new binaries was becoming a security issue Themotivation was to save money by not requiring each department to have
a high level of web server expertise The problem being fixed was thelack of similar configurations on each server A system was designed tomaintain a central configuration repository that would update each ofthe servers in a controlled and secure manner The customers affectedwere the departmental SAs, who were eager to give up a task that theydidn’t always understand By centralizing web services, the organizationcould also afford to have one or more SAs become better-trained in thatparticular service, to provide better in-house customer support
• Consolidation: Consolidate services onto fewer hosts In the past, for
reliability’s sake, one service was put on each physical host However,
as technology progresses, it can be beneficial to have many services onone machine The motivation is to decrease cost The problem beingfixed is that every host has overhead costs, such as power, cooling, ad-ministration, machine room space, and maintenance contracts Usually,
a single, a more powerful machine costs less to operate than severalsmaller hosts As services are consolidated, care must be taken to groupcustomers with similar needs
Since the late 1990s, storage consolidation has been a big buzzword.
By building one large storage-area network that each server accesses,there is less “stranded storage”—partially-full disks—on each server.Often, storage consolidation involves decommissioning older, slower, orsoon-to-fail disks and moving the data onto the SAN, providing betterperformance and reliability
Server virtualization, a more recent trend, involves using virtual
hosts to save hardware and license costs For example, financial tutions used to have expensive servers and multiple backup machines
insti-to run a calculation at a particular time of the day, such as makingend-of-day transactions after the stock market closes Instead, a virtual
Trang 2221.1 The Basics 507
machine can be spun up shortly before the market closes; the machineruns its tasks, then spins down Once it is done, the server is free to runother virtual machines that do other periodic tasks
By using a global file system, such as a SAN, a virtualization cluster
can be built Since the virtual machine images—the data stored on diskthat defines the state of a virtual machine—can be accessed from manyhardware servers, advanced virtualization management software canmigrate virtual machines between physical machines with almost unnot-icable switch-over time Many times, sites realize that they need manymachines, each performing a particular function, none of which requiresenough CPU horsepower to justify the cost of dedicated hardware.Instead, the virtual machines can share a farm, or cluster, of physicalmachines, as needed Since virtual machines can migrate between dif-ferent hardware nodes, workload can be rebalanced Virtual machinescan be moved off an overloaded physical machine Maintenance be-comes easier too If one physical machine is showing signs of hardwareproblems, virtual machines can be migrated off it onto a spare machinewith no loss of service; the physical machine can then be repaired orupgraded
• Administration: System administration When redesigning your
organi-zation (see Chapter 30), your motivation may be to reduce cost, improvespeed, or provide services consistently throughout the enterprise Theproblem may be the extra cost of having technical management for eachteam or that the distributed model resulted in some divisions’ havingpoorer service than others Centralizing the SA team can fix theseproblems
To provide customization and the “warm fuzzies” of personal tention, subteams might focus on particular customer segments An ex-cellent example of this is a large hardware company’s team of “CADambassadors,” an SA group that specializes in cross-departmental sup-port of CAD/CAM tools throughout the company However, a commonmistake is to take this to an extreme We’ve seen at least one amazinglyhuge company that centralized to the point that “customer liaisons”were hired to maintain a relationship with the customer groups, and thecustomers hired liaisons to the centralized SA staff Soon, these liaisonsnumbered more than 100 At that point, the savings in reduced over-head were surely diminished A regular reminder and dedication to theoriginal motivation may have prevented that problem
Trang 23at-• Specialization: Expertise In decentralized organizations, a few of the
groups are likely to have more expertise in particular areas than othergroups do This is fine if they maintain casual relationships and helpone another However, certain expertise can become critical to busi-ness, and therefore an informal arrangement becomes an unacceptablebusiness risk In that case, it may make sense to consolidate that exper-tise into one group The motivation is to ensure that all divisions haveaccess to a minimum level of expertise in one specific area or areas.The problem is that the lack of this expertise causes uneven service lev-els, for example, if one division had unreliable DNS but others didn’t
or if one division had superior Internet email service, whereas otherswere still using UUCP-style addresses (If you are too young to remem-ber UUCP-style addresses, just count your blessings.) That would beintolerable!
Establishing a centralized group for one particular service can bringuniformity and improve the average across the entire company Someexamples of this include such highly specialized skills as maintaining
an Internet gateway, a software depot, various security issues—VPNservice, intrusion detection, security-hole scanning, and so on—DNS,and email service A common pattern at larger firms is to create a “CareServices” or “Infrastructure” team to consolidate expertise in these areasand provide infrastructure across the organization
• Left Hand, Right Hand: Infrastructure decisions The creation of
infras-tructure and platform standards can be done centrally This is a subcase
of centralizing expertise The motivation at one company was that thatinfrastructure costs were high and interoperability between divisionswas low There were many specific problems to be solved Every divi-sion had a team of people researching new technologies and makingdecisions independently Each team’s research duplicated the effort ofthe others Volume-purchasing contracts could not be signed, becauseeach individual division was too small to qualify Repair costs were highbecause so many different spare parts had to be purchased When di-visions did make compatible purchasing decisions, multiple spare partswere still being purchased because there was no coordination or coop-eration The solution was to reduce the duplication in effort by hav-ing one standards committee for infrastructure and platform standards.Previously, new technology was often adopted in pockets around thecompany because some divisions were less averse to risk; these became
Trang 24as preferred pricing, when they deal with a centralized purchasing groupthat reflects the true volume of orders from that one source Sometimes,money can be saved through centralization Other times, it is better touse the savings to invest in better equipment.
• Commodity: If it has become a commodity, consider centralization A
good time to consider centralizing something is when the technologypath it has taken has made it a commodity Network printing, file service,email servers, and even workstation maintenance used to be unique, raretechnologies However, now these things are commodities and excellentcandidates for centralization
Case Study: Big, Honkin’ File Servers
Tom’s customers and even fellow SAs fought long and hard against the concept of large, centralized file servers The customers complained about the loss of control and produced, in Tom’s opinion, ill-conceived pricing models that demonstrated that the old U NIX -based file servers were the better way to go What they were really fighting was the notion that network file service was no longer very special; it had become
a commodity and therefore an excellent candidate for centralization Eventually, an apples-to-apples comparison was done This included a total cost-of-ownership model that included the SA time and energy to maintain the old-style systems The value
of some unique features of the dedicated file servers, such as file system snapshot, was difficult to quantify However, even when the cost model showed the systems to cost about the same per gigabyte of usable storage, the dedicated file servers had
an advantage over the old systems: consistency and support The old systems were
a mishmash of various manufacturers for the host; for the RAID controllers; and for the disk drives, cables, network interfaces, and, in some cases, even the racks they sat in! Each of these items usually required a level of expertise and training to maintain efficiently, and no single vendor would support these Frankenstein monsters Usually, when the SA who purchased a particular RAID device left the group, the expertise left with the person Standardizing on a particular product resulted in a higher level of service because the savings were used to purchase top-of-the line systems that had fewer problems than inexpensive competitors Also, having a single phone number
to call for support was a blessing.
Trang 25Printing is another commodity service that has many opportunities forcentralization, both in the design of the service itself and when purchasingsupplies Section 24.1.1 provides more examples.
21.1.3 Candidates for Decentralization
Decentralization does not automatically improve response times When donecorrectly, it creates an opportunity to do so Even when the new process isless efficient or is inefficient in different ways, people may be satisfied simply
to be in control We’ve found that people are more tolerant of a mediocreprocess if they feel they control it
Decentralization often trades cost efficiency for something even morevaluable In these examples, we decentralize to democratize control, gainfault tolerance, acquire the ability to have a customized solution, or removeourselves from clue-lacking central authorities (“They’re idiots, but they’re
our division’s idiots.”) One must seek to retain what was good about the old
system while fixing what was bad
Decentralization democratizes control The new people gaining controlmay require training; this includes both the customers and the SAs The goalmay be autonomy, the ability to control one’s own destiny, or the ability to
be functional when disconnected from the network This latter feature is also
referred to as compartmentalization, the ability to achieve different reliability
levels for different segments of the community Here are some good candidatesfor decentralization
• Fault tolerance The duplication of effort that happens with
decentral-ization can remove single points of failure A company with growingfield offices required all employees to read email off servers located
in the headquarters There were numerous complaints that during work outages, people couldn’t read or even compose email, becausecomposition required access to directory servers that were also at theheadquarters Divisions in other time zones were particularly upset thatmaintenance times at the headquarters were their prime working hours.The motivation was to increase reliability, in particular access duringoutages The problem was that people couldn’t use email when WANlinks were down The solution was to install local LDAP caches andemail servers in each of the major locations (It was convenient andeffective to also use this host for DNS, authentication, and other ser-vices.) Although mail would not be transmitted site to site during an
Trang 26net-21.1 The Basics 511
outage, customers could access their email store, local email could bedelivered, and messages that needed to be relayed to other sites wouldtransparently queue until the WAN link recovered This could have been
a management disaster if each site was expected to have the expertise quired to configure and maintain such systems or if different sites createddifferent standards Instead, it was a big success because managementwas centralized Each site received preconfigured hardware and softwarethat simply needed to be plugged in Updates were done via a centralizedsystem Even backups could be performed remotely, if required
re-• Customization Sometimes, certain customer groups have a business
re-quirement to be on the bleeding edge of technology, whereas othersrequire stability A research group required early access to technology,usually before it was approved by corporate infrastructure standardscommittees The motivation was largely political because the groupmaintained a certain status by being ahead of others within the com-pany, as well as in the industry There was also business motivation: Thegroup’s projects were far-reaching and futuristic, and the group needed
to “live in the future” if it was going to build systems that would workwell in the networks of the future The problem was that the groupwas being prevented from deviating from corporate standards The so-lution was to establish a group-specific SA team The team membersparticipated in the committees that created the corporate standards andwere able to provide valuable feedback because they had experiencewith technologies that the remainder of the committee was just consid-ering Providing this advice also maintained the groups elite status Theirparticipation also guaranteed that they would be able to establish inter-operability guidelines between their “rogue” systems and the corporatestandards This local SA team was able to meet the local requirementsthat were different from those of the rest of the company They couldprovide special features and select a different balance of stability versuscutting-edge features
• Meeting your customers’ needs Sometimes, the centralized services
group may be unable to meet the demands placed on it by some ofthe departments in the company Before abandoning the centralized ser-vice, try to understand the reason for the failures of the central group
to meet your customers’ needs Try to work with the customers to find
a solution that works for both SAs and customers, such as the one scribed earlier Your ultimate responsibility is to meet your customers’
Trang 27de-needs and to make them successful If you cannot make the relationshipwith the central group work, your company may have to decentralizethe necessary services so that you can meet your group’s needs Makesure that you have management support to make this move; be aware ofthe pitfalls of decentralization, and try to avoid them Remember whyyou moved to a centralized model, and periodically reevaluate whether
it still makes sense
Advocates of decentralization sometimes argue that centralized vices are single points of failure However, when centralization is doneright, the savings can be reinvested into technology that increases faulttolerance Often, the result of decentralization is many single points
ser-of failure spread all over the company; redundancy is reduced Forexample, when individual groups build their own LANs, they mighthave the training only to set up a very basic, simple LAN infrastructure.The people doing the work have other responsibilities that keep themfrom being able to become experts in modern LAN technology WhenLAN deployment is centralized, people who specialize in networkingand do it full-time are in charge They have the time to deploy redun-dancy protocols and proactive monitoring that will enable them to fixproblems within a defined SLA The savings from volume discounts of-ten will pay for much of the redundancy The increased reliability from aprofessional SLA-driven design and operation process benefits the com-pany in many ways
Another point in support of decentralization is that there are fits to having diversity in your systems For example, different OSs havedifferent security problems It can be beneficial to have only a fraction
bene-of your systems taken out by a virus A major sbene-oftware company had ahighly publicized DNS outage because all of its DNS servers were run-ning the same OS and the same release of DNS software If the companyhad used a variety of OSs and DNS software, one of them might not havebeen susceptible to the security hole that was being leveraged If you arethe centralized provider, accept that this may sometimes be necessary
21.2 The Icing
Centralization and decentralization can be major overhauls If you are askingpeople to accept the pain of converting to a new system, you should beproposing a system that is not only cheaper but also better for them
Trang 2821.2 The Icing 513
There is an old adage that often appears on buttons and bumper stickers:
“Cheap, fast, good Pick two.” This pointed statement reveals a time-testedtruism In general, you must sacrifice one of those three items to achievethe other two In fact, if someone tries to claim that they provide all threesimultaneously, look under the tablecloth of their slick demo and check forhidden wires This section describes some examples that achieved or promised
to achieve all three Some, like the purchasing example in 21.2.1, were a greatoverall success Others had mixed results
21.2.1 Consolidate Purchasing
In this example, centralization resulted in better products more quickly livered for less money An SA group was able to position itself to approve allcomputer-related purchasing for its division In fact, the group was able tohave the purchasing agent who handled such purchases moved into its group
de-so they could work closely on contracts, maintenance agreements, and de-so on
As a result, the group was able to monitor what was being purchased ticular purchases, such as servers, would alert the SAs to contact customers
Par-to find out what special requirements the server would create: Did it needmachine room space, special networking, or configuration? This solved aproblem whereby customers would blindside the SAs with requests for majorprojects Now the SAs could contact them and schedule these large projects
As a side benefit, the group was able to do a better job of asset agement Because all purchasing was done through one system, there wasone place where serial numbers of new equipment were captured Previousattempts at tracking assets had failed because it depended on such data to becollected by individuals who had other priorities
man-Centralized purchasing’s biggest benefit was the fact that the SAs now hadknowledge of what was being purchased They noticed certain products beingpurchased often and arranged volume purchasing deals for them Certainsoftware packages were preordered in bulk Imagine the customers’ surprisewhen they tried to purchase a particular package and instead received a notesaying that their department would be billed for one-fiftieth of a 50-licensepackage purchased earlier that year and were given a password they could use
to download the software package and manuals That certainly beat waitingfor it to be delivered!
The most pervasive savings came from centralizing the PC ing process Previously, customers had ordered their own PCs and spentdays looking through catalogs, selecting each individual component to their
Trang 29purchas-particular needs The result was that the PC repair center had to handlemany types of motherboards, cards, and software drivers Although a cus-tomer might take pride in saving the $10 by selecting a nonstandard videocard, he or she would not appreciate the cost of a technician at the PC repairdepartment spending half a day to get it working With the repair group un-able to stock such a wide variety of spare parts, the customers were extremelyunhappy with having to wait weeks for replacement parts to arrive.
The average time for a PC to be delivered under the old system had been
6 weeks It would take a week to determine what was to be ordered andpush it through the purchasing process The vendor would spend a couple ofweeks building the PC to the specifications and delivering it Finally, anotherweek would pass before the SAs had time to load the OS, with possibly anadditional week if there were difficulties A company cannot be fast paced ifevery PC requires more than a month to be delivered To make matters worse,new employees had to wait weeks before they had a PC This was a moralekiller and reflected badly on the company The temporary solution was thatmanagement would beg the SAs to cobble together a PC out of spare parts to
be used until the person’s real PC was delivered Thus, twice as much workwas being done because two complete PC deliveries were required
The centralized purchasing group was able to solve these problems Thegroup found that by standardizing the PC configuration, a volume discountcould be used to reduce cost In fact, the group was able to negotiate agood price even though it had negotiated four configurations: server, desktop,ultralight laptop, and ultrapowerful laptop Fearing that people would stillopt for a custom configuration, the group used some of the savings to en-sure that the standard configuration would be more powerful, with betteraudio and video than any previously purchased custom PC Even if the groupmatched the old price, the savings to the PC repair department would beconsiderable The ability to stock spare parts would be a reduction in lostproductivity by customers waiting for repairs
The purchasing group realized that it wouldn’t be able to push a dard on the customers, who could simply opt for a fully custom solution
stan-if the standard configuration wasn’t to their liking Therefore, purchasingmade sure to pull people to their standard by making it amazingly good.With the volume discounts, the price was so low and quality so high thatmost of the remaining customization options would result in a less powerfulmachine for more money How could anyone not choose the standard? Usingpull instead of push is using the carrot, not the stick, get a mule to moveforward
Trang 3021.2 The Icing 515
One more benefit was achieved Because the flow of new machines beingpurchased was relatively constant, the purchasing group was able to pre-order batches of machines that would be preloaded with their OS by theSAs New employees would have a top-notch PC installed on their deskthe day before they arrived They would be productive starting on the veryfirst day
Ordering time for PCs was reduced from 6 weeks to 6 minutes Whenfaced with the choice between ordering the exact PC they wanted and waiting
6 weeks or waiting 6 minutes and getting a PC that was often more powerfulthan they required, for less money, it was difficult to reject the offer
Any company that is rapidly growing, purchasing a lot of related items, or deploying PCs should consider these techniques Other ad-vice on rapid PC deployment can be found in Chapter 3 More informationabout how PC vendors price their product lines is in Section 4.1.3
computer-21.2.2 Outsourcing
Outsourcing is often a form of centralization Outsourcing is a process by
which an external company is paid to provide certain technical functions for
a company Some commonly outsourced tasks are running the corporate PChelpdesk, remote access, WAN and LAN services, and computer deploymentoperations Some specific tasks, such as building the infrastructure to sup-port a particular application—web site, e-commerce site, enterprise resourceplanning (ERP) system—are outsourced, though probably vendors refer tothat process “professional services” instead
The process of outsourcing usually involves centralization to reduce dundant services and to standardize processes Outsourcing can save money
re-by eliminating the political battles that were preventing such efficiencies.When executives are unable to overcome politics through good management,outsourcing can be a beneficial distraction
Advocates emphasize that outsourcing lets a company focus on its corecompetency rather than on all the technological infrastructure required tosupport that core Some companies become bogged down in supporting theirinfrastructure, to the detriment of their business goals In that situation, out-sourcing can be an appealing solution
The key in outsourcing is to know what you want and to make sure that
it is specified in the contract The outsourcing company isn’t required to doanything that isn’t in the contract Although the salespeople may paint anexciting picture, once the contract is signed, you should expect only what
Trang 31is specified in ink This can be a particular problem when the outsourcedservices had been provided previously in-house.
We’ve seen three related problems with signing an outsourcing contract.Together, they create an interesting paradox Outsourcing to gain new tech-nical competence means that the people you are negotiating with have moretechnical competence than you do This gives the outsourcing firm the powerseat in the negotiations Second, to accurately state your requirements in thecontract, you must have a good understanding of your technical needs; how-ever, if your executive management had a good handle on what was neededand was skilled at communicating this, you wouldn’t need outsourcing Fi-nally, companies sometimes don’t decide to outsource until their computinginfrastructure has deteriorated to the point that outsourcing is being done
as an emergency measure and are thus too rushed or desperate to keep theupper hand in negotiations These companies don’t know what they want orhow to ask for it and are in too much of a panicked rush to do adequateresearch As you can imagine, this spells trouble It is worth noting that none
of these situations that specifically allow for technology knowledge refer tothe buying company
You should research the outsourcing process, discuss the process withpeers at other companies, and talk with customer references Make surethat the contract specifies the entire life cycle of services—design, installa-tion, maintenance and support, decommissioning, data integrity, and disasterrecovery—SLAs penalties for not meeting performance metrics, and a processfor adding and removing services from the contract Negotiating an outsourc-ing contract is extremely difficult, requiring skills far more sophisticated thanour introduction to negotiating (Section 32.2.1)
Some outsourcing contracts are priced below cost in order for the dor to be considered a preferred bidder on project work; it’s on this projectwork that outsourcing deals make money for the supplier Contracts usuallyspecify what is “in scope” of the contract and casually mention a standard ratefor all other “out-of-scope” work The standard rate is often very high, andthe outsource organization hopes to identify as much out-of-scope work aspossible Clients don’t usually think to send such work out to bid to receive abetter rate
ven-There are outsourcing consultants who will lead you through the ating process Be sure to retain one who has no financial ties to the outsourcingfirms that you are considering
Trang 32negoti-21.2 The Icing 517
Don’t Hide Negotiations
When one Fortune 500 company outsourced its computing support infrastructure, the executive management feared a large backlash by both the computing professionals within the company and the office workers being supported Therefore, the deal was done quickly and without consulting the people who were providing the support As a result, the company was missing key elements, such as data backups, quality metrics, and a clear service-level specification The company had no way to renegotiate the contract without incurring severe penalties When it added backups after the fact, the out-of-scope charges in the contract were huge.
Don’t negotiate an outsource contract in secret; get buy-in from the affected customers.
When you outsource anything, your job becomes quality assurance Somepeople think that after outsourcing, the contract will simply take care of itself
In reality, you must now learn now to monitor SLAs to make sure that you getall the value out of the contract It is common for items such as documentation
or service/network architecture diagrams to be specified on the contract, butnot delivered until explicitly requested
Critically Examine Metrics
Executives at one company were very proud of their decision to outsource when, after
a year of service, the metrics indicated that calls to the helpdesk were completed, on average, in 5 minutes This sounded good, but why were employees still complaining about the service they received? Someone thought to ask how this statistic could be true when so many calls included sending a technician to the person’s desk A moderate percentage of calls like that would destroy such an excellent average It turned out that the desk-side support technicians had their own queue of requests, which had their own time-to-completion metrics A call to the helpdesk was considered closed when a ticket was passed on to the desk-side technician’s queue, thus artificially improving the helpdesk’s metrics Always ask for a detailed explanation of any metrics you receive from a vendor, so that you can clearly relate them to SLAs.
While you are trying to get the most out of your contract, the outsourcingcompany is trying to do the same If the contract is for “up to $5 million over
5 years,” you can be assured that the account executive is not going to let you
Trang 33spend only $4.5 million Most outsourcing companies hold weekly meetings
to determine whether they are on schedule for using the entire value of thecontract as quickly as possible; they penalize their sales team for coming in
“under contract.” Does the contract charge $1,000 per server per month?
“How can we convince them that a new service they’ve requested needs adedicated host rather than loading it onto an existing machine?” will be asked
at every turn Here’s the best part: If they can get you to spend all $5 million
of a 5-year contract in only 4.5 years, those last 6 months usually won’t be
at the discounted rate you negotiated How can anyone predict what their
IT needs will be that far out? This is the most dangerous aspect of long-termoutsourcing contracts
Make sure that your contract specifies an exit strategy When starting along-term contract, the outsourcing company usually retains the right to hireyour current computing staff However, the contract never says that you getthem back if you decide that outsourcing isn’t for you Many contracts fail
to guarantee that your former staff will remain on-site for the duration of thecontract The company may decide to use their skills at another site! Evenswitching to a different outsourcing company is difficult, because the oldcompany certainly isn’t going to hand over its employees to the competition.Make sure that the contract specifies what will happen in these situations sothat you do not get trapped Switching back to in-house service is extremelydifficult Eliminate any noncompete clauses that would prevent you fromhiring back people
Our coverage of outsourcing is admittedly centric to our experiences asSAs Many books give other points of view Some are general books about out-sourcing (Gay and Essinger 2000, Rothery and Robertson 1995); by contrast,Williams (1998) gives a CIO’s view of the process Mylott (1995) discussesthe outsourcing process with a focus on managing the transfer of MIS duties.Group Staff Outsource (1996) has a general overview of outsourcing Kuong(2000) discusses the specific issue of provisioning outsourced web applicationservice provider services Jennings and Passaro (1999) is an interesting read
if you want to go into the outsourcing business yourself Finally, Chapmanand Andrade (1997) discuss how to get out of an outsourcing contract andoffer an excellent sample of outsourcing horror stories We pick up the topic
of outsourcing again in Section 30.1.8
The first edition of this book was written during the outsourcing craze ofthe late 1990s We had numerous warnings about the negative prospects of
outsourcing, many of which came true Now the craze is over, but off-shoring
is the new craze Everything old is new again
Trang 3421.3 Conclusion 519
21.3 Conclusion
Centralization and decentralization are complicated topics Neither is alwaysthe right solution Technical issues, such as server administration, as well asnontechnical issues, such as organizational structure, can be centralized ordecentralized
Both topics are about making changes When making such pervasivechanges, we recommend that you consider these guiding principles: knowwhat specific problem you are solving; understand your motivation for mak-ing the change; centralize as much as makes sense for today; recognize that
as in rolling out any new service, it requires careful planning; and, mostimportant, listen to the customers
It is useful to learn from other people’s experiences The USENIX LISAconference has published many case studies (Epp and Baines 1992;Ondishko 1989; Schafer 1992b; and Schwartz, Cottrell, and Dart 1994).Harlander (1994) and Miller and Morris (1996) describe useful tools and thelessons learned from using them
Centralizing purchasing can be an excellent way to control costs, and ourexample showed that it can be done not by preventing people from gettingwhat they want, but by helping them make purchases in a more cost-effectivemanner
We ended with a discussion of outsourcing Outsourcing can be a majorforce for centralization and will be a large part of system administration for
a very long time, even under different names
❖ Centralization Rules of Thumb Every site is different, but we have
found that, as an informal rule of thumb, centralization of the followingservices is preferred once a company grows large enough to have multipledivisions:
Trang 35• Storage within a data center
• Web services with external access
• IP address allocation and DNS management
4 In Section 21.1.3, we describe decentralizing email servers to achievebetter reliability How would you construct a similar architecture forprint servers?
5 Describe a small centralization project that would improve your currentsystem
6 Share your favorite outsourcing horror story
Trang 36Part IV
Providing Services
Trang 38Chapter 22
Service Monitoring
Monitoring is an important component of providing a reliable, professionalservice The two primary types of monitoring are real-time monitoring andhistorical monitoring Each has a very different purpose As discussed inSection 5.1.13, monitoring is a basic component of building a service andmeeting its expected or required service levels
“If you can’t measure it, you can’t manage it.” In the field of systemadministration, that useful business axiom becomes: “If you aren’t moni-toring it, you aren’t managing it.”
Monitoring is essential for any well-run site but is a project that can keepincreasing in scope This chapter should help you anticipate and prepare forthat We look at what the basics of a monitoring system are and then discussthe numerous ways that you can improve your monitoring system
For some sites, such as sites providing a service over the Internet, prehensive monitoring is a business requirement These sites need to monitoreverything to make sure that they don’t lose revenue because of an outage thatgoes unnoticed E-commerce sites will probably need to implement everythingpresented in this chapter
com-22.1 The Basics
Systems monitoring can be used to detect and fix problems, identify the source
of problems, predict and avoid future problems, and provide data on SAs’achievements The two primary ways to monitor systems are to (1) gatherhistorical data related to availability and usage and (2) perform real-timemonitoring to ensure that SAs are notified of failures
Historical monitoring is used for recording long-term uptime, usage, and
performance statistics This has two components: collecting the data and
523
Trang 39viewing the data The results of historical monitoring are conclusions: “Theweb service was up 99.99 percent of the time last year, up from the previousyear’s 99.9 percent statistic.” Utilization data is used for capacity planning.For example, you might view a graph of bandwidth utilization gathered forthe past year for an Internet connection The graph might visually depict agrowth rate indicating that the pipe will be full in 4 months Cricket and Orcaare commonly used historical monitoring tools.
Real-time monitoring alerts the SA team of a failure as soon as it happens
and has two components: a monitoring component that notices failures and
an alerting component that alerts someone to the failure There is no point in
a system’s knowing that something has gone down unless it alerts someone
to the problem The goal is for the SA team to notice outages before tomers do This results in shorter outages and problems being fixed beforecustomers notice, along with building the team’s reputation for maintaininghigh-quality service Nagios and Big Brother are commonly used real-timemonitoring systems
cus-Typically, the two types of monitoring are performed by different systems.The tasks involved in each type of monitoring are very different After readingthis chapter, you should have a good idea of how they differ and know what
to look for in the software that you choose for each task
But first, a few words of warning Monitoring uses network bandwidth,
so make sure that it doesn’t use too much Monitoring uses CPU and memoryresources, so you don’t want your monitoring to make your service worse.Security is important for monitoring systems
• Within a local area network, network bandwidth is not usually a
sig-nificant percentage However, over low-bandwidth—usually distance—connections, monitoring can choke links, causing perfor-mance to suffer for other applications Make sure that you know howmuch bandwidth your monitoring is using A rule of thumb is that itshould not exceed 1 percent of the available bandwidth Try to opti-mize your monitoring system so that it is easy on low-bandwidth con-nections Consider putting monitoring stations at the remote locationswith a small amount of communication back to the main site or usingprimarily a trap-based system, whereby the devices notify the monitor-ing system when there is a failure, rather than a polling-based system,whereby the monitoring system checks status periodically
long-• Under normal circumstances, a reasonable monitoring system will not
consume enough CPU and memory to be noticed However, you should
Trang 4022.1 The Basics 525
test for failure modes What happens if the monitoring server1is down
or nonfunctional? What happens if there is a network outage? Also becareful of transitions from one monitoring system to another: Remember
to turn off the old service when the new one is fully operational
• Security is a factor in that monitoring systems may have access to
ma-chines or data that can be abused by an attacker Or, it may be ble for an attacker to spoof a real-time monitoring system by sendingmessages that indicate a problem with a server or a service Strong au-thentication between the server and the client is best Older monitoringprotocols, such as SNMPv1, have weak authentication
possi-22.1.1 Historical Monitoring
Polling systems at predefined intervals can be used to gather utilization orother statistical data from various components of the system and to checkhow well services that the system provides are working The informationgathered through such historical data collection is stored and typically used
to produce graphs of the system’s performance over time or to detect orisolate a minor problem that occurred in the past In an environment withwritten SLA policies, historical monitoring is the method used to monitorSLA conformance
Historical data collection is often introduced at a site because the SAswonder whether they need to upgrade a network, add more memory to aserver, or get more CPU power They might be wondering when they will need
to order more disks for a group that consumes space rapidly or when theywill need to add capacity to the backup system To answer these questions,the SAs realize that they need to monitor the systems in question and gatherutilization data over a period of time in order to see the trends and the peaks
in usage There are many other uses for historical data, such as usage-basedbilling, anomaly detection (see Section 11.1.3.7) and presenting data to thecustomer base or management (see Chapter 31)
Historical data can consume a lot of disk space This can be mitigated by
condensing or expiring data Condensing data means replacing detailed data
with averages For example, one might collect bandwidth utilization datafor a link every 5 minutes However, retaining only hourly averages requiresabout 90 percent less storage It is common to store the full detail for the pastweek but to reduce down to hourly averages for older data
1 The machine that all the servers report to.