Service Level and PerformanceMonitoring Windows 2000 Server is being widely considered as an alternative to mainframe-type systems for high-endcomputing requirements.. MIS has to ensure
Trang 1Service Level and Performance
Monitoring
Windows 2000 Server is being widely considered as an
alternative to mainframe-type systems for high-endcomputing requirements This will place tremendous burdenand responsibility on Windows 2000 administrators to ensuremaximum availability of systems This chapter thus discussesservice level and provides an introduction to Windows 2000Server performance monitoring
What Is Service Level?
If there is anything you have learned in this book, it is this:
Windows 2000 is a major-league operating system In our ion, it is the most powerful operating system in existence for the majority of needs of all enterprises Only time and ser-vice packs will tell if Windows 2000 can go up against the bigirons such as AS/400, Solaris, S/390, and the like
opin-Microsoft has aimed Windows 2000 Server squarely at all els of business and industry and at all business sizes You will
lev-no doubt feel the rush of diatribe in the industry: 99.9 this,10,000 concurrent hits that, clustering and load balancing, andmore But every system, server or OS, has its meltdown point,weak links, single point of failure (SPOF), “tensile strength,”
and so on Knowing, or at least predicting, the meltdown
“event horizon” is more important than availability claims
Trust us, poor management will turn any system or serviceinto a service level nightmare
20C H A P T E R
In This Chapter
Service LevelManagement Windows 2000Service Level ToolsTask ManagerThe PerformanceConsole PerformanceMonitoringGuidelines
Trang 2One of the first things you need to ignore in the press from the get-go is the crazycomparisons of Windows 2000 to $75 operating systems, and so on If your busi-ness is worth your life to you and your staff, you need to invest in performance andmonitoring tools, disaster recovery, Quality of Service tools, service level tools, andmore Take a survey of what these tools can cost you Windows 2000 Server out ofthe box has more built in to it than anything else, as this chapter will illustrate.
On our calculators, Windows 2000 Server is the cheapest system going onperformance-monitoring tools alone
Windows 2000 is no doubt going to be adopted by many organizations; it will tainly replace Windows NT over the next few years and will probably become theleading server operating system on the Internet With application service providing(ASP), thin-client, Quality of Service, e-commerce, distributed networking architec-ture (DNA), and the like becoming implementations everywhere as opposed tobeing new buzzwords, you, the server or network administrator, are going to findyourself dealing with a new animal in your server room This animal is known as
cer-the service level agreement (SLA).
Before we discuss the SLA further, we should first define service level and, second,how Windows 2000 addresses it
Service Level (SL) is simply the ability of IT management or MIS to maintain a sistent, maximum level of system uptime and availability Many companies mayunderstand SL as quality assurance and quality control (QA/QC) Examples willbetter explain it, as follows
con-Service Level: Example 1
Management comes to MIS with a business plan for application services providing(ASP) If certain customers can lease applications online, over reliable Internet con-
nections, for x rate per month, they will forgo expensive in-house IT budgets and
outsource instead An ASP can, therefore, make its highly advanced network tions center and a farm of servers available to these businesses If enough cus-tomers lease applications, the ASP will make a profit
opera-The business plan flies if ASP servers and applications are available to customersall the time from at least 7 a.m to 9 p.m The business plan will only tolerate a 09percent downtime during the day Any more and customers will lose respect for thebusiness and rather bring resources back in house This means that IT or MIS mustsupport the business plan by ensuring that systems are never offline for more than.09 percent of the business day Response, as opposed to availability, is also a criti-cal factor And Quality of Service, or QoS, addresses this in SL This will be dis-cussed shortly in this chapter
Note
Trang 3Service Level: Example 2
Management asks MIS to take its order-placing system, typically fax-based and cessed by representatives in the field, to the extranet Current practice involves arepresentative going to a customer, taking an order for stock, and then faxing theorder to the company’s fax system, where the orders are manually entered into thesystem The new system proposes that customers be equipped with an inexpensiveterminal or terminal software and place the orders directly against their accounts
pro-on a Web server
MIS has to ensure that the Web servers and the backend systems, SQL Server 2000,Windows 2000 Server, the WAN, and so on, are available all the time If customersfind the systems offline, they will swamp the phones and fax machines, or simplyplace their orders with the competition The system must also be reliable, informa-tive, and responsive to the customers’ needs
The Service Level Agreement
The first example may require a formal service level agreement In other words, theSLA will be a written contract signed between the client and the provider The cus-tomer demands that the ASP provide written — signed — guarantees that the sys-tems will be available 99.9 percent of the time The customer demands such an SLA,because it cannot afford to be in the middle of an order-processing application, orsales letter, and then have the ASP suddenly disappear
The customer may be able to tolerate a certain level of unavailability, but if SLdrops beyond what’s tolerable, the customer needs a way to obtain redress fromthe ASP This redress could be the ability to cancel the contract, or the ability tohold the ASP accountable with penalties, such as fines, discount on service costs,waiver of monthly fees, and so on Whatever the terms of the SLA, if the ASP cannotmeet it, then MIS gets the blame
In the second example, there is unlikely to be a formal SLA between a customer andthe supplier Service level agreements will be in the form of memos between MISand other areas of management MIS will agree to provide a certain level of avail-ability to the business model or plan These SLAs are put in writing and usuallyfavored by the MIS, who will take the SLA to budget and request money for systemsand software to meet the SLA
However, the SLA can work to the disadvantage of MIS, too If SL is not met, the MISstaff or CTO may get fired, demoted, or reassigned The CEO may also decide tooutsource or force MIS to bring in expensive consultants (which may help or hurtMIS)
Trang 4In IT shops that now support SL for mission-critical applications, there are no gins for tolerating error Engineers who cannot help MIS meet SL do not survivelong Education and experience are likely to be high on the list of employmentrequirements.
mar-Service Level Management
Understanding Service Level Management (SLM) is an essential requirement for MIS
in almost all companies today This section examines critical SLM factors that have
to be addressed
Problem Detection
This factor requires IT to be constantly monitoring systems for advanced warnings
of system failure You use whatever tools you can obtain to monitor systems andfocus on all the possible points of failure For example, you will need to monitorstorage, networks, memory, processors, power, and so on
Problem detection is a lot like earthquake detection You spend all of your time tening to the earth, and the quake comes when you least expect it and where youleast expect it Then, 100 percent of your effort is spent on disaster recovery (DR).Your DR systems then need to kick in to recover According to research from thelikes of Forrester Research, close to 40 percent of IT management resources arespent on problem detection
lis-Performance Management
Performance Management accounts for about 20 percent of MIS resources This tor is closely related to problem detection You can hope that poor performance inareas such as networking, access times, transfer rates, restore or recover perfor-mance, and so on, will point to problems that can be fixed before they turn into dis-asters However, most of the time a failure is usually caused by failures in anotherpart of the system For example, if you get a flood of continuous writes to a harddisk that does not let up until the hard disk crashes, is the hard disk at fault orshould you be looking for better firewall software?
fac-The right answer is a combination of both factors fac-The fault is caused by the poorquality of firewall software that gives passage to a denial-of-service attack But inthe event this happens again, we need hard disks that can stand the attack a lotlonger
Trang 5Availability, for the most part, is a post-operative factor In other words, availability
management covers redundancy, mirrored or duplexed systems, fail-overs, and so
on Note that fail-over is emphasized because the term itself denotes taking over
from a system that has failed
Clustering of systems or load balancing, on the other hand, is also as much disasterprevention as it is a performance-level maintenance practice Using performancemanagement, you would take systems to a performance point that is nearingthreshold or maximum level, then you switch additional requests for service toother resources A fail-over, on the other hand, is a machine or process that picks
up the users and processes that were on a system that has just failed, and it is posed to allow the workload to continue uninterrupted on the fail-over systems Agood example of fail-over is a mirrored disk, or a RAID-5 storage set: The failure ofone disk does not interrupt the processing, which carries on oblivious to the failure
sup-on the remaining disks, giving management time to replace the defective nents
compo-There are several other SL-related areas that IT spends time on and which impactSLM These include change management and control, software distribution, andsystems management See Chapter 11 for an extensive discussion of ChangeManagement
sys-keeping The performance leg supports the model by assuring that systems are able
to service the business and keep systems operating at threshold points consideredsafely below bottleneck and failure levels If one of the legs fails or becomes weak,the stool may falter or collapse, which puts the business at risk
When managing for availability, the enterprise will ensure it has the resources torecover from disasters as soon as possible This usually means hiring gurus orexperts to be available on-site to fix problems as quickly as possible Often, man-agement will pay a guru who does nothing for 95 percent of his or her time, whichseems a waste But if they can fix a problem in record time, they will have earnedtheir keep several times over
Note
Trang 6Figure 20-1: The SLM model is a three-legged stool.
Often, a guru will restore a system that, had it stayed offline a few days longer,would have cost the company much more than the salary of the guru However, itgoes without saying that the enterprise will save a lot of money and effort if it canobtain gurus who are also qualified to monitor for performance and problems, andwho do not just excel at recovery This should be worth 50 percent more salary tothe guru
Administration is the effort of technicians to keep systems backed up, keep powersupplies on line, monitor servers for error messages, ensure server rooms remain
at safe temperatures and air circulation, and so on The administrative leg managesthe SL budget, hires and fires, maintains and reports on service level achievement,and reports to management or the CEO
The performance leg is usually carried out by analysts who know what to look for in a system These analysts get paid the big bucks to help management decidehow to support business initiatives and how to exploit opportunity They need toknow everything there is about the technology and its capabilities For example,they need to know which databases should be used, how RAID works and the levelrequired, and so on They are able to collect data, interpret data, and forecastneeds
Availability
Administration
Performance
Trang 7SLM and Windows 2000 Server
Key to meeting the objective of SLM is the acquisition of SL tools and technology
This is where Windows 2000 Server comes in While clustering and load balancingare included in Advanced Server and Datacenter Server, the performance and sys-tem monitoring tools and disaster recovery tools are available to all versions ofthe OS
These tools are essential to SL Acquired independently of the operating systems,they can cost an arm and a leg, and they might not integrate at the same level
These tools on Windows NT 4.0 were seriously lacking On Windows 2000, however,they raise the bar for all operating systems Many competitive products unfortu-nately just do not compete when it comes to SLM The costs of third-party toolsand integration for some operating systems are so prohibitive that they cannot beconsidered of any use to SLM whatsoever
The Windows 2000 monitoring tools are complex, and continued ignorance of themwill not be tolerated by management as more and more customers demand SL com-pliance and service level agreements The monitoring and performance tools onWindows 2000 include the following:
Windows 2000 System Monitoring Architecture
Windows 2000 monitors or analyzes storage, memory, networks, and processing
This does not sound like a big deal, but the data analysis is not done on these areasper se In other words, you do not monitor memory itself, or disk usage itself, but
Trang 8rather how software components and functionality use these resources In short, it
is not sufficient to just report that 56MB of RAM were used between time x and time
y Your investigations need to find out what used the RAM at a certain time and why
so much was used
If a system continues to run out of memory, there is a strong possibility, for ple, that an application is stealing the RAM somewhere In other words, the applica-
exam-tion or process has a bug and is leaking memory When we refer to memory leaks,
this means that software that has used memory has not released it after it is done.Software developers are able to watch their applications on servers to be sure theyrelease all the memory they use
What if you are losing memory and you do not know which application is ble? Not too long ago, Windows NT servers used on the Internet and in high-endmail applications (no fewer than 100,000 e-mails per hour) would simply run out ofRAM After extensive system monitoring, we were able to determine that the leakwas in the latest release of the Winsock libraries responsible for Internet communi-cations on NT Another company in Europe found the leak about the same time.Microsoft later released a patch It turned out that the Winsock functions responsi-ble for releasing memory were not able to cope with the rapid demand on the sock-ets They were simply being opened at a rate faster than the Winsock librariescould cope with
responsi-The number of software components, services, and threads of functionality inWindows 2000 are so numerous that it is literally impossible to monitor tens ofthousands of instances of storage, memory, network, or processor usage
To achieve such detailed and varied analysis, Windows 2000 includes built-in ware objects, associated with services and applications, which are able to collect
soft-data in these critical areas So when you collect soft-data, the focus of your soft-data
collec-tion is on the software components, in various services of the operating system,that are associated with these areas When you perform data collection, the systemcollects data from the targeted object managers in each monitoring area
There are two methods of data collection supported in Windows 2000 The first oneinvolves accessing registry pointers to functions in the performance counter DLLs
in the operating system The second supports collecting data through the WindowsManagement Infrastructure (WMI) WMI is an object-oriented framework that allowsyou to instantiate (create instances of) performance objects that wrap the perfor-mance functionality in the operating system
The OS installs a new technology for recovering data through WMI These areknown as managed object files (MOFs) These MOFs correspond to or are associ-ated with resources in a system The number of objects that are the subject of per-formance monitoring are too numerous to list here, but they can be looked up inthe Windows 2000 Performance Counters Reference, which is on the Windows 2000
Trang 9Resource Kit CD (see Appendix B) However, they include the operating system’sbase services, such as the services that report on the RAM, Paging File functional-ity, and Physical Disk usage, and the operating system’s advanced services, such asActive Directory, Active Server Pages, the FTP service, DNS, WINS, and so on.
To understand the scope and usage of the objects, it helps to first understand someperformance data and analysis terms There are three essential concepts to under-
standing performance monitoring These are throughput, queues, and response time.
From these terms, and once you fully understand them, you can broaden yourscope of analysis and perform calculations to report transfer rate, access time,latency, tolerance, thresholds, bottlenecks, and so on
What is Rate and Throughput?
Throughput is the amount of work done in a unit of time If your child is able to struct 100 pieces of Lego bricks per hour, you could say that his or her assemblage
con-rate is 100 pieces per hour, assessed over a period of x hours, as long as the con-rate
remains constant However, if the rate of assemblage varies, through fatigue,hunger, thirst, and so on, we can calculate the throughput
Throughput increases as the number of components increases, or the availabletime to complete a job is reduced Throughput depends on resources, and time andspace are examples of resources The slowest point in the system sets the through-put for the system as a whole Throughput is the true indicator of performance
Memory is a resource, the space in which to carry out instructions It makes littlesense to rate a system by millions of instructions per second, when insufficientmemory is not available to hold the instruction information
queue develops, we say that a bottleneck has occurred Looking for bottlenecks in
the system is key to monitoring for performance and troubleshooting or problemdetection If there are no bottlenecks, the system might be considered healthy, but
a bottleneck might soon start to develop
Queues can also form if requests for resources are not evenly spread over the unit
of time If your child assembles one piece per minute evenly every minute, he orshe will get through 60 pieces in an hour But if the child does nothing for 45 min-utes and then suddenly gets inspired, a bottleneck will occur in the final 15 minutesbecause there are more pieces than the child can process in the remaining time On
Trang 10computer systems when queues and bottlenecks develop, systems become sponsive Additional requests for processor or disk resources are stalled Whenrequesting services are not satisfied, the system begins to break down In thisrespect, we reference the response time of a system.
unre-What Is Response Time?
Response time is the measure of how much time elapses between the firing of acomputer event, such as a read request, and the system’s response to the request.Response time will increase as the load increases because the system is stillresponding to other events and does not have enough resources to handle newrequests A system that has insufficient memory and/or processing ability will pro-cess a huge database sort a lot slower than a better-endowed system with fasterhard disks and CPUs If response time is not satisfactory, you will either have towork with less data or increase the resources
Response time is typically measured by dividing the queue length over the resourcethroughput Response time, queues, and throughput are reported and calculated bythe Windows 2000 reporting tools
How Performance Objects Work
Windows 2000 performance monitoring objects contain functionality known as
per-formance counters These so-called counters perform the actual analysis For
exam-ple, a hard disk object is able to calculate transfer rate, while aprocessor-associated object is able to calculate processor time
To gain access to the data or to start the data collection, you first have to createthe object and gain access to its functionality This is done by calling a createfunc-tion from a user interface or other process As soon as the object is created, and itsdata collection functionality invoked, it begins the data-collection process andstores the data in various properties Data can be streamed out to disk, in files,RAM, or to other components that assess the data and present it in some meaning-ful way
Depending on the object, your analysis software can create at least one copy of theperformance object and analyze the counter information it generates You need toconsult Microsoft documentation to “expose” the objects to determine if the objectcan be created more than once concurrently If it can be created more than once,you will have to associate your application with the data the object collects by ref-erencing the object’s instance counter Windows 2000 allows you to instantiate anobject for a local computer’s services, or you can create an object that collects datafrom a remote computer
Trang 11There are two methods of data collection and reporting made possible using
perfor-mance objects First, the objects can sample data This means that data is collected
periodically rather than when a particular event occurs All forms of data collectionplace a burden on resources, which means that monitoring in itself can be a burden
to systems Sampled data has the advantage of being a period-driven load, but thedisadvantage is that values may be inaccurate when a certain activity falls outsidethe sampling period or between events
The other method of data collection is event tracing Event tracing, new to Windows
2000, is able to collect data as certain events occur Because there is no samplingwindow, you can correlate resource usage against events For example, you canwatch an application consume memory when it executes a certain function andmonitor how and if it releases that memory when the function completes
The disadvantage of event tracing is that it consumes more resources than pling, so you would only want to perform event tracing for short periods where theobjective of the trace is to troubleshoot, and not to just monitor
sam-Counters are able to report their data in one of two ways: instantaneous counting
or average counting An instantaneous counter displays the data as it happens; it is
a snapshot In other words, the counter does not compute the data it receives andjust reports it On the other hand, average counting computes the data for you Forexample, it is able to compute bits per second, or pages per second, and so on
Other counters are able to report percentages, difference, and so on
System Monitoring Tools
Before you rush out and buy a software development environment to access theperformance monitoring routines, you should know that Windows 2000 comesequipped with two primary, ready-to-go monitoring tools: the Performance Consoleand Task Manager Task Manager provides an instant view of systems activity such
as memory usage, processor activity, process activity, and resource consumption
Task Manager is very helpful for an immediate detection of system problems ThePerformance Console is used to provide performance analysis and information thatcan be used for troubleshooting and bottleneck analysis It can also be used toestablish regular monitoring regimens such as ongoing server health analysis
Performance Console comes with two tools built in: System Monitor and mance Logs and Alerts but more about them later The first tool, due to itsimmediacy and as a troubleshooting and information tool, is the Task Manager
Trang 12Perfor-Task Manager
Task Manager provides quick information on applications and services currentlyrunning on your server It provides information such as processor usage in percent-age terms, memory usage, task priority, response, and some statistics about mem-ory and processor performance
Task Manager is very useful as a quick system sanity check, and it is usually evoked
as a troubleshooting tool when a system indicates slow response times, lockups orerrors, or messages pointing to lack of system resources, and so on
Task Manager, illustrated in Figure 20-2, is started in several ways:
1 Right-click the taskbar (the bottom-right area where the time is usually
displayed) and select Task Manager from the Context menu
2 Select Ctrl+Shift and hit the Esc key.
3 Select Ctrl+Alt and hit the Del key The Windows Security dialog box loads.
Click Task Manager
Figure 20-2: The Task Manager opened
to the Performance tab
When Task Manager loads, you will notice that the dialog box has three tabs:Applications, Processes, and Performance There are a number of useful trickswith the Task Manager:
✦ The columns can be sorted in ascending or descending order by clicking thecolumn heads The columns can also be resized
Trang 13✦ When the Task Manager is running, a CPU gauge icon displaying accurateinformation is placed into the system tray on the bottom-right of the screen
If you drag your mouse cursor over this area, you will obtain a pop-up menu
of current CPU usage
✦ It is also possible to keep the Task Manager button off the system tray if youuse it a lot This is done by selecting the Options menu and then checking theHide When Minimized option The CPU icon next to the system time remains,however
✦ You can control the rate of Refresh or Update from the View ➪ Update Speedmenu You can also pause the update to preserve resources and click RefreshNow to update the display at any time
The Process tab is the most useful and provides a list of running processes on the
system It measures their performance in simple data These include CPU percentused, the CPU time allocated to a resource, and memory usage
There are a number of additional performance or process measures that can beadded to or removed from the list on the Processes page Select View ➪ SelectColumns This will show the Select Columns dialog box that will allow you to add
or subtract Process counters to the Processes list
A description of each process counter is available in Windows 2000 Help
It is also possible to terminate a process by selecting the process in the list andthen clicking the End Process button Some processes are protected, but you canterminate them using the kill or remote kill utilities that are included in the operat-ing system (see Appendix B for more information on killand rkill) You willneed authority to kill processes, and before you do, you should fully understandthe ramifications of terminating a process
The Performance tab (shown in Figure 20-2) allows you to graph the percentage of
processor time in kernel mode To show this, select the View menu and check theShow Kernel Times option The Kernel Times is the measure of time that applica-tions are using operating system services The remaining time, known as Usermode, is spent in threads that are spawned by applications
If your server supports multiple processes, you can click CPU History on the Viewmenu and graph each processor in a single graph pane or in separate graph panes
The Application tab lists running applications and allows you to terminate an
appli-cation that has become unresponsive or that you determine is in trouble or is thecause of trouble on the server