Tài liệu Service Level and Performance Monitoring pptx

Service Level and PerformanceMonitoring Windows 2000 Server is being widely considered as an alternative to mainframe-type systems for high-endcomputing requirements.. MIS has to ensure

Trang 1

Service Level and Performance

Monitoring

Windows 2000 Server is being widely considered as an

alternative to mainframe-type systems for high-endcomputing requirements This will place tremendous burdenand responsibility on Windows 2000 administrators to ensuremaximum availability of systems This chapter thus discussesservice level and provides an introduction to Windows 2000Server performance monitoring

What Is Service Level?

If there is anything you have learned in this book, it is this:

Windows 2000 is a major-league operating system In our ion, it is the most powerful operating system in existence for the majority of needs of all enterprises Only time and ser-vice packs will tell if Windows 2000 can go up against the bigirons such as AS/400, Solaris, S/390, and the like

opin-Microsoft has aimed Windows 2000 Server squarely at all els of business and industry and at all business sizes You will

lev-no doubt feel the rush of diatribe in the industry: 99.9 this,10,000 concurrent hits that, clustering and load balancing, andmore But every system, server or OS, has its meltdown point,weak links, single point of failure (SPOF), “tensile strength,”

and so on Knowing, or at least predicting, the meltdown

“event horizon” is more important than availability claims

Trust us, poor management will turn any system or serviceinto a service level nightmare

20C H A P T E R

In This Chapter

Service LevelManagement Windows 2000Service Level ToolsTask ManagerThe PerformanceConsole PerformanceMonitoringGuidelines

Trang 2

One of the first things you need to ignore in the press from the get-go is the crazycomparisons of Windows 2000 to $75 operating systems, and so on If your busi-ness is worth your life to you and your staff, you need to invest in performance andmonitoring tools, disaster recovery, Quality of Service tools, service level tools, andmore Take a survey of what these tools can cost you Windows 2000 Server out ofthe box has more built in to it than anything else, as this chapter will illustrate.

On our calculators, Windows 2000 Server is the cheapest system going onperformance-monitoring tools alone

Windows 2000 is no doubt going to be adopted by many organizations; it will tainly replace Windows NT over the next few years and will probably become theleading server operating system on the Internet With application service providing(ASP), thin-client, Quality of Service, e-commerce, distributed networking architec-ture (DNA), and the like becoming implementations everywhere as opposed tobeing new buzzwords, you, the server or network administrator, are going to findyourself dealing with a new animal in your server room This animal is known as

cer-the service level agreement (SLA).

Before we discuss the SLA further, we should first define service level and, second,how Windows 2000 addresses it

Service Level (SL) is simply the ability of IT management or MIS to maintain a sistent, maximum level of system uptime and availability Many companies mayunderstand SL as quality assurance and quality control (QA/QC) Examples willbetter explain it, as follows

con-Service Level: Example 1

Management comes to MIS with a business plan for application services providing(ASP) If certain customers can lease applications online, over reliable Internet con-

nections, for x rate per month, they will forgo expensive in-house IT budgets and

outsource instead An ASP can, therefore, make its highly advanced network tions center and a farm of servers available to these businesses If enough cus-tomers lease applications, the ASP will make a profit

opera-The business plan flies if ASP servers and applications are available to customersall the time from at least 7 a.m to 9 p.m The business plan will only tolerate a 09percent downtime during the day Any more and customers will lose respect for thebusiness and rather bring resources back in house This means that IT or MIS mustsupport the business plan by ensuring that systems are never offline for more than.09 percent of the business day Response, as opposed to availability, is also a criti-cal factor And Quality of Service, or QoS, addresses this in SL This will be dis-cussed shortly in this chapter

Note

Trang 3

Service Level: Example 2

Management asks MIS to take its order-placing system, typically fax-based and cessed by representatives in the field, to the extranet Current practice involves arepresentative going to a customer, taking an order for stock, and then faxing theorder to the company’s fax system, where the orders are manually entered into thesystem The new system proposes that customers be equipped with an inexpensiveterminal or terminal software and place the orders directly against their accounts

pro-on a Web server

MIS has to ensure that the Web servers and the backend systems, SQL Server 2000,Windows 2000 Server, the WAN, and so on, are available all the time If customersfind the systems offline, they will swamp the phones and fax machines, or simplyplace their orders with the competition The system must also be reliable, informa-tive, and responsive to the customers’ needs

The Service Level Agreement

The first example may require a formal service level agreement In other words, theSLA will be a written contract signed between the client and the provider The cus-tomer demands that the ASP provide written — signed — guarantees that the sys-tems will be available 99.9 percent of the time The customer demands such an SLA,because it cannot afford to be in the middle of an order-processing application, orsales letter, and then have the ASP suddenly disappear

The customer may be able to tolerate a certain level of unavailability, but if SLdrops beyond what’s tolerable, the customer needs a way to obtain redress fromthe ASP This redress could be the ability to cancel the contract, or the ability tohold the ASP accountable with penalties, such as fines, discount on service costs,waiver of monthly fees, and so on Whatever the terms of the SLA, if the ASP cannotmeet it, then MIS gets the blame

In the second example, there is unlikely to be a formal SLA between a customer andthe supplier Service level agreements will be in the form of memos between MISand other areas of management MIS will agree to provide a certain level of avail-ability to the business model or plan These SLAs are put in writing and usuallyfavored by the MIS, who will take the SLA to budget and request money for systemsand software to meet the SLA

However, the SLA can work to the disadvantage of MIS, too If SL is not met, the MISstaff or CTO may get fired, demoted, or reassigned The CEO may also decide tooutsource or force MIS to bring in expensive consultants (which may help or hurtMIS)

Trang 4

In IT shops that now support SL for mission-critical applications, there are no gins for tolerating error Engineers who cannot help MIS meet SL do not survivelong Education and experience are likely to be high on the list of employmentrequirements.

mar-Service Level Management

Understanding Service Level Management (SLM) is an essential requirement for MIS

in almost all companies today This section examines critical SLM factors that have

to be addressed

Problem Detection

This factor requires IT to be constantly monitoring systems for advanced warnings

of system failure You use whatever tools you can obtain to monitor systems andfocus on all the possible points of failure For example, you will need to monitorstorage, networks, memory, processors, power, and so on

Problem detection is a lot like earthquake detection You spend all of your time tening to the earth, and the quake comes when you least expect it and where youleast expect it Then, 100 percent of your effort is spent on disaster recovery (DR).Your DR systems then need to kick in to recover According to research from thelikes of Forrester Research, close to 40 percent of IT management resources arespent on problem detection

lis-Performance Management

Performance Management accounts for about 20 percent of MIS resources This tor is closely related to problem detection You can hope that poor performance inareas such as networking, access times, transfer rates, restore or recover perfor-mance, and so on, will point to problems that can be fixed before they turn into dis-asters However, most of the time a failure is usually caused by failures in anotherpart of the system For example, if you get a flood of continuous writes to a harddisk that does not let up until the hard disk crashes, is the hard disk at fault orshould you be looking for better firewall software?

fac-The right answer is a combination of both factors fac-The fault is caused by the poorquality of firewall software that gives passage to a denial-of-service attack But inthe event this happens again, we need hard disks that can stand the attack a lotlonger

Trang 5

Availability, for the most part, is a post-operative factor In other words, availability

management covers redundancy, mirrored or duplexed systems, fail-overs, and so

on Note that fail-over is emphasized because the term itself denotes taking over

from a system that has failed

Clustering of systems or load balancing, on the other hand, is also as much disasterprevention as it is a performance-level maintenance practice Using performancemanagement, you would take systems to a performance point that is nearingthreshold or maximum level, then you switch additional requests for service toother resources A fail-over, on the other hand, is a machine or process that picks

up the users and processes that were on a system that has just failed, and it is posed to allow the workload to continue uninterrupted on the fail-over systems Agood example of fail-over is a mirrored disk, or a RAID-5 storage set: The failure ofone disk does not interrupt the processing, which carries on oblivious to the failure

sup-on the remaining disks, giving management time to replace the defective nents

compo-There are several other SL-related areas that IT spends time on and which impactSLM These include change management and control, software distribution, andsystems management See Chapter 11 for an extensive discussion of ChangeManagement

sys-keeping The performance leg supports the model by assuring that systems are able

to service the business and keep systems operating at threshold points consideredsafely below bottleneck and failure levels If one of the legs fails or becomes weak,the stool may falter or collapse, which puts the business at risk

When managing for availability, the enterprise will ensure it has the resources torecover from disasters as soon as possible This usually means hiring gurus orexperts to be available on-site to fix problems as quickly as possible Often, man-agement will pay a guru who does nothing for 95 percent of his or her time, whichseems a waste But if they can fix a problem in record time, they will have earnedtheir keep several times over

Note

Trang 6

Figure 20-1: The SLM model is a three-legged stool.

Often, a guru will restore a system that, had it stayed offline a few days longer,would have cost the company much more than the salary of the guru However, itgoes without saying that the enterprise will save a lot of money and effort if it canobtain gurus who are also qualified to monitor for performance and problems, andwho do not just excel at recovery This should be worth 50 percent more salary tothe guru

Administration is the effort of technicians to keep systems backed up, keep powersupplies on line, monitor servers for error messages, ensure server rooms remain

at safe temperatures and air circulation, and so on The administrative leg managesthe SL budget, hires and fires, maintains and reports on service level achievement,and reports to management or the CEO

The performance leg is usually carried out by analysts who know what to look for in a system These analysts get paid the big bucks to help management decidehow to support business initiatives and how to exploit opportunity They need toknow everything there is about the technology and its capabilities For example,they need to know which databases should be used, how RAID works and the levelrequired, and so on They are able to collect data, interpret data, and forecastneeds

Availability

Administration

Performance

Trang 7

SLM and Windows 2000 Server

Key to meeting the objective of SLM is the acquisition of SL tools and technology

This is where Windows 2000 Server comes in While clustering and load balancingare included in Advanced Server and Datacenter Server, the performance and sys-tem monitoring tools and disaster recovery tools are available to all versions ofthe OS

These tools are essential to SL Acquired independently of the operating systems,they can cost an arm and a leg, and they might not integrate at the same level

These tools on Windows NT 4.0 were seriously lacking On Windows 2000, however,they raise the bar for all operating systems Many competitive products unfortu-nately just do not compete when it comes to SLM The costs of third-party toolsand integration for some operating systems are so prohibitive that they cannot beconsidered of any use to SLM whatsoever

The Windows 2000 monitoring tools are complex, and continued ignorance of themwill not be tolerated by management as more and more customers demand SL com-pliance and service level agreements The monitoring and performance tools onWindows 2000 include the following:

Windows 2000 System Monitoring Architecture

Windows 2000 monitors or analyzes storage, memory, networks, and processing

This does not sound like a big deal, but the data analysis is not done on these areasper se In other words, you do not monitor memory itself, or disk usage itself, but

Trang 8

rather how software components and functionality use these resources In short, it

is not sufficient to just report that 56MB of RAM were used between time x and time

y Your investigations need to find out what used the RAM at a certain time and why

so much was used

If a system continues to run out of memory, there is a strong possibility, for ple, that an application is stealing the RAM somewhere In other words, the applica-

exam-tion or process has a bug and is leaking memory When we refer to memory leaks,

this means that software that has used memory has not released it after it is done.Software developers are able to watch their applications on servers to be sure theyrelease all the memory they use

What if you are losing memory and you do not know which application is ble? Not too long ago, Windows NT servers used on the Internet and in high-endmail applications (no fewer than 100,000 e-mails per hour) would simply run out ofRAM After extensive system monitoring, we were able to determine that the leakwas in the latest release of the Winsock libraries responsible for Internet communi-cations on NT Another company in Europe found the leak about the same time.Microsoft later released a patch It turned out that the Winsock functions responsi-ble for releasing memory were not able to cope with the rapid demand on the sock-ets They were simply being opened at a rate faster than the Winsock librariescould cope with

responsi-The number of software components, services, and threads of functionality inWindows 2000 are so numerous that it is literally impossible to monitor tens ofthousands of instances of storage, memory, network, or processor usage

To achieve such detailed and varied analysis, Windows 2000 includes built-in ware objects, associated with services and applications, which are able to collect

soft-data in these critical areas So when you collect soft-data, the focus of your soft-data

collec-tion is on the software components, in various services of the operating system,that are associated with these areas When you perform data collection, the systemcollects data from the targeted object managers in each monitoring area

There are two methods of data collection supported in Windows 2000 The first oneinvolves accessing registry pointers to functions in the performance counter DLLs

in the operating system The second supports collecting data through the WindowsManagement Infrastructure (WMI) WMI is an object-oriented framework that allowsyou to instantiate (create instances of) performance objects that wrap the perfor-mance functionality in the operating system

The OS installs a new technology for recovering data through WMI These areknown as managed object files (MOFs) These MOFs correspond to or are associ-ated with resources in a system The number of objects that are the subject of per-formance monitoring are too numerous to list here, but they can be looked up inthe Windows 2000 Performance Counters Reference, which is on the Windows 2000

Trang 9

Resource Kit CD (see Appendix B) However, they include the operating system’sbase services, such as the services that report on the RAM, Paging File functional-ity, and Physical Disk usage, and the operating system’s advanced services, such asActive Directory, Active Server Pages, the FTP service, DNS, WINS, and so on.

To understand the scope and usage of the objects, it helps to first understand someperformance data and analysis terms There are three essential concepts to under-

standing performance monitoring These are throughput, queues, and response time.

From these terms, and once you fully understand them, you can broaden yourscope of analysis and perform calculations to report transfer rate, access time,latency, tolerance, thresholds, bottlenecks, and so on

What is Rate and Throughput?

Throughput is the amount of work done in a unit of time If your child is able to struct 100 pieces of Lego bricks per hour, you could say that his or her assemblage

con-rate is 100 pieces per hour, assessed over a period of x hours, as long as the con-rate

remains constant However, if the rate of assemblage varies, through fatigue,hunger, thirst, and so on, we can calculate the throughput

Throughput increases as the number of components increases, or the availabletime to complete a job is reduced Throughput depends on resources, and time andspace are examples of resources The slowest point in the system sets the through-put for the system as a whole Throughput is the true indicator of performance

Memory is a resource, the space in which to carry out instructions It makes littlesense to rate a system by millions of instructions per second, when insufficientmemory is not available to hold the instruction information

queue develops, we say that a bottleneck has occurred Looking for bottlenecks in

the system is key to monitoring for performance and troubleshooting or problemdetection If there are no bottlenecks, the system might be considered healthy, but

a bottleneck might soon start to develop

Queues can also form if requests for resources are not evenly spread over the unit

of time If your child assembles one piece per minute evenly every minute, he orshe will get through 60 pieces in an hour But if the child does nothing for 45 min-utes and then suddenly gets inspired, a bottleneck will occur in the final 15 minutesbecause there are more pieces than the child can process in the remaining time On

Trang 10

computer systems when queues and bottlenecks develop, systems become sponsive Additional requests for processor or disk resources are stalled Whenrequesting services are not satisfied, the system begins to break down In thisrespect, we reference the response time of a system.

unre-What Is Response Time?

Response time is the measure of how much time elapses between the firing of acomputer event, such as a read request, and the system’s response to the request.Response time will increase as the load increases because the system is stillresponding to other events and does not have enough resources to handle newrequests A system that has insufficient memory and/or processing ability will pro-cess a huge database sort a lot slower than a better-endowed system with fasterhard disks and CPUs If response time is not satisfactory, you will either have towork with less data or increase the resources

Response time is typically measured by dividing the queue length over the resourcethroughput Response time, queues, and throughput are reported and calculated bythe Windows 2000 reporting tools

How Performance Objects Work

Windows 2000 performance monitoring objects contain functionality known as

per-formance counters These so-called counters perform the actual analysis For

exam-ple, a hard disk object is able to calculate transfer rate, while aprocessor-associated object is able to calculate processor time

To gain access to the data or to start the data collection, you first have to createthe object and gain access to its functionality This is done by calling a createfunc-tion from a user interface or other process As soon as the object is created, and itsdata collection functionality invoked, it begins the data-collection process andstores the data in various properties Data can be streamed out to disk, in files,RAM, or to other components that assess the data and present it in some meaning-ful way

Depending on the object, your analysis software can create at least one copy of theperformance object and analyze the counter information it generates You need toconsult Microsoft documentation to “expose” the objects to determine if the objectcan be created more than once concurrently If it can be created more than once,you will have to associate your application with the data the object collects by ref-erencing the object’s instance counter Windows 2000 allows you to instantiate anobject for a local computer’s services, or you can create an object that collects datafrom a remote computer

Trang 11

There are two methods of data collection and reporting made possible using

perfor-mance objects First, the objects can sample data This means that data is collected

periodically rather than when a particular event occurs All forms of data collectionplace a burden on resources, which means that monitoring in itself can be a burden

to systems Sampled data has the advantage of being a period-driven load, but thedisadvantage is that values may be inaccurate when a certain activity falls outsidethe sampling period or between events

The other method of data collection is event tracing Event tracing, new to Windows

2000, is able to collect data as certain events occur Because there is no samplingwindow, you can correlate resource usage against events For example, you canwatch an application consume memory when it executes a certain function andmonitor how and if it releases that memory when the function completes

The disadvantage of event tracing is that it consumes more resources than pling, so you would only want to perform event tracing for short periods where theobjective of the trace is to troubleshoot, and not to just monitor

sam-Counters are able to report their data in one of two ways: instantaneous counting

or average counting An instantaneous counter displays the data as it happens; it is

a snapshot In other words, the counter does not compute the data it receives andjust reports it On the other hand, average counting computes the data for you Forexample, it is able to compute bits per second, or pages per second, and so on

Other counters are able to report percentages, difference, and so on

System Monitoring Tools

Before you rush out and buy a software development environment to access theperformance monitoring routines, you should know that Windows 2000 comesequipped with two primary, ready-to-go monitoring tools: the Performance Consoleand Task Manager Task Manager provides an instant view of systems activity such

as memory usage, processor activity, process activity, and resource consumption

Task Manager is very helpful for an immediate detection of system problems ThePerformance Console is used to provide performance analysis and information thatcan be used for troubleshooting and bottleneck analysis It can also be used toestablish regular monitoring regimens such as ongoing server health analysis

Performance Console comes with two tools built in: System Monitor and mance Logs and Alerts but more about them later The first tool, due to itsimmediacy and as a troubleshooting and information tool, is the Task Manager

Trang 12

Perfor-Task Manager

Task Manager provides quick information on applications and services currentlyrunning on your server It provides information such as processor usage in percent-age terms, memory usage, task priority, response, and some statistics about mem-ory and processor performance

Task Manager is very useful as a quick system sanity check, and it is usually evoked

as a troubleshooting tool when a system indicates slow response times, lockups orerrors, or messages pointing to lack of system resources, and so on

Task Manager, illustrated in Figure 20-2, is started in several ways:

1 Right-click the taskbar (the bottom-right area where the time is usually

displayed) and select Task Manager from the Context menu

2 Select Ctrl+Shift and hit the Esc key.

3 Select Ctrl+Alt and hit the Del key The Windows Security dialog box loads.

Click Task Manager

Figure 20-2: The Task Manager opened

to the Performance tab

When Task Manager loads, you will notice that the dialog box has three tabs:Applications, Processes, and Performance There are a number of useful trickswith the Task Manager:

✦ The columns can be sorted in ascending or descending order by clicking thecolumn heads The columns can also be resized

Trang 13

✦ When the Task Manager is running, a CPU gauge icon displaying accurateinformation is placed into the system tray on the bottom-right of the screen

If you drag your mouse cursor over this area, you will obtain a pop-up menu

of current CPU usage

✦ It is also possible to keep the Task Manager button off the system tray if youuse it a lot This is done by selecting the Options menu and then checking theHide When Minimized option The CPU icon next to the system time remains,however

✦ You can control the rate of Refresh or Update from the View ➪ Update Speedmenu You can also pause the update to preserve resources and click RefreshNow to update the display at any time

The Process tab is the most useful and provides a list of running processes on the

system It measures their performance in simple data These include CPU percentused, the CPU time allocated to a resource, and memory usage

There are a number of additional performance or process measures that can beadded to or removed from the list on the Processes page Select View ➪ SelectColumns This will show the Select Columns dialog box that will allow you to add

or subtract Process counters to the Processes list

A description of each process counter is available in Windows 2000 Help

It is also possible to terminate a process by selecting the process in the list andthen clicking the End Process button Some processes are protected, but you canterminate them using the kill or remote kill utilities that are included in the operat-ing system (see Appendix B for more information on killand rkill) You willneed authority to kill processes, and before you do, you should fully understandthe ramifications of terminating a process

The Performance tab (shown in Figure 20-2) allows you to graph the percentage of

processor time in kernel mode To show this, select the View menu and check theShow Kernel Times option The Kernel Times is the measure of time that applica-tions are using operating system services The remaining time, known as Usermode, is spent in threads that are spawned by applications

If your server supports multiple processes, you can click CPU History on the Viewmenu and graph each processor in a single graph pane or in separate graph panes

The Application tab lists running applications and allows you to terminate an

appli-cation that has become unresponsive or that you determine is in trouble or is thecause of trouble on the server

Tiêu đề	Service level and performance monitoring
Thể loại	chapter
Năm xuất bản	2000

Định dạng
Số trang	26
Dung lượng	166,99 KB