Process-Management Monitor

test $debug -eq 0 && kill $4 ;; * echo "Warning: killoption not set correctly, please validate configuration." ;; esac } Here, for ease of reference, I define all of the command-lin

Trang 1

■ ■ ■

System process monitors can be a vital tool in determining the health of a running

machine Ensuring that the required processes are running and that the total number

of each type of running process is appropriate is a good way to maintain system stability

The downside of these types of monitors is that they let you know only which processes

are running and how many there are They don’t give you an indication of the health of

each individual process

This script dives a little deeper into the condition of processes By using the ps

com-mand with a customized format, we’ll be able to monitor the age, proportion of CPU

usage, virtual-memory consumption, and amount of CPU time consumed by a particular

process If you are monitoring multiple instances of any given process, each instance will

be held up to the standard being monitored

One other feature of this process monitor is that it can be configured not only to warn

you of impending peril from processes whose operational values are out of bounds, but

also to take action in the form of killing the aberrant process when necessary The monitor

could be modified easily to perform other actions besides killing a process

Using historical data, you can sometimes predict when a specific application will start

to consume too many resources It was one such application I was working with that

prompted me to write this monitor The monitor helped in characterizing exactly when

the application ran out of control and in finding the cause of the behavior Both were very

helpful in fixing the problem

The syntax for monitor configuration is fairly straightforward, with five

colon-separated fields as shown in the following example The fields are as follows: the process

command, the indicator to track, a lower threshold, an upper threshold, and the kill

option You can configure multiple processes by including several records in the

config-uration string

kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"

The first field is the process command itself This will be slightly different, and

hope-fully simpler, than the traditional ps -ef output The ps -ef default output (-e for all

processes, -f for formatted output) includes the commands that are running, as well as

any arguments they were passed The ps -eo comm output is formatted to include only

the commands that are running on a system without any path or argument information

Trang 2

With this switch combination (-eo) you can also format your output in many ways to show many other options, such as memory size, process age, process CPU time, and

so on (On some UNIX systems, you may need to define the UNIX95 variable within the script for the ps -eo command to function properly The UNIX95 variable can be set to anything you’d like; it just needs to not be undefined.) When specifying the process for our script to monitor, you’ll want to use only the command name, as this is what the script will be looking for

The second field contains the indicator you want to track The options are cputime, which measures the number of minutes the cpu has allocated to the process; etime, which

is the elapsed time in minutes since the process began running; pcpu which represents the current percentage of the CPU capacity the process is consuming; and vsize, which shows the virtual-memory size in kilobytes for the process

The third and fourth fields contain the desired lower and upper thresholds for the indi-cator you’re tracking

The fifth and final field is the kill option It is a value from 0 to 3:

0: Send a notification when either the low warning or high error threshold have been

crossed, but don’t kill the process.

1: Send a warning notification when the low threshold has been crossed or an error

notification when the high threshold has been crossed, and kill the process.

2: Send only a low-level warning notification when either the low or high threshold has

been crossed, and kill the process.

3: Kill the process without any notification at all

Note that for safety, if the kill option is not set or is set to anything but one of the values outlined here, processes will not be killed Notice that there are two levels of notification

I have used alphanumeric paging for the high level (error status) and e-mail for the low level (warning status) You may want to implement the notification method as appropri-ate for your needs

The first section of the script sets up a few configuration variables, which alternatively could be stored in a separate configuration file and sourced each time the script runs through the loop This would allow for live configuration changes to the script The debug value is for testing and the sleeptime value represents the amount of time to delay between each run The kill_plist variable is the main configuration value that lets the script know what processes and values it should be watching

#!/bin/sh

debug=1

sleeptime=3

kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"

Trang 3

The following function performs all notifications and process terminations in the

script It is called with seven sequentially numbered parameters The positional variables

are somewhat difficult to understand and their values could have been assigned to more

meaningfully named variables before they were used, for ease of debugging later To

streamline the script a little, I didn’t do this

notify ()

{

case $2 in

0)

# Warn/error level and don't kill

echo "$1: $3 process id $4 found with $5 $7 Should be less than $6."

;;

1)

# Warn/error level and kill

echo "$1: $3 process id $4 found with $5 $7 Should be less than $6."

test $debug -eq 0 && kill $4

;;

2)

# Warning level only

echo "Warning: $3 process id $4 found with $5 $7 Should be less than $6."

;;

3)

# Just kill, don't warn at all

;;

*)

echo "Warning: killoption not set correctly, please validate configuration."

;;

esac

}

Here, for ease of reference, I define all of the command-line arguments passed to this

function:

$1: Text passed used for building the notification string; used for the difference

between warning and error

$2: The kill option, which has a possible value of 0-3

$3: The process name that is being monitored

$4: The process ID of the process being monitored

$5: The current value of the indicator you are tracking

Trang 4

$6: The monitor’s lower threshold

$7: The text equivalent of the indicator you are tracking

This is also a good example of how a function can reduce the length and complexity

of a script The body of this function is code that would have to be repeated eight times throughout the script if it were not placed in a function An older version of this script was written this way Putting the code into a function reduced the script’s length by roughly

40 percent

The following code is the beginning of the main loop The script is intended to be run

at system startup; it will then be run continuously through an infinite loop After each iter-ation completes, the script will sleep for a predetermined time before the next iteriter-ation The first part here is a nested loop that progresses through each record in the configura-tion string to parse its fields and set up the monitor

while :

do

for pline in $kill_plist

do

process=`echo $pline | cut -d: -f1`

process="`echo $process | sed -e \"s/%20/ /g\"`"

type=`echo $pline | cut -d: -f2`

value=`echo $pline | awk -F: '{print $3}'`

errval=`echo $pline | awk -F: '{print $4}'`

killoption=`echo $pline | awk -F: '{print $5}'`

The process variable is assigned the first field in the configuration record (pline) It is possible that the process command name you’re monitoring will consist of more than one word, separated by spaces Such spaces are replaced (here using the sed command) with

%20, which is a commonly used substitute for the space character, as in URL encoding, for example

The type variable is the second field in the configuration record As mentioned, it spec-ifies the performance indicator to watch: cputime (amount of CPU time consumed), etime (elapsed time or age of process), pcpu (current percentage of the CPU consumed), or vsize (virtual-memory size)

The value variable holds the lower warning threshold for the monitored value, taken from the third field

The errval variable is assigned the value of the upper error threshold for the monitored value, taken from the fourth field

The killoption variable is assigned the final field of the configuration record and spec-ifies an action to perform when the process deviates from the normal range

If the kill option was not specified initially, we set it to be the default kill option This makes sure no processes are killed unless one of the options for doing so is explicitly used

Trang 5

if [ "$killoption" = "" ]

then

killoption=0

fi

test $debug -gt 0 && echo "Kill $process processes if $type is greater than

$errval"

Next we pare down the full list of processes running on the system to the ones running

the command being monitored Then we start a loop that iterates through the remaining

processes

for pid in `ps -eo pid,comm | egrep "${process}$|${process}:$" | grep -v grep |

awk '{print $1}'`

do

For each process ID, the script has to gather the pertinent information The embedded

ps command gathers only the specific information we want

test $debug -gt 0 && echo "$process pid $pid"

pid_string=`ps -eo pid,cputime,etime,pcpu,vsize,comm | \

grep $pid | egrep "${process}$|${process}:$" | grep -v grep`

The following case statement is the heart of the monitor The script tests for the monitor

type (cputime, etime, pcpu, or vsize); the cputime is the first monitor type listed The code for

each type is slightly different, but all are very similar Here we obtain the process time from

the ps output, as well as the number of fields that the proc_time variable contains

case $type in

"cputime")

proc_time=`echo $pid_string | awk '{print $2}'`

fields=`echo $proc_time | awk -F: '{print NF}'`

proc_time_min=`echo $proc_time | awk -F: '{print $(NF-1)}'`

Both of these are needed because the format of the time value varies depending on the

amount of time it represents The cputime and etime variables have values of the form

days-hours:minutes:seconds or hours:minute:seconds A low value might look something

like 00:28 for 28 seconds A high value could be 1-18:32:29 for 1 day, 18 hours, 32 minutes,

and 29 seconds Both of these types have to be processed and converted to minutes

(Seconds are dropped for simplicity.)

Of the four performance indicators, the logic for handling the cputime and etime values

is the most complex because the format used to report them changes depending on the

amount of time these values represent

if [ $fields -lt 3 ]

then

proc_time_hr=0

proc_time_day=0

Trang 6

else

proc_time_hr=`echo $proc_time | awk -F: '{print $(NF-2)}'`

fields=`echo $proc_time_hr | awk -F- '{print NF}'`

if [ $fields -ne 1 ]

then

proc_time_day=`echo $proc_time_hr | awk -F- '{print $1}'`

proc_time_hr=`echo $proc_time_hr | awk -F- '{print $2}'`

else

proc_time_day=0

fi

Once all time values have been determined, we convert them to minutes for compari-son with the monitor thresholds

curr_cpu_time=\

`echo "$proc_time_day*1440+$proc_time_hr*60+$proc_time_min"\

| bc`

test $debug -gt 0 && echo "Current cpu time for \

$process pid $pid is $curr_cpu_time minutes"

If the current cputime value is between the warning and error thresholds, we call the notify() function with the appropriate switches It will handle output and process termi-nation, as described earlier

if test $curr_cpu_time -gt $value -a \

$curr_cpu_time -lt $errval

then

notify "Warning" $killoption $process $pid \

$curr_cpu_time $value "minutes of CPU time"

If the current cputime is greater than the error threshold, we call the notify() function with a different set of options

elif test $curr_cpu_time -ge $errval

then

notify "Error" $killoption $process $pid \

$curr_cpu_time $value "minutes of CPU time"

The final condition handles the case where there is no issue with the running process: the script just issues a message saying so

else

test $debug -gt 0 && echo "process cpu time ok"

fi

;;

Trang 7

The etime monitor is nearly the same as the cputime monitor The primary difference is

the field that is extracted from the ps output to get the current process age

"etime")

proc_age=`echo $pid_string | awk '{print $3}'`

fields=`echo $proc_age | awk -F: '{print NF}'`

proc_age_min=`echo $proc_age | awk -F: '{print $(NF-1)}'`

Once again, you convert the age of the process to values that will then be used to

calcu-late the age in minutes

if [ $fields -lt 3 ]

then

proc_age_hr=0

proc_age_day=0

else

proc_age_hr=`echo $proc_age | awk -F: '{print $(NF-2)}'`

fields=`echo $proc_age_hr | awk -F- '{print NF}'`

if [ $fields -ne 1 ]

then

proc_age_day=`echo $proc_age_hr | awk -F- '{print $1}'`

proc_age_hr=`echo $proc_age_hr | awk -F- '{print $2}'`

else

proc_age_day=0

fi

Now expressing the process age in minutes makes the threshold check very simple

curr_age=\

`echo "$proc_age_day*1440+$proc_age_hr*60+$proc_age_min" \

| bc`

test $debug -gt 0 && echo "Current age of $process pid \

$pid is $curr_age minutes"

We now perform the comparison checks against the monitor thresholds as before The

first check determines if the current process age is between the low and high thresholds

The second sees if the current age is above the high threshold In both these cases, call the

notify() function for end-user output and process termination The final possibility is that

there is no issue, and in this case the script gives a message stating that the process is OK

if test $curr_age -gt $value -a $curr_age -lt $errval

then

$curr_age $value "minutes of elapsed time"

elif test $curr_age -ge $errval

Trang 8

then

$curr_age $value "minutes of elapsed time"

else

test $debug -gt 0 && echo "process age ok"

fi

;;

The test for percentage CPU usage is quite simple The value to be compared to the thresholds is obtained directly from the ps output There is no need for further calculation

as was needed in the code for the cputime and etime monitors

"pcpu")

curr_proc_cpu=`echo $pid_string | awk '{print $4}' | \

awk -F '{print $1}'`

test $debug -gt 0 && echo "Current percent cpu of \

$process pid $pid is $curr_proc_cpu"

Once again, we compare the percentage CPU value with the configured low and high thresholds and call the notify() function to alert the user and perform any required pro-cess termination If the CPU percentage is below either of these values, the code outputs

an “OK” message

if test $curr_proc_cpu -gt $value -a \

$curr_proc_cpu -lt $errval

then

$curr_proc_cpu $value "percent of the CPU"

elif test $curr_proc_cpu -ge $errval

then

$curr_proc_cpu $value "percent of the CPU"

else

test $debug -gt 0 && echo "process cpu percent ok"

fi

;;

The vsize monitor is as simple as the percent-CPU monitor We obtain the current process’s memory footprint directly from the ps output

"vsize")

curr_proc_size=`echo $pid_string | awk '{print $5}'`

test $debug -gt 0 && echo "Current size of $process pid \

$pid is $curr_proc_size"

We have to check the current memory size against the monitor thresholds one last time If they are within a low or high warning status, we call the notify() function for out-put and termination If not, the code outout-puts that the process size is OK

Trang 9

if test $curr_proc_size -gt $value -a \

$curr_proc_size -lt $errval

then

$curr_proc_size $value "blocks of virtual size"

elif test $curr_proc_size -ge $errval

then

$curr_proc_size $value "blocks of virtual size"

else

test $debug -gt 0 && echo "process virtual size ok"

fi

;;

Finally we close the monitor case statement and the two inner processing loops The

script then goes to sleep for the configured amount of time before starting over again It

will then continue its monitoring until the monitor itself dies or is killed or the system is

shut down

esac

done

sleep $sleeptime

done

Tiêu đề	Process-Management Monitor
Trường học	Unknown University
Chuyên ngành	Computer Science
Thể loại	Tổng luận
Năm xuất bản	N/A
Thành phố	N/A

Định dạng
Số trang	9
Dung lượng	70,39 KB