test $debug -eq 0 && kill $4 ;; * echo "Warning: killoption not set correctly, please validate configuration." ;; esac } Here, for ease of reference, I define all of the command-lin
Trang 1■ ■ ■
Process-Management Monitor
System process monitors can be a vital tool in determining the health of a running
machine Ensuring that the required processes are running and that the total number
of each type of running process is appropriate is a good way to maintain system stability
The downside of these types of monitors is that they let you know only which processes
are running and how many there are They don’t give you an indication of the health of
each individual process
This script dives a little deeper into the condition of processes By using the ps
com-mand with a customized format, we’ll be able to monitor the age, proportion of CPU
usage, virtual-memory consumption, and amount of CPU time consumed by a particular
process If you are monitoring multiple instances of any given process, each instance will
be held up to the standard being monitored
One other feature of this process monitor is that it can be configured not only to warn
you of impending peril from processes whose operational values are out of bounds, but
also to take action in the form of killing the aberrant process when necessary The monitor
could be modified easily to perform other actions besides killing a process
Using historical data, you can sometimes predict when a specific application will start
to consume too many resources It was one such application I was working with that
prompted me to write this monitor The monitor helped in characterizing exactly when
the application ran out of control and in finding the cause of the behavior Both were very
helpful in fixing the problem
The syntax for monitor configuration is fairly straightforward, with five
colon-separated fields as shown in the following example The fields are as follows: the process
command, the indicator to track, a lower threshold, an upper threshold, and the kill
option You can configure multiple processes by including several records in the
config-uration string
kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"
The first field is the process command itself This will be slightly different, and
hope-fully simpler, than the traditional ps -ef output The ps -ef default output (-e for all
processes, -f for formatted output) includes the commands that are running, as well as
any arguments they were passed The ps -eo comm output is formatted to include only
the commands that are running on a system without any path or argument information
Trang 2With this switch combination (-eo) you can also format your output in many ways to show many other options, such as memory size, process age, process CPU time, and
so on (On some UNIX systems, you may need to define the UNIX95 variable within the script for the ps -eo command to function properly The UNIX95 variable can be set to anything you’d like; it just needs to not be undefined.) When specifying the process for our script to monitor, you’ll want to use only the command name, as this is what the script will be looking for
The second field contains the indicator you want to track The options are cputime, which measures the number of minutes the cpu has allocated to the process; etime, which
is the elapsed time in minutes since the process began running; pcpu which represents the current percentage of the CPU capacity the process is consuming; and vsize, which shows the virtual-memory size in kilobytes for the process
The third and fourth fields contain the desired lower and upper thresholds for the indi-cator you’re tracking
The fifth and final field is the kill option It is a value from 0 to 3:
0: Send a notification when either the low warning or high error threshold have been
crossed, but don’t kill the process.
1: Send a warning notification when the low threshold has been crossed or an error
notification when the high threshold has been crossed, and kill the process.
2: Send only a low-level warning notification when either the low or high threshold has
been crossed, and kill the process.
3: Kill the process without any notification at all
Note that for safety, if the kill option is not set or is set to anything but one of the values outlined here, processes will not be killed Notice that there are two levels of notification
I have used alphanumeric paging for the high level (error status) and e-mail for the low level (warning status) You may want to implement the notification method as appropri-ate for your needs
The first section of the script sets up a few configuration variables, which alternatively could be stored in a separate configuration file and sourced each time the script runs through the loop This would allow for live configuration changes to the script The debug value is for testing and the sleeptime value represents the amount of time to delay between each run The kill_plist variable is the main configuration value that lets the script know what processes and values it should be watching
#!/bin/sh
debug=1
sleeptime=3
kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"
Trang 3The following function performs all notifications and process terminations in the
script It is called with seven sequentially numbered parameters The positional variables
are somewhat difficult to understand and their values could have been assigned to more
meaningfully named variables before they were used, for ease of debugging later To
streamline the script a little, I didn’t do this
notify ()
{
case $2 in
0)
# Warn/error level and don't kill
echo "$1: $3 process id $4 found with $5 $7 Should be less than $6."
;;
1)
# Warn/error level and kill
echo "$1: $3 process id $4 found with $5 $7 Should be less than $6."
test $debug -eq 0 && kill $4
;;
2)
# Warning level only
echo "Warning: $3 process id $4 found with $5 $7 Should be less than $6."
test $debug -eq 0 && kill $4
;;
3)
# Just kill, don't warn at all
test $debug -eq 0 && kill $4
;;
*)
echo "Warning: killoption not set correctly, please validate configuration."
;;
esac
}
Here, for ease of reference, I define all of the command-line arguments passed to this
function:
$1: Text passed used for building the notification string; used for the difference
between warning and error
$2: The kill option, which has a possible value of 0-3
$3: The process name that is being monitored
$4: The process ID of the process being monitored
$5: The current value of the indicator you are tracking
Trang 4$6: The monitor’s lower threshold
$7: The text equivalent of the indicator you are tracking
This is also a good example of how a function can reduce the length and complexity
of a script The body of this function is code that would have to be repeated eight times throughout the script if it were not placed in a function An older version of this script was written this way Putting the code into a function reduced the script’s length by roughly
40 percent
The following code is the beginning of the main loop The script is intended to be run
at system startup; it will then be run continuously through an infinite loop After each iter-ation completes, the script will sleep for a predetermined time before the next iteriter-ation The first part here is a nested loop that progresses through each record in the configura-tion string to parse its fields and set up the monitor
while :
do
for pline in $kill_plist
do
process=`echo $pline | cut -d: -f1`
process="`echo $process | sed -e \"s/%20/ /g\"`"
type=`echo $pline | cut -d: -f2`
value=`echo $pline | awk -F: '{print $3}'`
errval=`echo $pline | awk -F: '{print $4}'`
killoption=`echo $pline | awk -F: '{print $5}'`
The process variable is assigned the first field in the configuration record (pline) It is possible that the process command name you’re monitoring will consist of more than one word, separated by spaces Such spaces are replaced (here using the sed command) with
%20, which is a commonly used substitute for the space character, as in URL encoding, for example
The type variable is the second field in the configuration record As mentioned, it spec-ifies the performance indicator to watch: cputime (amount of CPU time consumed), etime (elapsed time or age of process), pcpu (current percentage of the CPU consumed), or vsize (virtual-memory size)
The value variable holds the lower warning threshold for the monitored value, taken from the third field
The errval variable is assigned the value of the upper error threshold for the monitored value, taken from the fourth field
The killoption variable is assigned the final field of the configuration record and spec-ifies an action to perform when the process deviates from the normal range
If the kill option was not specified initially, we set it to be the default kill option This makes sure no processes are killed unless one of the options for doing so is explicitly used
Trang 5if [ "$killoption" = "" ]
then
killoption=0
fi
test $debug -gt 0 && echo "Kill $process processes if $type is greater than
$errval"
Next we pare down the full list of processes running on the system to the ones running
the command being monitored Then we start a loop that iterates through the remaining
processes
for pid in `ps -eo pid,comm | egrep "${process}$|${process}:$" | grep -v grep |
awk '{print $1}'`
do
For each process ID, the script has to gather the pertinent information The embedded
ps command gathers only the specific information we want
test $debug -gt 0 && echo "$process pid $pid"
pid_string=`ps -eo pid,cputime,etime,pcpu,vsize,comm | \
grep $pid | egrep "${process}$|${process}:$" | grep -v grep`
The following case statement is the heart of the monitor The script tests for the monitor
type (cputime, etime, pcpu, or vsize); the cputime is the first monitor type listed The code for
each type is slightly different, but all are very similar Here we obtain the process time from
the ps output, as well as the number of fields that the proc_time variable contains
case $type in
"cputime")
proc_time=`echo $pid_string | awk '{print $2}'`
fields=`echo $proc_time | awk -F: '{print NF}'`
proc_time_min=`echo $proc_time | awk -F: '{print $(NF-1)}'`
Both of these are needed because the format of the time value varies depending on the
amount of time it represents The cputime and etime variables have values of the form
days-hours:minutes:seconds or hours:minute:seconds A low value might look something
like 00:28 for 28 seconds A high value could be 1-18:32:29 for 1 day, 18 hours, 32 minutes,
and 29 seconds Both of these types have to be processed and converted to minutes
(Seconds are dropped for simplicity.)
Of the four performance indicators, the logic for handling the cputime and etime values
is the most complex because the format used to report them changes depending on the
amount of time these values represent
if [ $fields -lt 3 ]
then
proc_time_hr=0
proc_time_day=0
Trang 6else
proc_time_hr=`echo $proc_time | awk -F: '{print $(NF-2)}'`
fields=`echo $proc_time_hr | awk -F- '{print NF}'`
if [ $fields -ne 1 ]
then
proc_time_day=`echo $proc_time_hr | awk -F- '{print $1}'`
proc_time_hr=`echo $proc_time_hr | awk -F- '{print $2}'`
else
proc_time_day=0
fi
fi
Once all time values have been determined, we convert them to minutes for compari-son with the monitor thresholds
curr_cpu_time=\
`echo "$proc_time_day*1440+$proc_time_hr*60+$proc_time_min"\
| bc`
test $debug -gt 0 && echo "Current cpu time for \
$process pid $pid is $curr_cpu_time minutes"
If the current cputime value is between the warning and error thresholds, we call the notify() function with the appropriate switches It will handle output and process termi-nation, as described earlier
if test $curr_cpu_time -gt $value -a \
$curr_cpu_time -lt $errval
then
notify "Warning" $killoption $process $pid \
$curr_cpu_time $value "minutes of CPU time"
If the current cputime is greater than the error threshold, we call the notify() function with a different set of options
elif test $curr_cpu_time -ge $errval
then
notify "Error" $killoption $process $pid \
$curr_cpu_time $value "minutes of CPU time"
The final condition handles the case where there is no issue with the running process: the script just issues a message saying so
else
test $debug -gt 0 && echo "process cpu time ok"
fi
;;
Trang 7The etime monitor is nearly the same as the cputime monitor The primary difference is
the field that is extracted from the ps output to get the current process age
"etime")
proc_age=`echo $pid_string | awk '{print $3}'`
fields=`echo $proc_age | awk -F: '{print NF}'`
proc_age_min=`echo $proc_age | awk -F: '{print $(NF-1)}'`
Once again, you convert the age of the process to values that will then be used to
calcu-late the age in minutes
if [ $fields -lt 3 ]
then
proc_age_hr=0
proc_age_day=0
else
proc_age_hr=`echo $proc_age | awk -F: '{print $(NF-2)}'`
fields=`echo $proc_age_hr | awk -F- '{print NF}'`
if [ $fields -ne 1 ]
then
proc_age_day=`echo $proc_age_hr | awk -F- '{print $1}'`
proc_age_hr=`echo $proc_age_hr | awk -F- '{print $2}'`
else
proc_age_day=0
fi
fi
Now expressing the process age in minutes makes the threshold check very simple
curr_age=\
`echo "$proc_age_day*1440+$proc_age_hr*60+$proc_age_min" \
| bc`
test $debug -gt 0 && echo "Current age of $process pid \
$pid is $curr_age minutes"
We now perform the comparison checks against the monitor thresholds as before The
first check determines if the current process age is between the low and high thresholds
The second sees if the current age is above the high threshold In both these cases, call the
notify() function for end-user output and process termination The final possibility is that
there is no issue, and in this case the script gives a message stating that the process is OK
if test $curr_age -gt $value -a $curr_age -lt $errval
then
notify "Warning" $killoption $process $pid \
$curr_age $value "minutes of elapsed time"
elif test $curr_age -ge $errval
Trang 8then
notify "Error" $killoption $process $pid \
$curr_age $value "minutes of elapsed time"
else
test $debug -gt 0 && echo "process age ok"
fi
;;
The test for percentage CPU usage is quite simple The value to be compared to the thresholds is obtained directly from the ps output There is no need for further calculation
as was needed in the code for the cputime and etime monitors
"pcpu")
curr_proc_cpu=`echo $pid_string | awk '{print $4}' | \
awk -F '{print $1}'`
test $debug -gt 0 && echo "Current percent cpu of \
$process pid $pid is $curr_proc_cpu"
Once again, we compare the percentage CPU value with the configured low and high thresholds and call the notify() function to alert the user and perform any required pro-cess termination If the CPU percentage is below either of these values, the code outputs
an “OK” message
if test $curr_proc_cpu -gt $value -a \
$curr_proc_cpu -lt $errval
then
notify "Warning" $killoption $process $pid \
$curr_proc_cpu $value "percent of the CPU"
elif test $curr_proc_cpu -ge $errval
then
notify "Error" $killoption $process $pid \
$curr_proc_cpu $value "percent of the CPU"
else
test $debug -gt 0 && echo "process cpu percent ok"
fi
;;
The vsize monitor is as simple as the percent-CPU monitor We obtain the current process’s memory footprint directly from the ps output
"vsize")
curr_proc_size=`echo $pid_string | awk '{print $5}'`
test $debug -gt 0 && echo "Current size of $process pid \
$pid is $curr_proc_size"
We have to check the current memory size against the monitor thresholds one last time If they are within a low or high warning status, we call the notify() function for out-put and termination If not, the code outout-puts that the process size is OK
Trang 9if test $curr_proc_size -gt $value -a \
$curr_proc_size -lt $errval
then
notify "Warning" $killoption $process $pid \
$curr_proc_size $value "blocks of virtual size"
elif test $curr_proc_size -ge $errval
then
notify "Error" $killoption $process $pid \
$curr_proc_size $value "blocks of virtual size"
else
test $debug -gt 0 && echo "process virtual size ok"
fi
;;
Finally we close the monitor case statement and the two inner processing loops The
script then goes to sleep for the configured amount of time before starting over again It
will then continue its monitoring until the monitor itself dies or is killed or the system is
shut down
esac
done
done
sleep $sleeptime
done