bec+jnla*_bc to the cfengine master at LNK@+nalh+nkkp+qon+lgc+jnla_kjb+jnla*_bc then edited the jnla*_bc file to use the +qon+lgc+j]cekolhqcejo+he^ata_ directory for all the paths and
Trang 1Figure 10-1 Nagios service detail screen for the system localhost
Trang 2If you’re following along with the book in an environment of your own, you’ll notice
a problem—there isn’t a _da_g[dpplo
This new _kii]j` object definition calls the _da_g[dppl plug-in with the appropriate
arguments to test an HTTPS-enabled web site Once this was copied to our Nagios server
the check cleared in Nagios
Nagios is now in a fully functional state in our environment, but we don’t find it very
useful to only monitor a single machine Next, we’ll take steps to monitor the rest of the
hosts at our site The first step will be to deploy a local monitoring agent called NRPE to
all our systems
NRPE
NRPE is the Nagios Remote Plug-in Executor It is used in place of agents and protocols
such as SNMP for remotely monitoring hosts It grants access to remote hosts to execute
plug-ins such as those in the Nagios plug-ins distribution NRPE has two components: a
daemon called jnla and a plug-in to the Nagios daemon called _da_g[jnla
Trang 3The NRPE documentation points out that there are other ways to accomplish remote plug-in execution, such as the Nagios _da_g[^u[ood host seems attractive for security reasons, it imposes more overhead on remote hosts than the NRPE program does In addition, a site’s security policy may expressly forbid
lightweight, flexible, and fast
Step 15: Building NRPE
The NRPE source distribution does not include an installation facility Once it is built, it
_da_g[jnla to the preexisting j]ceko)lhqcejo directory for the `a^e]j*e242
architecture and copied the jnla program itself into the single shared LNK@+nalh+nkkp+qon+lgc+jnla).*-.)^ej directory
except that we copied the plug-ins to the jnla)^ej+jnla)na`d]p*e242 directory and the
jnla binary to jnla).*-.)^ej+jnla)na`d]p*e242
Trang 4bec+jnla*_bc) to the cfengine master at LNK@+nalh+nkkp+qon+lgc+jnla)_kjb+jnla*_bc
then edited the jnla*_bc file to use the +qon+lgc+j]ceko)lhqcejo+he^ata_ directory for all
the paths and allow access from our etchlamp system as shown:
At this point, we have the NRPE programs built and ready for distribution from the
cfengine master, along with a configuration file The last thing we need to prepare for
NRPE is a start-up script
Trang 5Step 17: Creating an NRPE Start-up Script
created a simple init script for NRPE at LNK@+nalh+nkkp+ap_+ejep*`+jnla on the cfengine master with these contents:
This is a very simple init script, but it suffices because NRPE is a very simple daemon
lgehh command, because in writing this chapter, we found that ally the PID of the jnla process wasn’t properly stored in the jnla*le` file Occasionally, daemons have bugs such as this, so we simply work around it with some extra measures
occasion-to kill the daemon with the lgehh command
Step 18: Copying NRPE Using cfengine
now have everything we need to deploy NRPE at our site To distribute NRPE with cfengine, we created a task to distribute the configuration file, init script, and binaries in
a file named LNK@+ejlqpo+p]ogo+]ll+j]ceko+_b*jnla[ouj_ Here’s the file, which we will describe only briefly after showing the complete contents, because we’re not introducing any new cfengine functionality in this task:
Trang 7jnlanaop]np+ap_+ejep*`+jnlaop]npejbkni9pnqaqi]og9,
Trang 8 +ap_+ejep*`+jnla start-up script into the runlevel-specific
directo-ries in the preceding hejgo section, we avoid creating a link in +ap_+n_/*` on Solaris hosts
execute twice No damage would result, but we don’t want to be sloppy
directories n_0*`, n_1*`, and n_2*` don't exist on Solaris, so we won't attempt to create
sym-links in them
Note that we make it easy to move to a newer version of NRPE later on, using version
numbers and a symlink at +qon+lgc+jnla to point to the current version The use of a
vari-able means only the single entry in this task will need to change once a new NRPE version
is built and placed in the appropriate directories on the cfengine master
To activate this new task, we placed the following line in LNK@+ejlqpo+dkopcnkqlo+
_b*]ju:
p]ogo+]ll+j]ceko+_b*jnla[ouj_
Step 19: Configuring the Red Hat Local Firewall to Allow NRPE
The next-to-last step we had to take was to allow NRPE connections through the Red Hat
firewall To do so, we added rules directly to the +ap_+ouo_kjbec+elp]^hao file on the
sys-tem rhlamp and restarted elp]^hao with oanre_aelp]^haonaop]np Here are the complete
contents of the elp]^hao file, with the newly added line in bold:
Trang 9always use the Red Hat command ouopai)_kjbec)oa_qnepuharah to make changes and then feed the resulting +ap_+ouo_kjbec+elp]^hao changes back into the copy that we distribute with cfengine This is just another example of how manual changes are often needed to determine how to automate something It’s always OK as long as we feed the resulting changes and steps back into cfengine for long-term enforcement.
elp]^hao file on our cfengine master at LNK@+nalh+nkkp+ap_+ouo_kjbec+elp]^hao and placed a task with these contents at the location LNK@+ejlqpo+p]ogo+ko+_b*elp]^hao[ouj_:
_kjpnkh6
]ju66
]``ejop]hh]^ha9$naop]npelp]^hao%
Trang 10
It might seem strange to use the ]ju class in the _b*na`d]p hostgroup file, but if you
think about it, the task doesn’t apply to all hosts on our network, only to the hosts that
import this dkopcnkql file That means that this ]ju66 class will actually apply to only Red
Hat systems
Now, sit back and let NRPE go out to your network If you encounter any issues while
building NRPE, refer to the JNLA*l`b file included in the `k_o directory of the NRPE source distribution
Trang 11Monitoring Remote Systems
So far, we’re simply using the example configuration included with Nagios to monitor
only the system that is actually running Nagios To make Nagios generally useful, we need
to monitor remote systems
Step 20: Configuring Nagios to Monitor All Hosts at Our Example Site
Trang 12Templates are used in Nagios to avoid repeating the same values for every service
and host object These objects have many required entries, but Nagios allows the use of
every required value in the objects that we define Template definitions are very similar
to the host or service definitions that they are meant for, but templates contain the line
naceopan, to keep Nagios from loading it as a real object Any or all values can be
Note Be aware that ao_]h]pekj settings override the _kjp]_p[cnkqlo setting in service definitions
We have no ao_]h]pekj settings and won’t configure it in this chapter, but keep them in mind for your own
configurations
Now that we have a template that suits our needs, we can inherit from it in our
ser-vice definitions and specify only important values or those that we wish to override from
the template’s values
In the directory LNK@+nalh+nkkp+qon+lgc+j]ceko)_kjb+k^fa_po+oanrano, we have four
files to define the objects to monitor on our network:
dkopo*_bc
dkopcnkqlo*_bc
ouopai[_da_go*_bc
sa^[_da_go*_bc
Trang 13 the hosts at our site in the file dkopo*_bc:
Trang 14Now that we have host definitions for all the hosts that we want to monitor at our
site, we will set up groups in the file dkopcnkqlo*_bc:
an existing dkopcnkql and immediately have the proper checks performed against it
Next, we set up some system level monitoring using NRPE, configured in the file
Trang 15This entry means that the _da_g[jnla command is passed the argument _da_g[hk]`
for the Hk]`_da_gkranJNLA _da_g[jnla, you can now see that what is run on the monitoring host is:
Trang 16 _da_g[dpplo check earlier to test the web server on localhost, so here
we simply set it up for a remote host and it works properly
Each time we update the Nagios configuration files, cfengine gets the files to the
cor- etchlamp) and restarts the Nagios daemon.
etchlamp system fails due to hardware issues, we
will simply need to reimage the host, and without any manual intervention cfengine will
central monitoring host
At this point, we have the four components of Nagios deployed, as planned: Nagios
to run plug-ins that we define, either locally on systems via NRPE or across the network to test client/server applications
add checks and perhaps new plug-ins Our monitoring infrastructure choice really shines
in the easy addition of new plug-ins; it should be able to support us for quite a while
with-out any core modifications
Trang 17What Nagios Alerts Really Mean
-ing system, what does it really mean?
a monitoring program or
notifica-by all systems except the monitoring host.
Don’t jump to the conclusion that a notification means that a service or host has failed You need to understand exactly what each service definition is checking and vali-date that the service is really failing with some checks of your own before undertaking any remediation steps
Ganglia
Ganglia is a distributed monitoring system that uses graphs to display the data it collects Nagios will let us know if an application or host is failing a check, but Ganglia is there to
site-specific metrics into Ganglia, though we don’t demonstrate doing so in this book
If a host intermittently triggers a load alarm in Nagios, with no clear cause ately visible, looking at graphs of the system’s load over time can be useful in helping you see when the load increase began Armed with this information, we can check if the alarm correlates to a system change or application update Ganglia is extremely useful in such
Trang 18 incredibly well, and adding new custom metrics to the Ganglia graphs is extremely easy.
The core functionality of Ganglia is provided by two main daemons, along with a web front end:
cikj`: This multithreaded daemon runs on each host you want to monitor cikj`
keeps track of state on the system, relays the state changes on to other systems via
TCP or multicast UDP, listens for and gathers the state of other cikj` daemons in
the local cluster, and answers request for all the collected information The cikj`
configuration will cause hosts to join a cluster group A site might contain many
different clusters, depending on how the administrator wants to group systems for display in the Ganglia web interface
ciap]`: This daemon is used to aggregate Ganglia data and can even be used to
aggregate information from multiple Ganglia clusters ciap]` polls one or many
sockets to clients
Web interface ciap]` daemon to receive the
clusterwide, or for a single host over periods of time such as the last hour, day,
week, or month The web interface uses graphs generated by ciap]` to display
his-torical information
Ganglia’s cikj` daemon can communicate using TCP with explicit connections to
other hosts that aggregate a cluster’s state, or it can use multicast UDP to broadcast the
and then poll those hosts explicitly with ciap]` The cikj` configuration file still has UDP
port configuration settings, but they won’t be used at our example site
Building and Distributing the Ganglia Programs
of commands Note that a C++ compiler will need to be present on the system, as well
as development libraries for RRDtool he^ljc-.),
the RRDtool libraries the build will seem successful, but the ciap]` program will fail to
be built
Trang 19 cikj`*_kjb), edit as appropriate for your site, and then place the cikj`*_kjb file on the cfengine master The beautiful thing about this option is that it even emits comments describing each configuration section! Ganglia was clearly written by system administrators
Trang 20- goldmaster and etchlamp to be the cluster data aggregators via the
q`l[oaj`[_d]jjah ciap]` to poll the cluster state from these two hosts
The p_l[] alp[_d]jjah section allows our host running ciap]`
etch-lamp) to poll state over TCP from any host running cikj` The rest of the configuration
file is unchanged
Trang 21 ciap]`*_kjb file from the Ganglia source distribution
at the location ciap]`+ciap]`*_kjb cikj`*_kjb
and ciap]`*_kjb) into the directory LNK@+nalh+nkkp+qon+lgc+c]jche])_kjb on the cfengine
Wl]oos`xod]`ksxcnkqlY files with these entries:
Trang 23Next, add this line to LNK@+ejlqpo+dkopcnkqlo+_b*]ju so that all of our hosts get the Ganglia programs copied over:
Configuring the Ganglia Web Interface
Our central Ganglia machine will run the web interface for displaying graphs, as well as the ciap]` program that collects the information from the cikj` daemons on our network.Ganglia’s web interface is written in PHP and distributed in the source package Copy the PHP files from the Ganglia source package’s web directory to this location on the cfengine master:
Trang 25This task causes the ciap]` daemon to be started on the c]jche][sa^ host if it isn’t
c]jche][sa^ in the next section) Our configuration for the ciap]`
serve as sufficient documentation to get most users going with a working configuration
Trang 26 SA> ]lp)cap on
etchlamp in this case, so that we didn’t have to reimage the host just to add two packages.
Next, we created a new dkopcnkql file for our new c]jche][sa^ role on the cfengine
master at the location LNK@+ejlqpo+dkopcnkqlo+_b*c]jche][sa^, with these contents:
Once cfengine on etchlamp copies the PHP content and Apache configuration files,
we can visit dpplo6++c]jche]*_]ilej*jap+ in our web browser and view graphs for all the
hosts at our site, individually or as a whole If you haven’t previously used a similar host
you refer to the graphs during troubleshooting or for capacity planning
Now You Can Rest Easy
At this
and will grow and scale along with our new infrastructure
As your site requires more and more monitoring, you might benefit from the
a test instance of distributed Nagios in order to determine if the additional load sharing
and redundancy is a good fit for your site Many sites simply purchase more powerful
this may no longer be feasible
Ganglia will scale extremely well to large numbers of systems, and most of the
follow-on cfollow-onfiguratifollow-on will be around breaking up hosts into separate groups, and possibly
to aggregate the cluster’s state and simply configure ciap]` to poll the cluster state from
a list of several hosts running cikj` This allows one or more cikj` aggregators to fail and
to run with many more as the total number of systems at your site increases
Trang 27Infrastructure Enhancement
At this point, we have a fully functional infrastructure We have automated all of the
changes to the hosts at our site from the point at which the initial imaging hosts and
cfengine server were set up
We’re running a rather large risk, however, because if we make errors in our cfengine
configuration files, we won’t have an easy way to revert the changes We run an even
greater risk if our cfengine server were to suffer hardware failure: we would have no way
of restoring the cfengine i]opanbehao tree The other hosts on our network will continue
running cfengine, and they will apply the last copied policies and configuration files, but
no updates will be possible until we restore our central host
Subversion can help us out with both issues Using version control, we can easily
track the changes to all the files hosted in our cfengine i]opanbehao tree, and by making
backups of the Subversion repository, we can restore our cfengine server in the event of
system failure or even total site failure
Cfengine Version Control with Subversion
With only a small network in place, we already have over 2,800 lines of configuration
code in over 55 files under the LNK@+ejlqpo directory We need to start tracking the
dif-ferent versions of those files as time goes on, as well as tracking any additional files that
are added The workplace of one of this book’s authors has over 30,000 lines of cfengine
configuration in 971 files Without version control, it is difficult to maintain any
sem-blance of control over your cfengine configuration files, as well as the files being copied
by cfengine
We covered basic Subversion usage in Chapter 8 and included instructions on how
to set up a Subversion server with an Apache front end We’ll utilize that infrastructure to
host version control for our cfengine master repository
Importing the masterfiles Directory Tree
In order to import our cfengine i]opanbehao directory into Subversion, we need to create
the repository on etchlamp, our Subversion host Conveniently, we already created the