 | Level: Intermediate Cameron Laird (claird@phaseit.net), Vice President, Phaseit, Inc. Wojciech Kocjan (wojciech@kocjan.org), Software Engineer,
IBM
17 Jul 2007 Learn more about Nagios and find out what new system monitoring possibilities exist
with this software. Nagios is open source monitoring software that scans hosts, services, and
networks for problems. The two main differences between Nagios and other similar
packages are that Nagios reduces all information to "working," "questionable,"
and "failure" statuses; and Nagios supports a particularly rich "ecosystem" of
plug-ins. These features result in effective installations where users are not
overwhelmed with details, but have just the information they need.
Monitoring and analyzing masses of information—Is the CPU
overloaded? Is the network interface saturated?—across several hosts is a daunting task. A good
solution might be only a few steps away, though! The open source Nagios project (see
Resources) solves complex monitoring and
notification requirements quite handily.
Crucial to understanding Nagios is that, rather than monitor and track "natural"
measurements such as CPU utilization, the tool reduces all information to
"working," "questionable," and "failure" states. This helps operators focus on the
most important and crucial problems, based on predefined and configurable
criteria.
Nagios builds in the capacity for downtime reporting, which might be useful in
tracking fulfillment of service level agreements (SLAs). As later articles will
illustrate, Nagios also offers downtime escalations, service, and
host dependencies; this introduction
concentrates on the ease with which you can write small customizations for your
basic monitoring requirements.
Installation
Most Linux® distributions build in Nagios installations. In such cases, the
installation integrates smoothly with the Apache Web server. To activate or update
such a configuration, you need to do the following:
or apt-get install nagios-text
Binary for the AIX®
platform is available for free download from NagiosExchange (see Resources).
For other platforms, Nagios sources can be downloaded from
Nagios.org (see Resources). The development tools
necessary to generate Nagios "from scratch" are standard:
- Tools
-
gcc
-
make
-
autoconf
-
automake
- Executables
- Packages (libraries and headers)
Many Simple Network Management Protocol-related (SNMP-related) plug-ins also require
Perl and the Net::SNMP package.
After installing and setting up Nagios, you should be able to access Nagios using
a default http://your.host.name/nagios URL. Figure 1 shows which hosts and services are up
or down.
Figure 1. Tactical Monitoring Overview screen
Configuring Nagios
By default, all Nagios configuration files are in the
/etc/nagios directory. Apache-related configuration
files might be symlinked into the Apache configuration directory for convenience. The
configuration is split into multiple files, each for a different part of the
configuration.
The first components to set up are contacts and contact groups. Contacts are
people who receive notifications when a host or service is down. Nagios offers
pager and e-mail notifications by default. Extensions allow notification by Jabber
and many other ways, which can be very convenient in certain circumstances.
Contacts are stored in the contacts.cfg file and are
defined as follows:
Listing 1.
Configuration 1: Basic contact information
define contact{
contact_name jdoe
alias John Due
service_notification_commands notify-by-email
host_notification_commands host-notify-by-emailes
email john.doe@yourcompany.com
}
|
Contacts are grouped: Instead of specifying persons that should be notified in
case of host or service status changes, Nagios notifies the pertinent group.
Sometimes it's even appropriate to define a person multiple times to specify
different notification commands or addresses, and then add all ways to contact
a person into a contact group to which that user belongs (see Listing
2).
Listing 2.
Configuration 2: Grouping contacts
define contactgroup{
contactgroup_name server-admins
alias Server Administrators
members jdoe,albundy
}
|
The next step is to configure hosts that should be monitored by Nagios. Each host
that has services monitored or checked for being alive should
be added. The configuration file for storing hosts is
hosts.cfg. Here's an example host definition:
Listing 3. Configuration 3: Adding a new host
define host{
host_name ubuntu_1_2
alias Ubuntu test server
address 192.168.1.2
check_command check-host-alive
max_check_attempts 20
notifications_enabled 1
event_handler_enabled 0
flap_detection_enabled 0
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 60
notification_period 24x7
notification_options d,u,r
}
|
The final step of Nagios configuration is definition of services for the
configured hosts. This example uses a predefined "ping" Nagios plug-in, which sends
Internet Control Message Protocol (ICMP) echo requests to determine if a host is responsive or not.
Listing 4. Configuration
4: Adding a new service
define service{
use service-template
host_name ubuntu_1_2
service_description PING
check_period 24x7
contact_groups server-admins
notification_options c,r
check_command check_ping!300.0,20%!1000.0,60%
}
|
With this configuration in place, restart your Nagios daemon, then, after pausing
for a few seconds to let Nagios initialize, confirm the visibility of the ping
service in the Web administrative interface.
How to write Nagios plug-ins
The most exciting aspect of Nagios is that writing your own plug-ins is simple and
requires learning only a few easy principles. To manage a plug-in, Nagios simply
spawns a child process each time it queries the status of a service, and it uses the
output and exit code from that command to determine status. Exit status codes are
interpreted as follows:
-
OK
—exit code 0—indicates a service is
working properly.
-
WARNING
—exit code 1—indicates a service
is in warning state.
-
CRITICAL
—exit code 2—indicates a
service is in critical state.
-
UNKNOWN
—exit code 3—indicates a service
is in unknown state.
The last state usually means that the plug-in was unable to determine the
status of the service. This might be the condition of an internal error, for
instance.
Below is an example script in Python that checks the UNIX® load average. It
assumes a level above 2.0 is a warning state and level above 5.0 is critical. The
values are hardcoded and the load average from the last minute is always used.
Listing 5. Python plug-in—sample working plug-in
#!/usr/bin/env python
import os,sys
(d1, d2, d3) = os.getloadavg()
if d1 >= 5.0:
print "GETLOADAVG CRITICAL: Load average is %.2f" % (d1)
sys.exit(2)
elif d1 >= 2.0:
print "GETLOADAVG WARNING: Load average is %.2f" % (d1)
sys.exit(1)
else:
print "GETLOADAVG OK: Load average is %.2f" % (d1)
sys.exit(0)
|
With this small working executable in place, next is to register the plug-in with
Nagios and create a service definition that checks the load average.
This is also straightforward: Create a file called
/etc/nagios-plugins/config/mygetloadavg.cfg with
contents as below, and add a service based on the example below to the
services.cfg file. Remember that
localhost must be defined in the
hosts.cfg configuration file.
Listing 6. Sample plug-in—registering with Nagios
define command{
command_name check_mygetloadavg
command_line /path/to/check_getloadavg
}
|
Listing 7. Creating a service using sample plug-in
define service{
use service-template
host_name localhost
service_description LoadAverage
check_period 24x7
contact_groups server-admins
notification_options c,r
check_command check_mygetloadavg
}
|
Writing a complete plug-in
The previous example illustrates the limits of a "hardcoded" plug-in that admits
no run time configuration. In practice, it's often best to create a configurable
plug-in. This way you can create and maintain one plug-in, register it as a single
plug-in with Nagios, and pass arguments to customize the warning and critical
levels to specific circumstances. The next example also includes a usage message;
this has proven particularly valuable for plug-ins used or maintained by several
different developers or administrators.
Another healthy practice is to catch all exceptions and fall back to reporting
UNKNOWN service status so that Nagios can manage notification of this fact
appropriately. Plug-ins that let exceptions "fall through" are likely to exit with
a value of 1; to Nagios, this suggests a WARNING state. Make sure your plug-ins
properly distinguish WARNING from UNKNOWN. Notice, for instance, that it's common
to disable notifications for at least some WARNINGs, when it would be a mistake to
do so for UNKNOWN results.
Writing a plug-in—Python
The suggestions above—run time parametrization, a usage message, and improved
exception handling—result in source code for the example plug-in, which is several
times as long. You gain, though, safer handling of errors and the ability to
re-use the plug-in over a wider range of circumstances.
Listing 8. Python
plug-in—complete plug-in for getting load average
#!/usr/bin/env python
import os
import sys
import getopt
def usage():
print """Usage: check_getloadavg [-h|--help] [-m|--mode 1|2|3] \
[-w|--warning level] [-c|--critical level]"
Mode: 1 - last minute ; 2 - last 5 minutes ; 3 - last 15 minutes"
Warning level defaults to 2.0
Critical level defaults to 5.0"""
sys.exit(3)
try:
options, args = getopt.getopt(sys.argv[1:],
"hm:w:c:",
"--help --mode= --warning= --critical=",
)
except getopt.GetoptError:
usage()
sys.exit(3)
argMode = "1"
argWarning = 2.0
argCritical = 5.0
for name, value in options:
if name in ("-h", "--help"):
usage()
if name in ("-m", "--mode"):
if value not in ("1", "2", "3"):
usage()
argMode = value
if name in ("-w", "--warning"):
try:
argWarning = 0.0 + value
except Exception:
print "Unable to convert to floating point value\n"
usage()
if name in ("-c", "--critical"):
try:
argCritical = 0.0 + value
except Exception:
print "Unable to convert to floating point value\n"
usage()
try:
(d1, d2, d3) = os.getloadavg()
except Exception:
print "GETLOADAVG UNKNOWN: Error while getting load average"
sys.exit(3)
if argMode == "1":
d = d1
elif argMode == "2":
d = d2
elif argMode == "3":
d = d3
if d >= argCritical:
print "GETLOADAVG CRITICAL: Load average is %.2f" % (d)
sys.exit(2)
elif d >= argWarning:
print "GETLOADAVG WARNING: Load average is %.2f" % (d)
sys.exit(1)
else:
print "GETLOADAVG OK: Load average is %.2f" % (d)
sys.exit(0)
|
To use the new plug-in, register
/etc/nagios-plugins/config/mygetloadavg2.cfg with
the following:
Listing 9. Python plug-in—registering with Nagios
define command{
command_name check_mygetloadavg2
command_line /path/to/check_getloadavg2 -m $ARG1$ -w $ARG2$ -c $ARG3$
}
|
Also, add or change the service entry based on the example below in the
services.cfg file. Note that an exclamation
mark—
!
—separates plug-in parameters. As before,
localhost must be defined in the
hosts.cfg configuration file.
Listing 10.
Creating a service using a python plug-in
define service{
use service-template
host_name localhost
service_description LoadAverage2
check_period 24x7
contact_groups server-admins
notification_options c,r
check_command check_mygetloadavg2!1!3.0!6.0
}
|
Writing a plug-in—Tcl
The final example is a plug-in in Tcl that checks exchange rates from xmethods.net
using Simple Object Access Protocol (SOAP) and Web Services Description Language
(WSDL). SOAP supplies the plug-in with current values for exchange
rates and compares these with the configured ranges. If the value is not within
warning limits, it is assumed to be OK. If the value is above or below warning
levels but does not exceed critical limits, the state is set to WARNING.
Otherwise it is set to CRITICAL, unless a networking error occurs, in which case
the state is set to UNKNOWN.
The plug-in recognizes configurable parameters so that different rates with
various checking ranges can be checked. It can also be used to check for various
exchange rates of countries.
Listing 11.
Tcl plug-in—verifying current exchange rates
#!/usr/bin/env tclsh
# parse arguments
package require cmdline
set options {
{country1.arg "" "Country 1"}
{country2.arg "" "Country 2"}
{lowerwarning.arg "" "Lower warning limit"}
{upperwarning.arg "" "Upper warning limit"}
{lowercritical.arg "" "Lower critical limit"}
{uppercritical.arg "" "Upper critical limit"}
}
array set opt [cmdline::getoptions argv $options {: [options]}]
# if the user did not supply all arguments, show help message
for each necessary [array names opt] {
if {$opt($necessary) == ""} {
set argv "-help"
catch {cmdline::getoptions argv $options {: [options]}} usage
puts stderr $usage
exit 3
}
}
# load TclWebServices package
package require WS::Client
if {[catch {
# download WSDL
WS::Client::GetAndParseWsdl \
http://www.xmethods.net/sd/2001/CurrencyExchangeService.wsdl \
{} currency
# create stub commands
WS::Client::CreateStubs currency
# download the actual exchange rate
set result [lindex \
[currency::getRate "England" "Japan"] 1]
} error]} {
# if downloading the rate failed for some reason, report it
puts "EXCHANGERATE UNKNOWN: $error"
exit 3
}
if {($result < $opt(lowercritical)) || ($result > $opt(uppercritical))} {
puts "EXCHANGERATE CRITICAL: rate is $result"
exit 2
}
if {($result < $opt(lowerwarning)) || ($result > $opt(upperwarning))} {
puts "EXCHANGERATE WARNING: rate is $result"
exit 1
}
puts "EXCHANGERATE OK: rate is $result"
exit 0
|
Now, you need to register this command so that Nagios knows how to invoke it. In
order to do that, create a file called
/etc/nagios-plugins/config/exchangerate.cfg with
contents similar to previous configurations and the command definition:
command_line /path/to/check_exchangerate
-country1 $ARG1$ -country2 $ARG2$ -lowercritical \
$ARG3$ -lowerwarning $ARG4$ -upperwarning $ARG5$ -uppercritical $ARG6$ |
The check_exchangerate command name is assumed in the
example below.
Next, create a service that uses the newly created plug-in to monitor exchange
rates. Below is a service definition that associates the service with
the localhost server. Even though the check is not really
associated with any physical host, it needs to be bound to a host. If the check
involves calling SOAP methods from servers inside trusted networks, you can add the actual
server to be monitored, and the service should be bound to that
server in this case. The code in Listing 12 checks British Pounds to Japanese Yen and
verifies the conversion rate is between 225-275.
Listing 12.
Adding the Tcl plug-in as a new service
define service{
use service-template
host_name localhost
service_description EXCHANGERATE
check_period 24x7
contact_groups other-admins
notification_options c,r
check_command check_exchangerate!England!Japan!200!225!275!300
}
|
Conclusions
You can use Nagios to monitor all sorts of hardware and software. The opportunity
to write your own plug-ins makes it possible to monitor everything that your Nagios
server can communicate with. As you can use any computing language that manages
command-line arguments and exit status, the possibilities are almost endless!
An advanced system administrator might extend the SOAP example with Tcl or any
other language to communicate with intranet Web services and write plug-ins to
verify correct behavior of the services.
It is also possible to use C plug-ins or embed C into your favorite dynamic
language (using Pyinline with Python,
Inline with Perl, or Critcl
with Tcl) to combine your operating system's C API with your plug-in (written using
high-level languages).
Another Nagios feature worth your attention is the passive check. The Nagios
monitoring you've seen to this point manages short-lived status executables,
launching them, and then receiving results. In passive checking, Nagios
does not spawn plug-ins to check status, but separate applications send status
updates to Nagios periodically or when a state of a service has changed. Such an
application might receive notifications from other sources, aggregate them, and
pass a computed summary to Nagios. Nagios can also assume a service is down if it
has not received notifications in some period of time. We'll document
implementation of a Nagios passive check in a future article.
What makes Nagios plug-ins so exciting is the ease with which they're written and
shared. Nagios plug-ins are useful for the situations network and system managers
encounter, and, in many cases, it's simple to re-use work someone else has already
done. Just as with well-run Wikis or the Web itself, it requires little to
contribute a helpful example, yet the collective value of all available Nagios
plug-ins is very large.
Resources Learn
-
Nagios: For more information on Nagios, be sure to
visit the official Nagios website. It contains
the latest versions of applications, RPM packages, and standalone version for the Linux
platform. Also, the propaganda
page shows you what kind of companies use Nagios and why.
- Several
books on Nagios have been
written.
This
review
of Nagios: System and Network Monitoring, also provides general background
information on the subject of Nagios.
- The
Nagios Exchange is a central
repository for scores of public Nagios plug-ins.
- Binary for the AIX
platform is available for free download from
NagiosExchange. For other platforms, Nagios sources can be downloaded from
Nagios.org.
-
Nagiosplug Developer
guidelines: This page contains suggestions and good practices for writing your own
Nagios plug-ins.
-
Passive Host and Service Checks
send notifications to Nagios directly from your applications.
- In order to learn more on Python, make sure to
visit Python homepage. This website contains
all the download information as well as additional help on Python and using it.
Also be sure to scan David Mertz's
"Charming
Python"
column for developerWorks.
- For those who need to catch up to speed on Tcl,
entire Tcl documentation for many releases is available online at
www.tcl.tk/. Tcl/Tk itself can be freely
downloaded from its
Sourceforge
project.
- Writing C code inside Python can be
accomplished using PyInline module
that can be freely downloaded from its
Sourceforge
project.
- For all Tcl fans wanting to use native OS API
in their plug-ins, Critcl
brings C to Tcl. A Starkit that allows running, building and using Critcl on your
machine can is available for downloaded for free as
critcl.kit.
- Check out other articles and tutorials written
by Cameron Laird:
-
Popular content:
See what AIX and UNIX content your peers find interesting.
-
AIX and
UNIX:
The AIX and UNIX developerWorks zone provides a wealth of information relating to
all aspects of AIX systems administration and expanding your UNIX skills.
-
New to AIX and UNIX?:
Visit the New to AIX and UNIX page to learn more about AIX and UNIX.
-
AIX 5L Wiki:
Discover a collaborative environment for technical information related to AIX.
- Search the AIX and UNIX library by topic:
-
Safari bookstore:
Visit this e-reference library to find specific technical resources.
-
developerWorks technical events and webcasts:
Stay current with developerWorks technical events and webcasts.
-
Podcasts: Tune in and
catch up with IBM technical experts.
Get products and technologies
-
IBM trial software:
Build your next development project with software for download directly from
developerWorks.
Discuss
- Participate in the
developerWorks blogs
and get involved in the developerWorks community.
- Participate in the AIX and UNIX forums:
About the authors  | 
|  | Cameron Laird is a long-time developerWorks contributor and former columnist. He often writes about the open source projects that accelerate development of his employer's applications, focused on reliability and security. He first used AIX twenty years ago, when it was still an experimental product. He's been an enthusiastic consumer of and contributor to a variety of memory debugging tools through that time. You can contact him at claird@phaseit.net. |
 | |  | Wojciech Kocjan works as a Software Engineer for IBM. His commercial customers include Motorola and IBM. He also has several years of experience volunteering for a variety of open source projects. You can contact him at wojciech@kocjan.org. |
Rate this page
|  |