Skip to main content

Leverage Nagios with plug-ins you write

Advantages and new possibilities in system monitoring

Cameron Laird (claird@phaseit.net), Vice President, Phaseit, Inc.
Photo of Cameron Laird
Cameron Laird is a long-time developerWorks contributor and former columnist. He often writes about the open source projects that accelerate development of his employer's applications, focused on reliability and security. He first used AIX twenty years ago, when it was still an experimental product. He's been an enthusiastic consumer of and contributor to a variety of memory debugging tools through that time. You can contact him at claird@phaseit.net.
Wojciech Kocjan (wojciech@kocjan.org), Software Engineer, IBM
Wojciech Kocjan works as a Software Engineer for IBM. His commercial customers include Motorola and IBM. He also has several years of experience volunteering for a variety of open source projects. You can contact him at wojciech@kocjan.org.

Summary:  Learn more about Nagios and find out what new system monitoring possibilities exist with this software. Nagios is open source monitoring software that scans hosts, services, and networks for problems. The two main differences between Nagios and other similar packages are that Nagios reduces all information to "working," "questionable," and "failure" statuses; and Nagios supports a particularly rich "ecosystem" of plug-ins. These features result in effective installations where users are not overwhelmed with details, but have just the information they need.

Date:  17 Jul 2007
Level:  Intermediate
Activity:  10158 views
Comments:  

Monitoring and analyzing masses of information—Is the CPU overloaded? Is the network interface saturated?—across several hosts is a daunting task. A good solution might be only a few steps away, though! The open source Nagios project (see Resources) solves complex monitoring and notification requirements quite handily.

Crucial to understanding Nagios is that, rather than monitor and track "natural" measurements such as CPU utilization, the tool reduces all information to "working," "questionable," and "failure" states. This helps operators focus on the most important and crucial problems, based on predefined and configurable criteria.

Nagios builds in the capacity for downtime reporting, which might be useful in tracking fulfillment of service level agreements (SLAs). As later articles will illustrate, Nagios also offers downtime escalations, service, and host dependencies; this introduction concentrates on the ease with which you can write small customizations for your basic monitoring requirements.

Installation

Most Linux® distributions build in Nagios installations. In such cases, the installation integrates smoothly with the Apache Web server. To activate or update such a configuration, you need to do the following:

yum install nagios

or apt-get install nagios-text Binary for the AIX® platform is available for free download from NagiosExchange (see Resources).

For other platforms, Nagios sources can be downloaded from Nagios.org (see Resources). The development tools necessary to generate Nagios "from scratch" are standard:

  • Tools
    • gcc
    • make
    • autoconf
    • automake
  • Executables
    • libgd
    • openssl
  • Packages (libraries and headers)

Many Simple Network Management Protocol-related (SNMP-related) plug-ins also require Perl and the Net::SNMP package.

After installing and setting up Nagios, you should be able to access Nagios using a default http://your.host.name/nagios URL. Figure 1 shows which hosts and services are up or down.


Figure 1. Tactical Monitoring Overview screen
Tactical overview

Configuring Nagios

By default, all Nagios configuration files are in the /etc/nagios directory. Apache-related configuration files might be symlinked into the Apache configuration directory for convenience. The configuration is split into multiple files, each for a different part of the configuration.

The first components to set up are contacts and contact groups. Contacts are people who receive notifications when a host or service is down. Nagios offers pager and e-mail notifications by default. Extensions allow notification by Jabber and many other ways, which can be very convenient in certain circumstances.

Contacts are stored in the contacts.cfg file and are defined as follows:


Listing 1. Configuration 1: Basic contact information
define contact{
        contact_name                    jdoe
        alias                           John Due
        service_notification_commands   notify-by-email
        host_notification_commands      host-notify-by-emailes
        email                           john.doe@yourcompany.com
        }

Contacts are grouped: Instead of specifying persons that should be notified in case of host or service status changes, Nagios notifies the pertinent group. Sometimes it's even appropriate to define a person multiple times to specify different notification commands or addresses, and then add all ways to contact a person into a contact group to which that user belongs (see Listing 2).


Listing 2. Configuration 2: Grouping contacts
define contactgroup{
        contactgroup_name               server-admins
        alias                           Server Administrators
        members                         jdoe,albundy
        }

The next step is to configure hosts that should be monitored by Nagios. Each host that has services monitored or checked for being alive should be added. The configuration file for storing hosts is hosts.cfg. Here's an example host definition:


Listing 3. Configuration 3: Adding a new host
define host{
        host_name                       ubuntu_1_2
        alias                           Ubuntu test server
        address                         192.168.1.2
        check_command                   check-host-alive
        max_check_attempts              20
        notifications_enabled           1
        event_handler_enabled           0
        flap_detection_enabled          0
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1
        notification_interval           60
        notification_period             24x7
        notification_options            d,u,r
        }
	

The final step of Nagios configuration is definition of services for the configured hosts. This example uses a predefined "ping" Nagios plug-in, which sends Internet Control Message Protocol (ICMP) echo requests to determine if a host is responsive or not.


Listing 4. Configuration 4: Adding a new service
define service{
        use                             service-template
        host_name                       ubuntu_1_2
        service_description             PING
        check_period                    24x7
        contact_groups                  server-admins
        notification_options            c,r
        check_command                   check_ping!300.0,20%!1000.0,60%
        }


With this configuration in place, restart your Nagios daemon, then, after pausing for a few seconds to let Nagios initialize, confirm the visibility of the ping service in the Web administrative interface.

How to write Nagios plug-ins

The most exciting aspect of Nagios is that writing your own plug-ins is simple and requires learning only a few easy principles. To manage a plug-in, Nagios simply spawns a child process each time it queries the status of a service, and it uses the output and exit code from that command to determine status. Exit status codes are interpreted as follows:

  • OK—exit code 0—indicates a service is working properly.
  • WARNING—exit code 1—indicates a service is in warning state.
  • CRITICAL—exit code 2—indicates a service is in critical state.
  • UNKNOWN—exit code 3—indicates a service is in unknown state.

The last state usually means that the plug-in was unable to determine the status of the service. This might be the condition of an internal error, for instance.

Below is an example script in Python that checks the UNIX® load average. It assumes a level above 2.0 is a warning state and level above 5.0 is critical. The values are hardcoded and the load average from the last minute is always used.


Listing 5. Python plug-in—sample working plug-in
#!/usr/bin/env python

import os,sys

(d1, d2, d3) = os.getloadavg()

if d1 >= 5.0:
    print "GETLOADAVG CRITICAL: Load average is %.2f" % (d1)
    sys.exit(2)
elif d1 >= 2.0:
    print "GETLOADAVG WARNING: Load average is %.2f" % (d1)
    sys.exit(1)
else:
    print "GETLOADAVG OK: Load average is %.2f" % (d1)
    sys.exit(0)
    

With this small working executable in place, next is to register the plug-in with Nagios and create a service definition that checks the load average.

This is also straightforward: Create a file called /etc/nagios-plugins/config/mygetloadavg.cfg with contents as below, and add a service based on the example below to the services.cfg file. Remember that localhost must be defined in the hosts.cfg configuration file.


Listing 6. Sample plug-in—registering with Nagios
define command{
        command_name    check_mygetloadavg
	command_line    /path/to/check_getloadavg
	}


Listing 7. Creating a service using sample plug-in
define service{
        use                             service-template
        host_name                       localhost
        service_description             LoadAverage
        check_period                    24x7
        contact_groups                  server-admins
        notification_options            c,r
        check_command                   check_mygetloadavg
        }

Writing a complete plug-in

The previous example illustrates the limits of a "hardcoded" plug-in that admits no run time configuration. In practice, it's often best to create a configurable plug-in. This way you can create and maintain one plug-in, register it as a single plug-in with Nagios, and pass arguments to customize the warning and critical levels to specific circumstances. The next example also includes a usage message; this has proven particularly valuable for plug-ins used or maintained by several different developers or administrators.

Another healthy practice is to catch all exceptions and fall back to reporting UNKNOWN service status so that Nagios can manage notification of this fact appropriately. Plug-ins that let exceptions "fall through" are likely to exit with a value of 1; to Nagios, this suggests a WARNING state. Make sure your plug-ins properly distinguish WARNING from UNKNOWN. Notice, for instance, that it's common to disable notifications for at least some WARNINGs, when it would be a mistake to do so for UNKNOWN results.

Writing a plug-in—Python

The suggestions above—run time parametrization, a usage message, and improved exception handling—result in source code for the example plug-in, which is several times as long. You gain, though, safer handling of errors and the ability to re-use the plug-in over a wider range of circumstances.


Listing 8. Python plug-in—complete plug-in for getting load average
#!/usr/bin/env python

import os
import sys
import getopt

def usage():
    print """Usage: check_getloadavg [-h|--help] [-m|--mode 1|2|3] \
    [-w|--warning level] [-c|--critical level]"

Mode: 1 - last minute ; 2 - last 5 minutes ; 3 - last 15 minutes"
Warning level defaults to 2.0
Critical level defaults to 5.0"""
    sys.exit(3)

try:
    options, args = getopt.getopt(sys.argv[1:],
        "hm:w:c:",
        "--help --mode= --warning= --critical=",
        )
except getopt.GetoptError:
    usage()
    sys.exit(3)

argMode = "1"
argWarning = 2.0
argCritical = 5.0

for name, value in options:
    if name in ("-h", "--help"):
        usage()
    if name in ("-m", "--mode"):
        if value not in ("1", "2", "3"):
            usage()
        argMode = value
    if name in ("-w", "--warning"):
        try:
            argWarning = 0.0 + value
        except Exception:
            print "Unable to convert to floating point value\n"
            usage()
    if name in ("-c", "--critical"):
        try:
            argCritical = 0.0 + value
        except Exception:
            print "Unable to convert to floating point value\n"
            usage()

try:
    (d1, d2, d3) = os.getloadavg()
except Exception:
    print "GETLOADAVG UNKNOWN: Error while getting load average"
    sys.exit(3)

if argMode == "1":
    d = d1
elif argMode == "2":
    d = d2
elif argMode == "3":
    d = d3

if d >= argCritical:
    print "GETLOADAVG CRITICAL: Load average is %.2f" % (d)
    sys.exit(2)
elif d >= argWarning:
    print "GETLOADAVG WARNING: Load average is %.2f" % (d)
    sys.exit(1)
else:
    print "GETLOADAVG OK: Load average is %.2f" % (d)
    sys.exit(0)
    

To use the new plug-in, register /etc/nagios-plugins/config/mygetloadavg2.cfg with the following:


Listing 9. Python plug-in—registering with Nagios
define command{
        command_name    check_mygetloadavg2
	command_line    /path/to/check_getloadavg2 -m $ARG1$ -w $ARG2$ -c $ARG3$
	}

Also, add or change the service entry based on the example below in the services.cfg file. Note that an exclamation mark—!—separates plug-in parameters. As before, localhost must be defined in the hosts.cfg configuration file.


Listing 10. Creating a service using a python plug-in
define service{
        use                             service-template
        host_name                       localhost
        service_description             LoadAverage2
        check_period                    24x7
        contact_groups                  server-admins
        notification_options            c,r
        check_command                   check_mygetloadavg2!1!3.0!6.0
        }

Writing a plug-in—Tcl

The final example is a plug-in in Tcl that checks exchange rates from xmethods.net using Simple Object Access Protocol (SOAP) and Web Services Description Language (WSDL). SOAP supplies the plug-in with current values for exchange rates and compares these with the configured ranges. If the value is not within warning limits, it is assumed to be OK. If the value is above or below warning levels but does not exceed critical limits, the state is set to WARNING. Otherwise it is set to CRITICAL, unless a networking error occurs, in which case the state is set to UNKNOWN.

The plug-in recognizes configurable parameters so that different rates with various checking ranges can be checked. It can also be used to check for various exchange rates of countries.


Listing 11. Tcl plug-in—verifying current exchange rates
#!/usr/bin/env tclsh

# parse arguments
package require cmdline
set options {
    {country1.arg "" "Country 1"}
    {country2.arg "" "Country 2"}
    {lowerwarning.arg "" "Lower warning limit"}
    {upperwarning.arg "" "Upper warning limit"}
    {lowercritical.arg "" "Lower critical limit"}
    {uppercritical.arg "" "Upper critical limit"}
}

array set opt [cmdline::getoptions argv $options {: [options]}]

# if the user did not supply all arguments, show help message
for each necessary [array names opt] {
    if {$opt($necessary) == ""} {
        set argv "-help"
        catch {cmdline::getoptions argv $options {: [options]}} usage
	puts stderr $usage
        exit 3
    }
}

# load TclWebServices package
package require WS::Client

if {[catch {
    # download WSDL
    WS::Client::GetAndParseWsdl \
        http://www.xmethods.net/sd/2001/CurrencyExchangeService.wsdl \
	{} currency

    # create stub commands
    WS::Client::CreateStubs currency

    # download the actual exchange rate
    set result [lindex \
        [currency::getRate "England" "Japan"] 1]
} error]} {
    # if downloading the rate failed for some reason, report it
    puts "EXCHANGERATE UNKNOWN: $error"
    exit 3
}
    
if {($result < $opt(lowercritical)) || ($result > $opt(uppercritical))} {
    puts "EXCHANGERATE CRITICAL: rate is $result"
    exit 2
}
if {($result < $opt(lowerwarning)) || ($result > $opt(upperwarning))} {
    puts "EXCHANGERATE WARNING: rate is $result"
    exit 1
}
puts "EXCHANGERATE OK: rate is $result"
exit 0
    

Now, you need to register this command so that Nagios knows how to invoke it. In order to do that, create a file called /etc/nagios-plugins/config/exchangerate.cfg with contents similar to previous configurations and the command definition:

command_line    /path/to/check_exchangerate
-country1 $ARG1$ -country2 $ARG2$ -lowercritical \
 $ARG3$ -lowerwarning $ARG4$ -upperwarning $ARG5$ -uppercritical $ARG6$


The check_exchangerate command name is assumed in the example below.

Next, create a service that uses the newly created plug-in to monitor exchange rates. Below is a service definition that associates the service with the localhost server. Even though the check is not really associated with any physical host, it needs to be bound to a host. If the check involves calling SOAP methods from servers inside trusted networks, you can add the actual server to be monitored, and the service should be bound to that server in this case. The code in Listing 12 checks British Pounds to Japanese Yen and verifies the conversion rate is between 225-275.


Listing 12. Adding the Tcl plug-in as a new service
define service{
        use                             service-template
        host_name                       localhost
        service_description             EXCHANGERATE
        check_period                    24x7
        contact_groups                  other-admins
        notification_options            c,r
        check_command                   check_exchangerate!England!Japan!200!225!275!300
        }

Conclusions

You can use Nagios to monitor all sorts of hardware and software. The opportunity to write your own plug-ins makes it possible to monitor everything that your Nagios server can communicate with. As you can use any computing language that manages command-line arguments and exit status, the possibilities are almost endless!

An advanced system administrator might extend the SOAP example with Tcl or any other language to communicate with intranet Web services and write plug-ins to verify correct behavior of the services.

It is also possible to use C plug-ins or embed C into your favorite dynamic language (using Pyinline with Python, Inline with Perl, or Critcl with Tcl) to combine your operating system's C API with your plug-in (written using high-level languages).

Another Nagios feature worth your attention is the passive check. The Nagios monitoring you've seen to this point manages short-lived status executables, launching them, and then receiving results. In passive checking, Nagios does not spawn plug-ins to check status, but separate applications send status updates to Nagios periodically or when a state of a service has changed. Such an application might receive notifications from other sources, aggregate them, and pass a computed summary to Nagios. Nagios can also assume a service is down if it has not received notifications in some period of time. We'll document implementation of a Nagios passive check in a future article.

What makes Nagios plug-ins so exciting is the ease with which they're written and shared. Nagios plug-ins are useful for the situations network and system managers encounter, and, in many cases, it's simple to re-use work someone else has already done. Just as with well-run Wikis or the Web itself, it requires little to contribute a helpful example, yet the collective value of all available Nagios plug-ins is very large.


Resources

Learn

  • Nagios: For more information on Nagios, be sure to visit the official Nagios website. It contains the latest versions of applications, RPM packages, and standalone version for the Linux platform. Also, the propaganda page shows you what kind of companies use Nagios and why.

  • Several books on Nagios have been written. This review of Nagios: System and Network Monitoring, also provides general background information on the subject of Nagios.

  • The Nagios Exchange is a central repository for scores of public Nagios plug-ins.

  • Binary for the AIX platform is available for free download from NagiosExchange. For other platforms, Nagios sources can be downloaded from Nagios.org.

  • Nagiosplug Developer guidelines: This page contains suggestions and good practices for writing your own Nagios plug-ins.

  • Passive Host and Service Checks send notifications to Nagios directly from your applications.

  • In order to learn more on Python, make sure to visit Python homepage. This website contains all the download information as well as additional help on Python and using it. Also be sure to scan David Mertz's "Charming Python" column for developerWorks.

  • For those who need to catch up to speed on Tcl, entire Tcl documentation for many releases is available online at www.tcl.tk/. Tcl/Tk itself can be freely downloaded from its Sourceforge project.

  • Writing C code inside Python can be accomplished using PyInline module that can be freely downloaded from its Sourceforge project.

  • For all Tcl fans wanting to use native OS API in their plug-ins, Critcl brings C to Tcl. A Starkit that allows running, building and using Critcl on your machine can is available for downloaded for free as critcl.kit.

  • Check out other articles and tutorials written by Cameron Laird:
  • Popular content: See what AIX and UNIX content your peers find interesting.

  • AIX and UNIX: The AIX and UNIX developerWorks zone provides a wealth of information relating to all aspects of AIX systems administration and expanding your UNIX skills.

  • New to AIX and UNIX?: Visit the New to AIX and UNIX page to learn more about AIX and UNIX.

  • AIX 5L Wiki: Discover a collaborative environment for technical information related to AIX.

  • Search the AIX and UNIX library by topic:
  • Safari bookstore: Visit this e-reference library to find specific technical resources.

  • developerWorks technical events and webcasts: Stay current with developerWorks technical events and webcasts.

  • Podcasts: Tune in and catch up with IBM technical experts.

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

About the authors

Photo of Cameron Laird

Cameron Laird is a long-time developerWorks contributor and former columnist. He often writes about the open source projects that accelerate development of his employer's applications, focused on reliability and security. He first used AIX twenty years ago, when it was still an experimental product. He's been an enthusiastic consumer of and contributor to a variety of memory debugging tools through that time. You can contact him at claird@phaseit.net.

Wojciech Kocjan works as a Software Engineer for IBM. His commercial customers include Motorola and IBM. He also has several years of experience volunteering for a variety of open source projects. You can contact him at wojciech@kocjan.org.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=240567
ArticleTitle=Leverage Nagios with plug-ins you write
publish-date=07172007
author1-email=claird@phaseit.net
author1-email-cc=mmccrary@us.ibm.com
author2-email=wojciech@kocjan.org
author2-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers