Basic Monitoring with Nagios
Nagios calls itself an "open source host, service and network monitoring program". In reality, though, it's more of a monitoring framework, in that it allows an administrator to quickly fold the one-liners they use to gather information right into the configuration. Add to this the numerous plugins available, and you can easily integrate Nagios with monitoring tools you already use, like RRDTool or MRTG.
First, though, you need to get your head around the way Nagios approaches configuration in general, so we'll start there with a relatively simple configuration. To get anything useful out of Nagios, there are four things, at a minimum, that need to be configured. They are hosts, hostgroups, contacts, and services. I'm going to assume that, as administrators, you're as capable of reading the README and INSTALL files that come with Nagios as I am, so I'm not covering installation, and I'm also making the assumption that, once installed, the configuration directory is /etc/nagios. In this directory, there should be sample configuration files to give you an idea of how things work. If not, no worries -- we'll create them.
The logic behind configuring Nagios is very (almost too) simple. You have hosts, on which presumably run services. Hosts providing the same services can be grouped together into hostgroups for easy summarization in the web front end. Likewise, your organization probably has contacts for the different services. If there's more than one contact for a particular service, you can put these contacts together under an alias or contactgroup. If a machine Nagios monitors goes down or loses a service it's been running, Nagios can be configured to notify the proper contact or group for that host or service.
We'll go in the order of the above paragraph in our configuration, so you can always refer back to it if your mind gets numb. Let's create two hosts. Here are two from my test configuration:
alias Cycle Server 1
alias VPN Server 1
This is a small host configuration file. The first entry will save you some typing, since "generic-host" is just a template. In the other two entries here, I've put use generic-host which automatically sets generic-host's settings for all of the hosts that use it. Line 1 of the template assigns a name. Line 2 allows you to turn notifications on and off, which is great for keeping your inbox from exploding during testing with a large number of hosts. Line 3 enables event handling, which allows you define a set of actions to take when Nagios detects a change in the state of a host or service it's monitoring. Line 4 protects your inbox or pager in the event that a service or host is intermittently (and frequently) changing state due to a network anomaly. Line 5 aggregates the data collected from the various hosts and services to give you pretty reports as to the availability of your network environment. Lines 6 and 7 cause Nagios to hold on to "last known values" across restarts of Nagios. Keep in mind that this includes the program's own settings! Read the user guide on ways to get around this. The last line tells Nagios not to look at this entry as a normal host and register it as such. It's just a template!
Our host definitions are fairly ho-hum. First, we use the generic-host template for both host definitions, so all things that are true for the template are automagically true for the host definitions that use it. The hostname line is supposed to be the actual hostname the machine in question goes by, but the alias line is what the Nagios website titles will say. The check_command line specifies which built-in Nagios command to use to determine if the host is even up. You can figure out what the check-host-alive command does by looking in the command.cfg file for the corresponding entry. The notification settings are such that the machines are monitored 24x7, every 120 seconds, and a host is considered "dead" if the check fails 10 times. The notification_options line tells nagios, per host, what will cause a notification to be sent. In my case, I use "d"own, "u"nreachable, and "r"ecovered. So if a machine is down, I get a message when it goes down, and when it recovers from the down state. Using "unreachable" as an option is a little obsessive if you're in a network where small occasional glitches can cause one machine or another to become unreachable temporarily.
OK, so we now have two hosts configured. At this point, all Nagios knows how to do is ping them to see if they're alive, though. Let's set up monitoring for the individual services on those machines that we care about. In this case, since I administer the machines, I know that they both run SSH daemons, and vpn1 also runs a name server and a print server. So we have three services to set up. The configuration file format is the same for pretty much everything, so this configuration file shouldn't be too scary by now:
First, again, we have a template entry where we can set flags that can then essentially be "included" by the other entries, just like we had for our hosts.
name generic-service ; The 'name' of this service template, referenced in other service definitions
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Paralellize active service checks for performance
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
With the template defined, we can go about the business of configuring our actual services, and reference the template to get those settings on a per-service basis.
Looks somewhat familiar, no? Note that for the SSH service, I've used a wildcard for the host_name. This is because I want to monitor that service on every host configured. Most of the other flags are the same as other configuration files, which is good news. Note the check_command here. This command is defined in command.cfg, which will point off to the actual script used to check the service. This is good for two reasons; first, it allows you to tweak the script if needed. Second, this means that you could also drop your own script in place to check whatever wacky service you might have running, and define your own command, which can then be applied to whatever hosts you want.
So what if only a subset of my hosts are running SSH? I don't want to have to list every single host that runs SSH individually, and I don't want to run SSH everywhere -- what to do? Configure a hostgroup called "SSH Servers" that contains the hostnames of those hosts in the hosts.cfg file that run SSH. Here's an example of what a hostgroup looks like:
alias SSH Servers
Now, when you define the SSH service, instead of using host_name, just use hostgroup_name instead, and all the right things will happen. This is also nice, because as you add the service to different machines, you can just add their hostnames to the right group, and off you go! One quick note, though: if you define a hostgroup with a non-existent contact_group, you'll get errors from Nagios, so let's create one of those!
If a service becomes unavailable, you probably want someone to know about it PDQ. After all, if all Nagios did was make shiny pictures in your browser, nobody would use it! Much like there are hosts and host groups, likewise there are also contacts, and contact groups. You define all of your contacts in the contacts.cfg file, and group those contacts into groups in the contactgroups.cfg file. Here is a typical contact definition:
alias Brian K. Jones
Here, I've defined a contact (with an alias for easy viewing in a browser), the periods during which this contact will receive host and service outage notifications, what types of notifications I'll get (only "d"own and "r"ecovery messages for hosts, for example), the notification mechanism (email in this case), and finally, an email address. Setting up groups of contacts simply requires that the contacts exist in the contacts.cfg file, and they're put together exactly like hostgroups -- just give 'em a name, an alias, and a list of members, and you're all set. Here's a quick example:
alias Solaris Administrators
It's a good idea to create all the groups you're likely to need going forward, even if they only have one member. This way, when you add a Solaris administrator (in this case), you only have to add them to the contacts file, and then to the group definition, instead of having to hard-code the contact name everywhere it belongs.
The goal of this article was not to see pretty stuff in your browser. It was to get the uninitiated over the initial hump of understanding how Nagios is generally configured. With this knowledge it will be a little easier to set up a simple configuration, get some useful information back in the web interface, and then browse the documentation and sample config files to grow your monitoring solution.
Nagios benefits from fairly good documentation, and a fairly simple configuration. It also suffers from this same simple configuration. In a production environment with a web server farm, several DNS servers, mail servers, development servers, remote access hosts, file servers and the like, configuration can really be slow as molasses. There are a couple of web-based tools to help ease the configuration burden, but none label themselves as anything more than "beta". Overall, Nagios is a very powerful, very flexible monitoring solution, with many plugins available to do almost anything, and with a seemingly endless number of options for notification and service monitoring. The best part, though, is that the layout and design of Nagios makes it amazingly easy to drop in your own ideas that may be specific to your environment's needs. I highly recommend getting to know Nagios.