Notifications and Events in Nagios 3.0-part1

There's lots what Nagios can do, and how it can make your life easier. Imagine that you set up Nagios to send a text message to your mobile during day time. It can also send you a message on Jabber or MSN. Imagine that you also make Nagios stop notifying you when your workstation is not online.

Even though the above examples above seem complicated, they are actually quite simple to implement. It's a matter of combining event handlers with custom variables, and a little ingenuity. A service that will check if a user's workstation is present can have an event handler to automatically enable and disable host and/or service notifications for a contact or contact group.

It's also possible to set up your monitoring to notify managers if the issue has not been fixed within a certain period of time. Based on the importance of a host or service, these can be different managers that are notified and different time periods after which the notification is sent. Nagios can also be used to notify emergency response teams, so that if a problem is not fixed in a short period of time, they will assist in recovering from the potential after effects of this problem.

There are cases when you want Nagios to perform one or more actions if a service starts or stops malfunctioning. For instance, you might have a web server set up to retry five times before a failure becomes a hard state for Nagios. In such a case, you can also configure Nagios to try restarting itself after the third soft failure —if it fails, it will move to a hard state after the next two failures. In case the restart succeeds, a hard state will not even get recorded and only a soft failure will get logged.

Nagios is able to integrate itself with other applications that can send commands to Nagios directly and can report the status of host or service checks. Sending commands can be used by Nagios web interface, but you might as well use it inside your application or event handlers for various objects.

Effective Notifications

This section covers notifications in depth and describes the details of how Nagios can tell other people about what is happening. We will discuss a simple approach, as well as a more complex approach on how notifications can make your life easier.

Probably, most people already know that a plain email notification about a problem may not always be the right thing to do. As people's inboxes get cluttered with emails, the usual approach is to create rules to move certain messages that they don't even look at to separate folders. There's a pretty good chance that if people start getting a lot of notifications that they won't need to react to, they'll simply ask their favorite mailing program to move these messages into a 'do not look in here unless you have plenty of time' folder. Moreover, in such cases, if there is an issue they should be handling, they will most probably not even see the notification email.

This section talks about the things that can be implemented in your company to make notifications more convenient to the IT staff. Limiting the amount of irrelevant information sent to various people tends to increase their response time, as they will have much less information to filter out.

At this point, it's worth mentioning that there's another easy solution. Again, most people do not use it even though it offers a very flexible set up in an easy way. The approach is to create multiple contacts for a single person. For example, you can set up different contacts when you're at work, when you're offline, and define a profile to not to disturb you too much during the night.

The first issue that many Nagios administrators overlook is the ability to create more than one notification command. In this way, Nagios can try to notify you on both instant messaging (such as Jabber, Gtalk, MSN, or Yahoo) and email. It can also send you an SMS. A disadvantage is that at some point, you might end up receiving SMSes at 2 AM about an outage of a machine that may well be down for the next 3days and is not critical.

For example you can set up the following contacts to handle various times of the day in a different fashion:

jdoe-workhours would be a contact that will only receive notifications during working hours; notifications will be carried out using both the corporate IM system and an email
jdoe-daytime would be a contact that will only receive notifications between 7 AM and 10 PM, excluding working hours; notifications will be sent as a text or a pager message, and an email
jdoe-night would be a contact that will only receive notifications between 10 PM and 7 AM; notifications will only be sent out as an email

All entries would also contain contactgroups pointing to the same groups that the single jdoe contact entry used to contain. This way, the other objects such as hosts, services, or contact groups related to this user would not be affected. All entries would also reside in the same file; for example, contacts/jdoe.cfg.

The main drawback of this approach is that logging on to the web interface would require using one of the users above or keeping the jdoe contact without any notifications, just to be able to log on to the interface.

The example above combined both the creation of multiple contacts and use of multiple notification commands to achieve a convenient way of getting notified about a problem. Using only multiple contacts also works fine. Another approach to the problem is to define different contacts for different ways of being notified—for example, jdoe-email, jdoe-sms, and jdoe-jabber. This way, you can define different contact methods for various time periods—instant messages during working hours, SMSes while on duty, and an email when not at work.

Another important issue is to make sure that as few people as possible are notified of the problem. Imagine there is a host without an explicit administrator assigned to it. A notification about a problem gets sent out to 20 different people. In such a case, either each of them will assume that someone else will resolve the problem, or people will run into a communication problem over discussing who will actually handle it.

Teams that cooperate tightly with each other usually solve these issues naturally—knowledgeable people start discussing a solution and a natural person to solve the issue comes out of the discussion. However, the teams that are distributed across various locations or that have poor communication skills will run into problems in such cases.

This is why, it is a good idea to either nominate a coordinator who will assign tasks as they arise, or try to maintain a short list of people responsible for each machine. If you need to make sure that other people will investigate the problem if the original owner of the machine cannot do it immediately, then it is a good idea to use escalations for this purpose. These are described later in this article.

Previously, we mentioned that notifications only via email may not always be the best thing to do. For example, they don't work well for situations that require fast response times. There are various reasons behind this. Firstly, emails are slow—even though the email lands on your mail server in a few seconds, people usually only poll their emails every few minutes. Secondly, people tend to filter emails and skip those that they are not interested in.

Another good reason why emails should not always be used is that they stay on your email account until you actually fetch and read them. If you have been on a 2-week vacation and a problem has occurred, should you still be worried when you read it after you get back? Has the issue been resolved already?

If your team needs to react to problems promptly, using email as the basic notification method is definitely not the best choice. Let's consider what other possibilities exist to notify users of a problem effectively.

As already mentioned, a very good choice is to use instant messaging or SMS (Simple Messaging Service) messages as the basic means of notification, and only use email as a last resort. Some companies might also use the client-server approach to notify the users of the problems, perhaps integrated with showing Nagios' status only for particular hosts and services. NagiosExchange has plenty of available solutions you can use for handling notifications effectively.

The first and the most powerful option is to use Jabber for notifications. There is an existing script for this that is available in the contributions repository on the Nagios project website. This is a small Perl script that sends messages over Jabber. You may need to install additional system packages to handle Jabber connectivity from Perl. On Ubuntu, this requires running the following command:

root@ubuntu1:~# apt-get install libnet-jabber-perl

If you are using CPAN to install Perl packages, then simply run the following command:

root@ubuntu1:~# cpan install Net::Jabber

In order to use the notification plugin, you will need to customize the script—change the SERVER, PORT, USER, and PASSWORD parameters to an existing account. Our recommendation is to create a separate account to use only for Nagios notifications—you will need to set up authorization for each user that you want to send notifications to.

As you plan to monitor servers and potentially even outgoing Internet connectivity, it would not be wise to use public Jabber servers for reporting errors. Therefore, it would be a good idea to set up a private Jabber server, probably on the same host on which the Nagios monitoring system is running.

If you plan to have a more comfortable setup, you can also use Tkabber as a Jabber client, and write a plugin that reads object's cache and the current status from the Nagios host and shows an up-to-date report for hosts that you are the owner of. Information on reading Nagios output can be found on my Tclmentor blog

Another possibility is to send messages over SMB/CIFS protocol. This way, you can send messages directly to the computers, assuming people are running the Microsoft Windows operating system. There is also the possibility of receiving messages using Samba package on UNIX machines. This requires having the smbclient command installed. On Ubuntu, this requires running the following command:

root@ubuntu1:~# apt-get install smbclient

A simple command definition example that uses smbclient directly to send messages to the specified host name is as follows:

define command
{
command_name notify_host_via_smbclient
command_line printf "Host notification: $NOTIFICATIONTYPE$nn
Host: $HOSTNAME$n
State: $HOSTSTATE$
Address: $HOSTADDRESS$n
Info: $HOSTOUTPUT$" |
smbclient -M $_CONTACTSMBHOSTNAME$
}

The preceding example uses the $_CONTACTSMBHOSTNAME$ macro definition. It maps to the _SMBHOSTNAME custom variable defined for a specified contact. In order for Windows XP and 2003 to show the messages from other users correctly, you will need to enable the Messenger service. This can be done by running the following command as the system administrator, or as a user with administrator privileges:

C> net start Messenger

Another way to communicate problems to the users is to use text messages, also known as SMS. This is a very sensitive issue because if your system is not properly configured, it can send a message in the middle of a night about a noncritical thing that can be fixed within the next 5 working days.

There is a very useful package for handling of SMS sending called SMSServerTools. It allows the configuration of email and web gateways, as well as sending text messages over dedicated GSM (Global System for Mobile Communication) terminals. The tool offers the ability to queue text messages so that it handles a higher number of messages to be sent by the appropriate means.

GSM terminals work in a manner similar to a typical mobile phone. They use a standard SIM card and have a normal GSM phone module that is used to send SMS messages. Terminals are usually connected via a serial port or USB connection. Your server can then send messages by sending commands to the terminal. GSM terminals use the same command convention as phone modems, although each model uses a different set of commands. For information on how you can send SMS messages over it, please refer the terminal's user manual.

Current mobile phones also offer cheap Internet connectivity, and smart devices offer the possibility to write custom applications in Java, .NET, and many other languages including Python and Tcl. Therefore, you can also make a client-server application that queries the server for the status of selected hosts and services. It can even be unified with a notification command that pushes the changes down to the application immediately.

These are only a few of the possibilities that you can use to communicate problems more effectively.

Other possibilities include a ready-to-use client-server application (visit http://
www.nagiosexchange.org/Notifications.35.0.html?tx_netnagext_pi1[p_view]=182) that allows the sending of notifications to people directly to their desktop machines. One interesting notification command allows you to choose other commands to use based on user availability on Jabber—this sends messages over Jabber if the user is are available and uses SMSes or emails otherwise. (Visit http://www.nagiosexchange.org/Notifications.35.0.html?&tx_netnagext_pi1[p_view]=1036).

There are also tools to send messages to ICQ users and ones that use VoIP technology to provide you with predefined wave messages or output from a speech synthesis system.

Escalations

A common problem with resolving problems is that a host or a service may have blurred ownership. Often, there is no single person responsible for a host or service, which makes things harder. It is also typical to have a service with subtle dependencies on other things, which by themselves are small enough not to be monitored by Nagios. In such a case, it is good to include lower management in the escalations so that they are able to focus on problems that haven't been resolved in a timely manner.

Here is a good example: a database server might fail because a small Perl script that is run prior to actual start and clean things up has entered an infinite loop. The owner of this machine gets notified. But the question is who should be fixing it? Should it be the script owner? Or perhaps, should it be the database administrator? In IT reality, this often ends up in a series of throwing ball into each other's yards without solving anything.

In such cases, escalations are a great way to solve such complex problems. In the previous example, if the problem is not been resolved after two hours, the IT team coordinator or manager would be notified. Another hour later, he would get another email. At that point, he would schedule an urgent meeting with the developer who owns the script, and the database admin, to discuss how this could be solved.

Of course, in real-world scenarios, escalating to management alone would not solve all problems. However, often, situations need a coordinator that will take care of communicating issues between teams and trying to find a company-wide solution. Business-critical services also require much higher attention. In such cases, it is a real benefit for the company if it has an escalation ladder that can be followed for all major problems.

Nagios offers many ways to set up escalations, depending on your needs. Escalations do not need to be sent out just after a problem occurs—that would create confusion and prevent smaller problems from being solved. Usually, escalations are set up so that additional people are informed only if a problem has not been resolved after a certain amount of time.

From a configuration point of view, all escalations are defined as separate objects. There are two types of objects—hostescalation and serviceescalation. Escalations are configured so that they start and stop being active along with the normal host or service notifications. This way, if you change the notification_ interval directive in host or service definition, the times at which escalations start and stop will also change.

A sample escalation for company's main router is as follows:

define hostescalation
{
host_name mainrouter
contactgroups it-management
first_notification 2
last_notification 0
notification_interval 60
escalation_options d,u,r
}

The following table describes all available directives for defining a host escalation. Items in bold are required when specifying an escalation.

Option	Description
host_name	Defines host names that escalation should be defined for; separated by comma
hostgroup_name	Defines host group names for all members of which groups escalation should be defined for; separated by comma
contacts	List of all contacts that should receive notifications related to this escalation; separated by comma; at least one contact or contact group needs to be specified for each escalation
contactgroups	List of all contacts groups that should receive notifications related to this escalation, separated by comma; at least one contact or contact group needs to be specified for each escalation
first_notification	Number of notifications after which this escalation becomes active; setting this to 0 causes notifications to be sent until host recovers from problem; see description below
last_notification	Number of notifications after which this escalation stop being active; see description below
notification_interval	Specifies number of minutes between sending notifications related to this escalation
escalation_period	Specifies time period during which escalation should be valid; if not specified defaults to 24 hours a day 7 days a week
escalation_options	Specifies which notification types for host states should be sent, separated by comma; should be one or more of the following: d - host DOWN state u - host UNREACHABLE state r - host recovery (UP state)

Service escalations are defined in a very similar way to host escalations. You can specify one or more hosts or host groups, as well as a single service description. Service escalation will be associated with this service on all hosts mentioned in the host_name and hostgroup_name attributes.

The following is an example of a service escalation for an OpenVPN check on the company's main router:

define serviceescalation
{
host_name mainrouter
service_description OpenVPN
contactgroups it-management
first_notification 2
last_notification 0
notification_interval 60
escalation_options w,c,r
}

The following table describes all available directives for defining a service escalation. Items in bold are required when specifying an escalation.