Notifications and Events in Nagios 3.0- part2

0
155
15 min read

External Commands

Nagios offers a very powerful mechanism for receiving events and commands from external applications—the external commands pipe. This is a pipe file created on a file system that Nagios uses to receive incoming messages. The name of the file is rw/nagios.cmd and it is located in the directory passed as the localstatedir option during compilation. Following the compilation and installation instructions and the given guidelines, the file name will be /var/nagios/rw/ nagios.cmd.

The communication does not use any authentication or authorization—the only requirement is to have write access to the pipe file. An external command file is usually writable by the owner and the group; the usual group used is nagioscmd. If you want a user to be able to send commands to the Nagios daemon, simply add that user to this group.

A small limitation of the command pipe is that there is no way to get any results back and so it is not possible to send any query commands to Nagios. Therefore, by just using the command pipe, you have no verification that the command you have just passed to Nagios has actually been processed, or will be processed soon. It is, however, possible to read the Nagios log file and check if it indicates that the command has been parsed correctly, if necessary.

An external command pipe is used by the web interface to control how Nagios works. The web interface does not use any other means to send commands or apply changes to Nagios. This gives a good understanding of what can be done with the external command pipe interface.

From the Nagios daemon perspective, there is no clear distinction as to who can perform what operations. Therefore, if you plan to use the external command pipe to allow users to submit commands remotely, you need to make sure that the authorization is in place as well so that it is not possible for unauthorized users to send potentially dangerous commands to Nagios.

The syntax for formatting commands is easy. Each command must be placed on a single line and end with a newline character. The syntax is as follows:

[TIMESTAMP] COMMAND_NAME;argument1;argument2;...;argumentN

TIMESTAMP is written as UNIX time—that is the number of seconds since 1970-01-01 00:00:00. This can be created by using the date +%s system command. Most programming languages also offer the means to get the current UNIX time. Commands are written in upper case. This can be one of the commands that Nagios should execute, and the arguments depend on the actual command.

For example, to add a comment to a host stating that it has passed a security audit, one can use the following shell command:

echo "['date +%s'] ADD_HOST_COMMENT;somehost;1;Security Audit;
This host has passed security audit on 'date +%Y-%m-%d'"
>/var/nagios/rw/nagios.cmd

This will send an ADD_HOST_COMMENT command (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=1) to Nagios over the external command pipe. Nagios will then add a comment to the host, somehost, stating that the comment originated from Security Audit. The first argument specifies the host name to add the comment to; the second tells Nagios if this comment should be persistent. The next argument describes the author of the comment, and the last argument specifies the actual comment text.

Similarly, adding a comment to a service requires the use of the ADD_SVC_COMMENT command (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=1) . The command’s syntax is very similar to the ADD_HOST_COMMENT command except that the command requires the specification of the host name and service name.

For example, to add a comment to a service stating that it has been restarted, you should use the following

echo "['date +%s'] ADD_SVC_COMMENT;router;OpenVPN;1;nagiosadmin;
Restarting the OpenVPN service" >/var/nagios/rw/nagios.cmd

The first argument specifies the host name to add the comment to; the second is the description of the service to which Nagios should add the comment. The next argument tells Nagios if this comment should be persistent. The fourth argument describes the author of the comment, and the last argument specifies actual comment text.

You can also delete a single comment or all comments using the DEL_HOST_ COMMENT (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=3), DEL_ALL_HOST_COMMENTS (visit http://www. nagios.org/developerinfo/externalcommands/commandinfo.php? command_ id=13), and DEL_SVC_COMMENT (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=4) or DEL_ALL_SVC_COMMENTS commands (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=14).

Other commands worth mentioning are related to scheduling checks on demand. Very often, it is necessary to request that a check be carried out as soon as possible; for example, when testing a solution.

This time, let’s create a script that schedules a check of a host, all services on that host, and a service on a different host, as follows:

#!/bin/sh
NOW='date +%s'
echo "[$NOW] SCHEDULE_HOST_CHECK;somehost;$NOW"
>/var/nagios/rw/nagios.cmd
echo "[$NOW] SCHEDULE_HOST_SVC_CHECKS;somehost;$NOW"
>/var/nagios/rw/nagios.cmd
echo "[$NOW] SCHEDULE_SVC_CHECK;otherhost;Service Name;$NOW"
>/var/nagios/rw/nagios.cmd
exit 0

The commands SCHEDULE_HOST_CHECK (visit http://www.nagios.org/ developerinfo/externalcommands/commandinfo.php? command_id=127) and SCHEDULE_HOST_SVC_CHECKS (http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=30) accept a host name and the time at which the check should be scheduled. The SCHEDULE_SVC_CHECK command (visit http://www.nagios.org/developerinfo/externalcommands/ commandinfo.php? command_id=29) requires the specification of a service description as well as the name of the host to schedule the check on.

Normal scheduled checks, such as the ones scheduled above, might not actually take place at the time that you scheduled them. Nagios also needs to take allowed time periods into account as well as checking whether checks were disabled for a particular object or globally for the entire Nagios.

There are cases when you’ll need to force Nagios to do a check—in such cases, you should use SCHEDULE_FORCED_HOST_CHECK (visit http://www.nagios. org/developerinfo/externalcommands/commandinfo.php? command_id=128), SCHEDULE_FORCED_HOST_SVC_CHECKS (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=130) and SCHEDULE_FORCED_SVC_CHECK (visit http://www.nagios.org/developerinfo/ externalcommands/commandinfo.php? command_id=129) commands. They work in exactly the same way as described above, but make Nagios skip the checking of time periods, and ensure that the checks are disabled for this particular object. This way, a check will always be performed, regardless of other Nagios parameters

Other commands worth using are related to custom variables, introduced in Nagios 3. When you define a custom variable for a host, service, or contact, you can change its value on the fly with the external command pipe.

As these variables can then be directly used by check or notification commands and event handlers, it is possible to make other applications or event handlers change these attributes directly without modifications to the configuration files.

A good example would be that the IT staff registers its presence via an application without any GUI. This application periodically sends information about the latest known IP address, and that information is then passed to Nagios assuming that the person is in the office. This would later be sent to a notification command to use that specific IP address while sending a message to the user.

Assuming that the user name is jdoe and the custom variable name is DESKTOPIP, the message that would be sent to the Nagios external command pipe would be as follows:

[1206096000] CHANGE_CUSTOM_CONTACT_VAR;jdoe;DESKTOPIP;12.34.56.78

This would cause a later use of $_CONTACTDESKTOPIP$ to return a value of 12.34.56.78.

Nagios offers the CHANGE_CUSTOM_CONTACT_VAR (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php? command_id=141), CHANGE_CUSTOM_HOST_VAR (visit http://www.nagios.org/developerinfo/ externalcommands/commandinfo.php?command_id=139), and CHANGE_CUSTOM_ SVC_VAR (visit http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=140) commands for modifying custom variables in contacts, hosts and, services accordingly.

The commands explained above are just a very small subset of the full capabilities of the Nagios external command pipe. For a complete list of commands, visit http:// www.nagios.org/developerinfo/externalcommands/commandlist.php, where the External Command List can be seen. External commands are usually sent from event handlers or from the Nagios web interface. You will find external commands most useful when writing event handlers for your system, or when writing an external application that interacts with Nagios.

Event Handlers

Event handlers are commands that are triggered whenever the state of a host or  service changes. They offer functionality similar to notifications. The main difference is that the event handlers are called for each type of change and even for each soft state change. This provides the ability to react to a problem before Nagios notifies it as a hard state and sends out notifications about it. Another difference is what the event handlers should do. Instead of notifying users that there is a problem, event handlers are meant to carry out actions automatically.

For example, if a service defined with max_check_attempts set to 4, the retry_ interval set to 1, and the check_interval is set to 5, then the following example illustrates when event handlers would be triggered, and with what values, for $SERVICESTATE$, $SERVICESTATETYPE$, and $SERVICEATTEMP$ macro definitions:

Learning Nagios 3.0

Event handlers are triggered for each state change—for example, in minutes, 10, 23, 28, and 29. When writing an event handler, it is necessary to check whether an event handler should perform an action at that particular time or not. See the following example for more details.

A typical example might be that your web server process tends to crash once a month. Because this is rare enough, it is very difficult to debug and resolve it. Therefore, the best way to proceed is to restart the server automatically until a solution to the problem is found.

If your configuration has max_check_attempts set to 4, as in the example above, then a good place to try to restart the web server is after the third soft failure check—in the previous example, this would be minute 12.

Assuming that the restart has been successful, the diagram shown above would look like this:

Learning Nagios 3.0

Please note that no hard critical state has occurred since the event handler resolved the problem. If a restart cannot resolve the issue, Nagios will only try it once, as the attempt is done only in the third soft check.

Event handlers are defined as commands, similar to check commands. The main difference is that the event handlers only use macro definitions to pass information to the actual event handling script. This implies that the $ARGn$ macro definitions cannot be used and arguments cannot be passed in the host or service definition by using the ! separator.

In the previous example, we would define the following command:

define command
{
command_name restart-apache2
command_line $USER1$/events/restart_apache2
$SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}

The command would need to be added to the service. For both hosts and services, this requires adding an event_handler directive that specifies the command to be run for each event that is fired. In addition, it is good to set event_handler_enabled set to 1 to make sure that event handlers are enabled for this object.

The following is an example of a service definition:

define service
{
host_name localhost
service_description Webserver
use apache
event_handler restart-apache2
event_handler_enabled 1
}

Finally, a short version of the script is as follows:

#!/bin/sh
# use variables for arguments
SERVICESTATE=$1
SERVICESTATETYPE=$2
SERVICEATTEMPT=$3
# we don't want to restart if current status is OK
if [ "$SERVICESTATE" != "OK" ] ; then
# proceed only if we're in soft transition state
if [ "$SERVICESTATETYPE" == "SOFT" ] ; then
# proceed only if this is 3rd attempt, restart
if [ "$SERVICESTATEATTEMPT" == "3" ] ; then
# restarts Apache as system administrator
sudo /etc/init.d/apache2 restart
fi
fi
fi
exit 0

As we’re using sudo here, obviously the script needs an entry in the sudoers file to allow the nagios user to run the command without a password prompt. An example entry for the sudoers file would be as follows:

nagios ALL=NOPASSWD: /etc/init.d/apache2

This will tell sudo that the command /etc/init.d/apache2 can be run by the user nagios and that asking for passwords before running the command will not be done.

According to our script, the restart is only done after the third check fails. Assuming that the restart went correctly, the next Nagios check will notify that Apache is running again. As this is considered a soft state, Nagios has not yet sent out any notifications about the problem

If the service would not restart correctly, the next check will cause Nagios to set this failure as a hard state. At this point, notifications will be sent out to the object owners.

You can also try performing a restart in the second check. If that did not help, then during the third attempt, the script can forcefully terminate all Apache2 processes using the killall or pkill command. After this has been done, it can try to start the service again. For example:

# proceed only ifthis is 3rd attempt, restart
if [ "$SERVICESTATEATTEMPT" == "2" ] ; then
# restart Apache as system administrator
sudo /etc/init.d/apache2 restart
fi
# proceed only ifthis is 3rd attempt, restart
if [ "$SERVICESTATEATTEMPT" == "3" ] ; then
# try to terminate apache2 process as system administrator
sudo pkill apache2
# starts Apache as system administrator
sudo /etc/init.d/apache2 start
fi

Another common scenario is to restart one service if another one has just recovered—for example, you might want to restart email servers that use a database for authentication if the database has just recovered from a failure state. The reason for doing this is that some applications may not handle disconnected database handles correctly—this can lead to the service working correctly from the Nagios perspective, but not allowing some of the users in due to internal problems.

If you have set this up for hosts or services, it is recommended that you keep flapping enabled for these services. It often happens that due to incorrectly planned scripts and the relations between them, some services might end up being stopped and started again.

In such cases, Nagios will detect these problems and stop running event handlers for these services, which will cause fewer malfunctions to occur. It is also recommended that you keep notifications set up so that people also get information on when flapping starts and stops.

Modifying Notifications

An interesting new feature in Nagios 3 is the ability to change various parameters related to notifications. These parameters are modified via an external command pipe, similar to a few of the commands shown in the previous section.

A good example would be when Nagios contact persons have their workstations connected to the local network only when they are actually at work (which is usually the case if they are using notebooks), and turn their computers off when they leave work. In such a case, a ping check for a person’s computer could trigger an event handler to toggle that person’s attributes.

Let’s assume that our user jdoe has two actual contacts—jdoe-email and jdoejabber, each for different types of notifications. We can set up a host corresponding to the jdoe workstation. We will also set it up to be monitored every five minutes and create an event handler. The handler will change the jdoe-jabber‘s host and service notification time period to none on a hard host down state. On a host up state change, the time period for jdoe-jabber will be set to 24×7. This way, the user will only get Jabber notifications if he or she is at work.

Nagios offers commands to change the time periods during which a user wants to receive notifications. The commands for this purpose are: CHANGE_CONTACT_HOST_ NOTIFICATION_TIMEPERIOD ( http://www.nagios.org/developerinfo/ externalcommands/commandinfo.php? command_id=153 ) and CHANGE_CONTACT_ SVC_NOTIFICATION_TIMEPERIOD ( http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=152 ). Both commands take the contact and the time period name as their arguments.

An event handler script that modifies the user’s contact time period based on state is as follows

#!/bin/sh
NOW='date +%s'
CONTACTNAME=$1-jabber
if [ "$2,$3" = "DOWN,HARD" ] ; then
TP=none
else
TP=24x7
fi
echo "[$NOW] CHANGE_CONTACT_HOST_NOTIFICATION_TIMEPERIOD;
$CONTACT;$TP"
>/var/nagios/rw/nagios.cmd
echo "[$NOW] CHANGE_CONTACT_SVC_NOTIFICATION_TIMEPERIOD;
$CONTACT;$TP"
>/var/nagios/rw/nagios.cmd
exit 0

The command should pass $CONTACTNAME$, $SERVICESTATE$, and $SERVICESTATETYPE$ as parameters to the script.

In case you need a notification about a problem sent again, you should use the SEND_CUSTOM_HOST_NOTIFICATION or SEND_CUSTOM_SVC_NOTIFICATION command. These commands take host or host and service names, additional options, author name, and comments that should be put in the notification. Options allow specifying if the notification should also include all escalation levels (a value of 1), if Nagios should skip time periods for specific users (a value of 2) as well as if Nagios should increment notifications counters (a value of 4). Options are stored bitwise so a value of 7 (1+2+4) would enable all of these options. The notification would be sent to all people including escalations; it will be forced, and the escalation counters will be increased. Option value 3 means it should be broadcast to all escalations as well, and the time periods should be skipped.

To send a custom notification about the main router to all users including escalations, you should send the following command to Nagios

[1206096000] SEND_CUSTOM_HOST_NOTIFICATION;router1;3;jdoe;RESPOND ASAP

LEAVE A REPLY

Please enter your comment!
Please enter your name here