(For more resources related to this topic, see here.)

This article focuses on how to set up templates, groupings, and the naming structure. However, creating a robust monitoring system involves much more.

In this article, we will learn the following:

Setting up and maintaining the configuration files that can grow along with your IT monitoring system
Configuring the dependencies for easier root cause analysis of IT problems
Creating the templates for easier management of similar hosts and services
Using the custom variables for easier customization of objects
What flapping is and how it works

Creating maintainable configurations

Enormous effort is required to deploy, configure, and maintain a system that monitors your company's IT infrastructure. The configuration for several hundred machines can take months. The effort required will also depend upon the scope of hosts and services that should be tracked—the more precise the checks need to be, the more the time needed to set these up.

If your company plans to monitor a wide range of hosts and services, you should consider setting up a machine dedicated to Nagios that will only take care of this single job. Even though a small Nagios installation consumes little resources, as it grows, Nagios will start using more resources. If you set it to run on the same machine as business-critical applications, it can lead to problems. Therefore, it is always best to set up a dedicated Nagios box, even if this is on a slower machine, right from the beginning.

Very often, a good approach is to start with monitoring only critical parts of your network, such as routers and main servers. You can also start off with only making sure that essential services are working—DHCP, DNS, file sharing, and databases are good examples of what is critical. Of course, if your company does not use file servers or if databases are not critical to the production environment, you can skip these. The next step would be to set up parenting and start adopting more hosts. At some point, you will also need to start planning how to group hosts and services. In the beginning, the configuration might simply be definitions of people, hosts, and services. After several iterations of setting up more hosts and services to be monitored, you should get to a point where all of the things that are critical to the company's business are monitored. This should be an indication that the setting up of the Nagios configuration is complete.

As the number of objects grows, you will need to group them. Contacts need to be defined as groups, because if your team consists of more than one to two people, they will likely rotate over time. So, it's better to maintain a group than change the people responsible for each host individually. Hosts and services should be grouped for many reasons. It makes viewing the status and infrastructure topology on the web interface much easier. Also, after you start defining escalations for your objects, it is much easier to manage these using groups.

You should take some time to plan how group hosts and services should be set up. How will you use the groupings? For escalations? For viewing single host groups via the web interface? Learn how you can take advantage of this functionality, and then plan how you will approach the setup of your groups.

If your network has common services, it is better to define them for particular groups and only once—such as the SSH server for all Linux servers and Telnet for all AIX (Advanced Interactive eXecutive) machines, which is an IBM operating system that is mainly used by IBM enterprise-level servers. It is possible to define a service only once, and tell Nagios to which hosts or host groups the service should be bound. By specifying that all Linux servers offer SSH, and all AIX servers offer telnet, it will automatically add such services to all of the machines in these groups. This is often more convenient than specifying services for each of the hosts separately.

In such cases, you should either set up a new host group or use an existing one to keep track of the hosts that offer a particular service. Combined with keeping a list of host groups inside each host definition, this makes things much easier to manage—disabling a particular host also takes care of the corresponding service definitions.

It is also worth mentioning that Nagios performs and schedules service checks in a much better way than it does host checks—the service checks are scheduled in a much better way. That is why it is recommended that you do not schedule host checks at all. You can set up a separate service for your hosts that will send a ping to them and report how many packets have returned and the approximate time taken for them to return.

Nagios can be set up to schedule host checks only if one of the hosts is failing (that is, it is not responding to the pings). A host will be periodically checked until it recovers. In this way, problems with hosts will still be detected, but host checks will only be scheduled on demand. This will cause Nagios to perform much better than it would if regular checks of all hosts on your network are made. To disable regular host checks, simply don't specify the check interval for the hosts that you want checked only on demand.

Configuring the file structure

A very important issue is how to store all our configuration files. We can put every object definition in a single file, but this will not make it easy to manage. It is recommended to store different types of objects in separate folders.

Assuming your Nagios configuration is in /etc/nagios, it is recommended that you create folders for all types of objects in the following manner:

/etc/nagios/commands /etc/nagios/timeperiods /etc/nagios/contacts /etc/nagios/hosts /etc/nagios/services

Of course, these files will need to be added to the nagios.cfg file. After having followed the instructions while installing Nagios 4, these directories should already be added to our main Nagios configuration file.

It would also be worthwhile to use a version control mechanism such as Git (visit http://www.git-scm.com/), Hg (Mercurial, visit http://mercurial.selenic.com/) or SVN (Subversion, visit http://subversion.tigris.org/) to store your Nagios configuration. While this will add overhead to the process of applying configuration changes, it will also prevent someone from overwriting a file accidentally. It will also keep track of who changed which parts of the configuration, so you will always know whom to blame if things break down.

You might consider writing a simple script that will perform an export from the source code repository into a temporary directory, verify that Nagios works fine by using the nagios -v command and only if that did not fail, we will copy the new configuration in place of the older one and restart Nagios. This will make deployment of configuration changes much easier, especially in cases where multiple people are managing it.

As for naming the files themselves—for time periods, contacts, and commands, it is recommended that you keep single definitions per file, as in contacts/nagiosadmin.cfg. This greatly reduces naming collisions and also makes it much easier to find particular object definitions.

Storing hosts and services might be done in a slightly different way—host definitions should go to the hosts subdirectory, and the file should be named the same as the hostname, for example, hosts/localhost.cfg. Services can be split into two different types and stored, depending on how they are defined and used.

Services that are associated with more than one host should be stored in the services subdirectory. A good example is the SSH service, which is present on the majority of systems. In this case, it should go to services/ssh.cfg, and use host groups to associate it with the hosts that actually offer connection over this protocol.

Services that are specific to a host should be handled differently. It's best to store them in the same file as the host definition. A good example might be checking the disk space on partitions that might be specific to a particular machine, such as checking the /oracle partition on a host that's dedicated to Oracle databases.

For handling groups, it is recommended to create files called groups.cfg, and define all groups in it, without any members. Then, while defining a contact, host, or group, you can define to which groups it belongs by using the contactgroups, hostgroups, or servicegroups directives accordingly. This way, if you disable a particular object by deleting or commenting out its definition, the definition of the group itself will still work.

If you plan on having a large number of both check command and notify command definitions, you may want to split this into two separate directories—checkcommands and notifycommands. You can also use a single commands subdirectory, prefix the file names, and store the files in a single directory, for example, commands/check_ssh.cfg and commands/notify_jabber.cfg.