Troubleshooting Lotus Notes/Domino 7 applications

Introduction

The major topics that we'll cover in this article are:

Testing your application (in other words, uncovering problems before your users do it for you).
Asking the right questions when users do discover problems.
Using logging to help troubleshoot your problems.

We'll also examine two important new Notes/Domino 7 features that can be critical for troubleshooting applications:

Domino Domain Monitoring (DDM)
Agent Profiler

troubleshooting-lotus-notesdomino-7-applications-img-0

For more troubleshooting issues visit: TroubleshootingWiki.org

Testing your Application

Testing an application before you roll it out to your users may sound like an obvious thing to do. However, during the life cycle of a project, testing is often not allocated adequate time or money.

Proper testing should include the following:

A meaningful amount of developer testing and bug fixing: This allows you to catch most errors, which saves time and frustration for your user community.

User representative testing: A user representative, who is knowledgeable about the application and how users use it, can often provide more robust testing than the developer. This also provides early feedback on features.

Pilot testing: In this phase, the product is assumed to be complete, and a pilot group uses it in production mode. This allows for limited stress testing as well as more thorough testing of the feature set.

In addition to feature testing, you should test the performance of the application. This is the most frequently skipped type of testing, because some consider it too complex and difficult. In fact, it can be difficult to test user load, but in general, it's not difficult to test data load. So, as part of any significant project, it is a good practice to programmatically create the projected number of documents that will exist within the application, one or two years after it has been fully deployed, and have a scheduled agent trigger the appropriate number of edits-per-hour during the early phases of feature testing.

Although this will not give a perfect picture of performance, it will certainly help ascertain whether and why the time to create a new document is unacceptable (for example, because the @Db formulas are taking too long, or because the scheduled agent that runs every 15 minutes takes too long due to slow document searches).

Asking the Right Questions

Suppose that you've rolled out your application and people are using it. Then the support desk starts getting calls about a certain problem. Maybe your boss is getting an earful at meetings about sluggish performance or is hearing gripes about error messages whenever users try to click a button to perform some action.

In this section, we will discuss a methodology to help you troubleshoot a problem when you don't necessarily have all the information at your disposal. We will include some specific questions that can be asked verbatim for virtually any application.

The first key to success in troubleshooting an application problem is to narrow down where and when it happens. Let's take these two very different problems suggested above (slow performance and error messages), and pose questions that might help unravel them:

Does the problem occur when you take a specific action? If so, what is that action? Your users might say, "It's slow whenever I open the application", or "I get an error when I click this particular button in this particular form".

Does the problem occur for everyone who does this, or just for certain people? If just certain people, what do they have in common? This is a great way to get your users to help you help them. Let them be a part of the solution, not just "messengers of doom". For example, you might ask questions such as, "Is it slow only for people in your building or your floor? Is it slow only for people accessing the application remotely? Is it slow only for people who have your particular access (for example, SalesRep)?"

Does this problem occur all the time, at random times, or only at certain times? It's helpful to check whether or not the time of day or the day of week/month is relevant. So typical questions might be similar to the following: "Do you get this error every time you click the button or just sometimes? If just sometimes, does it give you the error during the middle of the day, but not if you click it at 7 AM when you first arrive? Do you only get the error on Mondays or some other day of the week? Do you only see the error if the document is in a certain status or has certain data in it? If it just happens for a particular document, please send me a link to that document so that I can inspect it carefully to see if there is invalid or unexpected data."

Logging

Ideally, your questions have narrowed down the type of problem it could be. So at this point, the more technical troubleshooting can start. You will likely need to gather concrete information to confirm or refine what you're hearing from the users. For example, you could put a bit of debugging code into the button that they're clicking so that it gives more informative errors, or sends you an email (or creates a log document) whenever it's clicked or whenever an error occurs. Collecting the following pieces of information might be enough to diagnose the problem very quickly:

Time/date

User name

Document UNID (if the button is pushed in a document)

Error

Status or any other likely field that might affect your code

By looking for common denominators (such as the status of the documents in question, or access or roles of the users), you will likely be able to further narrow down the possibilities of why the problem is happening. This doesn't solve your problem of course, but it helps in advancing you a long way towards that goal.

A trickier problem to troubleshoot might be one we mentioned earlier: slow performance. Typically, after you've determined that there is some kind of performance delay, it's a good idea to first collect some server logging data. Set the following Notes.ini variables in the Server Configuration document in your Domino Directory, on the Notes.ini tab:

Log_Update=1
Log_AgentManager=1

These variables instruct the server to write output to the log.nsf database in the Miscellaneous Events view. Note that they may already be set in your environment. If not, they're fairly unobtrusive, and shouldn't trouble your administration group. Set them for a 24-hour period during a normal business week, and then examine the results to see if anything pops out as being suspicious.

For view indexing, you should look for lines like these in the Miscellaneous Events (Log_Update=1):

07/01/2006 09:29:57 AM Updating views in appsSalesPipeline.nsf
07/01/2006 09:30:17 AM Finished updating views in
                                         appsSalesPipeline.nsf
07/01/2006 09:30:17 AM Updating views in appsTracking.nsf
07/01/2006 09:30:17 AM Finished updating views in appsTracking.nsf
07/01/2006 09:30:17 AM Updating views in appsZooSchedule.nsf
07/01/2006 09:30:18 AM Finished updating views in
                                         appsZooSchedule.nsf

And lines like these for Agent execution (Log_AgentManager=1):

06/30/2006 09:43:49 PM AMgr: Start executing agent 'UpdateTickets'
                     in 'appsSalesPipeline.nsf ' by Executive '1'
06/30/2006 09:43:52 PM AMgr: Start executing agent 'ZooUpdate' in
                          'appsZooSchedule.nsf ' by Executive '2'
06/30/2006 09:44:44 PM AMgr: Start executing agent 'DirSynch' in
                             'appsTracking.nsf ' by Executive '1'

Let's examine these lines to see whether or not there is anything we can glean from them. Starting with the Log_Update=1 setting, we see that it gives us the start and stop times for every database that gets indexed. We also see that the database file paths appear alphabetically. This means that, if we search for the text string updating views and pull out all these lines covering (for instance) an hour during a busy part of the day, and copy/paste these lines into a text editor so that they're all together, then we should see complete database indexing from A to Z on your server repeating every so often.

In the log.nsf database, there may be many thousands of lines that have nothing to do with your investigation, so culling the important lines is imperative for you to be able to make any sense of what's going on in your environment.

You will likely see dozens or even hundreds of databases referenced. If you have hundreds of active databases on your server, then culling all these lines might be impractical, even programmatically. Instead, you might focus on the largest group of databases. You will notice that the same databases are referenced every so often. This is the Update Cycle, or view indexing cycle. It's important to get a sense of how long this cycle takes, so make sure you don't miss any references to your group of databases.

Imagine that SalesPipeline.nsf and Tracking.nsf were the two databases that you wanted to focus on. You might cull the lines out of the log that have updating views and which reference these two databases, and come up with something like the following:

07/01/2006 09:29:57 AM Updating views in appsSalesPipeline.nsf
07/01/2006 09:30:17 AM Finished updating views in
                                         appsSalesPipeline.nsf
07/01/2006 09:30:17 AM Updating views in appsTracking.nsf
07/01/2006 09:30:20 AM Finished updating views in appsTracking.nsf
07/01/2006 10:15:55 AM Updating views in appsSalesPipeline.nsf
07/01/2006 10:16:33 AM Finished updating views in
                                         appsSalesPipeline.nsf
07/01/2006 10:16:33 AM Updating views in appsTracking.nsf
07/01/2006 10:16:43 AM Finished updating views in appsTracking.nsf
07/01/2006 11:22:31 AM Updating views in appsSalesPipeline.nsf
07/01/2006 11:23:33 AM Finished updating views in
                                         appsSalesPipeline.nsf
07/01/2006 11:23:33 AM Updating views in appsTracking.nsf
07/01/2006 11:23:44 AM Finished updating views in appsTracking.nsf

This gives us some very important information: the Update task (view indexing) is taking approximately an hour to cycle through the databases on the server; that's too long. The Update task is supposed to run every 15 minutes, and ideally should only run for a few minutes each time it executes. If the cycle is an hour, then that means update is running full tilt for that hour, and as soon as it stops, it realizes that it's overdue and kicks off again.

It's possible that if you examine each line in the log, you'll find that certain databases are taking the bulk of the time, in which case it might be worth examining the design of those databases. But it might be that every database seems to take a long time, which might be more indicative of a general server slowdown.

In any case, we haven't solved the problem; but at least we know that the problem is probably server-wide. More complex applications, and newer applications, tend to reflect server‑performance problems more readily, but that doesn't necessarily mean they carry more responsibility for the problem. In a sense, they are the "canary in the coal mine".

If you suspect the problem is confined to one database (or a few), then you can increase the logging detail by setting Log_Update=2. This will give you the start time for every view in every database that the Update task indexes. If you see particular views taking a long time, then you can examine the design of those views.

If no database(s) stand out, then you might want to see if the constant indexing occurs around the clock or just during business hours. If it's around the clock, then this might point to some large quantities of data that are changing in your databases. For example, you may be programmatically synchronizing many gigabytes of data throughout the day, not realizing the cost this brings in terms of indexing.

If slow indexing only occurs during business hours, then perhaps the user/data load has not been planned out well for this server. As the community of users ramps up in the morning, the server starts falling behind and never catches up until evening. There are server statistics that can help you determine whether or not this is the case. (These server statistics go beyond the scope of this book, but you can begin your investigation by searching on the various Notes/Domino forums for "server AND performance AND statistics".)

As may be obvious at this point, troubleshooting can be quite time-consuming. The key is to make sure that you think through each step so that it either eliminates something important, or gives you a forward path. Otherwise, you can find yourself still gathering information weeks and months later, with users and management feeling very frustrated.

Before moving on from this section, let's take a quick look at agent logging. Agent Manager can run multiple agents in different databases, as determined by settings in your server document. Typically, production servers only allow two or three concurrent agents to run during business hours, and these are marked in the log as Executive '1', Executive '2', and so on.

If your server is often busy with agent execution, then you can track Executive '1' and see how many different agents it runs, and for how long. If there are big gaps between when one agent starts and when the next one does (for Executive '1'), this might raise suspicion that the first agent took that whole time to execute. To verify this, turn up the logging by setting the Notes.ini variable debug_amgr=*. (This will output a fair amount of information into your log, so it's best not to leave it on for too long, but normally one day is not a problem.) Doing this will give you a very important piece of information: the number of "ticks" it took for the agent to run. One second equals 100 ticks, so if the agent takes 246,379 ticks, this equals 2,463 seconds (about 41 minutes).

As a general rule, you want scheduled agents to run in seconds, not minutes; so any agent that is taking this long will require some examination. In the next section, we will talk about some other ways you can identify problematic agents.

Domino Domain Monitoring (DDM)

Every once in a while, a killer feature is introduced—a feature so good, so important, so helpful, that after using it, we just shake our heads and wonder how we ever managed without it for so long. Domino Domain Monitor (DDM) is just such a feature.

DDM is too large to be completely covered in this one section, so we will confine our overview to what it can do in terms of troubleshooting applications. For a more thorough explanation of DDM and all its features, see the book, Upgrading to Lotus Notes and Domino (www.packtpub.com/upgrading_lotus/book).

In the events4.nsf database, you will find a new group of documents you can create for tracking agent or application performance. On Domino 7 servers, a new database is created automatically with the filename ddm.nsf. This stores the DDM output you will examine.

For application troubleshooting, some of the most helpful areas to track using DDM are the following:

Full-text index needs to be built. If you have agents that are creating a full‑text index on the fly because the database has no full‑text index built, DDM can track that potential problem for you. Especially useful is the fact that DDM compiles the frequency per database, so (for instance) you can see if it happens once per month or once per hour. Creating full‑text indexes on the fly can result in a significant demand on server resources, so having this notification is very useful. We discuss an example of this later in this section.

Agent security warnings. You can manually examine the log to try to find errors about agents not being able to execute due to insufficient access. However, DDM will do this for you, making it much easier to find (and therefore fix) such problems.

Resource utilization. You can track memory, CPU, and time utilization of your agents as run by Agent Manager or by the HTTP task. This means that at any time you can open the ddm.nsf database and spot the worst offenders in these categories, over your entire server/domain. We will discuss an example of CPU usage later in this section.

The following illustration shows the new set of DDM views in the events4.nsf (Monitoring configuration) database:

The following screenshot displays the By Probe Server view after we've made a few document edits:

Notice that there are many probes included out-of-the-box (identified by the property "author = Lotus Notes Template Development") but set to disabled. In this view, there are three that have been enabled (ones with checkmarks) and were created by one of the authors of this book.

If you edit the probe document highlighted above, Default Application Code/Agents Evaluated By CPU Usage (Agent Manager), the document consists of three sections.

The first section is where you choose the type of probe (in this case Application Code) and the subtype (in this case Agents Evaluated By CPU Usage).

The second section allows you to choose the servers to run against, and whether you want this probe to run against agents/code executed by Agent Manager or by the HTTP task (as shown in the following screenshot).

This is an important distinction. For one thing, they are different tasks, and therefore one can hit a limit while the other still has room to "breathe". But perhaps more significantly, if you choose a subtype of Agents Evaluated By Memory Usage, then the algorithms used to evaluate whether or not an agent is using too much memory are very different. Agents run by the HTTP task will be judged much more harshly than those run by the Agent Manager task. This is because with the HTTP task, it is possible to run the same agent with up to hundreds of thousands of concurrent executions. But with Agent Manager, you are effectively limited to ten concurrent instances, and none within the same database.

The third section allows you to set your threshold for when DDM should report the activity:

You can select up to four levels of warning: Fatal, Failure, Warning (High), and Warning (Low). Note that you do not have the ability to change the severity labels (which appear as icons in the view). Unless you change the database design of ddm.nsf, the icons displayed in the view and documents are non-configurable.

Experiment with these settings until you find the approach that is most useful for your corporation. Typically, customers start by overwhelming themselves with information, and then fine-tuning the probes so that much less information is reported. In this example, only two statuses are enabled: one for six seconds, with a label of Warning (High), and one for 60 seconds, with a label of Failure.

Here is a screenshot of the DDM database:

Notice that there are two Application Code results, one with a status of Failure (because that agent ran for more than 60 seconds), and one with a status of Warning (High) (because that agent ran for more than six seconds but less than 60 seconds). These are the parameters set in the Probe document shown previously, which can easily be changed by editing that Probe document. If you want these labels to be different, you must enable different rows in the Probe document.

If you open one of these documents, there are three sections. The top section gives header information about this event, such as the server name, the database and agent name, and so on. The second section includes the following table, with a tab for the most recent infraction and a tab for previous infractions. This allows you to see how often the problem is occurring, and with what severity.

The third section provides some possible solutions, and (if applicable) automation. For example, in our example, you might want to "profile" your agent. (We will profile one of our agents in the final section of this article.)

DDM can capture full-text operations against a database that is not full‑text indexed. It tracks the number of times this happens, so you can decide whether to full‑text index the database, change the agent, or neither.

For a more complete list of the errors and problems that DDM can help resolve, check the Domino 7 online help or the product documentation (www.lotus.com).

Agent Profiler

If any of the troubleshooting tips or techniques we've discussed in this article causes you to look at an agent and think, "I wonder what makes this agent so slow", then the Agent Profiler should be the next tool to consider. Agent Profiler is another new feature introduced in Notes/Domino 7. It gives you a breakdown of many methods/properties in your LotusScript agent, telling you how often each one was executed and how long they took to execute.

In Notes/Domino 7, the second (security) tab of Agent properties now includes a checkbox labeled Profile this agent. You can select this option if you want an agent to be profiled.

The next time the agent runs, a profile document in the database is created and filled with the information from that execution. This document is then updated every time the agent runs. You can view these results from the Agent View by highlighting your agent and selecting Agent | View Profile Results.

The following is a profile for an agent that performed slow mail searches:

Although this doesn't completely measure (and certainly does not completely troubleshoot) your agents, it is an important step forward in troubleshooting code. Imagine the alternative: dozens of print statements, and then hours of collating results!

Summary

In closing, we hope that this article has opened your eyes to new possibilities in troubleshooting, both in terms of techniques and new Notes/Domino 7 features. Every environment has applications that users wish ran faster, but with a bit of care, you can troubleshoot your performance problems and find resolutions.

After you have your servers running Notes/Domino 7, you can use DDM and Agent Profiler (both exceptionally easy to use) to help nail down poorly performing code in your applications. These tools really open a window on what had previously been a room full of mysterious behavior. Full-text indexing on the fly, code that uses too much memory, and long running agents are all quickly identified by Domino Domain Monitoring (DDM). Try it!