Jonathan Katz: PostgreSQL Monitoring for App Developers: Alerts & Troubleshooting from Planet PostgreSQL

We've seen an example of how to set up PostgreSQL monitoring in Kubernetes. We've looked at two sets of statistics to keep track of in your PostgreSQL cluster: your vitals (CPU/memory/disk/network) and your DBA fundamentals.

While starting at these charts should help you to anticipate, diagnose, and respond to issues with your Postgres cluster, the odds are that you are not staring at your monitor 24 hours a day. This is where alerts come in: a properly set up alerting system will let you know if you are on the verge of a major issue so you can head it off at the pass (and alerts should also let you know that there is a major issue).

Dealing with operational production issues was a departure from my application developer roots, but I looked at it as an opportunity to learn a new set of troubleshooting skills. It also offered an opportunity to improve communication skills: I would often convey to the team and customers what transpired during a downtime or performance degradation situation (VSSE: be transparent!). Some of what I observed I used to help us to improve to application, while other parts helped me to better understand how PostgreSQL works.

But I digress: let's drill into alerts on your Postgres database.

Note that just because an alert or alarm is going off, it does not mean you need to immediately react: for example, a transient network degradation issue may cause a replica to lag further behind a primary for a bit too long but will clear up when the degradation passes. That said, you typically want to investigate the alert to understand what is causing it.

Additionally, it's important to understand what actions you want to take to solve the problem. For example, a common mistake during an "out-of-disk" error is to delete the PostgreSQL WAL logs with a rm command; doing so can lead to a very bad day (and is also an advertisement for ensuring you have backups).

As mentioned in the post on setting up PostgreSQL monitoring in Kubernetes, the Postgres Operator uses pgMonitor for metric collection and visualization via open source projects like Prometheus and Grafana. pgMonitor uses open source Alertmanager for configuring and sending alerts, and is what the PostgreSQL Operator uses.

Using the above, let's dive into some of the items that you should be alerting on, and I will describe how my experience as an app developer translated into troubleshooting strategies.