PostgreSQL 9.0 High Performance

Accelerate your PostgreSQL system

Learn the right techniques to obtain optimal PostgreSQL database performance, from initial design to routine maintenance

Discover the techniques used to scale successful database installations

Avoid the common pitfalls that can slow your system down

Filled with advice about what you should be doing; how to build experimental databases to explore performance topics, and then move what you've learned into a production database environment

Covers versions 8.1 through 9.0

iostat

The data vmstat gives is a total across all devices on the system. If you want totals per disk device instead, you need to use iostat for that.

On Linux, iostat defaults to slightly different behavior than vmstat. When it uses "block", it means a 512 byte chunk of data, not the 1024 bytes chunk vmstat uses. You can switch iostat to using kilobytes instead using iostat -k, or you can just divide all the figures by two in order to get them on the same scale. Here's an example of the same data shown both ways:

$ iostat
Device    tps      Blk_read/s   Blk_wrtn/s  Blk_read    Blk_wrtn
sda1        0.07          3.29            0.24           1579784     115560
$ iostat -k
Device    tps    kB_read/s    kB_wrtn/s   kB_read    kB_wrtn
sda1       0.07           1.64             0.12           789892        57780

Since not all UNIX versions will have the kilobyte option available, the examples here all use the default 512 byte blocks, and accordingly halve the block figures to interpret using kilobyte units.

You'll likely find that you need to average iostat data over a slightly longer period of time than vmstat data. A single second of vmstat data is a summary of all the disks on the system. A PostgreSQL database goes through several common phases:

Just after a checkpoint: heavy full-page writes to WAL, fewer writes to database disks because there are fewer dirty buffer evictions.

Between checkpoints: most are an even mix of WAL and database writes.

Checkpoint in progress: Small to moderate WAL writes; increasingly heavy database writes as checkpoint data is written and starts flowing to disk.

Checkpoint sync phase: Minimal WAL writes because fewer full page writes are likely happening; heavy writes to database disks as all data is flushed out of the OS cache.

If you are looking at the vmstat data, or if you don't have the pg_xlog WAL data broken out onto a separate disk, you can't see the balance of the data vs. WAL writes change; you just see a total. But if you're grabbing really short iostat snapshots, you're likely to see writes bounce between the WAL and database disks, with the exact pattern depending on where in the checkpoint cycle you're at. You need to combine a few seconds of data (5 seconds is used for these examples) in order to have both types of writes be usefully averaged out:

$ iostat 5
avg-cpu: %user   %nice   %system   %iowait   %steal   %idle
               42.69    0.00      18.07        6.69       0.30    32.25

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 0.00 0.00 0.00 0 0
sda1 0.00 0.00 0.00 0 0
sdc 80.80 0.00 1286.40 0 6432
sdc1 80.80 0.00 1286.40 0 6432
sdd 77.80 0.00 1251.20 0 6256
sdd1 77.80 0.00 1251.20 0 6256
sde 69.40 0.00 1086.40 0 5432
sde1 69.40 0.00 1086.40 0 5432
sdf 2348.20 0.00 88262.40 0 441312
sdf1 2348.20 0.00 88262.40 0 441312
md0 311.40 0.00 2491.20 0 12456

Since all of the activity relates to the single partition on these disks, there's a lot of redundant data in here. You should also note that many of the statistics for the software RAID volume used here are not very interesting—you have to look at the underlying physical disk devices instead. If you're using hardware RAID, that particular problem will go away, but you won't have any easy way to get actual disk performance information out of that abstraction layer either; you'll just see the summary for the whole logical RAID device. The following examples eliminate all the redundant lines, and place the md0 array device between its individual components and the device the WAL is on (sdf1), for easier readability.

Examples of good performance

When busy but not overloaded, iostat data for this system looks like the following:

$ iostat 5
avg-cpu:  %user  %nice  %system  %iowait  %steal  %idle
               18.54    0.00      9.45       23.49     0.15    48.38

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 1068.80 0.00 15740.80 0 78704
sdd1 1041.80 0.00 15459.20 0 77296
sde1 1028.00 0.00 15377.60 0 76888
md0 5969.20 0.00 47753.60 0 238768
sdf1 989.00 0.00 40449.60 0 202248

The %iowait figure of 23% is high enough to know the disks are busy, but not completely saturated yet. This is showing 20 MB/s (40449.6 512-byte blocks per second) being written to the WAL and 24 MB/s to the entire database disk array, the latter of which is evenly split as almost 8 MB/s to each of the three drives.

Linux also features an extended iostat mode. This produces a large number of derived statistics from the underlying data. Since that's too wide to display here, the first example showing all of the data here has been transposed to swap the row for columns and vice-versa:

$ iostat –x 5
                sdc1      sdd1      sde1       md0          sdf1
rrqm/s         0           0            0             0              0
wrqm/s    411.8     404.6    396.2          0        3975.4
r/s               0            0            0            0              0
w/s          438.6     442       444.2     2461.4    1229.8
rsec/s          0            0            0             0              0
wsec/s   6956.8   6966.4   6915.2   19691.2   41643.2
avgrq-sz    15.86     15.76     15.57         8            33.86
avgqu-sz    67.36     67.09     62.93         0              0.65
await        158.18   158.85   148.39        0               0.55
svctm            1.2         1.2        1.19        0               0.51
%util           52.8        52.88    53.04       0             63.04

All of the values here with a "q" in them (most of what's listed on the following bulleted line) represent figures related to the read or write queues on these devices. Since the queue size doesn't correspond with any real-world figure you can benchmark the device against, it's hard to do anything with that data. The number of read and write requests is similarly useless in a database context. The following fields of iostat -x data are therefore not that useful here:

rrqm/s, wrqm/s, r/s, w/s, avgrq-sz, avgqu-sz

Solaris has a similar extended mode available using iostat -xc

This next example is similar to the iostat one given previously:

$ iostat –x 5
avg-cpu:    %user    %nice   %system   %iowait   %steal   %idle
                 21.51      0.00       11.08       23.75      0.10    43.56

Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 0.00 6956.80 15.86 67.36 158.18 1.20 52.80
sdd1 0.00 6966.40 15.76 67.09 158.85 1.20 52.88
sde1 0.00 6915.20 15.57 62.93 148.39 1.19 53.04
md0 0.00 19691.20 8.00 0.00 0.00 0.00 0.00
sdf 0.00 41643.20 33.86 0.65 0.55 0.51 63.04

That's 21 MB/s written to the WAL and 20 MB/s to the database disks, about 7 MB/s to each one. However, recall that the total disk read or write throughput available depends heavily on how random the workload is, which is normally a hard thing to estimate. The %util figure, which is by far the most valuable of the derived figures shown here, gives you a rough idea of that by noting how congested the device is to achieve that throughput. In this next example, there's minimal database I/O and heavy WAL I/O, typical of the period just after a checkpoint:

$ iostat –x 5
avg-cpu:   %user   %nice   %system   %iowait   %steal   %idle
                49.35     0.00      22.00        3.80       0.25     24.60

Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 0.00 2649.10 15.01 0.76 4.31 0.06 1.04
sdd1 0.00 2895.01 14.95 0.90 4.64 0.06 1.12
sde1 0.00 2728.94 15.06 0.82 4.51 0.06 1.04
md0 0.00 8273.05 8.00 0.00 0.00 0.00 0.00
sdf1 0.00 103760.48 38.11 0.23 0.09 0.09 23.47

This is happily getting >50 MB/s out of the WAL volume but it's still only busy 23.5% of the time. This suggests writes to it are being cached by the disk controller and written quite efficiently. One of the reasons to break out the WAL onto its own disk is because it makes it so easy to monitor this balance between WAL and database writes, and to determine if the WAL volume (which only gets sequential writes normally) is keeping up. Since there are techniques to accelerate the WAL writes at the expense of something else, such as switching to an unjournaled filesystem, the %util figure can help you determine when the WAL is the system bottleneck and therefore necessary to accelerate that way.

A final example of good performance involves the database disks. There are some operations in PostgreSQL that can bypass writing to the WAL. For example, if you start a transaction that creates a new table and does a COPY into it, as long as you don't have PITR archiving turned on that data is not put through the WAL before being written to disk. The idea is that if the server crashes, the whole transaction will be rolled back anyway, which includes deleting the table data; therefore, whether it's consistent or not at the block level doesn't matter.

Here is what the database disks are capable of when running such a COPY, which essentially turns into sequential write I/O directly to the database:

$ iostat –x 5
avg-cpu:   %user   %nice   %system   %iowait   %steal   %idle
                16.39     0.00        6.85       12.84       0.00    63.92

Device rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdc1 25.60 58710.40 249.09 27.22 115.43 1.19 28.08
sdd1 24.00 58716.80 249.11 27.76 117.71 1.20 28.24
sde1 1.60 58667.20 250.51 28.31 120.87 1.14 26.80
md0 51.20 176094.40 8.00 0.00 0.00 0.00 0.00
sdf1 0.00 0.00 0.00 0.00 0.00 0.00 0.00

This is over 29 MB/s being written to each database disk, for a total of 88 MB/s to the RAID 0 array, and even that isn't fully utilizing the disks, as shown by the %util at about 28%. Given that this is a four-core server and the COPY is the only process running, a %user of 16 means that about 64% of a single CPU is busy here. The CPU and disks are likely waiting for each other a bit in this situation, and you might have to improve both to significantly speed this up. This example is from a server with a battery-backed RAID controller; without one, it's much easier to run into one of the disk bottlenecks here before the CPU ones.

Final iostat hint: on some versions you can switch the output to use megabytes/second as its units, which is often the easiest to read. The following syntax for example, usually makes for a good summary on Linux systems:

$ iostat -x -m 5