How to handle backup and recovery with PostgreSQL 11 [Tutorial]

If you are running a PostgreSQL setup, there are basically two major methods to perform backups:

Logical dumps (extract an SQL script representing your data)

Transaction log shipping

The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true.

Many people rely on pg_dump to simply extract a textual representation of the data. Interestingly, pg_dump is also the oldest method of creating a backup and has been around since the very early days of the PostgreSQL project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does.

This article is an excerpt taken from the book Mastering PostgreSQL 11 - Second Edition by Hans-Jürgen Schönig. In this book, you will learn the approach to get to grips with advanced PostgreSQL 11 features and SQL functions, master replication and failover techniques, configure database security and more.

In this article, you will learn the process of partially dumping data, restoring backups, saving global data and much more.

Running pg_dump

The first thing we want to do is create a simple textual dump:

[hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql

This is the most simplistic backup you can imagine. Basically, pg_dump logs in to the local database instance, connects to a database test, and starts to extract all the data, which will then be sent to stdout and redirected to the file. The beauty, here, is that the standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want to do with it.

In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag as follows:

[hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql

The following set of parameters can be found in all PostgreSQL client programs:

... 
Connection options: 
  -d, --dbname=DBNAME database to dump 
  -h, --host=HOSTNAME database server host or 
                           socket directory 
  -p, --port=PORT database server port number 
  -U, --username=NAME connect as specified database user 
  -w, --no-password never prompt for password 
  -W, --password force password prompt (should 
                           happen automatically) 
  --role=ROLENAME do SET ROLE before dump 
...

You can just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data. Remember, a repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore.

A dump simply reads everything—therefore, there are no separate permissions to be able to dump something. As long as you can read it, you can back it up.

Passing passwords and connection information

If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option.

The reason for this is simply because the password might show up in the process table and be visible to other people. The question now is, if pg_hba.conf, which is on the server, enforces a password, how can the client program provide it?

There are various means of doing this. Some of them are as follows:

Making use of environment variables

Making use of .pgpass

Using service files

In this section, we will learn about all three methods.

Extracting subsets of data

Up until now, we have seen how to dump an entire database. However, this is not what we might wish for. In many cases, we just want to extract a subset of tables or schemas. Fortunately, pg_dump can help us do that while also providing a number of switches:

-a: It only dumps the data and does not dump the data structure

-s: It dumps the data structure but skips the data

-n: It only dumps a certain schema

-N: It dumps everything but excludes certain schemas

-t: It only dumps certain tables

-T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on)

Partial dumps can be very useful in order to speed things up considerably.

Handling various formats

So far, we have seen that pg_dump can be used to create text files. The problem here is that a text file can only be replayed completely. If we have saved an entire database, we can only replay the entire thing. In most cases, this is not what we want. Therefore, PostgreSQL has additional formats that offer more functionality.

At this point, four formats are supported:

-F, --format=c|d|t|p  output file  format (custom, directory, tar, plain  text  (default))

We have already seen plain, which is just normal text. On top of that, we can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump:

[hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc
[hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc

In addition to the table of contents, the compressed dump has one more advantage. It is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this is highly dependent on the number of indexes, but for many database applications, this rough estimation will hold true.

Once the backup is created, we can inspect the backup file:

[hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc
;
; Archive created at 2018-11-04 15:44:56 CET
;   dbname: test
;   TOC Entries: 18
;   Compression: -1
;   Dump Version: 1.12-0
;   Format: CUSTOM
;   Integer: 4 bytes
;   Offset: 8 bytes
;   Dumped from database version: 11.0
;   Dumped by pg_dump version: 11.0
;
; Selected TOC Entries:
;
3103;  1262  16384  DATABASE - test  hs
3; 2615  2200  SCHEMA - public hs
3104;  0 0 COMMENT - SCHEMA public hs
1; 3079  13350  EXTENSION - plpgsql
3105;  0 0 COMMENT - EXTENSION plpgsql
187;  1259  16391  TABLE  public t_test hs
...

Note that pg_restore --list will return the table of contents of the backup.

Using a custom format is a good idea as the backup will shrink in size. However, there's more; the -Fd command will create a backup in the directory format. Instead of a single file, you will now get a directory containing a couple of files:

[hs@linuxpc ~]$ mkdir /tmp/backup
[hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/
[hs@linuxpc ~]$ cd /tmp/backup/
[hs@linuxpc backup]$ ls -lh total  86M
-rw-rw-r--. 1 hs hs   85M Jan   4 15:54  3095.dat.gz
-rw-rw-r--. 1 hs hs   107 Jan   4 15:54  3096.dat.gz
-rw-rw-r--. 1 hs hs 740K  Jan   4 15:54  3097.dat.gz
-rw-rw-r--. 1 hs hs   39 Jan   4 15:54  3098.dat.gz
-rw-rw-r--. 1 hs hs 4.3K  Jan   4 15:54  toc.dat

One advantage of the directory format is that we can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how we can tell pg_dump to use four cores (jobs):

[hs@linuxpc backup]$ rm -rf *
[hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4

The more objects in our database, the more of a chance there is for a potential speedup.

Replaying backups

Having a backup is pointless unless you have tried to actually replay it. Fortunately, this is easy to do. If you have created a plain text backup, simply take the SQL file and execute it. The following example shows how that can be done:

psql your_db < your_file.sql

A plain text backup is simply a text file containing everything. We can always simply replay a text file.

If you have decided on a custom format or directory format, you can use pg_restore to replay the backup. Additionally, pg_restore allows you to do all kinds of fancy things such as replaying just part of a database and so on. In most cases, however, you will simply replay the entire database. In this example, we will create an empty database and just replay a custom format dump:

[hs@linuxpc backup]$ createdb new_db
[hs@linuxpc backup]$ pg_restore -d new_db -j 4 /tmp/dump.fc

Note that pg_restore will add data to an existing database. If your database is not empty, pg_restore might error out but continue.

Again, -j is used to throw up more than one process. In this example, four cores are used to replay the data; however, this only works when more than one table is being replayed.

If you are using a directory format, you can simply pass the name of the directory instead of the file.

As far as performance is concerned, dumps are a good solution if you are working with small or medium amounts of data. There are two major downsides:

We will get a snapshot, so everything since the last snapshot will be lost

Rebuilding a dump from scratch is comparatively slow compared to binary copies because all of the indexes have to be rebuilt

Handling global data

In the previous sections, we learned about pg_dump and pg_restore, which are two vital programs when it comes to creating backups. The thing is, pg_dump creates database dumps—it works on the database level. If we want to back up an entire instance, we have to make use of pg_dumpall or dump all of the databases separately. Before we dig into that, it makes sense to see how pg_dumpall works:

pg_dumpall > /tmp/all.sql

Let's see, pg_dumpall will connect to one database after the other and send stuff to standard out, where you can process it with Unix. Note that pg_dumpall can be used just like pg_dump. However, it has some downsides. It does not support a custom or directory format, and therefore does not offer multicore support. This means that we will be stuck with one thread.

However, there is more to pg_dumpall. Keep in mind that users live on the instance level. If you create a normal database dump, you will get all of the permissions, but you won't get all of the CREATE USER statements. Those globals are not included in a normal dump—they will only be extracted by pg_dumpall.

If we only want the globals, we can run pg_dumpall using the -g option:

pg_dumpall -g > /tmp/globals.sql

In most cases, you might want to run pg_dumpall -g, along with a custom or directory format dump to extract your instances. A simple backup script might look such as this:

#!/bin/sh
    
BACKUP_DIR=/tmp/
    
pg_dumpall -g > $BACKUP_DIR/globals.sql
    
for x in $(psql -c "SELECT datname FROM  pg_database
   WHERE  datname NOT IN ('postgres', 'template0', 'template1')" postgres -A -t)
do
pg_dump -Fc $x > $BACKUP_DIR/$x.fc done

It will first dump the globals and then loop through the list of databases to extract them one by one in a custom format.

To summarize, in this article, we learned about creating backups and dumps in general. To know more about streaming replication and binary backups, check out our book Mastering PostgreSQL 11 - Second Edition.

Handling backup and recovery in PostgreSQL 10 [Tutorial]

Understanding SQL Server recovery models to effectively backup and restore your database

Saving backups on cloud services with ElasticSearch plugins