10 min read
If you are running a PostgreSQL setup, there are basically two major methods to perform backups:
- Logical dumps (extract an SQL script representing your data)
- Transaction log shipping
The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true.
Many people rely on pg_dump to simply extract a textual representation of the data. Interestingly, pg_dump is also the oldest method of creating a backup and has been around since the very early days of the PostgreSQL project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does.
This article is an excerpt taken from the book Mastering PostgreSQL 11 – Second Edition by Hans-Jürgen Schönig. In this book, you will learn the approach to get to grips with advanced PostgreSQL 11 features and SQL functions, master replication and failover techniques, configure database security and more.
In this article, you will learn the process of partially dumping data, restoring backups, saving global data and much more.
The first thing we want to do is create a simple textual dump:
[[email protected] ~]$ pg_dump test > /tmp/dump.sql
This is the most simplistic backup you can imagine. Basically, pg_dump logs in to the local database instance, connects to a database test, and starts to extract all the data, which will then be sent to stdout and redirected to the file. The beauty, here, is that the standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want to do with it.
In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag as follows:
[[email protected] ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql
The following set of parameters can be found in all PostgreSQL client programs:
... Connection options: -d, --dbname=DBNAME database to dump -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) --role=ROLENAME do SET ROLE before dump ...
You can just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data. Remember, a repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won’t make it to the backup anymore.
Passing passwords and connection information
If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option.
The reason for this is simply because the password might show up in the process table and be visible to other people. The question now is, if pg_hba.conf, which is on the server, enforces a password, how can the client program provide it?
There are various means of doing this. Some of them are as follows:
- Making use of environment variables
- Making use of .pgpass
- Using service files
In this section, we will learn about all three methods.
Extracting subsets of data
Up until now, we have seen how to dump an entire database. However, this is not what we might wish for. In many cases, we just want to extract a subset of tables or schemas. Fortunately, pg_dump can help us do that while also providing a number of switches:
- -a: It only dumps the data and does not dump the data structure
- -s: It dumps the data structure but skips the data
- -n: It only dumps a certain schema
- -N: It dumps everything but excludes certain schemas
- -t: It only dumps certain tables
- -T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on)
Partial dumps can be very useful in order to speed things up considerably.
Handling various formats
So far, we have seen that pg_dump can be used to create text files. The problem here is that a text file can only be replayed completely. If we have saved an entire database, we can only replay the entire thing. In most cases, this is not what we want. Therefore, PostgreSQL has additional formats that offer more functionality.
At this point, four formats are supported:
-F, --format=c|d|t|p output file format (custom, directory, tar, plain text (default))
We have already seen plain, which is just normal text. On top of that, we can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump:
[[email protected] ~]$ pg_dump -Fc test > /tmp/dump.fc [[email protected] ~]$ pg_dump -Fc test -f /tmp/dump.fc
In addition to the table of contents, the compressed dump has one more advantage. It is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this is highly dependent on the number of indexes, but for many database applications, this rough estimation will hold true.
Once the backup is created, we can inspect the backup file:
[[email protected] ~]$ pg_restore --list /tmp/dump.fc ; ; Archive created at 2018-11-04 15:44:56 CET ; dbname: test ; TOC Entries: 18 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 11.0 ; Dumped by pg_dump version: 11.0 ; ; Selected TOC Entries: ; 3103; 1262 16384 DATABASE - test hs 3; 2615 2200 SCHEMA - public hs 3104; 0 0 COMMENT - SCHEMA public hs 1; 3079 13350 EXTENSION - plpgsql 3105; 0 0 COMMENT - EXTENSION plpgsql 187; 1259 16391 TABLE public t_test hs ...
Note that pg_restore --list will return the table of contents of the backup.
Using a custom format is a good idea as the backup will shrink in size. However, there’s more; the -Fd command will create a backup in the directory format. Instead of a single file, you will now get a directory containing a couple of files:
[[email protected] ~]$ mkdir /tmp/backup [[email protected] ~]$ pg_dump -Fd test -f /tmp/backup/ [[email protected] ~]$ cd /tmp/backup/ [[email protected] backup]$ ls -lh total 86M -rw-rw-r--. 1 hs hs 85M Jan 4 15:54 3095.dat.gz -rw-rw-r--. 1 hs hs 107 Jan 4 15:54 3096.dat.gz -rw-rw-r--. 1 hs hs 740K Jan 4 15:54 3097.dat.gz -rw-rw-r--. 1 hs hs 39 Jan 4 15:54 3098.dat.gz -rw-rw-r--. 1 hs hs 4.3K Jan 4 15:54 toc.dat
One advantage of the directory format is that we can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how we can tell pg_dump to use four cores (jobs):
[[email protected] backup]$ rm -rf * [[email protected] backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4
Having a backup is pointless unless you have tried to actually replay it. Fortunately, this is easy to do. If you have created a plain text backup, simply take the SQL file and execute it. The following example shows how that can be done:
A plain text backup is simply a text file containing everything. We can always simply replay a text file.
If you have decided on a custom format or directory format, you can use pg_restore to replay the backup. Additionally, pg_restore allows you to do all kinds of fancy things such as replaying just part of a database and so on. In most cases, however, you will simply replay the entire database. In this example, we will create an empty database and just replay a custom format dump:
[[email protected] backup]$ createdb new_db [[email protected] backup]$ pg_restore -d new_db -j 4 /tmp/dump.fc
Note that pg_restore will add data to an existing database. If your database is not empty, pg_restore might error out but continue.
Again, -j is used to throw up more than one process. In this example, four cores are used to replay the data; however, this only works when more than one table is being replayed.
As far as performance is concerned, dumps are a good solution if you are working with small or medium amounts of data. There are two major downsides:
- We will get a snapshot, so everything since the last snapshot will be lost
- Rebuilding a dump from scratch is comparatively slow compared to binary copies because all of the indexes have to be rebuilt
Handling global data
In the previous sections, we learned about pg_dump and pg_restore, which are two vital programs when it comes to creating backups. The thing is, pg_dump creates database dumps—it works on the database level. If we want to back up an entire instance, we have to make use of pg_dumpall or dump all of the databases separately. Before we dig into that, it makes sense to see how pg_dumpall works:
pg_dumpall > /tmp/all.sql
Let’s see, pg_dumpall will connect to one database after the other and send stuff to standard out, where you can process it with Unix. Note that pg_dumpall can be used just like pg_dump. However, it has some downsides. It does not support a custom or directory format, and therefore does not offer multicore support. This means that we will be stuck with one thread.
However, there is more to pg_dumpall. Keep in mind that users live on the instance level. If you create a normal database dump, you will get all of the permissions, but you won’t get all of the CREATE USER statements. Those globals are not included in a normal dump—they will only be extracted by pg_dumpall.
If we only want the globals, we can run pg_dumpall using the -g option:
pg_dumpall -g > /tmp/globals.sql
In most cases, you might want to run pg_dumpall -g, along with a custom or directory format dump to extract your instances. A simple backup script might look such as this:
#!/bin/sh BACKUP_DIR=/tmp/ pg_dumpall -g > $BACKUP_DIR/globals.sql for x in $(psql -c "SELECT datname FROM pg_database WHERE datname NOT IN ('postgres', 'template0', 'template1')" postgres -A -t) do pg_dump -Fc $x > $BACKUP_DIR/$x.fc done
It will first dump the globals and then loop through the list of databases to extract them one by one in a custom format.
To summarize, in this article, we learned about creating backups and dumps in general. To know more about streaming replication and binary backups, check out our book Mastering PostgreSQL 11 – Second Edition.