13 min read

In this article by Manpreet Kaur and Baji Shaik, authors of the book PostgreSQL Development Essentials, we will discuss indexes and constraints, types of indexes and constraints, their use in the real world, and the best practices on when we need to and how to create them. Your application may have different kinds of data, and you will want to maintain data integrity across the database and, at the same time, you need a performance gain as well. This article helps you understand how best you can choose indexes and constraints for your requirement and improve the performance. It also covers real-time examples that will help you understand better. Of course, not all types of indexes or constraints are the best fit for your requirement; however, you can choose the required ones based on how they work.

(For more resources related to this topic, see here.)

Introduction to indexes and constraints

An index is a pointer to the actual rows in its corresponding table. It is used to find and retrieve particular rows much faster than using the standard method. Indexes help you improve the performance of queries. However, indexes get updated on every Data Manipulation Language (DML)—that is, INSERT, UPDATE, and DELETE—query on the table, which is an overhead, so they should be used carefully. Constraints are basically rules restricting the values allowed in the columns and they define certain properties that data in a database must comply with. The purpose of constraints is to maintain the integrity of data in the database.

Primary-key indexes

As the name indicates, primary key indexes are the primary way to identify a record (tuple) in a table. Obviously, it cannot be null because a null (unknown) value cannot be used to identify a record. So, all RDBMSs prevent users from assigning a null value to the primary key. The primary key index is used to maintain uniqueness in a column. You can have only one primary key index on a table. It can be declared on multiple columns.

In the real world, for example, if you take the empid column of the emp table, you will be able to see a primary key index on that as no two employees can have the same empid value.

You can add a primary key index in two ways: One is while creating the table and the other, once the table has been created.

This is how you can add a primary key while creating a table:

CREATE TABLE emp(
empid integer PRIMARY KEY,
empname varchar,
sal numeric);

And this is how you can add a primary key after a table has been created:

CREATE TABLE emp(
empid integer,
empname varchar,
sal numeric);
ALTER TABLE emp ADD PRIMARY KEY(empid);

Irrespective of whether a primary key is created while creating a table or after the table is created, a unique index will be created implicitly. You can check unique index through the following command:

postgres=# select * from pg_indexes where tablename='emp';
-[ RECORD 1 ]-------------------------------------------------------
schemaname | public
tablename  | emp
indexname  | emp_pkey
tablespace | 
indexdef   | CREATE UNIQUE INDEX emp_pkey ON emp USING btree (empid)

Since it maintains uniqueness in the column, what happens if you try to INSERT a duplicate row? And try to INSERT NULL values? Let’s check it out:

postgres=# INSERT INTO emp VALUES(100, 'SCOTT', '10000.00');
INSERT 0 1
postgres=# INSERT INTO emp VALUES(100, 'ROBERT', '20000.00');
ERROR:  duplicate key value violates unique constraint "emp_pkey"
DETAIL:  Key (empid)=(100) already exists.
postgres=# INSERT INTO emp VALUES(null, 'ROBERT', '20000.00');
ERROR:  null value in column "empid" violates not-null constraint
DETAIL:  Failing row contains (null, ROBERT, 20000).

So, if empid is a duplicate value, database throws an error as duplicate key violation due to unique constraint and if empid is null, error is violates null constraint due to not-null constraint A primary key is simply a combination of a unique and a not-null constraint.

Unique indexes

Like a primary key index, a unique index is also used to maintain uniqueness; however, it allows NULL values. The syntax is as follows:

CREATE TABLE emp(
empid integer UNIQUE,
empname varchar,
sal numeric);

CREATE TABLE emp(
empid integer,
empname varchar,
sal numeric,
UNIQUE(empid));

This is what happens if you INSERT NULL values:

postgres=# INSERT INTO emp VALUES(100, 'SCOTT', '10000.00');
INSERT 0 1
postgres=# INSERT INTO emp VALUES(null, 'SCOTT', '10000.00');
INSERT 0 1
postgres=# INSERT INTO emp VALUES(null, 'SCOTT', '10000.00');
INSERT 0 1

As you see, it allows NULL values, and they are not even considered as duplicate values.

If a unique index is created on a column then there is no need of a standard index on the column. If you do so, it would just be a duplicate of the automatically created index. Currently, only B-Tree indexes can be declared unique.

When a primary key is defined, PostgreSQL automatically creates a unique index. You check it out using the following query:

postgres=# select * from pg_indexes where tablename ='emp';
-[ RECORD 1 ]-----------------------------------------------------
schemaname | public
tablename  | emp
indexname  | emp_empid_key
tablespace | 
indexdef   | CREATE UNIQUE INDEX emp_empid_key ON emp USING btree (empid)

Standard indexes

Indexes are primarily used to enhance database performance (though incorrect use can result in slower performance). An index can be created on multiple columns and multiple indexes can be created on one table. The syntax is as follows:

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table_name [ USING method ]
  ( { column_name | ( expression ) } [ COLLATE collation ] [   opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
  [ WITH ( storage_parameter = value [, ... ] ) ]
  [ TABLESPACE tablespace_name ]
  [ WHERE predicate ]

PostgreSQL supports B-Tree, hash, Generalized Search Tree (GiST), SP-GiST, and Generalized Inverted Index (GIN), which we will cover later in this article. If you do not specify any index type while creating it creates a B-Tree index as default.

Full text indexes

Let’s talk about a document, for example, a magazine article or an e-mail message. Normally, we use a text field to store a document. Searching for content within the document based on a matching pattern is called full text search. It is based on the @@ matching operator. You can check out http://www.postgresql.org/docs/current/static/textsearch-intro.html#TEXTSEARCH-MATCHING for more details.

Indexes on such full text fields are nothing but full text indexes. You can create a GIN index to speed up full text searches. We will cover the GIN index later in this article. The syntax is as follows:

CREATE TABLE web_page(
title text,
heading text,
body text);

CREATE INDEX web_page_title_body ON web_page USING GIN(to_tsvector('english', body));

The preceding commands create a full text index on the body column of a web_page table.

Partial indexes

Partial indexes are one of the special features of PostgreSQL. Such indexes are not supported by many other RDBMSs. As the name suggests, if an index is created partially, which typically means the subset of a table, then it’s a partial index. This subset is defined by a predicate (WHERE clause). Its main purpose is to avoid common values. An index will essentially ignore a value that appears in a large fraction of a table’s rows and the search will revert o a full table scan rather than an index scan. Therefore, indexing repeated values just wastes space and incurs the expense of the index updating without getting any performance benefit back at read time. So, common values should be avoided.

Let’s take a common example, an emp table. Suppose you have a status column in an emp table that shows whether emp exists or not. In any case, you care about the current employees of an organization. In such cases, you can create a partial index in the status column by avoiding former employees.

Here is an example:

CREATE TABLE emp_table
(empid integer,
empname varchar, 
status varchar);

INSERT INTO emp_table VALUES(100, 'scott', 'exists');
INSERT INTO emp_table VALUES(100, 'clark', 'exists');
INSERT INTO emp_table VALUES(100, 'david', 'not exists');
INSERT INTO emp_table VALUES(100, 'hans', 'not exists');

To create a partial index that suits our example, we will use the following query

CREATE INDEX emp_table_status_idx ON emp_table(status) WHERE status NOT IN('not exists');

Now, let’s check the queries that can use the index and those that cannot.

A query that uses index is as follows:

postgres=# explain analyze select * from emp_table where status='exists';
                         QUERY PLAN                             
-------------------------------------------------------------------
Index Scan using emp_table_status_idx on emp_table(cost=0.13..6.16 rows=2 width=17) (actual time=0.073..0.075 rows=2 loops=1)
Index Cond: ((status)::text = 'exists'::text)
A query that will not use the index is as follows:
postgres=# explain analyze select * from emp_table where status='not exists';
                         QUERY PLAN                          
-----------------------------------------------------------------
Seq Scan on emp_table  (cost=10000000000.00..10000000001.05 rows=1 width=17) (actual time=0.013..0.014 rows=1 loops=1)
Filter: ((status)::text = 'not exists'::text)
Rows Removed by Filter: 3

Multicolumn indexes

PostgreSQL supports multicolumn indexes. If an index is defined simultaneously in more than one column then it is treated as a multicolumn index. The use case is pretty simple, for example, you have a website where you need a name and date of birth to fetch the required information, and then the query run against the database uses both fields as the predicate(WHERE clause). In such scenarios, you can create an index in both the columns.

Here is an example:

CREATE TABLE get_info(name varchar, dob date, email varchar);
INSERT INTO get_info VALUES('scott', '1-1-1971', '[email protected]');
INSERT INTO get_info VALUES('clark', '1-10-1975', '[email protected]');
INSERT INTO get_info VALUES('david', '11-11-1971', '[email protected]');
INSERT INTO get_info VALUES('hans', '12-12-1971', '[email protected]');

To create a multicolumn index, we will use the following command:

CREATE INDEX get_info_name_dob_idx ON get_info(name, dob);

A query that uses index is as follows:

postgres=# explain analyze  SELECT * FROM get_info WHERE name='scott' AND dob='1-1-1971';
                               QUERY PLAN                          
--------------------------------------------------------------------
Index Scan using get_info_name_dob_idx on get_info (cost=0.13..4.15 rows=1 width=68) (actual time=0.029..0.031 rows=1 loops=1)
Index Cond: (((name)::text = 'scott'::text) AND (dob = '1971-01-01'::date))
Planning time: 0.124 ms
Execution time: 0.096 ms

B-Tree indexes

Like most of the relational databases, PostgreSQL also supports B-Tree indexes. Most of the RDBMS systems use B-Tree as the default index type, unless something else is specified explicitly. Basically, this index keeps data stored in a tree (self-balancing) structure. It’s a default index in PostgreSQL and fits in the most common situations.

The B-Tree index can be used by an optimizer whenever the indexed column is used with a comparison operator, such as<, <=, =, >=, >, and LIKE or ~ operator; however, LIKE or ~ will only be used if the pattern is a constant and anchored to the beginning of the string, for example, my_col LIKE ‘mystring%’ or my_column ~ ‘^mystring’, but not my_column LIKE ‘%mystring’.

Here is an example:

CREATE TABLE emp(
empid integer,
empname varchar,
sal numeric);

INSERT INTO emp VALUES(100, 'scott', '10000.00');
INSERT INTO emp VALUES(100, 'clark', '20000.00');
INSERT INTO emp VALUES(100, 'david', '30000.00');
INSERT INTO emp VALUES(100, 'hans', '40000.00');

Create a B-Tree index on the empname column:

CREATE INDEX emp_empid_idx ON emp(empid);
CREATE INDEX emp_name_idx ON emp(empname);

Here are the queries that use index:

postgres=# explain analyze SELECT * FROM emp WHERE empid=100;
                        QUERY PLAN                           
---------------------------------------------------------------
Index Scan using emp_empid_idx on emp  (cost=0.13..4.15 rows=1 width=68) (actual time=1.015..1.304 rows=4 loops=1)
Index Cond: (empid = 100)
Planning time: 0.496 ms
Execution time: 2.096 ms

postgres=# explain analyze SELECT * FROM emp WHERE empname LIKE 'da%';                       QUERY PLAN                              
----------------------------------------------------------------
Index Scan using emp_name_idx on emp  (cost=0.13..4.15 rows=1 width=68) (actual time=0.956..0.959 rows=1 loops=1)
Index Cond: (((empname)::text >= 'david'::text) AND ((empname)::text < 'david'::text))
Filter: ((empname)::text ~~ 'david%'::text)
Planning time: 2.285 ms
Execution time: 0.998 ms

Here is a query that cannot use index as % is used at the beginning of the string:

postgres=# explain analyze SELECT * FROM emp WHERE empname LIKE '%david';
                         QUERY PLAN                         
---------------------------------------------------------------
Seq Scan on emp  (cost=10000000000.00..10000000001.05 rows=1 width=68) (actual time=0.014..0.015 rows=1 loops=1)
Filter: ((empname)::text ~~ '%david'::text)
Rows Removed by Filter: 3
Planning time: 0.099 ms
Execution time: 0.044 ms

Hash indexes

These indexes can only be used with equality comparisons. So, an optimizer will consider using this index whenever an indexed column is used with = operator.

Here is the syntax:

CREATE INDEX index_name ON table USING HASH (column);

Hash indexes are faster than B-Tree as they should be used if the column in question is never intended to be scanned comparatively with < or > operators.

The Hash indexes are not WAL-logged, so they might need to be rebuilt after a database crash, if there were any unwritten changes.

GIN and GiST indexes

GIN or GiST indexes are used for full text searches. GIN can only be created on the tsvector datatype columns and GIST on tsvector or tsquery database columns. The syntax is follows:

CREATE INDEX index_name ON table_name USING GIN (column_name);
CREATE INDEX index_name ON table_name USING GIST (column_name);

These indexes are useful when a column is queried for specific substrings on a regular basis; however, these are not mandatory.

What is the use case and when do we really need it? Let me give an example to explain. Suppose we have a requirement to implement simple search functionality for an application. Say, for example, we have to search through all users in the database. Also, let’s assume that we have more than 10 million users currently stored in the database.

This search implementation requirement shows that we should be able to search using partial matches and multiple columns, for example, first_name, last_name. More precisely, if we have customers like Mitchell Johnson and John Smith, an input query of John should return for both customers. We can solve this problem using the GIN or GIST indexes.

Here is an example:

CREATE TABLE customers (first_name text, last_name text);
Create GIN/GiST index:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX customers_search_idx_gin ON customers USING gin (first_name gin_trgm_ops, last_name gin_trgm_ops);
CREATE INDEX customers_search_idx_gist ON customers USING gist (first_name gist_trgm_ops, last_name gist_trgm_ops);

So, what is the difference between these two indexes? This is what the PostgreSQL documentation says:

  • GiST is faster to update and build the index and is less accurate than GIN
  • GIN is slower to update and build the index but is more accurate

As per the documentation, the GiST index is lossy. It means that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed).

BRIN indexes

Block Range Index (BRIN) indexes are introduced in PostgreSQL 9.5. BRIN indexes are designed to handle very large tables in which certain columns have some natural correlation with their physical location within the table. The syntax is follows:

CREATE INDEX index_name ON table_name USING brin(col);

Here is a good example: https://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.5#BRIN_Indexes

Summary

In this article, we looked at different types of indexes and constraints that PostgreSQL supports, how they are used in the real world with examples, and what happens if you violate a constraint. It not only helps you identify the right index for your data, but also improve the performance of the database. Every type of index or constraint has its own identification and need. Some indexes or constraints may not be portable to other RDBMSs, some may not be needed, or sometimes you might have chosen the wrong one for your need.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here