8 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from the book, Mastering  PostgreSQL 10, written by Hans-Jürgen Schönig. This book provides expert techniques on PostgreSQL 10 development and administration.[/box]

If you are looking up names or for simple strings, you are usually querying the entire content of a field. In Full-Text-Search (FTS), this is different. The purpose of the full-text search is to look for words or groups of words, which can be found in a text. Therefore, FTS is more of a contains operation as you are basically never looking for an exact string.

In this article, we will show how to perform a full-text search operation in PostgreSQL.

In PostgreSQL, FTS can be done using GIN indexes. The idea is to dissect a text, extract valuable lexemes (= “preprocessed tokens of words”), and index those elements rather than the underlying text. To make your search even more successful, those words are preprocessed.

Here is an example:

test=# SELECT to_tsvector('english', 'A car, I want a car. I would not even mind having many cars');

to_tsvector

---------------------------------------------------------------

'car':2,6,14 'even':10 'mani':13 'mind':11 'want':4 'would':8 (1 row)

The example shows a simple sentence. The to_tsvector function will take the string, apply English rules, and perform a stemming process. Based on the configuration (english), PostgreSQL will parse the string, throw away stop words, and stem individual words. For example, car and cars will be transformed to the car. Note that this is not about finding the word stem. In the case of many, PostgreSQL will simply transform the string to mani by applying standard rules working nicely with the English language.

Note that the output of the to_tsvector function is highly language dependent. If you tell PostgreSQL to treat the string as dutch, the result will be totally different:

test=# SELECT to_tsvector('dutch', 'A car, I want a car. I would not even mind having many cars');

to_tsvector

-----------------------------------------------------------------

'a':1,5 'car':2,6,14 'even':10 'having':12 'i':3,7 'many':13

'mind':11 'not':9 'would':8 (1 row)

To figure out which configurations are supported, consider running the following query:

SELECT cfgname FROM pg_ts_config;

Comparing strings

After taking a brief look at the stemming process, it is time to figure out how a stemmed text can be compared to a user query. The following code snippet checks for the word wanted:

test=# SELECT to_tsvector('english', 'A car, I want a car. I would not even mind having many cars') @@ to_tsquery('english', 'wanted');

?column?

---------- t

(1 row)

Note that wanted does not actually show up in the original text. Still, PostgreSQL will return true. The reason is that want and wanted are both transformed to the same lexeme, so the result is true. Practically, this makes a lot of sense. Imagine you are looking for a car on Google. If you find pages selling cars, this is totally fine. Finding common lexemes is, therefore, an intelligent idea.

Sometimes, people are not only looking for a single word, but want to find a set of words. With to_tsquery, this is possible, as shown in the next example:

test=# SELECT to_tsvector('english', 'A car, I want a car. I would not even mind having many cars') @@ to_tsquery('english', 'wanted & bmw');

?column?

---------- f

(1 row)

In this case, false is returned because bmw cannot be found in our input string. In the to_tsquery function, & means and and | means or. It is therefore easily possible to build complex search strings.

Defining GIN indexes

If you want to apply text search to a column or a group of columns, there are basically two choices:

  • Create a functional index using GIN
  • Add a column containing ready-to-use tsvectors and a trigger to keep them in sync

In this section, both options will be outlined. To show how things work, I have created some sample data:

test=# CREATE TABLE t_fts AS SELECT comment

FROM pg_available_extensions; SELECT 43

Indexing the column directly with a functional index is definitely a slower but more space efficient way to get things done:

test=# CREATE INDEX idx_fts_func ON t_fts

USING gin(to_tsvector('english', comment)); CREATE INDEX

Deploying an index on the function is easy, but it can lead to some overhead. Adding a materialized column needs more space, but will lead to a better runtime behavior:

test=# ALTER TABLE t_fts ADD COLUMN ts tsvector; ALTER TABLE

The only trouble is, how do you keep this column in sync? The answer is by using a trigger:

test=# CREATE TRIGGER tsvectorupdate

BEFORE INSERT OR UPDATE ON t_fts

FOR EACH ROW EXECUTE PROCEDURE

tsvector_update_trigger(somename, 'pg_catalog.english', 'comment');

Fortunately, PostgreSQL already provides a C function that can be used by a trigger to sync the tsvector column. Just pass a name, the desired language, as well as a couple of columns to the function, and you are already done. The trigger function will take care of all that is needed. Note that a trigger will always operate within the same transaction as the statement making the modification. Therefore, there is no risk of being inconsistent.

Debugging your search

Sometimes, it is not quite clear why a query matches a given search string. To debug your query, PostgreSQL offers the ts_debug function. From a user’s point of view, it can be used just like to_tsvector. It reveals a lot about the inner workings of the FTS infrastructure:

test=# x

Expanded display is on.

test=# SELECT * FROM ts_debug('english', 'go to www.postgresql-support.de');

-[ RECORD 1 ]+---------------------------- alias  | asciiword

description | Word, all ASCII

token      | go

dictionaries | {english_stem} dictionary       

   | english_stem lexemes   

| {go}



-[ RECORD 2 ]+---------------------------- alias  | blank

Description | Space symbols token

  |

        dictionaries | {}

        dictionary     |

        lexemes       |

-[ RECORD 3 ]+---------------------------- alias  | asciiword

description | Word, all ASCII

token      | to

dictionaries | {english_stem} dictionary  

| english_stem lexemes   

| {}

-[ RECORD 4 ]+---------------------------- alias  | blank

description | Space symbols token |

dictionaries | {} dictionary   | lexemes          |

-[ RECORD 5 ]+---------------------------- alias  | host

description | Host

token      | www.postgresql-support.de dictionaries | {simple}

dictionary | simple

lexemes    | {www.postgresql-support.de}

ts_debug will list every token found and display information about the token. You will see which token the parser found, the dictionary used, as well as the type of object. In my example, blanks, words, and hosts have been found. You might also see numbers, email addresses, and a lot more. Depending on the type of string, PostgreSQL will handle things differently. For example, it makes absolutely no sense to stem hostnames and e-mail addresses.

Gathering word statistics

Full-text search can handle a lot of data. To give end users more insights into their texts, PostgreSQL offers the pg_stat function, which returns a list of words:

SELECT * FROM ts_stat('SELECT to_tsvector(''english'', comment) FROM

pg_available_extensions') ORDER BY 2 DESC LIMIT 3;

word   | ndoc | nentry

----------+------+-------- function | 10 |   10 data      |  

   10 |  10 type        |   7  |     7

(3 rows)

The word column contains the stemmed word, ndoc tells us about the number of documents a certain word occurs.nentry indicates how often a word was found all together.

Taking advantage of exclusion operators

So far, indexes have been used to speed things up and to ensure uniqueness. However, a couple of years ago, somebody came up with the idea of using indexes for even more. As you have seen in this chapter, GiST supports operations such as intersects, overlaps, contains, and a lot more. So, why not use those operations to manage data integrity?

Here is an example:

test=# CREATE EXTENSION btree_gist;

test=# CREATE TABLE t_reservation (

room int,

from_to tsrange,

EXCLUDE USING GiST (room with =, from_to with &&)

);

CREATE TABLE

The EXCLUDE  USING  GiST clause defines additional constraints. If you are selling rooms, you might want to allow different rooms to be booked at the same time. However, you don’t want to sell the same room twice during the same period. What the EXCLUDE clause says in my example is this, if a room is booked twice at the same time, an error should pop up (the data in from_to with must not overlap (&&) if it is related to the same room).

The following two rows will not violate constraints:

test=# INSERT INTO t_reservation

VALUES (10, '["2017-01-01", "2017-03-03"]'); INSERT 0 1

test=# INSERT INTO t_reservation

VALUES (13, '["2017-01-01", "2017-03-03"]'); INSERT 0 1

However, the next INSERT will cause a violation because the data overlaps:

test=# INSERT INTO t_reservation

VALUES (13, '["2017-02-02", "2017-08-14"]'); ERROR:  conflicting key value violates exclusion constraint "t_reservation_room_from_to_excl"

DETAIL:   Key (room, from_to)=(13, ["2017-02-02 00:00:00","2017-08-14

00:00:00"]) conflicts with existing key (room, from_to)=(13, ["2017-01-01

00:00:00","2017-03-03 00:00:00"]).

The use of exclusion operators is very useful and can provide you with highly advanced means to handle integrity.

To summarize, we learnt how to perform full-text search operation in PostgreSQL.

If you liked our article, check out the book Mastering  PostgreSQL 10 to understand how to perform operations such as indexing, query optimization, concurrent transactions, table partitioning, server tuning, and more.

Mastering PostgreSQL 10

 

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.

LEAVE A REPLY

Please enter your comment!
Please enter your name here