(For more resources related to this topic, see here.)

Are untrusted languages inferior to trusted ones?

No, on the contrary, these languages are untrusted in the same way that a sharp knife is untrusted and should not be trusted to very small children, at least not without adult supervision. They have extra powers that ordinary SQL or even the trusted languages (such as PL/pgSQL) and trusted variants of the same language (PL/Perl versus PL/Perlu) don't have.

You can use the untrusted languages to directly read and write on the server's disks, and you can use it to open sockets and make Internet queries to the outside world. You can even send arbitrary signals to any process running on the database host. Generally, you can do anything the native language of the PL can do.

However, you probably should not trust arbitrary database users to have the right to define functions in these languages. Always think twice before giving all privileges on some untrusted language to a user or group by using the *u languages for important functions.

Can you use the untrusted languages for important functions? Absolutely. Sometimes, it may be the only way to accomplish some tasks from inside the server. Performing simple queries and computations should do nothing harmful to your database, and neither should connecting to the external world for sending e-mails, fetching web pages, or doing SOAP requests. They may cause delays and even queries that get stuck, but these can usually be dealt with by setting an upper limit as to how long a query can run by using an appropriate statement time-out value. Setting a reasonable statement time-out value by default is a good practice anyway.

So, if you don't deliberately do risky things, the probability of harming the database is no bigger than using a "trusted" (also known as "restricted") variant of the language. However, if you give the language to someone who starts changing bytes on the production database "to see what happens", you probably get what you asked for.

Will untrusted languages corrupt the database?

The power to corrupt the database is definitely there, since the functions run as the system user of the database server with full access to the filesystem. So, if you blindly start writing into the data files and deleting important logs, it is very likely that your database will be corrupted.

Additional types of denial-of-service attacks are also possible such as using up all memory or opening all IP ports; but there are ways to overload the database using plain SQL as well, so that part is not much different from the trusted database access with the ability to just run arbitrary queries.

So yes, you can corrupt the database, but please don't do it on a production server. If you do, you will be sorry.

Why untrusted?

PostgreSQL's ability to use an untrusted language is a powerful way to perform some nontraditional things from database functions. Creating these functions in a PL is an order of magnitude smaller task than writing an extension function in C. For example, a function to look up a hostname for an IP address is only a few lines in PL/Pythonu:

CREATE FUNCTION gethostbyname(hostname text) RETURNS inet AS $$ import socket return socket.gethostbyname(hostname) $$ LANGUAGE plpythonu SECURITY DEFINER;

You can test it immediately after creating the function by using psql:

hannu=# select gethostbyname('www.postgresql.org'); gethostbyname ---------------- 98.129.198.126 (1 row)

Creating the same function in the most untrusted language, C, involves writing tens of lines of boilerplate code, worrying about memory leaks, and all the other problems coming from writing code in a low-level language. I recommend prototyping in some PL language if possible, and in an untrusted language if the function needs something that the restricted languages do not offer.

Why PL/Python?

All of these tasks could be done equally well using PL/Perlu or PL/Tclu; I chose PL/Pythonu mainly because Python is the language I am most comfortable with. This also translates to having written some PL/Python code, which I plan to discuss and share with you in this article.

Quick introduction to PL/Python

PL/pgSQL is a language unique to PostgreSQL and was designed to add blocks of computation and SQL inside the database. While it has grown in its breath of functionality, it still lacks the completeness of syntax of a full programming language. PL/Python allows your database functions to be written in Python with all the depth and maturity of writing a Python code outside the database.

A minimal PL/Python function

Let's start from the very beginning (again):

CREATE FUNCTION hello(name text) RETURNS text AS $$ return 'hello %s !' % name $$ LANGUAGE plpythonu;

Here, we see that creating the function starts by defining it as any other PostgreSQL function with a RETURNS definition of a text field:

CREATE FUNCTION hello(name text) RETURNS text

The difference from what we have seen before is that the language part is specifying plpythonu (the language ID for PL/Pythonu language):

$$ LANGUAGE plpythonu;

Inside the function body it is very much a normal python function, returning a value obtained by the name passed as an argument formatted into a string 'hello %s !' using the standard Python formatting operator %:

return 'hello %s !' % name

Finally, let's test how this works:

hannu=# select hello('world'); hello --------------- hello world ! (1 row)

And yes, it returns exactly what is expected!

Data type conversions

The first and last things happening when a PL function is called by PostgreSQL are converting argument values between the PostgreSQL and PL types. The PostgreSQL types need to be converted to the PL types on entering the function, and then the return value needs to be converted back into the PostgreSQL type.

Except for PL/pgSQL, which uses PostgreSQL's own native types in computations, the PLs are based on existing languages with their own understanding of what types (integer, string, date, …) are, how they should behave, and how they are represented internally. They are mostly similar to PostgreSQL's understanding but quite often are not exactly the same. PL/Python converts data from PostgreSQL type to Python types as shown in the following table:

PostgreSQL	Python 2	Python 3	Comments
int2, int4	int	int
int8	long	int
real, double, numeric	float	float	This may lose precision for numeric values.
bytea	str	bytes	No encoding conversion is done, nor should any encoding be assumed.
text, char(), varchar(), and other text types	str	str	On Python 2, the string will be in server encoding. On Python 3, it is an unicode string.
All other types	str	str	PostgreSQL's type output function is used to convert to this string.

Inside the function, all computation is done using Python types and the return value is converted back to PostgreSQL using the following rules (this is a direct quote from official PL/Python documentation at http://www.postgresql.org/docs/current/static/plpython-data.html):

When the PostgreSQL return type is Boolean, the return value will be evaluated for truth according to the Python rules. That is, 0 and empty string are false, but notably f is true.

When the PostgreSQL return type is bytea, the return value will be converted to a string (Python 2) or bytes (Python 3) using the respective Python built-ins, with the result being converted bytea.

For all other PostgreSQL return types, the returned Python value is converted to a string using Python's built-in str, and the result is passed to the input function of the PostgreSQL data type.

Strings in Python 2 are required to be in the PostgreSQL server encoding when they are passed to PostgreSQL. Strings that are not valid in the current server encoding will raise an error; but not all encoding mismatches can be detected, so garbage data can still result when this is not done correctly. Unicode strings are converted to the correct encoding automatically, so it can be safer and more convenient to use those. In Python 3, all strings are Unicode strings.

In other words, anything but 0, False, and an empty sequence, including empty string ' ' or dictionary becomes PostgreSQL false.

One notable exception to this is that the check for None is done before any other conversions and even for Booleans, None is always converted to NULL and not to the Boolean value false.

For the bytea type, the PostgreSQL byte array, the conversion from Python's string representation, is an exact copy with no encoding or other conversions applied.

Writing simple functions in PL/Python

Writing functions in PL/Python is not much different in principle from writing functions in PL/pgSQL. You still have the exact same syntax around the function body in $$, and the argument name, types, and returns all mean the same thing regardless of the exact PL/language used.

A simple function

So a simple add_one() function in PL/Python looks like this:

CREATE FUNCTION add_one(i int) RETURNS int AS $$ return i + 1; $$ LANGUAGE plpythonu;

It can't get much simpler than that, can it?

What you see here is that the PL/Python arguments are passed to the Python code after converting them to appropriate types, and the result is passed back and converted to the appropriate PostgreSQL type for the return value.

Functions returning a record

To return a record from a Python function, you can use:

A sequence or list of values in the same order as the fields in the return record

A dictionary with keys matching the fields in the return record

A class or type instance with attributes matching the fields in the return record

Here are samples of the three ways to return a record.

First, using an instance:

CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ class PGUser: def __init__(self,username,user_id,is_superuser): self.username = username self.user_id = user_id self.is_superuser = is_superuser u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] user = PGUser(u['usename'], u['usesysid'], u['usesuper']) return user $$ LANGUAGE plpythonu;

Then, a little simpler one using a dictionary:

CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] return {'username':u['usename'], 'user_id':u['usesysid'], 'is_ superuser':u['usesuper']} $$ LANGUAGE plpythonu;

Finally, using a tuple:

CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] return (u['usename'], u['usesysid'], u['usesuper']) $$ LANGUAGE plpythonu;

Notice [0] at the end of u = plpy.execute(...)[0] in all the examples. It is there to extract the first row of the result, as even for one-row results plpy.execute still returns a list of results.

Danger of SQL injection!

As we have neither executed a prepare() method and executed a execute() method with arguments after it, nor have we used the plpy.quote_literal() method (both techniques are discussed later) to safely quote the username before merging it into the query, we are open to a security flaw known as SQL injection. So, make sure that you only let trusted users call this function or supply the username argument.

Calling the function defined via any of these three CREATE commands will look exactly the same:

hannu=# select * from userinfo('postgres'); username | user_id | is_superuser ----------+---------+-------------- postgres | 10 | t (1 row)

It usually does not make sense to declare a class inside a function just to return a record value. This possibility is included mostly for cases where you already have a suitable class with a set of attributes matching the ones the function returns.

Table functions

When returning a set from a PL/Python functions, you have three options:

Return a list or any other sequence of return type

Return an iterator or generator

yield the return values from a loop

Here, we have three ways to generate all even numbers up to the argument value using these different styles.

First, returning a list of integers:

CREATE FUNCTION even_numbers_from_list(up_to int) RETURNS SETOF int AS $$ return range(0,up_to,2) $$ LANGUAGE plpythonu;

The list here is returned by a built-in Python function called range, which returns a result of all even numbers below the argument. This gets returned as a table of integers, one integer per row from the PostgreSQL function. If the RETURNS clause of the function definition would say int[] instead of SETOF int, the same function would return a single number of even integers as a PostgreSQL array.

The next function returns a similar result using a generator and returning both the even number and the odd one following it. Also, notice the different PostgreSQL syntax RETURNS TABLE(...) used this time for defining the return set:

CREATE FUNCTION even_numbers_from_generator(up_to int) RETURNS TABLE (even int, odd int) AS $$ return ((i,i+1) for i in xrange(0,up_to,2)) $$ LANGUAGE plpythonu;

The generator is constructed using a generator expression (x for x in <seq>). Finally, the function is defined using a generator using and explicit yield syntax, and yet another PostgreSQL syntax is used for returning SETOF RECORD with the record structure defined this time by OUT parameters:

CREATE FUNCTION even_numbers_with_yield(up_to int, OUT even int, OUT odd int) RETURNS SETOF RECORD AS $$ for i in xrange(0,up_to,2): yield i, i+1 $$ LANGUAGE plpythonu;

The important part here is that you can use any of the preceding ways to define a PL/Python set returning function, and they all work the same. Also, you are free to return a mixture of different types for each row of the set:

CREATE FUNCTION birthdates(OUT name text, OUT birthdate date) RETURNS SETOF RECORD AS $$ return ( {'name': 'bob', 'birthdate': '1980-10-10'}, {'name': 'mary', 'birthdate': '1983-02-17'}, ['jill', '2010-01-15'], ) $$ LANGUAGE plpythonu;

This yields result as follows:

hannu=# select * from birthdates(); name | birthdate ------+------------ bob | 1980-10-10 mary | 1983-02-17 jill | 2010-01-15 (3 rows)

As you see, the data returning a part of PL/Pythonu is much more flexible than returning data from a function written in PL/pgSQL.

Running queries in the database

If you have ever accessed a database in Python, you know that most database adapters conform to a somewhat loose standard called Python Database API Specification v2.0 or DBAPI 2 for short.

The first thing you need to know about database access in PL/Python is that in-database queries do not follow this API.

Running simple queries

Instead of using the standard API, there are just three functions for doing all database access. There are two variants: plpy.execute() for running a query, and plpy.prepare() for turning query text into a query plan or a prepared query.

The simplest way to do a query is with:

res = plpy.execute(<query text>, [<row count>])

This takes a textual query and an optional row count, and returns a result object, which emulates a list of dictionaries, one dictionary per row.

As an example, if you want to access a field 'name' of the third row of the result, you use:

res[2]['name']

The index is 2 and not 3 because Python lists are indexed starting from 0, so the first row is res[0], the second row res[1], and so on.

Using prepared queries

In an ideal world this would be all that is needed, but plpy.execute(query, cnt) has two shortcomings:

It does not support parameters

The plan for the query is not saved, requiring the query text to be parsed and run through the optimizer at each invocation

We will show a way to properly construct a query string later, but for most uses simple case parameter passing is enough. So, the execute(query, [maxrows]) call becomes a set of two statements:

plan = plpy.prepare(<query text>, <list of argument types>) res = plpy.execute(plan, <list of values>, [<row count>])For example, to query if a user 'postgres' is a superuser, use the following: plan = plpy.prepare("select usesuper from pg_user where usename = $1", ["text"]) res = plpy.execute(plan, ["postgres"]) print res[0]["usesuper"]

The first statement prepares the query, which parses the query string into a query tree, optimizes this tree to produce the best query plan available, and returns the prepared_query object. The second row uses the prepared plan to query for a specific user's superuser status.

The prepared plan can be used multiple times, so that you could continue to see if user bob is superuser.

res = plpy.execute(plan, ["bob"]) print res[0]["usesuper"]

Caching prepared queries

Preparing the query can be quite an expensive step, especially for more complex queries where the optimizer has to choose from a rather large set of possible plans; so, it makes sense to re-use results of this step if possible.

The current implementation of PL/Python does not automatically cache query plans (prepared queries), but you can do it easily yourself.

try: plan = SD['is_super_qplan'] except: SD['is_super_qplan'] = plpy.prepare(".... plan = SD['is_super_qplan'] <the rest of the function>

The values in SD[] and GD[] only live inside a single database session, so it only makes sense to do the caching in case you have long-lived connections.