PL/Proxy

Function-based sharding for PostgreSQL

PL/Proxy Language Syntax

The language is similar to plpgsql - string quoting, comments, semicolon at the statements end.

It contains only 4 statements: CONNECT, CLUSTER, RUN and SELECT.

Each function needs to have either CONNECT or pair of CLUSTER + RUN statements to specify where to run the function.

The SELECT statement is optional, if it is missing, there will be default query generated based on proxy function signature.

The RUN statment is also optional, it defaults to RUN ON ANY which means the query will be run random partition.

CONNECT

CONNECT 'libpq connstr';

Specifies exact location where to connect and execute the query. If several functions have same connstr, they will use same connection.

CONNECT connect_func(...);
CONNECT argname;
CONNECT $argnr;

Connect string is taken from function result or directly from argument. If several functions have same connstr, they will use same connection. (New in 2.0.9)

NB: giving untrusted users ability to specify full connect string creates security hole. Eg it can used to read cleartext passwords from ~/.pgpass or pg_service.conf. If such function cannot be avoided, it’s access rights need to be restricted.

CLUSTER

CLUSTER 'cluster_name';

Specifies exact cluster name to be run on. The cluster name will be passed to plproxy.get_cluster_* functions.

CLUSTER cluster_func(..);

Cluster name can be dynamically decided upon proxy function arguments. cluster_func should return text value of final cluster name.

RUN ON …

RUN ON ALL;

Query will be run on all partitions in cluster in parallel.

RUN ON ANY;

Query will be run on random partition.

RUN ON <NR>;

Run on partition number <NR>.

RUN ON partition_func(..);

Run partition_func() which should return one or more hash values. (int4) Query will be run on tagged partitions. If more than one partition was tagged, query will be sent in parallel to them.

RUN ON argname;
RUN ON $1;

Take hash value directly from function argument. (New in 2.0.8)

SPLIT

SPLIT array_arg_1 [ , array_arg_2 ... ] ;
SPLIT ALL ;

Split the input arrays based on RUN ON statement into per-partition arrays. This is done by evaluating RUN ON condition for each array element and building per-partition parameter arrays for each matching partition. During execution each tagged partition then gets its own subset of the array to process. (New in 2.1)

The semantics of RUN ON statement is slightly changed with SPLIT arrays:

RUN ON partition_func(..);

The array is split between the partitions matching partition_func(). Any SPLIT parameters passed to the function are actually replaced with the individual array elements.

RUN ON argname;
RUN ON $1;

An array of partition numbers (or hashes) can be passed as argname. The function shall be run on the partitions specified in the array.

RUN ON ANY;

Each element is assigned to random partition.

RUN ON ALL;
RUN ON <NR>;

Unaffected, except for the added overhead of array copying.

Example:

CREATE FUNCTION set_profiles(i_users text[], i_profiles text[])
RETURNS SETOF text AS $$
    CLUSTER 'userdb';
    SPLIT i_users, i_profiles;
    RUN ON hashtext(i_users);
$$ LANGUAGE plproxy;

Given query:

SELECT * FROM set_profiles(ARRAY['foo', 'bar'], ARRAY['a', 'b']);

The hash function is called 2 times:

SELECT * FROM hashtext('foo');
SELECT * FROM hashtext('bar');

And target partitions get queries:

SELECT * FROM set_profiles(ARRAY['foo'], ARRAY['a']);
SELECT * FROM set_profiles(ARRAY['bar'], ARRAY['b']);

TARGET

Specify function name on remote side to be called. By default PL/Proxy uses current function name. (New in 2.2)

Following function:

CREATE FUNCTION some_function(username text, num int4)
RETURNS SETOF text AS $$
    CLUSTER 'userdb';
    RUN ON hashtext(username);
    TARGET other_function;
$$ LANGUAGE plproxy;

will run following query on remote side:

SELECT * FROM other_function(username, num);

SELECT

SELECT .... ;

By default, PL/Proxy generates query based on its own signature. But this can be overrided by giving explicit SELECT statement to run.

Everything after SELECT until semicolon is taken as SQL to be passed on. Only argument substitution is done on the contents, otherwise the text is unparsed. To avoid a table column to be parsed as function argument, table aliases should be used.

Query result should have same number of columns as function result and same names too.

Argument substitution

Proxy function arguments can be referenced using name or $n syntax. Everything that is not argument reference is just passed on.

Dynamic records

PL/Proxy supports function returning plain RECORD type. Such functions need the result type specified at call site. Main use-case is to run random queries on partitions. (New in 2.0.6)

Very simple example:

CREATE OR REPLACE FUNCTION dynamic_query(q text)
RETURNS SETOF RECORD AS $$
    CLUSTER 'mycluster';
    RUN ON ALL;
$$ LANGUAGE plproxy;

Corresponding function in partitions:

CREATE OR REPLACE FUNCTION dynamic_query(sql text)
RETURNS SETOF RECORD AS $$
DECLARE
    rec RECORD;
BEGIN
    FOR rec IN EXECUTE sql
    LOOP
    RETURN NEXT rec;
    END LOOP;
    RETURN;
END;
$$ LANGUAGE plpgsql;

Sample request:

SELECT * FROM dynamic_query('SELECT id, username FROM sometable')
                        AS (id integer, username text);

The types given in AS clause must match actual types from query.