postgresqlstored-proceduresplpgsqlsql-functionpostgresql-9.0

PostgreSQL function to perform a series of operations with the current record and then return the result


I'm not familiar with PostgreSQL functions and stored procedures, but I've done my best to learn during the past few days. I tried hard before asking here.

Basically, I have a situation in which I cannot use simple SQL and in which a function would be most helpful. (That is because I am sending the query via AJAX to a Postgres-based webservice that returns JSONP, and, because the query is built in JavaScript based on a non-predetermined number of variables, it can grow past the 2000 or so URL character limit allowed in MSIE.)

Say, I have a table called clients:

+-------------+-------------------+-------------+---------------+
|   CLIENT    | MONTHLY_PURCHASES | SALES_VALUE | RETURNS_VALUE |
+-------------+-------------------+-------------+---------------+
| Mercury Ltd | 3                 | 400000      | 30000         |
| Saturn Plc  | 11                | 150000      | 30000         |
| Uranus Ltd  | 4                 | 80000       | 1000          |
+-------------+-------------------+-------------+---------------+

The query is supposed to return clients ranked by various criteria contained in the columns. The number of columns may grow in the future.

For instance, if I want to get the top 10 best clients, ranked from 100 (best) to 0 (worst), the SQL query would be:

WITH var AS (
    --we need the min and max values for each criteria, to calculate the rank later
    SELECT 
      MIN(monthly_purchases) AS min_pur,
      MAX(monthly_purchases) AS max_pur,
      MIN(sales_value) AS min_sales,
      MAX(sales_value) AS max_sales,
      MIN(returns_value) AS min_returns,
      MAX(returns_value) AS max_returns
    FROM clients
),
--standardise values to a 0 to 100 range, so we can compare apples with oranges, and assign weights to each criteria (from 0 to 1)
weights AS (        
    SELECT client,
      --the higher the number of purchases the better. Weight: 0.2 out of 1.
      0.2 * (clients.monthly_purchases - var.min_pur) / (var.max_pur - var.min_pur) * 100 AS rnk_pur,
      --the higher the value of sales, the better. Weight: 0.4 out of 1.
      0.4 * (clients.sales_value - var.min_sales) / (var.max_sales - var.min_sales) * 100 AS rnk_sales,
      --the lower the value of returns the better. Weight: 0.4 out of 1.
      0.4 * (1 - (clients.returns_value - var.min_returns) / (var.max_returns - var.min_returns)) * 100 AS rnk_returns
    FROM clients, var
)
SELECT weights.client, weights.rnk_pur + weights.rnk_sales + weights.rnk_returns as overall_rank FROM weights ORDER BY overall_rank DESC LIMIT 10

All good, but in reality the number of columns is larger (about 40), and the user could use at once anything between 1 and 15 for ranking.

As such, the SQL route is not feasible. I tried creating a function which would at least do the standardisation of values:

--Firstly, a function to find the highest value in an array
DROP FUNCTION IF EXISTS array_max(float[]);

CREATE OR REPLACE FUNCTION array_max(float[])
RETURNS float
AS $$
  select max(x) from unnest($1)x order by 1;
$$
LANGUAGE 'sql';

--Secondly, a function to find the lowest value in an array
DROP FUNCTION IF EXISTS array_min(float[]);

CREATE OR REPLACE FUNCTION array_min(float[])
RETURNS float
AS $$
  select min(x) from unnest($1)x order by 1;
$$
LANGUAGE 'sql';

--Finally, our function
DROP FUNCTION IF EXISTS standardise(float[], float);

CREATE OR REPLACE FUNCTION standardise(myarray float[], val float)
RETURNS float AS
$$

DECLARE
  minimum float;
  maximum float;
  calc_result float;
BEGIN
  minimum = array_min(myarray);
  maximum = array_max(myarray);

  calc_result = (val - minimum) / (maximum - minimum) * 100;

  RETURN calc_result;
END;
$$
LANGUAGE 'plpgsql' IMMUTABLE;

Not surprisingly, the function is pretty slow. If used like this:

SELECT 0.5 * standardise((SELECT array(SELECT sales_value FROM clients)), clients.sales_value) AS rnk_sales
FROM clients

... it's acceptable. Anything that involves ordering slows it down to a crawl. I.e:

SELECT 0.5 * standardise((SELECT array(SELECT sales_value FROM clients)), clients.sales_value) AS rnk_sales
FROM clients ORDER BY rnk_sales LIMIT 10

Is there any way in which I can improve the speed of above function. Or, perhaps a totally different approach? Any help would be much appreciated. Thanks!

UPDATE:

I ran EXPLAIN ANALYZE with the last query. For this I chose only a sample from the whole table, as it was taking way too long. I cancelled the query after waiting for 10 minutes. This is on a table with 1000 clients:

EXPLAIN ANALYZE SELECT 0.5 * standardise((SELECT array(SELECT sales_value FROM clients_sample)), clients_sample.sales_value) AS rnk_sales
FROM clients_sample ORDER BY rnk_sales LIMIT 10

Result:

Limit  (cost=78.82..78.83 rows=10 width=8) (actual time=357.806..357.822 rows=10 loops=1)
  InitPlan 2 (returns $1)
    ->  Result  (cost=12.00..12.00 rows=1 width=0) (actual time=1.267..1.268 rows=1 loops=1)
          InitPlan 1 (returns $0)
            ->  Seq Scan on clients_sample clients_sample_1  (cost=0.00..12.00 rows=1000 width=8) (actual time=0.002..0.666 rows=1000 loops=1)
  ->  Sort  (cost=66.82..67.32 rows=1000 width=8) (actual time=357.805..357.809 rows=10 loops=1)
        Sort Key: ((0.5::double precision * standardise($1, clients_sample.sales_value)))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  Seq Scan on clients_sample  (cost=0.00..62.50 rows=1000 width=8) (actual time=1.870..356.742 rows=1000 loops=1)
Total runtime: 357.850 ms

Solution

  • Sanitized helper functions

    CREATE OR REPLACE FUNCTION array_max(float[])
      RETURNS float AS
    'SELECT max(x) from unnest($1) x'
    LANGUAGE sql;
    

    ORDER BY 1 would be useless, since max(x) returns a single row anyway.
    Same for array_min(float[]);

    However, do not use these functions. It is cheaper to get min() and max() in a single call.

    Main function:

    Use a simple SQL function instead:

    CREATE OR REPLACE FUNCTION standardise(_arr float[], _val float)
      RETURNS float AS
    $func$
    SELECT ((_val - min_x) * 100) / (max_x - min_x)
    FROM (
       SELECT min(x) AS min_x, max(x) AS max_x
       FROM   unnest($1) x
       ) sub
    $func$
    LANGUAGE sql IMMUTABLE;