sqlperformancepostgresqlrandom

Best way to select random rows PostgreSQL


I want a random selection of rows in PostgreSQL, I tried this:

select * from table where random() < 0.01;

But some other recommend this:

select * from table order by random() limit 1000;

I have a very large table with 500 Million rows, I want it to be fast.

Which approach is better? What are the differences? What is the best way to select random rows?


Solution

  • Fast ways

    Given your specifications (plus additional info in the comments):

    The query below does not need a sequential scan of the big table, only an index scan.

    First, get estimates for the main query:

    SELECT count(*) AS ct              -- optional
         , min(id)  AS min_id
         , max(id)  AS max_id
         , max(id) - min(id) AS id_span
    FROM   big;
    

    The only possibly expensive part is the count(*) (for huge tables). Given the above specifications, you don't need it. An estimate to replace the full count will do just fine, available at almost no cost:

    SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
    FROM   pg_class
    WHERE  oid = 'big'::regclass;  -- your table name
    

    Detailed explanation:

    As long as ct isn't much smaller than id_span, the query will outperform other approaches.

    WITH params AS (
       SELECT 1       AS min_id           -- minimum id <= current min id
            , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
        )
    SELECT *
    FROM  (
       SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
       FROM   params p
            , generate_series(1, 1100) g  -- 1000 + buffer
       GROUP  BY 1                        -- trim duplicates
    ) r
    JOIN   big USING (id)
    LIMIT  1000;                          -- trim surplus
    

    Short version

    You can simplify this query. The CTE in the query above is just for educational purposes:

    SELECT *
    FROM  (
       SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
       FROM   generate_series(1, 1100) g
       ) r
    JOIN   big USING (id)
    LIMIT  1000;
    

    Refine with rCTE

    Especially if you are not so sure about gaps and estimates.

    WITH RECURSIVE random_pick AS (
       SELECT *
       FROM  (
          SELECT 1 + trunc(random() * 5100000)::int AS id
          FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
          LIMIT  1030                      -- hint for query planner
          ) r
       JOIN   big b USING (id)             -- eliminate miss
    
       UNION                               -- eliminate dupe
       SELECT b.*
       FROM  (
          SELECT 1 + trunc(random() * 5100000)::int AS id
          FROM   random_pick r             -- plus 3 percent - adapt to your needs
          LIMIT  999                       -- less than 1000, hint for query planner
          ) r
       JOIN   big b USING (id)             -- eliminate miss
       )
    TABLE  random_pick
    LIMIT  1000;  -- actual limit
    

    We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.

    Duplicates are eliminated by the UNION in the rCTE.

    The outer LIMIT makes the CTE stop as soon as we have enough rows.

    This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

    Wrap into function

    For repeated use with the same table with varying parameters:

    CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
      RETURNS SETOF big
      LANGUAGE plpgsql VOLATILE ROWS 1000 AS
    $func$
    DECLARE
       _surplus  int := _limit * _gaps;
       _estimate int := (           -- get current estimate from system
          SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
          FROM   pg_class
          WHERE  oid = 'big'::regclass);
    BEGIN
       RETURN QUERY
       WITH RECURSIVE random_pick AS (
          SELECT *
          FROM  (
             SELECT 1 + trunc(random() * _estimate)::int
             FROM   generate_series(1, _surplus) g
             LIMIT  _surplus           -- hint for query planner
             ) r (id)
          JOIN   big USING (id)        -- eliminate misses
    
          UNION                        -- eliminate dupes
          SELECT *
          FROM  (
             SELECT 1 + trunc(random() * _estimate)::int
             FROM   random_pick        -- just to make it recursive
             LIMIT  _limit             -- hint for query planner
             ) r (id)
          JOIN   big USING (id)        -- eliminate misses
       )
       TABLE  random_pick
       LIMIT  _limit;
    END
    $func$;
    

    Call:

    SELECT * FROM f_random_sample();
    SELECT * FROM f_random_sample(500, 1.05);
    

    Generic function

    We can make this generic to work for any table with a unique integer column (typically the PK): Pass the table as polymorphic type and (optionally) the name of the PK column and use EXECUTE:

    CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
                                             , _id text = 'id'
                                             , _limit int = 1000
                                             , _gaps real = 1.03)
      RETURNS SETOF anyelement
      LANGUAGE plpgsql VOLATILE ROWS 1000 AS
    $func$
    DECLARE
       -- safe syntax with schema & quotes where needed
       _tbl text := pg_typeof(_tbl_type)::text;
       _estimate int := (SELECT (reltuples / relpages
                              * (pg_relation_size(oid) / 8192))::bigint
                         FROM   pg_class  -- get current estimate from system
                         WHERE  oid = _tbl::regclass);
    BEGIN
       RETURN QUERY EXECUTE format(
       $$
       WITH RECURSIVE random_pick AS (
          SELECT *
          FROM  (
             SELECT 1 + trunc(random() * $1)::int
             FROM   generate_series(1, $2) g
             LIMIT  $2                 -- hint for query planner
             ) r(%2$I)
          JOIN   %1$s USING (%2$I)     -- eliminate misses
    
          UNION                        -- eliminate dupes
          SELECT *
          FROM  (
             SELECT 1 + trunc(random() * $1)::int
             FROM   random_pick        -- just to make it recursive
             LIMIT  $3                 -- hint for query planner
             ) r(%2$I)
          JOIN   %1$s USING (%2$I)     -- eliminate misses
       )
       TABLE  random_pick
       LIMIT  $3;
       $$
     , _tbl, _id
       )
       USING _estimate              -- $1
           , (_limit * _gaps)::int  -- $2 ("surplus")
           , _limit                 -- $3
       ;
    END
    $func$;
    

    Call with defaults (important!):

    SELECT * FROM f_random_sample(null::big);  --!
    

    Or more specifically:

    SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);
    

    About the same performance as the static version.

    Related:

    This is safe against SQL injection. See:

    Possible alternative

    If your requirements allow identical sets for repeated calls (and we are talking about repeated calls) consider a MATERIALIZED VIEW. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

    Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

    Where n is a percentage. The manual:

    The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

    Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:

    The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

    The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:

    SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
    

    Related:

    Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:

    SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
    

    See Evan's answer for details.

    But that's still not exactly random.