Best way to select random rows PostgreSQL
Asked Answered
J

13

499

I want a random selection of rows in PostgreSQL, I tried this:

select * from table where random() < 0.01;

But some other recommend this:

select * from table order by random() limit 1000;

I have a very large table with 500 Million rows, I want it to be fast.

Which approach is better? What are the differences? What is the best way to select random rows?

Joijoice answered 29/12, 2011 at 23:30 Comment(7)
Hi Jack, thanks for your response, the execution time is slower in order by, but I would like to know which is the different if any...Joijoice
Uhhh...you're welcome. So, have you tried benchmarking the different approaches?Farther
There are also much faster ways. It all depends on your requirements and what you have to work with. Do you need exactly 1000 rows? Does the table have a numeric id? With no / few / many gaps? How important is speed? How many requests per time unit? Does every request need a different set or can they be the same for a defined time slice?Aventurine
Hi Erwin, I need 1M rows or in that order. I am doing some data mining research, so 1M rows, is the set for training... The table has a numeric id, and there are gaps but they are little. Me and another two guys are the only users of the database.Joijoice
The first option "(random() < 0.01)" is mathematically incorrect as you could get no rows in response if no random number is below 0.01, that could happen in any case (albeit less likely), no matter how big is the table or higher the threshold. The second option is always rightHeffner
If you know you want exactly 1000 rows, my answer uses the new updated tsm_system_rows though it's subject to clustering which may reduce randomness. @JoijoiceMclaurin
If you want to select just one row, see this question: https://mcmap.net/q/75320/-quick-random-row-selection-in-postgres/247696Medardas
A
346

Fast ways

Given your specifications (plus additional info in the comments):

  • You have a numeric ID column (integer numbers) with only a few (or moderately few) gaps.
  • Obviously no or few write operations.
  • Your ID column has to be indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional
     , min(id)  AS min_id
     , max(id)  AS max_id
     , max(id) - min(id) AS id_span
FROM   big;

The only possibly expensive part is the count(*) (for huge tables). Given the above specifications, you don't need it. An estimate to replace the full count will do just fine, available at almost no cost:

SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM   pg_class
WHERE  oid = 'big'::regclass;  -- your table name

Detailed explanation:

As long as ct isn't much smaller than id_span, the query will outperform other approaches.

WITH params AS (
   SELECT 1       AS min_id           -- minimum id <= current min id
        , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)
    )
SELECT *
FROM  (
   SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
   FROM   params p
        , generate_series(1, 1100) g  -- 1000 + buffer
   GROUP  BY 1                        -- trim duplicates
) r
JOIN   big USING (id)
LIMIT  1000;                          -- trim surplus
  • Generate random numbers in the id space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.

  • Each id can be picked multiple times by chance (though very unlikely with a big ID space), so group the generated numbers (or use DISTINCT).

  • Join the ids to the big table. This should be very fast with the index in place.

  • Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

SELECT *
FROM  (
   SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
   FROM   generate_series(1, 1100) g
   ) r
JOIN   big USING (id)
LIMIT  1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (
   SELECT *
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs
      LIMIT  1030                      -- hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss

   UNION                               -- eliminate dupe
   SELECT b.*
   FROM  (
      SELECT 1 + trunc(random() * 5100000)::int AS id
      FROM   random_pick r             -- plus 3 percent - adapt to your needs
      LIMIT  999                       -- less than 1000, hint for query planner
      ) r
   JOIN   big b USING (id)             -- eliminate miss
   )
TABLE  random_pick
LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with the same table with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
  RETURNS SETOF big
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   _surplus  int := _limit * _gaps;
   _estimate int := (           -- get current estimate from system
      SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
      FROM   pg_class
      WHERE  oid = 'big'::regclass);
BEGIN
   RETURN QUERY
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   generate_series(1, _surplus) g
         LIMIT  _surplus           -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * _estimate)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  _limit             -- hint for query planner
         ) r (id)
      JOIN   big USING (id)        -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  _limit;
END
$func$;

Call:

SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);

Generic function

We can make this generic to work for any table with a unique integer column (typically the PK): Pass the table as polymorphic type and (optionally) the name of the PK column and use EXECUTE:

CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
                                         , _id text = 'id'
                                         , _limit int = 1000
                                         , _gaps real = 1.03)
  RETURNS SETOF anyelement
  LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
   -- safe syntax with schema & quotes where needed
   _tbl text := pg_typeof(_tbl_type)::text;
   _estimate int := (SELECT (reltuples / relpages
                          * (pg_relation_size(oid) / 8192))::bigint
                     FROM   pg_class  -- get current estimate from system
                     WHERE  oid = _tbl::regclass);
BEGIN
   RETURN QUERY EXECUTE format(
   $$
   WITH RECURSIVE random_pick AS (
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   generate_series(1, $2) g
         LIMIT  $2                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses

      UNION                        -- eliminate dupes
      SELECT *
      FROM  (
         SELECT 1 + trunc(random() * $1)::int
         FROM   random_pick        -- just to make it recursive
         LIMIT  $3                 -- hint for query planner
         ) r(%2$I)
      JOIN   %1$s USING (%2$I)     -- eliminate misses
   )
   TABLE  random_pick
   LIMIT  $3;
   $$
 , _tbl, _id
   )
   USING _estimate              -- $1
       , (_limit * _gaps)::int  -- $2 ("surplus")
       , _limit                 -- $3
   ;
END
$func$;

Call with defaults (important!):

SELECT * FROM f_random_sample(null::big);  --!

Or more specifically:

SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);

About the same performance as the static version.

Related:

This is safe against SQL injection. See:

Possible alternative

If your requirements allow identical sets for repeated calls (and we are talking about repeated calls) consider a MATERIALIZED VIEW. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:

The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Related:

Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:

SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);

See Evan's answer for details.

But that's still not exactly random.

Aventurine answered 30/12, 2011 at 1:2 Comment(7)
Where is defined the t table ? Should it r instead of t ?Maun
@LucM: It is defined here: JOIN bigtbl t, which is short for JOIN bigtbl AS t. t is a table alias for bigtbl. Its purpose is to shorten the syntax but it would not be needed in this particular case. I simplified the query in my answer and added a simple version.Aventurine
What's the purpose of the range of values from generate_series(1,1100)?Thy
@Awesome-o: The goal is to retrieve 1000 rows, I start with an extra 10 % to compensate for a few gaps or (unlikely but possible) duplicate random numbers ... the explanation is in my answer.Aventurine
Erwin, I posted a variation of your "Possible alternative": https://mcmap.net/q/74055/-best-way-to-select-random-rows-postgresql. Would be interested in your thoughts.Alternative
I have just answered a question here (random single record) during which I performed considerable benchmarking and testing of the tsm_system_rows and tsm_system_time extensions. As far as I can see, they are virtually useless for anything but absolutely minimal selection of random rows. I would be grateful if you could take a quick look and comment on the validity or otherwise of my analysis.Livvy
@ErwinBrandstetter you are a pro! Such a detailed answer. Well done and thanks a lot!Sticky
A
143

postgresql order by random(), select rows in random order:

These are all slow because they do a tablescan to guarantee that every row gets an exactly equal chance of being chosen:

select your_columns from your_table ORDER BY random()

select * from 
  (select distinct your_columns from your_table) table_alias
ORDER BY random()

select your_columns from your_table ORDER BY random() limit 1

If you know how many rows are in the table N:

offset by floored random is constant time. However I am NOT convinced that OFFSET is producing a true random sample. It's simulating it by getting 'the next bunch' and tablescanning that, so you can step through, which isn't quite the same as above.

SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;

Roll your own constant Time Select Random N rows with periodic table scan to be absolutely sure of a random:

If your table is huge then the above table-scans are a show stopper taking up to 5 minutes to finish.

To go faster you can schedule a behind the scenes nightly table-scan reindexing which will guarantee a perfectly random selection in an O(1) constant-time speed, except during the nightly reindexing table-scan, where it must wait for maintenance to finish before you may receive another random row.

--Create a demo table with lots of random nonuniform data, big_data 
--is your huge table you want to get random rows from in constant time. 
drop table if exists big_data;  
CREATE TABLE big_data (id serial unique, some_data text );  
CREATE INDEX ON big_data (id);  
--Fill it with a million rows which simulates your beautiful data:  
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
 
--This delete statement puts holes in your index
--making it NONuniformly distributed  
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8); 
 
 
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper; 
CREATE TABLE big_data_mapper (id serial, big_data_id int); 
CREATE INDEX ON big_data_mapper (id); 
CREATE INDEX ON big_data_mapper (big_data_id); 
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
 
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row, 
--you try again until you succeed.  In the event the big_data_mapper 
--is broken, it tries 25 times then gives up and returns -1. 
CREATE or replace FUNCTION get_random_big_data_id()  
RETURNS int language plpgsql AS $$ 
declare  
    response int; 
BEGIN
    --Loop is required because big_data_mapper could be old
    --Keep rolling the dice until you find one that hits.
    for counter in 1..25 loop
        SELECT big_data_id 
        FROM big_data_mapper OFFSET floor(random() * ( 
            select max(id) biggest_value from big_data_mapper 
            )
        ) LIMIT 1 into response;
        if response is not null then
            return response;
        end if;
    end loop;
    return -1;
END;  
$$; 
 
--get a random big_data id in constant time: 
select get_random_big_data_id(); 
 
--Get 1 random row from big_data table in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 1 
); 
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
└─────────┴──────────────────────────────────┘ 

--Get 4 random rows from big_data in constant time: 
select * from big_data where id in ( 
    select get_random_big_data_id() from big_data limit 3 
);
┌─────────┬──────────────────────────────────┐ 
│   id    │            some_data             │ 
├─────────┼──────────────────────────────────┤ 
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │ 
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │ 
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │ 
└─────────┴──────────────────────────────────┘ 

--Test what happens when big_data_mapper stops receiving 
--nightly reindexing.
delete from big_data_mapper where 1=1; 
select get_random_big_data_id();   --It tries 25 times, and returns -1
                                   --which means wait N minutes and try again.

Adapted from: https://www.gab.lc/articles/bigdata_postgresql_order_by_random

Alternatively if all the above is too much work.

A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data.mapper_int make it not null with a unique index. Every night reset the column with a unique integer between 1 and max(n). To get a random row you "choose a random integer between 0 and max(id)" and return the row where mapper_int is that. If there's no row by that id, because the row has changed since re-index, choose another random row. If a row is added to big_data.mapper_int then populate it with max(id) + 1

Alternatively TableSample to the rescue:

If you have postgresql version > 9.5 then tablesample can do a constant time random sample without a heavy tablescan. https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

--Select 1 percent of rows from yourtable, 
--display the first 100 rows, order by column a_column
select * from yourtable TABLESAMPLE SYSTEM (1)
order by a_column 
limit 100;

TableSample is doing some stuff behind the scenes that takes some time and I don't like it, but is faster than order by random(). Good, fast, cheap, choose any two on this job.

Constant time random, keep random picks on hot standby:

You've got a bigtable with a billion rows, you want a perfect random row in constant time. Put this in a background task and run it once every 12 hours in the background:

drop table if exists random_nextpick_bigtable; 
CREATE TABLE IF NOT EXISTS random_nextpick_bigtable as (
   select your_columns from your_bigtable ORDER BY random()
)

It'll takes 5 minutes to get the picks during the scheduled task, but after it does, a perfectly random row is available in constant time with:

select * from random_nextpick_bigtable limit 1;
delete from random_nextpick_bigtable where id = your_used_id;

In the non-zero chance the id was deleted between scheduled task time and now, delete it and choose the next. Rows added between scheduled tasks won't be in the random sample.

Abstain answered 22/1, 2013 at 1:49 Comment(2)
select your_columns from your_table ORDER BY random() limit 1 take ~2 minutes to exec on 45mil rowsMedusa
is there a way to speed this up?Uncourtly
E
131

You can examine and compare the execution plan of both by using

EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;

A quick test on a large table1 shows, that the ORDER BY first sorts the complete table and then picks the first 1000 items. Sorting a large table not only reads that table but also involves reading and writing temporary files. The where random() < 0.1 only scans the complete table once.

For large tables this might not what you want as even one complete table scan might take to long.

A third proposal would be

select * from table where random() < 0.01 limit 1000;

This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner. Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.

Edit: Besides of this considerations, you might check out the already asked questions for this. Using the query [postgresql] random returns quite a few hits.

And a linked article of depez outlining several more approaches:


1 "large" as in "the complete table will not fit into the memory".

Earldom answered 29/12, 2011 at 23:51 Comment(3)
Good point about writing the temporary file for doing the ordering. That's a big hit indeed. I guess we could do random() < 0.02 and then shuffle that list, then limit 1000! The sort will be less expensive on a few thousand rows (lol).Aggiornamento
The "select * from table where random() < 0.05 limit 500;" is one of the easier methods for postgresql. We made use of this in one of our projects where we needed to select 5% of the results and no more then 500 rows at a time for processing.Abba
Why in the world would you ever consider a O(n) full scan for retrieving a sample on a 500m row table? It is ridiculously slow on large tables and entirely unnecessary.Myosin
C
81

Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :

SELECT * FROM mytable TABLESAMPLE SYSTEM (5);

This example will give you 5% of elements from mytable.

See more explanation on the documentation: http://www.postgresql.org/docs/current/static/sql-select.html

Confiture answered 25/1, 2016 at 10:24 Comment(4)
An important note from the docs: "The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned. The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects."Bagehot
Is there way to specify a number of rows instead of a percentage?Medardas
You can use TABLESAMPLE SYSTEM_ROWS(400) to get a sample of 400 random rows. You need to enable the built-in tsm_system_rows extension to use this statement.Estivation
Works well with 4 billion rows. Other methods take a significant amount of time.Unhand
A
42

The one with the ORDER BY is going to be the slower one.

select * from table where random() < 0.01; goes record by record, and decides to randomly filter it or not. This is going to be O(N) because it only needs to check each record once.

select * from table order by random() limit 1000; is going to sort the entire table, then pick the first 1000. Aside from any voodoo magic behind the scenes, the order by is O(N * log N).

The downside to the random() < 0.01 one is that you'll get a variable number of output records.


Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle, which runs in O(N). Implementing the shuffle in SQL sounds like quite the challenge, though.

Aggiornamento answered 29/12, 2011 at 23:46 Comment(3)
There's no reason you can't add a Limit 1 to the end of your first example though. Only problem is there's a potential that you'd get no records back, so you'd have to consider that in your code.Muniments
The trouble with the Fisher-Yates is that you need to have the whole dataset in memory in order to select from it. Not feasible for very large datasets :(Uncourtly
I would expect postgres to be smart enough to make a select ... order by ... limit k-statement in O(k * N) (iterating the list and inserting into a result heap or list) but I don't actually know if it does.Readiness
M
25
select * from table order by random() limit 1000;

If you know how many rows you want, check out tsm_system_rows.

tsm_system_rows

module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.

This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.

First install the extension

CREATE EXTENSION tsm_system_rows;

Then your query,

SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);
Mclaurin answered 27/12, 2016 at 1:3 Comment(5)
I added a link to your added answer, it's a notable improvement over the built-in SYSTEM method.Aventurine
I have just answered a question here (random single record) during which I performed considerable benchmarking and testing of the tsm_system_rows and tsm_system_time extensions. As far as I can see, they are virtually useless for anything but absolutely minimal selection of random rows. I would be grateful if you could take a quick look and comment on the validity or otherwise of my analysis.Livvy
@evan-carroll To mitigate the clustering effect you could do something like: SELECT * FROM "users" TABLESAMPLE SYSTEM_ROWS(n*10) ORDER BY RANDOM() LIMIT n; where n is the number of rows that you want. This way it will use tsm_system_rows to grab 10 times what you need, order only those rows randomly, and then grab n from those. It's still much faster than SELECT * FROM "users" ORDER BY RANDOM() LIMIT n; because it doesn't order every row.Austine
@ErwinBrandstetter What do you think of my solution to mitigate the clustering effect of using tsm_system_rows extension?Austine
@Austine generally if you need to mitigate the effects of clustering, it's probably a better solution not to rely on it at all, but sure in some use cases that may work fine. In that one you know it won't work if the corresponding rows n*10 columns fits in one block (8kb).Mclaurin
A
18

Here is a decision that works for me. I guess it's very simple to understand and execute.

SELECT 
  field_1, 
  field_2, 
  field_2, 
  random() as ordering
FROM 
  big_table
WHERE 
  some_conditions
ORDER BY
  ordering 
LIMIT 1000;
Antiseptic answered 1/6, 2017 at 11:34 Comment(1)
I think this solution is working as ORDER BY random() which works but might not efficient when working with a large table.Swordplay
D
10

If you want just one row, you can use a calculated offset derived from count.

select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
Design answered 12/9, 2015 at 9:16 Comment(0)
L
7

I think the best and simplest way in postgreSQL is:

SELECT * FROM tableName ORDER BY random() LIMIT 1
Lamplighter answered 12/9, 2022 at 10:8 Comment(0)
A
4

One lesson from my experience:

offset floor(random() * N) limit 1 is not faster than order by random() limit 1.

I thought the offset approach would be faster because it should save the time of sorting in Postgres. Turns out it wasn't.

Alkali answered 19/6, 2019 at 18:59 Comment(5)
Could you explain why?Booker
Because when you request an offset, it will still compute the rows up to your offset. for instance if you ask for offset 100 limit 20, it will process your query for the first 120 rows, then discard the first 100 and return the last 20. So you save nothing by using offset.Variance
@Variance where you read this, can you provide any links to official sources about this?Congregational
@Congregational I don't have a link to document but it's logically obvious. In order to pull the 100th row of a query you need to know what the first 99 are. The only exception to this might be if your table is indexed in a static order and you do a query with no variation, just a straight up select with limit/offset. Maybe in taht situation postgres can do optimizations to intelligently skip over certain rows but that's obviously not going to be happening in a situation with a random order.Variance
Compared on a table of 5 mln rows. Order by random: 800-1300 ms. Offset by random: 140-360 ms.Blockus
A
2

A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible.

Say, for example, that you don't want duplicates in the randomized values that are returned. An example use case is to generate short codes which can only be used once.

The primary table containing your (non-randomized) set of values must have some expression that determines which rows are "used" and which aren't — here I'll keep it simple by just creating a boolean column with the name used.

Assume this is the input table (additional columns may be added as they do not affect the solution):

id_values  id  |   used
           ----+--------
           1   |   FALSE
           2   |   FALSE
           3   |   FALSE
           4   |   FALSE
           5   |   FALSE
           ...

Populate the ID_VALUES table as needed. Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES table once:

CREATE MATERIALIZED VIEW id_values_randomized AS
  SELECT id
  FROM id_values
  ORDER BY random();

Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the id_values table.

In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values, selecting id_values from id_values_randomized with a join, and applying the desired criteria to obtain only relevant possibilities. For example:

UPDATE id_values
SET used = TRUE
WHERE id_values.id IN 
  (SELECT i.id
    FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
    WHERE (NOT i.used)
    LIMIT 1)
RETURNING id;

Change LIMIT as necessary -- if you need multiple random values at a time, change LIMIT to n where n is the number of values needed.

With the proper indexes on id_values, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_values table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's added to the id_values table need to be made available.

Alternative answered 13/5, 2014 at 14:33 Comment(1)
very interesting. Would that work if I need not only to select but also update using select..for update with a pg_try_advisory_xact_lock ? (i.e i need many concurrent reads AND writes)Gallium
A
0

Add a column called r with type serial. Index r.

Assume we have 200,000 rows, we are going to generate a random number n, where 0 < n <= 200, 000.

Select rows with r > n, sort them ASC and select the smallest one.

Code:

select * from YOUR_TABLE 
where r > (
    select (
        select reltuples::bigint AS estimate
        from   pg_class
        where  oid = 'public.YOUR_TABLE'::regclass) * random()
    )
order by r asc limit(1);

The code is self-explanatory. The subquery in the middle is used to quickly estimate the table row counts from https://mcmap.net/q/28650/-fast-way-to-discover-the-row-count-of-a-table-in-postgresql .

In application level you need to execute the statement again if n > the number of rows or need to select multiple rows.

Aframe answered 11/12, 2015 at 13:21 Comment(2)
I like this because it's short and elegant :) And I even found a way to improve it: EXPLAIN ANALYZE tells me that like this, a PKEY index will not be used because random() returns a double, whereas the PKEY needs a BIGINT.Delineator
select * from YOUR_TABLE where r > ( select ( select reltuples::bigint AS estimate from pg_class where oid = 'public.YOUR_TABLE'::regclass) * random() )::BIGINT order by r asc limit(1);Delineator
C
0

I know I'm a little late to the party, but I just found this awesome tool called pg_sample:

pg_sample - extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity.

I tried this with a 350M rows database and it was really fast, don't know about the randomness.

./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db
Crosslink answered 8/3, 2018 at 12:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.