quick selection of a random row from a large table in mysql
Asked Answered
G

24

52

What is a fast way to select a random row from a large mysql table?

I'm working in php, but I'm interested in any solution even if it's in another language.

Giraud answered 17/10, 2008 at 7:38 Comment(1)
possible duplicate of MySQL select 10 random rows from 600K rows fastScript
M
54

Grab all the id's, pick a random one from it, and retrieve the full row.

If you know the id's are sequential without holes, you can just grab the max and calculate a random id.

If there are holes here and there but mostly sequential values, and you don't care about a slightly skewed randomness, grab the max value, calculate an id, and select the first row with an id equal to or above the one you calculated. The reason for the skewing is that id's following such holes will have a higher chance of being picked than ones that follow another id.

If you order by random, you're going to have a terrible table-scan on your hands, and the word quick doesn't apply to such a solution.

Don't do that, nor should you order by a GUID, it has the same problem.

Monarch answered 17/10, 2008 at 7:42 Comment(1)
As of now, ORDER BY RAND() only takes a few seconds for a table that has a million records.Baiel
U
40

I knew there had to be a way to do it in a single query in a fast way. And here it is:

A fast way without involvement of external code, kudos to

http://jan.kneschke.de/projects/mysql/order-by-rand/

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;
Urdar answered 17/10, 2008 at 8:19 Comment(3)
Note the tradeoff here in that, to be assured of getting a result on the first try, any keys which are preceded by gaps will be more likely to be selected. e.g., Given two records with keys 1 and 10, the record with 10 as its key will be selected 90% of the time.Amatory
Yes, you can get a better distribution if the keys are without gaps and avoiding the WHERE and ORDER BY clauses. Check the article, it's all pretty well explained there. I didn't want to steal all of it, thus didn't put the other queries, pros and cons of each.Urdar
This query somehow does not return data at some time when you specify some extra parameter like WHERE r1.id >= r2.id AND r1.some_field=1 while some_field contains data=1. Any idea about how to solve this?Cavill
S
31

MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)

This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).

Swahili answered 18/10, 2008 at 4:39 Comment(9)
This is a beautiful idea. Is there an article or other resource detailing this?Anjaanjali
its nice idea but for N desired results may not work i guess.Because you might get less results or order might be tha same.Duvall
It's a nice idea. But on the query we still have to sort by the random column, right? Suppose the random column is random_number, then the query is like: "SELECT * FROM mytable WHERE random_number>$rand ORDER BY random_number LIMIT 1". Is it much faster than ORDER BY RAND()?Roxyroy
You'd need to place a degree of limitation on the maximum your random number with regard to the current number of entries. Then progress this limit with a degree of correlation to the number of rows in the table as it grows. Example is when there are not many entries. Say you have 3. Without a limit on the random number you can have say 2 very small number and one large one. The smallest of the 3 will almost never be called up when the gap between the min, itself, and the middle number is so small. What if min=0, max=100 with 3 entries & rand #'s assigned was 49, 50, 51?Anywhere
I don't understand this. How is this different from just randomizing a number between 1 and max(id) and pick the entry with that ID? Why do you need an extra column?Conscription
This is a really great idea, love it!Manufactory
Create an index on the column containing the random number. Then SELECT id FROM table WHERE randNumb >= RAND() ORDER BY randNumb LIMIT 5 will be index-accelerated. If you're using MyISAM you may need a compound index on (randNumb, id).Carper
er. this is extremely biased. consider the initial set with three pages: one random ended up at 0.1, another at 0.3, and the last at 0.9. this means that one has a 0.1 chance; one has a 0.2 chance; one has a 0.6 chance; and the last 0.1 chance is distributed however it handles the top-line loss. best case it's 6x distorted; worst it's 7x. "but it'll even out over time!" a look at the index shows that after almost two decades as one of the largest sites on earth, it hasn't. it actually gets worse over time as near-matches shave other pages out of existence. terrible ideaSooth
This will give you rows according to some non-uniform distribution which itself has been sampled uniformly from the set of all possible distributions. That is, if you fetch a single article, from your perspective, it will be completely random (even if you add some WHERE condition, which is not possible with the rand()*max(id) trick). If you fetch articles multiple times, their cross-correlations will be nothing like the cross-correlation from two unformly random picks (e.g. the chance of getting the same article twice will be much larger). Depends on your use case whether that is a problem.Lekishalela
S
14

Here's a solution that runs fairly quickly, and it gets a better random distribution without depending on id values being contiguous or starting at 1.

SET @r := (SELECT ROUND(RAND() * (SELECT COUNT(*) FROM mytable)));
SET @sql := CONCAT('SELECT * FROM mytable LIMIT ', @r, ', 1');
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;
Solanaceous answered 17/10, 2008 at 18:16 Comment(3)
How do you get the row returned by this SQL query using PHP? Setting $query equal to the above and then doing the usual mysql_query($query) is not returning any results. Thanks.Feminacy
That's 1.5 table scans -- 1 for the COUNT(*) (assuming InnoDB), something less than a full scan for the OFFSET @r. But it is excellent at being random and not depending on the properties of an id.Turk
@RickJames, Right. Another solution would be to enumerate the rows with a new column that is filled with serial integers. Then one can get the greatest with MAX() instead of COUNT(), and then choose it by index without coping with gaps. Though that solution requires renumbering as rows come and go.Solanaceous
O
4

Maybe you could do something like:

SELECT * FROM table 
  WHERE id=
    (FLOOR(RAND() * 
           (SELECT COUNT(*) FROM table)
          )
    );

This is assuming your ID numbers are all sequential with no gaps.

Oedema answered 26/9, 2008 at 22:15 Comment(3)
Actually you may want CEIL instead of FLOOR, depends if your ID's start at 0 or 1Oedema
That assumes that the expression is cached and not recalculated for every row.Deafmute
There are gaps in the primary key, as some rows get deleted.Strobotron
B
3

Add a column containing a calculated random value to each row, and use that in the ordering clause, limiting to one result upon selection. This works out faster than having the table scan that ORDER BY RANDOM() causes.

Update: You still need to calculate some random value prior to issuing the SELECT statement upon retrieval, of course, e.g.

SELECT * FROM `foo` WHERE `foo_rand` >= {some random value} LIMIT 1
Blacken answered 26/9, 2008 at 22:17 Comment(4)
I thought about that. Add a new indexed column and on row creation, assign a random int to it. But the problem with that is I'm storing unnecessary data and you would still have to do something else to actually get a random row out of it, since the random column data is static.Strobotron
How come this is -2, yet Cesar B's one is +17? They seem pretty much the same to me.Ulric
Should it be "SELECT * FROM foo WHERE foo_rand >= {some random value} ORDER BY foo_rand LIMIT 1"?Roxyroy
What if your {some random value} is greater than the highest pre-generated random number in the table though. You'll return an empty recordset.Helios
T
1

For selecting multiple random rows from a given table (say 'words'), our team came up with this beauty:

SELECT * FROM
`words` AS r1 JOIN 
(SELECT  MAX(`WordID`) as wid_c FROM `words`) as tmp1
WHERE r1.WordID >= (SELECT (RAND() * tmp1.wid_c) AS id) LIMIT n
Toile answered 23/4, 2009 at 9:10 Comment(0)
A
1

if you don't delete row in this table, the most efficient way is:

(if you know the mininum id just skip it)

SELECT MIN(id) AS minId, MAX(id) AS maxId FROM table WHERE 1

$randId=mt_rand((int)$row['minId'], (int)$row['maxId']);

SELECT id,name,... FROM table WHERE id=$randId LIMIT 1
Arrear answered 31/5, 2010 at 20:53 Comment(0)
C
1

In order to find random rows from a table, don’t use ORDER BY RAND() because it forces MySQL to do a full file sort and only then to retrieve the limit rows number required. In order to avoid this full file sort, use the RAND() function only at the where clause. It will stop as soon as it reaches to the required number of rows. See http://www.rndblog.com/how-to-select-random-rows-in-mysql/

Charr answered 25/1, 2011 at 13:50 Comment(0)
D
1

I see here a lot of solution. One or two seems ok but other solutions have some constraints. But the following solution will work for all situation

select a.* from random_data a, (select max(id)*rand() randid  from random_data) b
     where a.id >= b.randid limit 1;

Here, id, don't need to be sequential. It could be any primary key/unique/auto increment column. Please see the following Fastest way to select a random row from a big MySQL table

Thanks Zillur - www.techinfobest.com

Deina answered 22/2, 2014 at 9:22 Comment(0)
S
0

The classic "SELECT id FROM table ORDER BY RAND() LIMIT 1" is actually OK.

See the follow excerpt from the MySQL manual:

If you use LIMIT row_count with ORDER BY, MySQL ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result.

Sewerage answered 27/9, 2008 at 13:12 Comment(2)
But it still has to assign a random number to each and every record, doesn't it? I ask because that explanation doesn't make much sense to me: how it is going to return first N sorted rows if the whole resultset is not sorted :SLafond
@igelkott, there's still performance issue, I guess it's not OKCopenhaver
U
0

An easy but slow way would be (good for smallish tables)

SELECT * from TABLE order by RAND() LIMIT 1
Urdar answered 17/10, 2008 at 7:39 Comment(4)
This will produce a random value for all the rows in the table, a sort, and then grabbing one row. This is not quick.Monarch
True. It's quick in development time though. (and in answer time :-) ). I'll leave it here for non big table users who might need itUrdar
"smallish" can be surprisingly small (I've run into problems with a 20k entry table on a virtual host), and tracking this kind of problem down can be a royal pain in the back. Do yourself a favour and use a proper algorithm from the start.Pussyfoot
This is going to cause a big performance drain for large tables. Check this similar question #1245055Mont
B
0

With a order yo will do a full scan table. Its best if you do a select count(*) and later get a random row=rownum between 0 and the last registry

Bravin answered 17/10, 2008 at 7:44 Comment(0)
B
0

In pseudo code:

sql "select id from table"
store result in list
n = random(size of list)
sql "select * from table where id=" + list[n]

This assumes that id is a unique (primary) key.

Berkeleian answered 17/10, 2008 at 7:53 Comment(2)
If the IDs don't change frequently you can even keep the list of IDs in memory to make things faster.Berkeleian
What if there are a billion rows? That means your list variable is huge.Solanaceous
N
0

I'm a bit new to SQL but how about generating a random number in PHP and using

SELECT * FROM the_table WHERE primary_key >= $randNr

this doesn't solve the problem with holes in the table.

But here's a twist on lassevks suggestion:

SELECT primary_key FROM the_table

Use mysql_num_rows() in PHP create a random number based on the above result:

SELECT * FROM the_table WHERE primary_key = rand_number

On a side note just how slow is SELECT * FROM the_table:
Creating a random number based on mysql_num_rows() and then moving the data pointer to that point mysql_data_seek(). Just how slow will this be on large tables with say a million rows?

Novgorod answered 19/12, 2008 at 22:27 Comment(0)
H
0

Take a look at this link by Jan Kneschke or this SO answer as they both discuss the same question. The SO answer goes over various options also and has some good suggestions depending on your needs. Jan goes over all the various options and the performance characteristics of each. He ends up with the following for the most optimized method by which to do this within a MySQL select:

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;

HTH,

-Dipin

Hubert answered 20/3, 2009 at 22:7 Comment(0)
D
0

There is another way to produce random rows using only a query and without order by rand(). It involves User Defined Variables. See how to produce random rows from a table

Depurative answered 30/11, 2009 at 21:1 Comment(0)
D
0

I ran into the problem where my IDs were not sequential. What I came up with this.

SELECT * FROM products WHERE RAND()<=(5/(SELECT COUNT(*) FROM products)) LIMIT 1

The rows returned are approximately 5, but I limit it to 1.

If you want to add another WHERE clause it becomes a bit more interesting. Say you want to search for products on discount.

SELECT * FROM products WHERE RAND()<=(100/(SELECT COUNT(*) FROM pt_products)) AND discount<.2 LIMIT 1

What you have to do is make sure you are returning enough result which is why I have it set to 100. Having a WHERE discount<.2 clause in the subquery was 10x slower, so it's better to return more results and limit.

Design answered 30/5, 2012 at 22:37 Comment(0)
V
0

Use the below query to get the random row

SELECT user_firstname ,
COUNT(DISTINCT usr_fk_id) cnt
FROM userdetails 
GROUP BY usr_fk_id 
ORDER BY cnt ASC  
LIMIT 1
Valladolid answered 24/2, 2015 at 6:17 Comment(0)
A
0

In my case my table has an id as primary key, auto-increment with no gaps, so I can use COUNT(*) or MAX(id) to get the number of rows.

I made this script to test the fastest operation:

logTime();
query("SELECT COUNT(id) FROM tbl");
logTime();
query("SELECT MAX(id) FROM tbl");
logTime();
query("SELECT id FROM tbl ORDER BY id DESC LIMIT 1");
logTime();

The results are:

  • Count: 36.8418693542479 ms
  • Max: 0.241041183472 ms
  • Order: 0.216960906982 ms

Answer with the order method:

SELECT FLOOR(RAND() * (
    SELECT id FROM tbl ORDER BY id DESC LIMIT 1
)) n FROM tbl LIMIT 1

...
SELECT * FROM tbl WHERE id = $result;
Alina answered 15/5, 2015 at 14:27 Comment(0)
I
0

I have used this and the job was done the reference from here

SELECT * FROM myTable WHERE RAND()<(SELECT ((30/COUNT(*))*10) FROM myTable) ORDER BY RAND() LIMIT 30;
Insomnolence answered 12/3, 2016 at 16:36 Comment(0)
R
0

Create a Function to do this most likely the best answer and most fastest answer here!

Pros - Works even with Gaps and extremely fast.

<?

$sqlConnect = mysqli_connect('localhost','username','password','database');

function rando($data,$find,$max = '0'){
   global $sqlConnect; // Set as mysqli connection variable, fetches variable outside of function set as GLOBAL
   if($data == 's1'){
     $query = mysqli_query($sqlConnect, "SELECT * FROM `yourtable` ORDER BY `id` DESC LIMIT {$find},1");

     $fetched_data = mysqli_fetch_assoc($query);
      if(mysqli_num_rows($fetched_data>0){
       return $fetch_$data;
      }else{
       rando('','',$max); // Start Over the results returned nothing
      }
   }else{
     if($max != '0'){
        $irand = rand(0,$max); 
        rando('s1',$irand,$max); // Start rando with new random ID to fetch
     }else{

        $query = mysqli_query($sqlConnect, "SELECT `id` FROM `yourtable` ORDER BY `id` DESC LIMIT 0,1");
        $fetched_data = mysqli_fetch_assoc($query);
        $max = $fetched_data['id'];
        $irand = rand(1,$max);
        rando('s1',$irand,$max); // Runs rando against the random ID we have selected if data exist will return
     }
   }
 }

 $your_data = rando(); // Returns listing data for a random entry as a ASSOC ARRAY
?>

Please keep in mind this code as not been tested but is a working concept to return random entries even with gaps.. As long as the gaps are not huge enough to cause a load time issue.

Renoir answered 29/3, 2017 at 18:58 Comment(0)
P
-1

Quick and dirty method:

SET @COUNTER=SELECT COUNT(*) FROM your_table;

SELECT PrimaryKey
FROM your_table
LIMIT 1 OFFSET (RAND() * @COUNTER);

The complexity of the first query is O(1) for MyISAM tables.

The second query accompanies a table full scan. Complexity = O(n)

Dirty and quick method:

Keep a separate table for this purpose only. You should also insert the same rows to this table whenever inserting to the original table. Assumption: No DELETEs.

CREATE TABLE Aux(
  MyPK INT AUTO_INCREMENT,
  PrimaryKey INT
);

SET @MaxPK = (SELECT MAX(MyPK) FROM Aux);
SET @RandPK = CAST(RANDOM() * @MaxPK, INT)
SET @PrimaryKey = (SELECT PrimaryKey FROM Aux WHERE MyPK = @RandPK);

If DELETEs are allowed,

SET @delta = CAST(@RandPK/10, INT);

SET @PrimaryKey = (SELECT PrimaryKey
                   FROM Aux
                   WHERE MyPK BETWEEN @RandPK - @delta AND @RandPK + @delta
                   LIMIT 1);

The overall complexity is O(1).

Pileate answered 18/10, 2008 at 5:18 Comment(0)
C
-2

SELECT DISTINCT * FROM yourTable WHERE 4 = 4 LIMIT 1;

Chow answered 22/1, 2014 at 2:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.