MySQL使用rand获取随机记录的性能优化问题
来源:互联网 发布:陕西网络干部学院 编辑:程序博客网 时间:2024/06/05 17:25
对MySQL rand方法随机获取记录的性能优化问题,讲解很到位的一篇文章:
If you read the MySQL manual you might have seen the ORDER BY RAND() to randomize the the rows and using the LIMIT 1 to just take one of the rows.
SELECT name FROM random ORDER BY RAND() LIMIT 1;
This example works fine and is fast if you only when let’s say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don’t forget: we only sort to throw nearly all the rows away.
I never liked it. And there are better ways to do it. Without a sorting. As long as we have a numeric primary key.
For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.
move the work into the application
First idea: We can simplify the whole job if we calculate the ID beforehand in the application.
SELECT MAX(id) FROM random;## generate random id in applicationSELECT name FROM random WHERE id = <random-id>
As MAX(id) == COUNT(id) we just generate random number between 1 and MAX(id) and pass it into the database to retrieve the random row.
The first SELECT is a NO-OP and is optimized away. The second is a eq_ref
against a constant value and also very fast.
move the job into the database
But is it really necessary to do it in the application ? Can’t we do it in the database ?
# generating a random ID> SELECT RAND() * MAX(id) FROM random;+------------------+| RAND() * MAX(id) |+------------------+| 689.37582507297 |+------------------+# oops, this is a double, we need an int> SELECT CEIL(RAND() * MAX(id)) FROM random;+-------------------------+| CEIL(RAND() * MAX(id)) |+-------------------------+| 1000000 |+-------------------------+# better. But how is the performance:> EXPLAIN SELECT CEIL(RAND() * MAX(id)) FROM random;+----+-------------+-------+-------+------+-------------+| id | select_type | table | type | rows | Extra |+----+-------------+-------+-------+------+-------------+| 1 | SIMPLE | random | index | 1000000 | Using index |+----+-------------+-------+-------+------+-------------+## a index scan ? we lost our optimization for the MAX()> EXPLAIN SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));+----+-------------+-------+------+------+------------------------------+| id | select_type | table | type | rows | Extra |+----+-------------+-------+------+------+------------------------------+| 1 | PRIMARY | NULL | NULL | NULL | No tables used || 2 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |+----+-------------+-------+------+------+------------------------------+## a simple Sub-Query is bringing us the performance back.
Ok, now we know how to generate the random ID, but how to get the row ?
> EXPLAINSELECT name FROM random WHERE id = (SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+| 1 | PRIMARY | random | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where || 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+> show warnings;+-------+------+------------------------------------------+| Level | Code | Message |+-------+------+------------------------------------------+| Note | 1249 | Select 2 was reduced during optimization |+-------+------+------------------------------------------+
NO, NO, NO. Don’t go this way. This is the most obvious, but also the most wrong way to do it. The reason: the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching. This leads to 0 to 4091 rows, depending on your luck.
We need a way to make sure that the random-id is only generated once:
SELECT name FROM random JOIN (SELECT CEIL(RAND() * (SELECT MAX(id) FROM random)) AS id ) AS r2 USING (id);+----+-------------+------------+--------+------+------------------------------+| id | select_type | table | type | rows | Extra |+----+-------------+------------+--------+------+------------------------------+| 1 | PRIMARY | <derived2> | system | 1 | || 1 | PRIMARY | random | const | 1 | || 2 | DERIVED | NULL | NULL | NULL | No tables used || 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |+----+-------------+------------+--------+------+------------------------------+
The inner SELECT is generating a constant TEMPORARY table and the JOIN is selecting just on single row. Perfect.
No Sorting, No Application, Most parts of the query optimized away.
adding holes to the numbers
To generalize the last solution we add the possibility of holes, like when you DELETE
rows.
SELECT name FROM random AS r1 JOIN (SELECT (RAND() * (SELECT MAX(id) FROM random)) AS id) AS r2 WHERE r1.id >= r2.id ORDER BY r1.id ASC LIMIT 1;+----+-------------+------------+--------+------+------------------------------+| id | select_type | table | type | rows | Extra |+----+-------------+------------+--------+------+------------------------------+| 1 | PRIMARY | <derived2> | system | 1 | || 1 | PRIMARY | r1 | range | 689 | Using where || 2 | DERIVED | NULL | NULL | NULL | No tables used || 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |+----+-------------+------------+--------+------+------------------------------+
The JOIN now adds all the IDs which are greater or equal than our random value and selects only the direct neighboor if a direct match is not possible. BUT as soon as one row is found, we stop (theLIMIT 1
). And we read the rows according to the index (ORDER BY id ASC
). As we are using >=
instead of a =
we can get rid of a the CEIL
and get the same result with a little less work.
Equal Distribution
As soon as the distribution of the IDs is not equal anymore our selection of rows isn’t really random either.
> select * from holes;+----+----------------------------------+----------+| id | name | accesses |+----+----------------------------------+----------+| 1 | d12b2551c6cb7d7a64e40221569a8571 | 107 || 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 50 || 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 132 || 8 | 677a196206d93cdf18c3744905b94f73 | 230 || 16 | b7556d8ed40587a33dc5c449ae0345aa | 481 |+----+----------------------------------+----------+
The RAND
function is generating IDs like 9 to 15 which all lead to the id 16 to be selected as the next higher number.
There is no real solution for this problem, but your data is mostly constant you can add a mapping table which maps the row-number to the id:
> create table holes_map ( row_id int not NULL primary key, random_id int not null);> SET @id = 0;> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;> select * from holes_map;+--------+-----------+| row_id | random_id |+--------+-----------+| 1 | 1 || 2 | 2 || 3 | 4 || 4 | 8 || 5 | 16 |+--------+-----------+
The row_id
is now free of holes again and we can run our random query again:
SELECT name FROM holes JOIN (SELECT r1.random_id FROM holes_map AS r1 JOIN (SELECT (RAND() * (SELECT MAX(row_id) FROM holes_map)) AS row_id) AS r2 WHERE r1.row_id >= r2.row_id ORDER BY r1.row_id ASC LIMIT 1) as rows ON (id = random_id);
After 1000 fetches we see a equal distribution again:
> select * from holes;+----+----------------------------------+----------+| id | name | accesses |+----+----------------------------------+----------+| 1 | d12b2551c6cb7d7a64e40221569a8571 | 222 || 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 187 || 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 195 || 8 | 677a196206d93cdf18c3744905b94f73 | 207 || 16 | b7556d8ed40587a33dc5c449ae0345aa | 189 |+----+----------------------------------+----------+
Maintaining the holes
Let’s take the tables as before:
DROP TABLE IF EXISTS r2;CREATE TABLE r2 ( id SERIAL, name VARCHAR(32) NOT NULL UNIQUE);DROP TABLE IF EXISTS r2_equi_dist;CREATE TABLE r2_equi_dist ( id SERIAL, r2_id bigint unsigned NOT NULL UNIQUE);
When ever we change something in r2 we want to that r2_equi_dist is updated too.
DELIMITER $$DROP TRIGGER IF EXISTS tai_r2$$CREATE TRIGGER tai_r2 AFTER INSERT ON r2 FOR EACH ROWBEGIN DECLARE m BIGINT UNSIGNED DEFAULT 1; SELECT MAX(id) + 1 FROM r2_equi_dist INTO m; SELECT IFNULL(m, 1) INTO m; INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);END$$DELIMITER ;DELETE FROM r2;INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );SELECT * FROM r2;+----+----------------------------------+| id | name |+----+----------------------------------+| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 || 2 | a09a3959d68187ce48f4fe7e388926a9 || 3 | 4e1897cd6d326f8079108292376fa7d5 || 4 | 29a5e3ed838db497aa330878920ec01b |+----+----------------------------------+SELECT * FROM r2_equi_dist;+----+-------+| id | r2_id |+----+-------+| 1 | 1 || 2 | 2 || 3 | 3 || 4 | 4 |+----+-------+
INSERT is quite simple, on DELETE we have to update the equi-dist-id to maintain the hole-free setup:
DELIMITER $$DROP TRIGGER IF EXISTS tad_r2$$CREATE TRIGGER tad_r2 AFTER DELETE ON r2 FOR EACH ROWBEGIN DELETE FROM r2_equi_dist WHERE r2_id = OLD.id; UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;END$$DELIMITER ;DELETE FROM r2 WHERE id = 2;SELECT * FROM r2;+----+----------------------------------+| id | name |+----+----------------------------------+| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 || 3 | 4e1897cd6d326f8079108292376fa7d5 || 4 | 29a5e3ed838db497aa330878920ec01b |+----+----------------------------------+SELECT * FROM r2_equi_dist;+----+-------+| id | r2_id |+----+-------+| 1 | 1 || 2 | 3 || 3 | 4 |+----+-------+
UPDATE is straight-forward again. We only have to maintain the Foreign Key constraint:
DELIMITER $$DROP TRIGGER IF EXISTS tau_r2$$CREATE TRIGGER tau_r2 AFTER UPDATE ON r2 FOR EACH ROWBEGIN UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;END$$DELIMITER ;UPDATE r2 SET id = 25 WHERE id = 4;SELECT * FROM r2;+----+----------------------------------+| id | name |+----+----------------------------------+| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 || 3 | 4e1897cd6d326f8079108292376fa7d5 || 25 | 29a5e3ed838db497aa330878920ec01b |+----+----------------------------------+SELECT * FROM r2_equi_dist;+----+-------+| id | r2_id |+----+-------+| 1 | 1 || 2 | 3 || 3 | 25 |+----+-------+
Multiple Rows at once
If you want to get more than one row returned, you can:
- execute the Query several times
- write a stored procedure which is executing the query and stores the result in a temp-table
- make a UNION
a stored procedure
Stored procedures provide you with the structures you know from your favourite programming language:
- loops
- control structures
- procedures * …
For this task we only need a LOOP:
DELIMITER $$DROP PROCEDURE IF EXISTS get_rands$$CREATE PROCEDURE get_rands(IN cnt INT)BEGIN DROP TEMPORARY TABLE IF EXISTS rands; CREATE TEMPORARY TABLE rands ( rand_id INT );loop_me: LOOP IF cnt < 1 THEN LEAVE loop_me; END IF; INSERT INTO rands SELECT r1.id FROM random AS r1 JOIN (SELECT (RAND() * (SELECT MAX(id) FROM random)) AS id) AS r2 WHERE r1.id >= r2.id ORDER BY r1.id ASC LIMIT 1; SET cnt = cnt - 1; END LOOP loop_me;END$$DELIMITER ;CALL get_rands(4);SELECT * FROM rands;+---------+| rand_id |+---------+| 133716 || 702643 || 112066 || 452400 |+---------+
I leave to the reader to fix the issues:
- use dynamic SQL and pass in the name of the temporary table
- use a UNIQUE index on the table and catch the UNIQUE key violation to remove the possible duplicates in the result-set
Performance
Now let’s see what happends to our performance. We have 3 different queries for solving our problems.
- Q1. ORDER BY RAND()
- Q2. RAND() * MAX(ID)
- Q3. RAND() * MAX(ID) + ORDER BY ID
Q1 is expected to cost N * log2(N), Q2 and Q3 are nearly constant.
The get real values we filled the table with N rows ( one thousand to one million) and executed each query 1000 times.
100 1.000 10.000 100.000 1.000.000Q1 0:00.718s 0:02.092s 0:18.684s 2:59.081s 58:20.000sQ2 0:00.519s 0:00.607s 0:00.614s 0:00.628s 0:00.637sQ3 0:00.570s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
As you can see the plain ORDER BY RAND() is already behind the optimized query at only 100 rows in the table.
A more detailed analysis of those queries is at analyzing-complex-queries.
原文链接:
http://jan.kneschke.de/projects/mysql/order-by-rand/
- MySQL使用rand获取随机记录的性能优化问题
- mysql 随机获取记录 order by rand 优化
- mysql使用rand随机查询记录的高效率方法
- mysql使用rand随机查询记录的高效率方法
- MySql 100万级别数据中随机获取一条或多条记录之RAND()优化
- MySQL随机查询rand()优化
- MySQL随机查询rand()优化
- MySQL随机查询rand()优化
- mysql使用rand随机查询记录效率测试
- mysql使用rand随机查询记录效率测试
- mysql使用rand随机查询记录效率测试
- mysql 使用rand随机查询记录效率测试
- mysql使用rand随机查询记录效率测试 【转】
- php MySQL使用rand函数随机取记录
- (摘)mysql使用rand随机查询记录效率测试
- MySQL数据库中如何使用rand随机查询记录
- MySQL使用rand随机查询记录效率测试
- MySQL数据库中如何使用rand随机查询记录
- MAC 10.6及以上安装PHP不能显示phpinfo的问题及安装过程
- java-构造函数和静态初始化
- USB HID 下位机开发笔记
- hdu-4217Data Structure?
- 40多个关于人脸检测/识别的API、库和软件
- MySQL使用rand获取随机记录的性能优化问题
- ATM机基本操作代码
- 内存碎片导致应用不断被杀的问题
- POJ 2074 Line of Sight
- 详细解析Linux scp命令的应用
- Hbase Replication 介绍
- poj-3468-A Simple Problem with Integers
- 割点&割边模板
- 学生信息管理系统之深度理解