如何快速更新数据库中的百万条数据

来源:互联网 发布:淘宝扣12分有什么影响 编辑:程序博客网 时间:2024/05/20 07:17

环境:ubuntu 14.04,Mysql 5.6,python 2.7。

本文使用python的MySQLdb这个库,MySQL Python tutorial这个教程很赞。

MySQLDBdb处理INSERT、UPDATE、DELETE时需要显式得commit()才能使更新有效,这也意味着commit之前的更新语句是放在一个事务中的。

For databases that support transactions, the Python interface silently starts a transaction when the cursor is created. The commit() method commits the updates made using that cursor, and the rollback() method discards them. Each method starts a new transaction.

对比如下:

更新方式总时间(ms)rows平均(ms)157831100057.82896951089639294140×100000.2344795640×100000.11952840740×100000.0716327248×100000.008

方式1是单线程,where使用了索引,一个更新对应一个commit。
方式2是单线程,where没有使用索引,所有更新对应一个commit。
方式3是单线程,where使用了索引,所有更新对应一个commit。
方式4使用了2个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
方式5使用了4个进程来更新,where使用了索引,每个进程里的所有更新对应一个commit。
方式6是在一个update语句里面更新所有的数据。

建立表

mysql> CREATE TABLE `test`.`number` (  `id` INT NOT NULL COMMENT '',  `num` INT NULL COMMENT '',  PRIMARY KEY (`id`)  COMMENT '');mysql> show create table number \G*************************** 1. row ***************************       Table: numberCreate Table: CREATE TABLE `number` (  `id` int(11) NOT NULL,  `num` int(11) DEFAULT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)

使用mysqldb插入数据(100*000条)

# coding: UTF-8__author__ = 'letian'import MySQLdbimport timehost='127.0.0.1'user='root'passwd='123456'db='test'port = 3306start_time = time.time()conn = MySQLdb.connect(host, user, passwd, db, port)cur = conn.cursor()for x in xrange(100*10000): # 0 到 999999    print x    sql = "INSERT INTO number VALUES(%s,%s)" % (x, x)    cur.execute(sql)conn.commit()end_time = time.time()print '用时:', end_time - start_time

运行结果:

...999997999998999999用时:219.332565069

挺快。

看下数据:

mysql> select count(*) from number;+----------+| count(*) |+----------+|  1000000 |+----------+1 row in set (0.60 sec)mysql> select * from number limit 3;+----+------+| id | num  |+----+------+|  0 |    0 ||  1 |    1 ||  2 |    2 |+----+------+3 rows in set (0.00 sec)

更新方式1:单线程,使用索引,多个commmit,更新1000条

# coding: UTF-8__author__ = 'letian'import MySQLdbimport timehost='127.0.0.1'user='root'passwd='123456'db='test'port = 3306start_time = time.time()conn = MySQLdb.connect(host, user, passwd, db, port)cur = conn.cursor()for x in xrange(1000):    print x    sql = "UPDATE number SET num=%s where id=%s" % (x+1, x)    cur.execute(sql)    conn.commit()cur.close()conn.close()end_time = time.time()print '用时:', end_time - start_time

输出:

...997998999用时:57.8318688869

1000条用时约60秒。

为什么慢?

先explain一下:

mysql> explain update number set num=3 where id=2 \G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: number         type: rangepossible_keys: PRIMARY          key: PRIMARY      key_len: 4          ref: const         rows: 1        Extra: Using where1 row in set (0.06 sec)

sql语句没什么问题。

再看一下磁盘IO:

letian $ iostat -d -k 1 100Linux 3.13.0-37-generic (myhost)  06/22/2015  _x86_64_  (4 CPU)............Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              57.00         4.00       468.00          4        468Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              71.00        28.00       404.00         28        404Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              71.00        28.00       608.00         28        608Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              61.00         4.00       492.00          4        492............Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              73.00         0.00       536.00          0        536Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              69.00         4.00       516.00          4        516............Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              61.00         4.00       472.00          4        472Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              63.00         4.00       452.00          4        452............Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              16.00         0.00      1804.00          0       1804Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               3.00         0.00        16.00          0         16Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               0.00         0.00         0.00          0          0Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               2.00         0.00        12.00          0         12

tps代表每秒IO传输的次数,与下面的更新方式4对比一下,可以看出当前的更新方式IO较多。(为什么呢?移步Why does TRANSACTION / COMMIT improve performance so much with PHP/MySQL (InnoDB)?)

更新方式2,单线程,不使用索引,一个commit,更新100条

# coding: UTF-8__author__ = 'letian'import MySQLdbimport timehost='127.0.0.1'user='root'passwd='123456'db='test'port = 3306start_time = time.time()conn = MySQLdb.connect(host, user, passwd, db, port)cur = conn.cursor()for x in xrange(100):    print x    sql = "UPDATE number SET num=%s where num=%s" % (x*2, x)    cur.execute(sql)conn.commit()cur.close()conn.close()end_time = time.time()print '用时:', end_time - start_time

输出:

...979899用时: 89.695207119

为什么这么慢:

mysql> explain update number set num=4 where num=2 \G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: number         type: indexpossible_keys: NULL          key: PRIMARY      key_len: 4          ref: NULL         rows: 979800        Extra: Using where1 row in set (0.05 sec)

可以看到,每次update几乎要扫全表(rows: 979800)。

更新方式3:单线程,使用索引,一个commit,更新40×10000条

# coding: UTF-8__author__ = 'letian'import MySQLdbimport timehost='127.0.0.1'user='root'passwd='123456'db='test'port = 3306start_time = time.time()conn = MySQLdb.connect(host, user, passwd, db, port)cur = conn.cursor()for x in xrange(40*10000): # 0 到 999999    print x    sql = "UPDATE number SET num=%s where id=%s" % (x+1, x)    cur.execute(sql)conn.commit()end_time = time.time()print '用时:', end_time - start_time

输出:

...399997399998399999用时:92.9413559437

查看磁盘IO:

letian $ iostat -d -k 1 100Linux 3.13.0-37-generic (myhost)  06/22/2015  _x86_64_  (4 CPU)Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              20.88       288.50       283.18    2458129    2412772Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              18.00       116.00       328.00        116        328Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              17.00         8.00       548.00          8        548Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              11.00        12.00       396.00         12        396Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              10.00         4.00       488.00          4        488............Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              10.00         4.00       460.00          4        460Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              10.00         4.00       448.00          4        448Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              10.00         4.00       464.00          4        464Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              10.00         4.00       456.00          4        456Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               9.00         4.00       452.00          4        452Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               9.00         0.00       140.00          0        140Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              15.00         4.00      2448.00          4       2448Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              43.00       144.00      2544.00        144       2544............Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               9.00         0.00       516.00          0        516Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              10.00         4.00       512.00          4        512Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              12.00         4.00       540.00          4        540Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              14.00         4.00      2288.00          4       2288Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              10.00         4.00       528.00          4        528Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               6.00         0.00       184.00          0        184............Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              14.00         0.00      3868.00          0       3868Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              15.00         0.00      2604.00          0       2604Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              15.00         0.00      1516.00          0       1516Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               3.00         0.00        16.00          0         16Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda              12.00         0.00        68.00          0         68Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               4.00         0.00        28.00          0         28Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               0.00         0.00         0.00          0          0Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               0.00         0.00         0.00          0          0Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtnsda               0.00         0.00         0.00          0          0

更新方式4:2个进程,一个进程里一个commit,共更新40*10000条

两个进程:

# coding: UTF-8__author__ = 'letian'import MySQLdbimport timeimport multiprocessinghost='127.0.0.1'user='root'passwd='123456'db='test'port = 3306start_time = time.time()def update(int_iter):    conn = MySQLdb.connect(host, user, passwd, db, port)    cur = conn.cursor()    for x in int_iter:        print x        sql = "UPDATE number SET num=%s where id=%s" % (x+5, x)        cur.execute(sql)    conn.commit()    cur.close()    conn.close()pool = multiprocessing.Pool(processes = 2)pool.apply_async(update, (xrange(20*10000), ))pool.apply_async(update, (xrange(20*10000, 40*10000), ))pool.close()pool.join()end_time = time.time()print '用时:', end_time - start_time

输出:

...199998199999用时: 47.9561839104

看下数据:

mysql> select * from number where id=1;+----+------+| id | num  |+----+------+|  1 |    6 |+----+------+1 row in set (0.00 sec)mysql> select * from number where id=399999;+--------+--------+| id     | num    |+--------+--------+| 399999 | 400004 |+--------+--------+1 row in set (0.00 sec)

更新方式5:4个进程,一个进程里一个commit,共更新40*10000条

# coding: UTF-8__author__ = 'letian'import MySQLdbimport timeimport multiprocessinghost='127.0.0.1'user='root'passwd='123456'db='test'port = 3306start_time = time.time()def update(int_iter):    conn = MySQLdb.connect(host, user, passwd, db, port)    cur = conn.cursor()    for x in int_iter:        print x        sql = "UPDATE number SET num=%s where id=%s" % (x+10, x)        cur.execute(sql)    conn.commit()    cur.close()    conn.close()pool = multiprocessing.Pool(processes = 4)pool.apply_async(update, (xrange(10*10000), ))pool.apply_async(update, (xrange(10*10000, 20*10000), ))pool.apply_async(update, (xrange(20*10000, 30*10000), ))pool.apply_async(update, (xrange(30*10000, 40*10000), ))pool.close()pool.join()end_time = time.time()print '用时:', end_time - start_time

运行结果:

299996299997299998299999用时: 28.4070010185

查看更新效果:

mysql> select * from number where id=1;+----+------+| id | num  |+----+------+|  1 |   11 |+----+------+1 row in set (0.00 sec)mysql> select * from number where id=399999;+--------+--------+| id     | num    |+--------+--------+| 399999 | 400009 |+--------+--------+1 row in set (0.00 sec)

更新方式6:一次更新多行,更新40*10000条

# coding: UTF-8__author__ = 'letian'import MySQLdbimport timehost='127.0.0.1'user='root'passwd='123456'db='test'port = 3306start_time = time.time()conn = MySQLdb.connect(host, user, passwd, db, port)cur = conn.cursor()sql = "UPDATE number SET num=id+100 where id<%s" % (40*10000, )cur.execute(sql)conn.commit()cur.close()conn.close()end_time = time.time()print '用时:', end_time - start_time

输出:

用时: 3.27281808853

查看更新效果:

mysql> select * from number where id=1;+----+------+| id | num  |+----+------+|  1 |  101 |+----+------+1 row in set (0.00 sec)mysql> select * from number where id=399999;+--------+--------+| id     | num    |+--------+--------+| 399999 | 400099 |+--------+--------+1 row in set (0.00 sec)

分析使用的sql语句:

mysql> explain UPDATE number SET num=id+100 where id<400000 \G*************************** 1. row ***************************           id: 1  select_type: SIMPLE        table: number         type: rangepossible_keys: PRIMARY          key: PRIMARY      key_len: 4          ref: const         rows: 489900        Extra: Using where1 row in set (0.00 sec)

这种方式非常快,灵活性不够。

如果遇到这样一个场景:将id为0、1、2的num置为2,其他全部置为4,可以先全部置为4,然后在更新id为0、1、2的记录,这样速度会很快。

0 0
原创粉丝点击