再论mysql线程出现Waiting for table flush被hang住问题

来源:互联网 发布:noteshelf软件功能 编辑:程序博客网 时间:2024/06/07 09:55


一.问题描述:

版本:mariaDB 10.0.2
我们在http://blog.csdn.net/zengxuewen2045/article/details/51412663一文中曾经提到过一条sql搞挂mysql实例的案例,当时分析没有找到kill操作的原因。
这次问题再次遇到,进一步分析问题产生原因,解决方法如此相似。
show processlist\G

跑最长时间线程处于killed状态,一直未释放

*************************** 6. row ***************************
      Id: 5449173
    User: lotsprd
    Host: 10.16.26.199:51821
      db: lots
 Command: Killed
    Time: 84026
   State: Sending data
    Info: SELECT DISTINCT
                        o.CUSTOMER_NAME,
                        o.WAREHOUSE_NAME,
                        o.CUSTOMER_ORDER_NO,
                        o.ANNTO_ORDER_NO
Progress: 0.000

很多正在跑sql的线程都处于Waiting for table flush状态
*************************** 10. row ***************************
      Id: 5557272
    User: lotsprd
    Host: 172.16.12.155:54883
      db: lots
 Command: Query
    Time: 70
   State: Waiting for table flush
    Info: SELECT
                        o.ANNTO_ORDER_NO,
                        ct.CARRIER_CODE,
                        ct.EXPRESS_NO,
                        ct.NODE_ACTION,
                        ct.ACCEPT_TI
Progress: 0.000
*************************** 11. row ***************************
      Id: 5557273
    User: lotsprd
    Host: 172.16.12.155:54884
      db: lots
 Command: Query
    Time: 9
   State: Waiting for table flush
    Info: SELECT annto_order_no,express_no,express_code,EXPRESS_NAME,source_express_code FROM (
                        SELECT

二.问题分析
1.检查binlog中是否有手工执行的flush table、alter table和analyze table命令
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep FLUSH
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep flush
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep analze
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ANALYZE
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep alter
[apps@MCBL051 log]$ /apps/svr/mariadb10/bin/mysqlbinlog --no-defaults --base64-output=decode-rows -v -v mysql-bin.000345|grep ALTER

查询都没有手工执行的flush table、alter table和analyze table命令。

2.检查备份,发现在1点时的备份脚本中有在备份前kill掉长时间查询的线程
crontab -l
0 1 * * *  /apps/sh/cron/new_backup_mysql.sh   --socket=/tmp/mysql3306.sock >> /apps/tmp/dump_3306.log 2>&1

mydumper --user=${db_user} --password=${db_pass} --long-query-guard=${v_long_query_guard} --socket=${socket} --database=${dbname} --outputdir=${this_back_p} --compress

--verbose=3 --logfile=${base_p}/${logfilename} ${kill_long_query_option}
 v_backup_type=0  #mydumper  备份
            #log_mysql_start_status
            backup_mydumper
            if [ $try_times -eq 1 ]
            then
                kill_long_query_option="--kill-long-queries"

3.查看mydumper备份的日志,确认有在01:01:22时kill了运行超过54999s的线程
cd /apps/mysql_backup
tail -300 new_backup_mysql3306.log.archive|more

2016-07-01 01:01:22 [WARNING] - Killed a query that was running for 54999s
2016-07-01 01:11:22 [ERROR] - Couldn't acquire global lock, snapshots will not be consistent: Lock wait timeout exceeded; try restar
ting transaction
2016-07-01 01:21:24 [ERROR] - Error dumping table (lots.t_carrier_tracking) data: Lock wait timeout exceeded; try restarting transac
tion
2016-07-01 01:21:24 [INFO] - Empty table lots.t_carrier_tracking
2016-07-01 01:21:42 [ERROR] - Error dumping table (lots.t_order) data: Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:21:42 [INFO] - Empty table lots.t_order

2016-07-01 01:24:57 [ERROR] - Error dumping table (lots.t_warehouse) data: Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:24:57 [INFO] - Empty table lots.t_warehouse
2016-07-01 01:31:24 [ERROR] - Error dumping schemas (lots.t_carrier_tracking): Lock wait timeout exceeded; try restarting transactio
n
2016-07-01 01:31:24 [INFO] - Thread 2 shutting down
2016-07-01 01:31:42 [ERROR] - Error dumping schemas (lots.t_order): Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:31:42 [INFO] - Thread 4 shutting down
2016-07-01 01:31:57 [ERROR] - Error dumping schemas (lots.t_warehouse): Lock wait timeout exceeded; try restarting transaction
2016-07-01 01:31:57 [INFO] - Thread 1 shutting down

但从上面报错来看,线程被kill,但线程和锁没有真正释放,影响到的表为T_ORDER、 t_warehouse、t_carrier_tracking。
在flush tables with read lock成功获得锁之前,必须等待所有语句执行完成(包括SELECT)。所以如果有个慢查询在执行,或者一个打开的事务,或者其他进程拿着表锁,flush tables

with read lock就会被阻塞,直到所有的锁被释放。
这三个表:T_ORDER、 t_warehouse、t_carrier_tracking由于慢sql没有执行执行完成,导致加锁没有成功。所以下次再有新产生对三个表的select 或dml的sql执行时,线程就会出现waiting

for table flush。我们mysql的lock_wait_timeout时间为600 秒,上面备份加锁时,等待锁超过了10分钟后,就会超时报错。


此次问题产生原因是慢SQL和备份操作杀进程引起的,主要原因还是由于慢sql长时间执行,性能太差。

三.解决方法
  关闭mysql,mysql长时间未有关闭的话,使用kill -9来杀掉mysql进程。
  最后将引起问题的慢SQL发给开发人员优化,优化人员反馈已优化业务逻辑,观察问题暂未出现。

0 0