mysql5.7主从复制延时过大，stop slave夯住

来源：互联网发布：淘宝活动规则编辑：程序博客网时间：2024/06/06 09:34

一、说明

mysql5.7的GTID多线程主从模式的复制，复制延时过大：

【mysql主从告警】时间:20170725 14:25:02|【192.xx : xx】与【从库：192.xx : xx 】主从延时过大,超过457642 s!

异常分析：

1、从库基本不可用

2、slave备份数据不可靠，延迟太大；

3、若半同步复制情况，影响主库的性能

二、排查问题

2.1 查看日志

首先想到的都是先show processlist下，查看状态：

mysql> show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.xx
Master_User: myrep
Master_Port: xx20
Connect_Retry: 60
Master_Log_File: mysql-bin.000091
Read_Master_Log_Pos: 850335882
Relay_Log_File: relay-bin.000249
Relay_Log_Pos: 414
Relay_Master_Log_File: mysql-bin.000078
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB: information_schema,performance_schema,sys
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1048017045
Relay_Log_Space: 14058364808
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 1484
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 56001
Master_UUID: 2199b808-1dc9-11e7-97a0-1418773c7c83
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 2199b808-1dc9-11e7-97a0-1418773c7c83:203746-654994
Executed_Gtid_Set: 2199b808-1dc9-11e7-97a0-1418773c7c83:1-203745,
984721db-1dd1-11e7-b548-b8ca3af0976a:1-3
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)

红色标记的是表示连续多次查看Relay_Log_Pos和Exec_Master_Log_Pos不变；到主库查看binlog

# at 1048017045
#170720 7:17:39 server id 56001 end_log_pos 1048017110 CRC32 0x595de4a4 GTID last_committed=59399 sequence_number=59400
SET @@SESSION.GTID_NEXT= '2199b808-1dc9-11e7-97a0-1418773c7c83:203746'/*!*/;
# at 1048017110
#170720 7:17:39 server id 56001 end_log_pos 1048017187 CRC32 0x2bb724c0 Query thread_id=19064 exec_time=0 error_code=0
SET TIMESTAMP=1500506259/*!*/;
SET @@session.pseudo_thread_id=19064/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=0/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 1048017187
#170720 7:17:39 server id 56001 end_log_pos 1048017288 CRC32 0xfc36c6b9 Table_map: `dbname`.`app_xx_area_day` mapped to number 1071
# at 1048017288
#170720 7:17:39 server id 56001 end_log_pos 1048025438 CRC32 0xea7754b2 Delete_rows: table id 1071
# at 1048025438
#170720 7:17:39 server id 56001 end_log_pos 1048033606 CRC32 0xc6ca1995 Delete_rows: table id 1071
# at 1048033606
#170720 7:17:39 server id 56001 end_log_pos 1048041769 CRC32 0xb99f13f4 Delete_rows: table id 1071
# at 1048041769
#170720 7:17:39 server id 56001 end_log_pos 1048049923 CRC32 0x811e0cd7 Delete_rows: table id 1071
# at 1048049923
#170720 7:17:39 server id 56001 end_log_pos 1048058135 CRC32 0xbcbcae3a Delete_rows: table id 1071
# at 1048058135
#170720 7:17:39 server id 56001 end_log_pos 1048066299 CRC32 0xec29b40e Delete_rows: table id 1071
# at 1048066299
#170720 7:17:39 server id 56001 end_log_pos 1048074464 CRC32 0x9b0d3c60 Delete_rows: table id 1071
# at 1048074464
#170720 7:17:39 server id 56001 end_log_pos 1048082635 CRC32 0xf80b65a1 Delete_rows: table id 1071
# at 1048082635
#170720 7:17:39 server id 56001 end_log_pos 1048090830 CRC32 0x4518cd5c Delete_rows: table id 1071
# at 1048090830
#170720 7:17:39 server id 56001 end_log_pos 1048098993 CRC32 0x3330a34b Delete_rows: table id 1071
# at 1048098993
#170720 7:17:39 server id 56001 end_log_pos 1048107195 CRC32 0xbaa07468 Delete_rows: table id 1071
# at 1048107195
....

binlog看到事务过长，表Table_map: `dbname`.`app_xx_area_day` mapped to number 事务操作过长。

2.2 重新开启slave

由于之前在没验证之前，直接想通过stop slave后再start slave，发现stop slave后一直挂住：

通过2.1查看事物日志可用知道，在stop slave时候需要做一次事物回滚操作，导致stop slave夯住，无法停止slave操作；

2.3 同步挂住到表

由于我们知道是某个表导致同步异常的，但是不清楚为什么这个表会导致同步异常；我们查看这个表的情况，这里说明下我们同步记录的日志模式是ROW模式；

该表一直是In_use状态，查看表结构：

查看线上该表是1个一千个分区的表，而且该表有一个很明显的问题是没有主键，分区字段区分度很低；

我们总结下：

a.show slave status 显示position一直没有变化；

b.show open tables显示某个表一直是in_use为1；

c.show create table 查看表结构可以看到无主键，或者无任何索引，或者索引区分度很差。

三、解决方法

a. 找到表区分度比较高的几个字段, 可以使用这个方法判断：
select count(*) from xx;
select count(*) from (select distinct xx from xxx) t;
如果2个查询count(*)的结果差不多，说明可以对这些字段加索引
b. 备库stop slave;
可能会执行比较久，因为需要回滚事务。
c. 备库
set sql_log_bin=0;
alter table xx add key xx(xx);
老的版本slave应用binlog时只会选择第一个索引，需要把新加的索引放在最前面，可以先把老的索引删掉，建新的索引，再把老的索引建上。可以放到一个sql中执行。
d. 备库start slave
如果是innodb,可以通过show innodb status来查看 rows_inserted,updated,deleted,selected这几个指标来判断。
如果每秒修改的记录数比较多，说明复制正在以比较快的速度执行。

当然最终还是要解决该表分区太大的问题，从根源上去解决。

我是直接xtrabackup重做的，主从延时过大了。

阅读全文

0 0