记一次从库主机断电重启后同步故障处理

来源：互联网发布：指向二维数组的指针编辑：程序博客网时间：2024/06/05 03:01

一.基本信息
主库：10.24.65.30
从库：10.24.65.33
版本：mysql 5.7.14
mysql参数设置：
binlog_format:row
tx_isolation:READ-COMMITTED
master_info_repository:FILE
relay_log_info_repository:FILE
sync_master_info:10000
sync_relay_log_info:10000
relay_log_recovery :off

从库同步情况：
(product)root@localhost [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.24.65.30
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 338
               Relay_Log_File: relay-bin.000003
                Relay_Log_Pos: 551
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 338

二.模拟故障发生

1.主库模拟数据写入:

sysbench --test=/data/software/sysbench-0.5/sysbench/tests/db/oltp.lua --mysql-table-engine=innodb --oltp-table-size=1000000 --max-requests=10000 \
--num-threads=100 --oltp_tables_count=10 --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-user=admin --mysql-password=Dmjxjbgc6u \
--mysql-db=test --max-time=600 --mysql-socket=/tmp/mysql3306.sock prepare

2.从库操作：
将从库主机断电

3.从库主机断电恢复后，启动mysql和slave
同步发生了故障，如下：
(product)root@localhost [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 10.24.65.30
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000135
          Read_Master_Log_Pos: 29046651
               Relay_Log_File: relay-bin.000272
                Relay_Log_Pos: 367
        Relay_Master_Log_File: mysql-bin.000128
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                   Last_Errno: 1594
                   Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 154

三.修复故障
1.修复binlog损坏
上面显示binlog或relay log损坏，检查并记录确认上次sql thread最后正常执行的relay_master_log_file和exec_master_log_pos:
stop slave;
change master to
master_log_file='mysql-bin.000128',
master_log_pos=154;
start slave;

(product)root@localhost [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 10.24.65.30
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000138
          Read_Master_Log_Pos: 123
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 320
        Relay_Master_Log_File: mysql-bin.000128
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table test.sbtest5; Duplicate entry '32257' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000128, end_log_pos 532773
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 154
              Relay_Log_Space: 1348664795

2.修复1062错误
出现上面报错，是因为master.info和relay log.info两文件的信息没有及时刷盘更新，导致从master中读取或执行了重复的log，产生主键冲突报1062错误，先尝试跳过错误：
stop slave;
set global sql_slave_skip_counter=1;
start slave;
(product)root@localhost [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.24.65.30
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000145
          Read_Master_Log_Pos: 10398871
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 1037690
        Relay_Master_Log_File: mysql-bin.000128
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table test.sbtest5; Duplicate entry '34945' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000128, end_log_pos 1570143
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1037524
              Relay_Log_Space: 2303127120
继续跳：
stop slave;
set global sql_slave_skip_counter=1;
start slave;

第二次跳过错误后，依然还报错。

直接设置stop slave;
set global slave_exec_mode='IDEMPOTENT'; # 默认值是STRICT（严格模式）
start slave;
设置完毕后，当出现1062错误（主键重复）或1023错误（记录没找到)时，等几分钟后，会自动跳过该错误，并且记录到错误日志里。
其实和slave_skip_errors这个参数的作用是一样的，只不过slave_skip_errors参数必须加到配置文件my.cnf里然后重启MySQL，而slave_exec_mode参数可以动态设置。
(product)root@localhost [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.24.65.30
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000145
          Read_Master_Log_Pos: 10398871
               Relay_Log_File: relay-bin.000039
                Relay_Log_Pos: 320
        Relay_Master_Log_File: mysql-bin.000145
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 10398871

上面显示已没有报错，且已同步正常。

没有错误后，需要参数还原: set global slave_exec_mode=default ;或者 set global slave_exec_mode='STRICT';
虽然没有错误了，还是需要用 pt-sync使用检查主从数据一致性。

四.pt-table-checksum检查主从是否一致
在test库中建表,并插入从库的信息.
CREATE TABLE `dsns` ( `id` int(11) NOT NULL AUTO_INCREMENT, `parent_id` int(11) DEFAULT NULL, `dsn` varchar(255) NOT NULL, PRIMARY KEY (`id`) );
写入从库信息
INSERT INTO dsns (parent_id,dsn) values(1,"h=10.24.65.33,u=pc_checksum,p=percona,P=3306");
检查命令：
cd /data/tool/percona-toolkit-2.2.10/bin
perl pt-table-checksum --no-check-binlog-format --nocheck-replication-filters --ignore-databases=mysql,\
performance_schema,information_schema,log,common_schema --replicate=test.checksums --host=127.0.0.1 \
--port 3306 -upc_checksum -ppercona --engines=InnoDB,MyISAM --recurse=1 --chunk-size-limit=0 --recursion-method=dsn=D=host,D=test,t=dsns
检查操作已经完成，查看输出结果，是否有不一致的记录。

-bash-4.1$ perl pt-table-checksum --no-check-binlog-format --nocheck-replication-filters --ignore-databases=mysql,\
> performance_schema,information_schema,log,common_schema --replicate=test.checksums --host=127.0.0.1 \
> --port 3306 -upc_checksum -ppercona --engines=InnoDB,MyISAM --recurse=1 --chunk-size-limit=0 --recursion-method=dsn=D=host,D=test,t=dsns
            TS ERRORS DIFFS     ROWS CHUNKS SKIPPED    TIME TABLE
08-31T15:33:06      0      0        6       1       0   0.275 sys.sys_config
08-31T15:33:06      0      0        1       1       0   0.017 test.dsns
08-31T15:33:10      0      0 1000000       8       0   3.425 test.sbtest1
08-31T15:33:13      0      0 1000000       6       0   3.323 test.sbtest10
08-31T15:33:17      0      0 1000000       6       0   3.736 test.sbtest2
08-31T15:33:20      0      0 1000000       6       0   3.358 test.sbtest3
08-31T15:33:23      0      0 1000000       6       0   3.306 test.sbtest4
08-31T15:33:27      0      0 1000000       6       0   3.132 test.sbtest5
08-31T15:33:30      0      0 1000000       6       0   3.112 test.sbtest6
08-31T15:33:33      0      0 1000000       6       0   3.254 test.sbtest7
08-31T15:33:36      0      0 1000000       6       0   3.094 test.sbtest8
08-31T15:33:39      0      0 1000000       6       0   2.998 test.sbtest9
08-31T15:33:39      0      0        2       1       0   0.014 test.t1
08-31T15:33:39      0      0        2       1       0   0.012 test.t10
08-31T15:33:39      0      0        0       1       0   0.012 test.t10_tmp
08-31T15:33:39      0      0        2       1       0   0.012 test.t2
08-31T15:33:39      0      0        3       1       0   0.263 test.t3
08-31T15:33:39      0      0        2       1       0   0.013 test.t4
08-31T15:33:39      0      0        2       1       0   0.014 test.t5
08-31T15:33:39      0      0        3       1       0   0.012 test.t6
08-31T15:33:39      0      0        2       1       0   0.014 test.t7
08-31T15:33:39      0      0        1       1       0   0.013 test.t8
08-31T15:33:39      0      0        7       1       0   0.014 test.zeng
08-31T15:33:39      0      0        7       1       0   0.013 test.zeng1

上面显示两个库的数据是一致的。

0 0