mysql galera 集群常见问题处理

来源：互联网发布：人工智能的利弊英文编辑：程序博客网时间：2024/05/24 05:10

一、mysql HA集群在断网过久或者所有节点都down了之后的恢复有以下的方法：
解决方案1：
1、等三台机器恢复网络通讯后，因为此时的mysql已经异常无法加入集群，因此需要先保证所有的mysql都是down的，再上台执行/usr/libexec/mysqld --wsrep-new-cluster --wsrep-cluster-address='gcomm://' & 这条命令，并进入mysql（只有一台机器能够成功执行，其他机器执行了过几秒钟都会异常退出这个进程，我们这里把能够成功执行的机器称为master）
2、此时三台只有一台能够成功进入mysql（即执行mysql这条命令），在非master上的两台上一台一台的执行systemctl start mysqld，必须等一台成功了，另一台才能执行。

3、在mysql中执行show status like "wsrep%";结果如下图：

我们需要保证图中的第一项为synced，以及第二项必须为三个mysql的ip

4、保证3的结果是想要的说明集群已经恢复了，此时需要将master机器上面的/usr/libexec/mysqld --wsrep-new-cluster --wsrep-cluster-address='gcomm://'这个进程kill掉，然后再执行systemctl start mysqld即可

二、mysql HA集群某个节点无故down了并且有一段时间处于down的情况通过以下方式恢复：

1、若日志里面出现以下日志

160119 14:11:05 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (eb9f50c6-bc95-11e5-a735-9f48e437dc03): 1 (Operation not permitted)

解决方法：删除/var/lib/mysql/grastate.dat 文件（若还存在无法同步的情况则删除galera.cache文件）

2、若那个down了的节点出现以下日志

（异常情况集群挂了）[ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions

解决方法：
1、/usr/libexec/mysqld start --innodb_force_recovery=6
1. (SRV_FORCE_IGNORE_CORRUPT):忽略检查到的corrupt页。
2. (SRV_FORCE_NO_BACKGROUND):阻止主线程的运行，如主线程需要执行full purge操作，会导致crash。
3. (SRV_FORCE_NO_TRX_UNDO):不执行事务回滚操作。
4. (SRV_FORCE_NO_IBUF_MERGE):不执行插入缓冲的合并操作。
5. (SRV_FORCE_NO_UNDO_LOG_SCAN):不查看重做日志，InnoDB存储引擎会将未提交的事务视为已提交。
6. (SRV_FORCE_NO_LOG_REDO):不执行前滚的操作。
如果配置后出现以下情况：
130507 14:14:01 InnoDB: Waiting for the background threads to start
130507 14:14:02 InnoDB: Waiting for the background threads to start
130507 14:14:03 InnoDB: Waiting for the background threads to start
130507 14:14:04 InnoDB: Waiting for the background threads to start
130507 14:14:05 InnoDB: Waiting for the background threads to start
130507 14:14:06 InnoDB: Waiting for the background threads to start
130507 14:14:07 InnoDB: Waiting for the background threads to start
130507 14:14:08 InnoDB: Waiting for the background threads to start
130507 14:14:09 InnoDB: Waiting for the background threads to start

需要在galera.cfg中添加这一下：
如果在设置 innodb_force_recovery >2 的同时innodb_purge_thread = 0
2、mysqld --tc-heuristic-recover=ROLLBACK
3、删除/var/lib/mysql/ib_logfile*
4、当某个mysql节点挂了，并且存在三个mysql所在host有不同的网段，当mysql想重新加入需要一个sst的过程，sst时会需要知道集群中某个节点的ip因此需要制定参数--wsrep-sst-receive-address否则可能出现同步的ip不在三台机器所共有的网段
解决参考：
http://blog.itpub.net/22664653/viewspace-1441389/

三、一个mysql节点若down了一段时间。重新启动的时候需要一些时间去同步数据，服务的启动超时时间不够，导致服务无法启动，解决方法如下：
The correct way to adjust systemd settings so they don't get overwritten is to create a directory and file as such:
/etc/systemd/system/mariadb.service.d/timeout.conf
[Service]

TimeoutStartSec=12min

或者直接修改/usr/lib/systemd/system/mariadb.service
[Service]

TimeoutStartSec=12min
这里的时间最少要大于90s，默认是90s之后执行 systemctl daemon-reload再重启服务即可
四、日志中出现类似如下错误：
160428 13:54:49 [ERROR] Slave SQL: Error 'Table 'manage_operations' already exists' on query. Default database: 'horizon'. Query: 'CREATE TABLE `manage_operations` (
`id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
`name` varchar(50) NOT NULL,
`type` varchar(20) NOT NULL,
`operation` varchar(20) NOT NULL,
`status` varchar(20) NOT NULL,
`time` date NOT NULL,
`operator` varchar(50) NOT NULL
) default charset=utf8', Error_code: 1050
160428 13:54:49 [Warning] WSREP: RBR event 1 Query apply warning: 1, 28585
160428 13:54:49 [Warning] WSREP: Ignoring error for TO isolated action: source: 752eecd1-0ce0-11e6-83fc-3e0502d0bdd2 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 24053 trx_id: -1 seqnos (l: 28668, g: 28585, s: 28584, d: 28584, ts: 80224119986850)
导致进程异常关闭，
此时可以通过执行mysqladmin flush-tables来刷新表项，这个问题的原因是三个节点之间的表同步存在问题，刷新一下表即可

五、日志出现以下错误：
160520 10:48:23 [Note] WSREP: COMMIT failed, MDL released: 367194
160520 10:48:23 [Note] WSREP: cert failure, thd: 358780 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 358784 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: COMMIT failed, MDL released: 367188
160520 10:48:23 [Note] WSREP: cert failure, thd: 359683 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 358808 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367191 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367196 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367194 is_AC: 0, retry: 0 - 1 SQL: commit

160520 10:48:23 [Note] WSREP: cert failure, thd: 367188 is_AC: 0, retry: 0 - 1 SQL: commit

8、日志出现以下错误：

160820 3:13:41 [ERROR] Error in accept: Too many open files
160820 3:19:42 [ERROR] Error in accept: Too many open files
160827 3:16:24 [ERROR] Error in accept: Too many open files
160831 17:20:52 [ERROR] Error in accept: Too many open files
160831 19:54:29 [ERROR] Error in accept: Too many open files
160831 20:21:53 [ERROR] Error in accept: Too many open files
160901 11:25:57 [ERROR] Error in accept: Too many open files

解决方法

vim /usr/lib/systemd/system/mariadb.service

[Service]
LimitNOFILE=10000

默认的mysql的open_file_limits是1024将该项增大，并且修改vim /etc/my.cnf.d/server.cnf该文件的open_files_limit值

systemctl daemon-reload

systemctl restart mysqld

查看mysql的open_file_limits值是否调整成功

cat /proc/$pid/limit

其中$pid为mysql进程的pid看看值是否调整成功，并看看日志是否还会出现上述错误

1 0