运维笔记31 (pacemaker高可用集群搭建的总结)

来源：互联网发布：淘宝达人账号如何注销编辑：程序博客网时间：2024/05/24 03:11

概述：

pacemaker是heartbeat到了v3版本后拆分出来的资源管理器，所以pacemaker并不提供心跳信息，我们这个集群还需要corosync（心跳信息）的支持才算完整。pacemaker的功能是管理整个HA的控制中心，客户端通过pacemaker来配置管理整个集群。还有一款帮助我们自动生成配置文件，并且进行节点配置文件同步的crmshell是我们搭建集群的时候的一个利器。

1.安装集群软件

    yum install pacemaker corosync -y

直接通过yum安装pacemaker和corosync

crmsh-1.2.6-0.rc2.2.1.x86_64.rpm

pssh-2.3.1-2.1.x86_64.rpm

安装以上两个rpm包，其中crmsh对pssh有依赖性。

2.通过crm配置集群

[root@ha1 ~]# crmcrm(live)#

直接输入crm(cluster resource manager)进入集群资源管理器

crm(live)# ?           cib         exit        node        ra          status      bye         configure   help        options     resource    up          cd          end         history     quit        site

输入tab键可以看到相关的管理项

我们现在需要配置集群，所有进入configure。

ERROR: running cibadmin -Ql: Could not establish cib_rw connection: Connection refused (111)Signon to CIB failed: Transport endpoint is not connectedInit failed, could not perform requested operations

出现了如上的错误，这应该是没有开启corosync服务造成的。就算没有看到错误，我们连心跳层都没有开更不要谈开启更高层的集群管理了，所以现在先配置corosync。

[root@ha1 ~]# rpm -ql corosync/etc/corosync/etc/corosync/corosync.conf.example

使用rpm命令查找到corosync的配置文件的位置。

将配置文件后的example去掉，配置文件内容修改成如下即可：

# Please read the corosync.conf.5 manual pagecompatibility: whitetanktotem {version: 2secauth: offthreads: 0interface {ringnumber: 0bindnetaddr: 192.168.5.0#集群管理信息所传送的网段mcastaddr: 226.94.1.1#确定多播地址mcastport: 5405#确定多播端口ttl: 1#只向外多播ttl为1的报文，防止发生环路}}logging {fileline: offto_stderr: noto_logfile: yesto_syslog: yeslogfile: /var/log/cluster/corosync.logdebug: offtimestamp: onlogger_subsys {subsys: AMFdebug: off}}amf {mode: disabled}service {#让corosync去加载pacemakername: pacemakerver: 0#版本号，如果版本号是1的话这个插件不会去启动pacemaker，如果为0就会自动启用pacemaker}

接下来启动corosync如果启动成功，而且日志中没有报错，那么就成功了。

现在crm应该可以正常使用了。

crm(live)# configure crm(live)configure# shownode ha1.mo.comnode ha2.mo.comproperty $id="cib-bootstrap-options" \dc-version="1.1.10-14.el6-368c726" \cluster-infrastructure="classic openais (with plugin)" \expected-quorum-votes="2"

[root@ha1 cluster]# crm configure shownode ha1.mo.comnode ha2.mo.comproperty $id="cib-bootstrap-options" \dc-version="1.1.10-14.el6-368c726" \cluster-infrastructure="classic openais (with plugin)" \expected-quorum-votes="2"

在bash下输入相应命令也会显示，但是没有了补全。

现在咱们给集群添加相应服务

先是较为简单的ip服务

crm(live)configure# primitive vip ocf:heartbeat:IPaddr2 params ip=192.168.5.100 cidr_netmask=24 op monitor interval=30s

这条命令看似很长，但其实都是补全出来的，你只要理解你的操作，基本不需要记忆就可以配置出来这些，其中ocf表示的是集群服务脚本，LSB是linux下的标准脚本，也就是放置在/etc/init.d下的脚本。

每次修改了一下配置文件，并不是马上就被保存并输出成程序可读的xml，需要你进行commit操作才可以。

crm(live)configure# commit   error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined   error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option   error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrityErrors found during check: config not validDo you still want to commit?

我提交后出现了如上的错误，是STONITH的问题，说我们定义了STONITH，但是没进行配置，这里我们先不管，因为我们添加的是ip服务，直接确定提交。注意确认提交后，服务就会生效了。

我们通过crm自带的查看功能看一下服务是否正常。

crm(live)configure# cdcrm(live)# resource crm(live)resource# show vip(ocf::heartbeat:IPaddr2):Stopped crm(live)resource# start vipcrm(live)resource# show vip(ocf::heartbeat:IPaddr2):Stopped

通过cd回到一开始的目录下，然后进入resource查看资源情况，发现没有启动这就很奇怪了，手动启动后仍然失败，说明配置有问题，我们查看下日志。

GINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]Feb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: Resource start-up disabled since no STONITH resources have been definedFeb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled optionFeb 27 07:14:09 ha1 pengine[6053]:    error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity

只发现了STONITH的错误，我们尝试关闭STONITH。

crm(live)configure# property stonith-enabled=falsecrm(live)resource# show vip(ocf::heartbeat:IPaddr2):Started

发现服务已经正常。所以一定要清除ERROR。经过上面的操作，大家一定感觉这个pacemaker很好用，配置集群的时候只要在一个节点上修改，所有节点就都修改好了不用再继续分发操作。

现在测试一下是否有健康检查，关闭ha1的网络

[root@ha2 ~]# crm_monLast updated: Mon Feb 27 07:30:23 2017Last change: Mon Feb 27 07:16:50 2017 via cibadmin on ha1.mo.comStack: classic openais (with plugin)Current DC: ha2.mo.com - partition WITHOUT quorumVersion: 1.1.10-14.el6-368c7262 Nodes configured, 2 expected votes1 Resources configuredOnline: [ ha2.mo.com ]OFFLINE: [ ha1.mo.com ]

一般STONITH是一个硬件设备，我们的服务是虚拟机，所以需要一个虚拟的fence设备。

[root@ha1 ~]# stonith_admin -I fence_pcmk fence_legacy2 devices found

查看已经安装的fence设备，没有我们需要的fence_xvm。我们查一下万能的yum

fence-virt.x86_64 : A pluggable fencing framework for virtual machines

发现这个很符合我们的需求，安装看一下

[root@ha1 ~]# stonith_admin -I fence_xvm fence_virt fence_pcmk fence_legacy4 devices found

现在就有了我们需要的fence_xvm

[root@ha1 ~]# stonith_admin -M -a fence_xvm

使用上面命令添加fence代理
进入crm将fence的配置添加进去。

crm(live)configure# primitive vmfence stonith:fence_xvm params pcmk_host_map="ha1.mo.com:ha1;ha2.mo.com:ha2" op monitor interval=20s

上面的pcmk_host_map代表的是虚拟机的主机名和虚拟机的域名的对应关系。
现在查看一下fence的运行状况

vmfence (stonith:fence_xvm):    Started ha2.mo.com

现在添加一个http服务测试一下。

crm(live)configure# primitive apache lsb:httpd op monitor interval=30s

查看运行情况
现在结合一下我们前几天学的RHCS套件，ip和http服务的启动顺序是要由先后的，所以我们接下来要定义服务的先后顺序。

crm(live)configure# group website vip apache

这样就将vip和apache绑定成了一个组，而且是vip先启动然后是http服务。现在看一下服务的状态

crm(live)resource# show vmfence(stonith:fence_xvm):Started  Resource Group: website     vip(ocf::heartbeat:IPaddr2):Started      apache(lsb:httpd):Started

现在一个服务的基本雏形已经出来了，我们测试一下fence是否有效。关闭ha1的http服务。

Failed actions:    apache_monitor_30000 on ha1.mo.com 'not running' (7): call=27, status=complete, last-rc-change='Mon Feb 27 22:32:36 2017', queued=0ms, exec=0ms

通过在ha2上对集群的观察，集群已经发现了ha1上的http服务关闭，但是并没有启动fence，而是直接开启了ha1的http服务。
现在让ha1的网卡挂掉

2 Nodes configured, 2 expected votes3 Resources configuredNode ha1.mo.com: UNCLEAN (offline)Online: [ ha2.mo.com ] Resource Group: website     vip        (ocf::heartbeat:IPaddr2):Started ha1.mo.com     apache     (lsb:httpd):    Started ha1.mo.com

出现了一个奇怪的现象，服务并没有进行切换，仍然在ha1上。原来pacemaker有一个法定人数的选项我们没有设置，如果开启，集群就会认为当节点少于2个节点集群就坏掉了，在实际情况下，是一种容灾策略。

crm(live)configure# property no-quorum-policy=ignore

将这条输入，继续测试，当前服务在2上，现在将2的网卡关闭

Last change: Mon Feb 27 22:46:35 2017 via cibadmin on ha2.mo.comStack: classic openais (with plugin)Current DC: ha1.mo.com - partition with quorumVersion: 1.1.10-14.el6-368c7262 Nodes configured, 2 expected votes3 Resources configuredOnline: [ ha1.mo.com ha2.mo.com ]vmfence (stonith:fence_xvm):    Started ha1.mo.com Resource Group: website     vip        (ocf::heartbeat:IPaddr2):Started ha1.mo.com     apache     (lsb:httpd):    Started ha1.mo.com

可以看到服务切到了1上，而且ha2关机了。

现在将ldirectord服务加上，这样我们的集群就具备对lvs的操作功能了。关于ldirectord的配置在上一章博客上已经有说明，这里我们要配置一个虚拟ip是172.25.3.100，分配负载的两节点ip是172.25.3.3和172.25.3.4。

现在将ldirectord加入配置文件

crm(live)configure# primitive lvs lsb:ldirectord op  monitor interval=30s

接下来我们要为这个website添加存储服务。在这之前介绍几条命令，用于让某个节点下线和上线。

Last updated: Tue Feb 28 22:35:00 2017Last change: Tue Feb 28 22:34:04 2017 via cibadmin on ha1.mo.comStack: classic openais (with plugin)Current DC: ha1.mo.com - partition with quorumVersion: 1.1.10-14.el6-368c7262 Nodes configured, 2 expected votes3 Resources configuredNode ha1.mo.com: standbyOnline: [ ha2.mo.com ]vmfence (stonith:fence_xvm):    Started ha2.mo.com Resource Group: website     vip        (ocf::heartbeat:IPaddr2):Started ha2.mo.com     apache     (lsb:httpd):    Started ha2.mo.com

现在服务在ha2上运行，让ha2掉线看结果

Last updated: Tue Feb 28 22:37:21 2017Last change: Tue Feb 28 22:37:21 2017 via crm_attributeon ha2.mo.comStack: classic openais (with plugin)Current DC: ha1.mo.com - partition with quorumVersion: 1.1.10-14.el6-368c7262 Nodes configured, 2 expected votes3 Resources configuredNode ha1.mo.com: standbyNode ha2.mo.com: standby

现在两节点都处在standby状态，我们让ha1上线

Node ha2.mo.com: standbyOnline: [ ha1.mo.com ]vmfence (stonith:fence_xvm):    Started ha1.mo.com Resource Group: website     vip        (ocf::heartbeat:IPaddr2):Started ha1.mo.com     apache     (lsb:httpd):    Started ha1.mo.com

ha1开始接管

如果配置文件已经确实没有错误了，但是服务依旧起不来，比如我开启集群后，忘记开启真机的fence_virtd导致虚拟机的vmfence无法启动，可以尝试下面的命令，cleanup的作用就是刷新资源的状态

crm(live)resource# cleanup vmfence

Cleaning up vmfence on ha1.mo.comCleaning up vmfence on ha2.mo.comWaiting for 1 replies from the CRMd. OK

现在查看一下各个资源脚本的一些要求

start and stop Apache HTTP Server (lsb:httpd)The Apache HTTP Server is an efficient and extensible  \        server implementing the current HTTP standards.Operations' defaults (advisory minimum):    start         timeout=15    stop          timeout=15    status        timeout=15    restart       timeout=15    force-reload  timeout=15    monitor       timeout=15 interval=15

以上是apache脚本的一些介绍。

接下来为集群添加一个drbd共享存储和mysql服务。

首先为ha1和ha2加入两块4G的硬盘，关于DRBD从源码包成为rpm包的具体过程可以传送门

[root@ha1 x86_64]# lsdrbd-8.4.2-2.el6.x86_64.rpm                  drbd-heartbeat-8.4.2-2.el6.x86_64.rpm                 drbd-pacemaker-8.4.2-2.el6.x86_64.rpm  drbd-xen-8.4.2-2.el6.x86_64.rpmdrbd-bash-completion-8.4.2-2.el6.x86_64.rpm  drbd-km-2.6.32_431.el6.x86_64-8.4.2-2.el6.x86_64.rpm  drbd-udev-8.4.2-2.el6.x86_64.rpmdrbd-debuginfo-8.4.2-2.el6.x86_64.rpm        drbd-km-debuginfo-8.4.2-2.el6.x86_64.rpm              drbd-utils-8.4.2-2.el6.x86_64.rpm

最终生成的rpm包。之后下载mysql，将mysql的文件放到drbd的共享存储下。

将drbd的meta数据创建好，启动服务，强制为primary，这里注意你的drbd底层存储一定不能格式化过，否则你怎样强制primary都不会成功的，我已经犯了两次错误了。将drbd设备挂载到/var/lib/mysql也就是mysql的根目录，这样mysql的数据就在drbd设备中了。切记停止mysql再去切换drbd的主备，不要让drbd的存储中有mysql的sock文件存在。

现在将dbrd服务关闭，开始让pacemaker集群接管。

首先添加drbd资源

crm(live)resource# primitive drbddata ocf:linbit:drbd params drbd_resource=mo op monitor interval=120s

这次使用的脚本是ocf的linbit，且一定要定义drbd_resource

设置drbd的主备

crm(live)resource# ms drbdclone drbddata meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

设置drbd设备的挂载

crm(live)resource# primitive sqlfs ocf:heartbeat:Filesystem params device=/dev/drbd1 directory=/var/lib/mysql fstype=ext4

将sqlfs和drbd设置到一个联合里面，方便后面定义启动顺序

crm(live)resource# colocation sqlfs-with-drbd inf: sqlfs drbdclone:Master

设置当drbd为主设备的时候才启动文件系统

crm(live)resource# order sqlfs-after-drbd inf: drbdclone:promote sqlfs:start

现在commit一下，看下是否生效。如果出现时间上的warning可以先暂时不理他们。

crm(live)resource# show vmfence(stonith:fence_xvm):Started  Resource Group: website     vip(ocf::heartbeat:IPaddr2):Started      apache(lsb:httpd):Started      sqlfs(ocf::heartbeat:Filesystem):Started  Master/Slave Set: drbdclone [drbddata]     Masters: [ ha1.mo.com ]

可以看到服务正常运行

最后将mysql服务的配置添加进入配置文件中

crm(live)configure# primitive mysql lsb:mysqld op monitor interval=60s

crm(live)configure# group mydb vip sqlfs mysql

再删除之前的website组现在观察一下服务是否正常。

crm(live)resource# show vmfence(stonith:fence_xvm):Started  Master/Slave Set: drbdclone [drbddata]     Masters: [ ha2.mo.com ]     Stopped: [ ha1.mo.com ] apache(lsb:httpd):Started  Resource Group: mydb     vip(ocf::heartbeat:IPaddr2):Started      sqlfs(ocf::heartbeat:Filesystem):Started      mysql(lsb:mysqld):Started

1 0