Redis/Redis-sentinel环境建立和验证

来源:互联网 发布:手机淘宝怎么找主播 编辑:程序博客网 时间:2024/05/18 01:59

目的:

1. 验证redis-sentinel环境和切换功能验证;

2. 为后续基于jedis sentinel patch或 sentinel/twemproxy/Twemproxy-sentinel-agent实现故障转移做好准备;

参考:

http://www.redisdoc.com/en/latest/topic/sentinel.html

http://blog.163.com/a12333a_li/blog/static/87594285201304103257837/

验证内容:

1. 环境搭建

2. 切换:验证master shutdown

3. 切换:验证slave shutdown

4. 多sentinel环境

5. 总结sentinel环境的消息/编写状态判断的脚步,为后续维护或状态监控做准备;

环境:

192.168.0.11: redis master:6379/slave:6380/sentinel1:26379

192.168.0.12:redis slave:6379

OS:rhel 6.4

过程记录:

环境安装:

1. 两台机器的软件环境都已经安装完毕

2. 配置上述redis 运行节点;

2.1 11 机器配置master/slave,set数据后,验证主从工作正常;

2.2  12 机器配置slave,验证数据,显示数据已经从master复制过来;

3. 安装sentinel,之前build已经装好,只是install时,没有安装到路径内,执行文件在: $redis-source-dir/src/下,文件:redis-sentinel,见上述路径加入path。另外src同级目录有sentinel.conf配置文件模板,可以参考。

采用http://blog.163.com/a12333a_li/blog/static/87594285201304103257837/ 的配置模板:


#修改IP地址,IP可以是集群中的任意一个IP地址。

sentinel monitor mymaster192.168.1.11 6379 1

#默认1s检测一次,这里配置超时5000毫秒为宕机。

sentineldown-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 900000
sentinel can-failover mymaster yes

 

sentinel parallel-syncs mymaster 1

启动sentinel:[root@soa1 sentinel-env]# redis-server /root/devzone/redis/sentinel-env/sentinel.conf --sentinel &
[1] 29795
[root@soa1 sentinel-env]# [29795] 26 Nov 16:57:53.056 * Max number of open files set to 10032
                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 2.6.16 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in sentinel mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 26379
 |    `-._   `._    /     _.-'    |     PID: 29795
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               






[31193] 26 Nov 17:12:28.234 * +slave slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:12:28.234 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379

 

登录sentinel控制台,redis-cli -p 26379, 查询状态:

# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=127.0.0.1:6379,slaves=2,sentinels=1

说明节点工作正常

监控slave的控制台:monitor,可以看到大量信息,包括sentinel的消息:

"PUBLISH" "__sentinel__:hello" "127.0.0.1:26379:5af67fe4818cd8c1e1dc679a44d655cbcf744820:1"

主节点也存在这样的查询;

-----------停止master后,查询sentinel:

# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=2,sentinels=1

已经切换为12机器,slave的选择见sentinel的文档说明;

sentinel控制台显示了切换过程:

[31193] 26 Nov 17:18:24.228 # +sdown master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:24.228 # +odown master mymaster 127.0.0.1 6379 #quorum 1/1
[31193] 26 Nov 17:18:24.328 # +failover-triggered master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:24.328 # +failover-state-wait-start master mymaster 127.0.0.1 6379 #starting in 9168 milliseconds
[31193] 26 Nov 17:18:33.565 # +failover-state-select-slave master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:33.666 # +selected-slave slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:33.666 * +failover-state-send-slaveof-noone slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:33.766 * +failover-state-wait-promotion slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:34.269 # +promoted-slave slave 192.168.0.12:6379 192.168.0.12 6379 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:34.269 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:34.369 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.273 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.273 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.373 # +failover-end master mymaster 127.0.0.1 6379
[31193] 26 Nov 17:18:35.373 # +switch-master mymaster 127.0.0.1 6379 192.168.0.12 6379
[31193] 26 Nov 17:18:35.475 * +slave slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:19:05.455 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379


查询11 sentinel的信息:

redis 127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "192.168.0.11:6380"
    3) "ip"
    4) "192.168.0.11"
    5) "port"
    6) "6380"
    7) "runid"
    8) "d75e30a7ab284003827744b322281f53025f689e"
    9) "flags"
   10) "slave"
   11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "781"
   15) "last-ping-reply"
   16) "781"
   17) "info-refresh"
   18) "7500"
   19) "master-link-down-time"
   20) "0"
   21) "master-link-status"
   22) "ok"
   23) "master-host"
   24) "192.168.0.12"

   25) "master-port"
   26) "6379"
   27) "slave-priority"
   28) "100"
2)  1) "name"
    2) "127.0.0.1:6379"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6379"
    7) "runid"
    8) ""

    9) "flags"
   10) "s_down,slave,disconnected,demote"
   11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "739873"
   15) "last-ping-reply"
   16) "739873"
   17) "s-down-time"
   18) "709791"
   19) "info-refresh"
   20) "1385458255246"
   21) "master-link-down-time"
   22) "0"
   23) "master-link-status"
   24) "err"

   25) "master-host"
   26) "?"
   27) "master-port"
   28) "0"

   29) "slave-priority"
   30) "100"


显示现在还有两个slave,其中一个是自身127.0.0.1,runid为空,其master信息不正常,非常怪异,不知道是不是bug。11那个slave状态正常;

进入12的redis控制台,查看info,看到replication是正常的,只有一个slave,11的显示也正常;说明sentinel显示不正常,估计是bug:

# Replication
role:master
connected_slaves:1
slave0:192.168.0.11,6380,online

------------------shutdown 11从机:

12主机info信息:说明master发现slave shutdown

# Replication
role:master
connected_slaves:0

11主机 sentinel控制台显示:[31193] 26 Nov 17:42:34.520 # +sdown slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379

但是sentinel的redis-cli的info还是显示两个slave,太怪了;

# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=2,sentinels=1

进一步查询11 sentinel的sentinel slaves mymaster:

edis 127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "192.168.0.11:6380"
    3) "ip"
    4) "192.168.0.11"
    5) "port"
    6) "6380"
    7) "runid"
    8) "d75e30a7ab284003827744b322281f53025f689e"
    9) "flags"
   10) "s_down,slave,disconnected"
   
11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "232389"
   15) "last-ping-reply"
   16) "232389"
   17) "s-down-time"
   18) "202386"
   19) "info-refresh"
   20) "237202"
   21) "master-link-down-time"
   22) "0"
   23) "master-link-status"
   24) "ok"
   25) "master-host"
   26) "192.168.0.12"
   27) "master-port"
   28) "6379"
   29) "slave-priority"
   30) "100"
2)  1) "name"
    2) "127.0.0.1:6379"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6379"
    7) "runid"
    8) ""
    9) "flags"
   10) "s_down,slave,disconnected,demote"
   11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "1641534"
   15) "last-ping-reply"
   16) "1641534"
   17) "s-down-time"
   18) "1611452"
   19) "info-refresh"
   20) "1385459156907"
   21) "master-link-down-time"
   22) "0"
   23) "master-link-status"
   24) "err"
   25) "master-host"
   26) "?"
   27) "master-port"
   28) "0"
   29) "slave-priority"
   30) "100"

发现11主机从机状态变为s_down,slave,disconnected 说明sentinel已经记录两个slave的状态变化了,似乎又不是bug。

-------------launch 11 主机master :

12主机马上发现11从机:

# Replication
role:master
connected_slaves:1
slave0:192.168.0.11,6379,online


11 6379启动后,自己已经变成了slave:

# Replication
role:slave
master_host:192.168.0.12
master_port:6379

sentinel console显示11slave恢复:

[31193] 26 Nov 17:50:06.531 * +demote-old-slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:50:06.731 # -sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:50:11.448 * +slave slave 192.168.0.11:6379 192.168.0.11 6379 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 17:50:16.562 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 192.168.0.12 6379


sentinel的管理info信息竟然有多个一个slave记录!

redis 127.0.0.1:26379> redis 127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "192.168.0.11:6379" -----之前的主机变成了从机
    3) "ip"
    4) "192.168.0.11"
    5) "port"
    6) "6379"
    7) "runid"
    8) "55603e9d67ca053417806a2a3024387418f17684"
    9) "flags"
   10) "slave"

   11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "216"
   15) "last-ping-reply"
   16) "216"
   17) "info-refresh"
   18) "6135"
   19) "master-link-down-time"
   20) "0"
   21) "master-link-status"
   22) "ok"
   23) "master-host"
   24) "192.168.0.12"
   25) "master-port"
   26) "6379"
   27) "slave-priority"
   28) "100"
2)  1) "name"
    2) "192.168.0.11:6380" ---之前停掉的11从机
    3) "ip"
    4) "192.168.0.11"
    5) "port"
    6) "6380"
    7) "runid"
    8) "d75e30a7ab284003827744b322281f53025f689e"
    9) "flags"
   10) "s_down,slave,disconnected"
   11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "583437"
   15) "last-ping-reply"
   16) "583437"
   17) "s-down-time"
   18) "553434"
   19) "info-refresh"
   20) "588250"
   21) "master-link-down-time"
   22) "0"
   23) "master-link-status"
   24) "ok"
   25) "master-host"
   26) "192.168.0.12"
   27) "master-port"
   28) "6379"
   29) "slave-priority"
   30) "100"
3)  1) "name"
    2) "127.0.0.1:6379"  ---这个又什么主机?
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6379"
    7) "runid"
    8) "55603e9d67ca053417806a2a3024387418f17684" ---看runid知道,其实是11的刚启动的节点6379,它多列出了这个slave,全乱了!
    9) "flags"
   10) "slave"
   11) "pending-commands"
   12) "-1"
   13) "last-ok-ping-reply"
   14) "16"
   15) "last-ping-reply"
   16) "16"
   17) "info-refresh"
   18) "1118"
   19) "master-link-down-time"
   20) "0"
   21) "master-link-status"
   22) "ok"
   23) "master-host"
   24) "192.168.0.12"
   25) "master-port"
   26) "6379"
   27) "slave-priority"
   28) "100"


查询sentinel的info:果然显示为三个slave:



# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=3,sentinels=1

说明sentinel已经可以管理主从状态,但是slave管理存在bug?是不是启动时绑定的IP 127.0.0.1有问题?

----------------------恢复11从机6380:

查询11 6380 info,发现该从机未变成12的从机,而是变成了主机:

# Replication
role:master
connected_slaves:0

看sentinel的console,发现似乎sentinel做了12 6379到11 6380的主备切换,所以导致11:6380成为了master:

[31193] 26 Nov 18:04:36.244 * +reboot slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379
[31193] 26 Nov 18:04:36.244 # -slave-restart-as-master slave 192.168.0.11:6380 192.168.0.11 6380 @ mymaster 192.168.0.12 6379 #removing it from the attached slaves

但结果是12:6379依然未动,还是master:而且还有一个11:6379的slave,说明没有完成切换:

# Replication
role:master
connected_slaves:1
slave0:192.168.0.11,6379,online

sentinel的info显示12:6379还是master,但是slave变成了2个:

# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
master0:name=mymaster,status=ok,address=192.168.0.12:6379,slaves=2,sentinels=1

查询两个slave:

redis 127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "192.168.0.11:6379" ---正确
    3) "ip"
    4) "192.168.0.11"
    5) "port"
    6) "6379"
    7) "runid"
    8) "55603e9d67ca053417806a2a3024387418f17684"
    9) "flags"
   10) "slave"
   11) "pending-commands"
   12) "0"
   13) "last-ok-ping-reply"
   14) "145"
   15) "last-ping-reply"
   16) "145"
   17) "info-refresh"
   18) "7361"
   19) "master-link-down-time"
   20) "0"
   21) "master-link-status"
   22) "ok"
   23) "master-host"
   24) "192.168.0.12"
   25) "master-port"
   26) "6379"
   27) "slave-priority"
   28) "100"
2)  1) "name"
    2) "127.0.0.1:6379" ----这个又是谁?
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6379"
    7) "runid"
    8) "55603e9d67ca053417806a2a3024387418f17684" --从id看到其实就是上边的11:6379,sentinel又搞错了。现在实际的slave只有一个:11:6379
    9) "flags"
   10) "slave"
   11) "pending-commands"
   12) "-1"
   13) "last-ok-ping-reply"
   14) "145"
   15) "last-ping-reply"
   16) "145"
   17) "info-refresh"
   18) "2351"
   19) "master-link-down-time"
   20) "0"
   21) "master-link-status"
   22) "ok"
   23) "master-host"
   24) "192.168.0.12"
   25) "master-port"
   26) "6379"
   27) "slave-priority"
   28) "100"

  从结果看,sentinel的状态恢复确实存在问题


其他问题明天再试吧。