XX省移动CMNET三期扩容项目遇到IPMP BUG解决过程

来源:互联网 发布:淘宝外网活动 编辑:程序博客网 时间:2024/05/17 07:55

1.故障现象

XX移动网管中心CMNET三期扩容T5120*4+ST2530*1安装。安装系统完成后配置probe-based IPMP后,首次网卡切换测试,拔出e1000g0网线后,浮动IP自动切换到e1000g1,再插入e1000g0网线后,浮动IP无法切回,e1000g0网卡显示failed,重启系统后,系统始终检测到第二块网卡failed,导致ipmp无法切换。

 

Jul 29 16:31:42 HB01-DNS-SV02 in.mpathd[249]: NIC failure detected on e1000g1 of group sc_ipmp0

Jul 29 16:31:42 HB01-DNS-SV02 in.mpathd[240]: Successfully failed over from NIC e1000g1 to NIC e1000g0

root@HB01-DNS-SV02 # ifconfig -a

lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1

        inet 127.0.0.1 netmask ff000000

e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2

        inet 192.168.100.6 netmask ffffff00 broadcast 192.168.100.255

        groupname sc_ipmp0

        ether 0:21:28:3a:fa:36

e1000g0:1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2

        inet 192.168.100.5 netmask ffffff00 broadcast 192.168.100.255

e1000g1: flags=19040803<UP,BROADCAST,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FAILED> mtu 1500 index 3

        inet 192.168.100.4 netmask ffffff00 broadcast 192.168.100.255

        groupname sc_ipmp0

bash-3.00# more /etc/host*

/etc/hostname.e1000g0

::::::::::::::

HB01-DNS-SV02 netmask + broadcast + group sc_ipmp0 up /

addif HB01-DNS-SV02-e1000g0-test netmask + broadcast + deprecated -failover up

::::::::::::::

/etc/hostname.e1000g1

::::::::::::::

HB01-DNS-SV02-e1000g1-test netmask + broadcast + group sc_ipmp0 deprecated -failover up

::::::::::::::

/etc/hosts

::::::::::::::

#

# Internet host table

#

::1     localhost      

127.0.0.1       localhost      

192.168.100.6   HB01-DNS-SV02   HB01-DNS-SV02.  loghost

192.168.100.4   HB01-DNS-SV02-e1000g0-test     

192.168.100.5   HB01-DNS-SV02-e1000g1-test

 

2 解决过程

2.1排除硬件故障

2.1.1排除网卡故障

检查local-mac-address?=true

使用e1000g2e1000g3配置ipmp后,故障依旧,系统也检测到e1000g3failed,同时另外3T5120也有同样的问题。因此排除了主机网卡故障

2.1.2排除网关故障

   找用户要了两台cisco2950 switchdplink hub,将主机两个网卡连接switchhub上,自己笔记本设置网关,然后做切换测试,问题依然存在,排除了网关故障

 分析有可能是ipmp的配置或者有bug导致

2.2解决方案

2.2.1配置基于link-based IPMP

之前配置probe-based IPMP基于IPOSI模型第三层网络层,ping通网关.solairs10之后推出了link-base d IPMP基于OSI模型第二层链路层,无需测试IP,也无需ping通网关.建议大家采用link-based IPMP,即可节约IP资源,也可省去很多麻烦事。

 

root@HB01-DNS-SV02# more /etc/host*

::::::::::::::

/etc/hostname.e1000g0

::::::::::::::

HB01-DNS-SV02 netmask + broadcast + group ipmp_group0 up

::::::::::::::

/etc/hostname.e1000g1

::::::::::::::

HB01-DNS-SV02-e1000g1-test netmask + broadcast + group ipmp_group0 up

::::::::::::::

/etc/hosts

::::::::::::::

#

# Internet host table

#

::1     localhost      

127.0.0.1       localhost      

192.168.100.6   HB01-DNS-SV02   HB01-DNS-SV02.  loghost

#192.168.100.4   HB01-DNS-SV02-e1000g0-test     

192.168.100.5   HB01-DNS-SV02-e1000g1-test

root@HB01-DNS-SV02 # ifconfig -a

lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1

        inet 127.0.0.1 netmask ff000000

e1000g0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2

        inet 192.168.100.6 netmask ffffff00 broadcast 192.168.100.255

        groupname ipmp_group0

        ether 0:21:28:3a:fa:34

e1000g1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3

        inet 192.168.100.5 netmask ffffff00 broadcast 192.168.100.255

        groupname ipmp_group0

        ether 0:21:28:3a:fa:35

2.2.2此为系统BUG,升级到142900-02以后

后经查资料发现此为solaris 10 u8的一个bug bug id271519 Solaris 10 Kernel Patches 141444-09 and 141445-09 May CauseInterfaceFailure in IP Multipathing (IPMP)

经测试打EIS 2.2.4 2010.06可解决此故障。

3 总结

建议大家采用link-based IPMPlink-based IPMP支持以下网卡:

Solaris OS:

 hme

 eri

ce

 ge

bge

 qfe

dmfe