bug1234513
来源:互联网 发布:js的location方法 编辑:程序博客网 时间:2024/06/05 16:04
Bug 1234513 -NFS locks can become out-of-sync between client & server when client process killed with signal
这个bug从复现到写case整整5天,从一开始的什么也不懂(对锁的概念很模糊)到后来写出满意的case还是很有成就感的,先总结如下:
=重现现象:
client那个执行的while循环退出并且没有任何test_lock在运行,即没有任何锁存在的情况下,server那边lslk还有lockd
=重现步骤:
1.client端执行while循环(给文件加锁解锁的过程),lslk列出有关文件系统的锁
2.server端lslk同步有锁lockd
3.client端中断循环语句,kill掉所有test_lock进程
4.server端lslk打印锁的信息
=问题原因:
1.当使用nfsvers=3挂载时,client端程序中调用了fcnl,其中设置write锁,server端lslk就会出现lockd,这是正常现象
2.client端执行循环是为了反复操作文件进行加锁释放锁,并且因为timemount超时,程序会被timeout发送的siigterm信号断掉
3./etc/init.d/nfslock restart 可以释放锁,real中是/etc/init.d/nfslock
=解决问题:
1.lslk与lslock命令
a,在终端启动时,会加载/bin,/sbin下面的文件及命令,但是有些命令是不存在的,所以要自己进行安装(一个命令对应一个包)
rpm -qf `which lslk`/ rpm -qf l`which lslock` #查找系统中是否有此命令
yum provides */lslk / yum prvides */lslock #查找提供此命令的安装包
b,对lslk与lslock进行封装
#rhel7 uses lslocks provided by util-linux
if which lslocks >/dev/null 2>&1; then
lsLocks=lslocks
# rhel6 uses lslk from package lslk
elif which lslk >/dev/null 2>&1; then
lsLocks=lslk
else
echo "{Warn} Need to use tool lslocks or lslk to execute this test."
report_result $TEST FAIL
exit 1
fi
对于只有一端使用lslk的:
lsLocks=lslock
which lsllock || lsLocks=lslk
2.免密码登陆问题:
$ssh-keygen -t rsa #生成一对密钥
$ssh-copy-id -i ~/.ssh/id_rsa.pub root@xxxx #给xxxx分配公钥(本地机器登陆远端机器时,用自己的私钥去匹配已经分配好的公钥)
$ssh root@xxx #实验是否可以免密码登陆
3.为了避免信号控制的复杂,可以使用ssh进行远程登陆来验证远程机器。
4.screen的使用:
run 'screen -dm bash -c "while true; do timeout 3 ./test_lock $nfsmp/stats; sleep 1; done &> screen.log"'
run "ps aux | grep -v grep | grep SCREEN"
if [ $? -ne 0 ]; then
run "cat screen.log"
fi
5.学会debug信息:run "rhts-sync-block -s testing $HOSTNAME"
暂停到这里,重新打开一个终端进行验证
#rhel7 uses lslocks provided by util-linuxif which lslocks >/dev/null 2>&1; then lsLocks=lslocks# rhel6 uses lslk from package lslkelif which lslk >/dev/null 2>&1; then lsLocks=lslkelse echo "{Warn} Need to use tool lslocks or lslk to execute this test." report_result $TEST FAIL exit 1fiServer() { rlPhaseStartSetup do-$role-Setup- rlFileBackup /etc/exports run "mkdir -p $expdir" run 'echo "$expdir *(rw,no_root_squash)" > /etc/exports' run "service_nfs restart" run "exportfs -v" - rlPhaseEnd rlPhaseStartTest do-$role-Test- run "rhts-sync-set -s servReady" run "rhts-sync-block -s testDone $CLIENT" rlPhaseEnd rlPhaseStartCleanup do-$role-Cleanup- rlFileRestore run "rm -rf $expdir" run "service nfs restart" rlPhaseEnd}Client() { rlPhaseStartSetup do-$role-Setup- run "mkdir -p $nfsmp" run "rhts-sync-block -s servReady $SERVER" run "ls_nfsvers $SERVER" - rlPhaseEndfor V in $(ls_nfsvers $SERVER); do rlPhaseStartTest do-$role-Test-vers${V} run "mount -o vers=$V $SERVER:$expdir $nfsmp" run "ssh $SERVER service nfslock restart" run "ssh $SERVER $lsLocks" 0 "Server should not have lock" run 'screen -dm bash -c "while true; do timeout 3 ./test_lock $nfsmp/stats; sleep 1; done &> screen.log"' run "ps aux | grep -v grep | grep SCREEN" if [ $? -ne 0 ]; then run "cat screen.log" fi # when lslk find test_lock then break cycle and print lock run "while :; do $lsLocks | grep -q test_lock && break; done" run "$lsLocks" run "ssh $SERVER $lsLocks" run "sleep 100" run "pkill -9 screen" run "ps aux | grep -v grep | grep while" 1 "Should be killed" run "sleep 100" run "$lsLocks | grep test_lock" 1 "Client should have released lock" run "ssh $SERVER $lsLocks | grep lockd" 1 "Server should have released lock" run "rm $nfsmp/stats" run "sleep 30" run "umount $nfsmp" rlPhaseEnddone rlPhaseStartCleanup do-$role-Cleanup- run "rhts-sync-set -s testDone" run "service nfslock restart" run "rm -rf $nfsmp" rlPhaseEnd}rlJournalStart