MongoDb move chunk失败分析和处理(SERVER-5351)

来源:互联网 发布:java颗粒度是什么意思 编辑:程序博客网 时间:2024/06/04 23:18
版本: 2.2.0
涉及BUG: SERVER-5351
操作: 把数据chunk从一个分片挪到另外一个分片(下面描述为: 从分片shard3_2 挪到 分片shard1_1)
现象
1) mongos 上move chunk命令执行后卡住, 发现日志:
Tue Mar  4 20:34:12 [conn25194] CMD: movechunk: { moveChunk: "archive.archive", to: "shard1_1", find: { uid: -2130706432 } }Tue Mar  4 20:34:12 [conn25194] moving chunk ns: archive.archive moving ( ns:archive.archive at: shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 lastmod: 1|2||000000000000000000000000 min: { uid: -2130706432 } max: { uid: MaxKey }) shard3_2:shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003 -> shard1_1:shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002Tue Mar  4 20:34:20 [LockPinger] cluster 192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001 pinged successfully at Tue Mar  4 20:34:20 2014 by distributed lock pinger '192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001/server-a08:27227:1392462352:1804289383', sleeping for 30000ms
2)在shard3_2的primary日志里面发现这种记录,并且这种日志持续10个小时:
Wed Mar  5 05:34:57 [conn4754510] moveChunk data transfer progress: { active: true, ns: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", min: { uid: -2130706432 }, max: { uid: MaxKey }, shardKeyPattern: { uid: 1 }, state: "catchup", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
最后有个失败日志:
Wed Mar  5 06:40:13 [conn4754510] distributed lock 'archive.archive/server-a03:15001:1349773025:1657874366' unlocked.Wed Mar  5 06:40:13 [conn4754510] about to log metadata event: { _id: "server-a03-2014-03-04T22:40:13-163", server: "server-a03", clientAddr: "192.168.130.18:21606", time: new Date(1393972813700), what: "moveChunk.from", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 6: 1, step2 of 6: 442, step3 of 6: 9, step4 of 6: 36000896, note: "aborted" } }Wed Mar  5 06:40:13 [conn4754510] command admin.$cmd command: { moveChunk: "archive.archive", from: "shard3_2/192.168.130.13:15001,192.168.130.14:15002,10.1.2.13:15003", to: "shard1_1/10.1.2.12:13003,192.168.130.11:13001,192.168.130.12:13002", fromShard: "shard3_2", toShard: "shard1_1", min: { uid: -2130706432 }, max: { uid: MaxKey }, maxChunkSizeBytes: 134217728, shardId: "archive.archive-uid_-2130706432", configdb: "192.168.130.14:9001,192.168.130.15:9001,192.168.130.16:9001", secondaryThrottle: false, $auth: { local: { __system: 2 } } } ntoreturn:1 keyUpdates:0 locks(micros) r:66 w:28 reslen:398 36361617ms
3)在shard1_1的primary日志里面发现这种记录, 并且这种日志持续10个小时:
Tue Mar  4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
最后个失败日志:
Wed Mar  5 06:40:13 [migrateThread] about to log metadata event: { _id: "server-a01-2014-03-04T22:40:13-161", server: "server-a01", clientAddr: ":27017", time: new Date(1393972813571), what: "moveChunk.to", ns: "archive.archive", details: { min: { uid: -2130706432 }, max: { uid: MaxKey }, step1 of 5: 9, step2 of 5: 0, step3 of 5: 0, step4 of 5: 0, note: "aborted" } }
4)在shard1_1的secondary里面没有任何日志, 说明move shard对于secondary其实是透明的.
综上,(下面称shardx_y的primary为shardx_y_pri, 称其secondary为shardx_y_bjs)
流程是这样的,chunk要从shard3_2挪到chunk1_1, 根据mongodb的设计, 数据由shard1_1_pri 向shard3_2_pri拉取数据,  shard1_1_pri会等数据同步到若干个secondary之后才返回. 在我们的迁移中, shard1_1的primary一直在等待oplog完成同步到其他secondary的工作,如日志:warning: migrate commit waiting for 3 slaves for. 另一方面shard3_2 也在不停检查shard1_1的迁移状态.

疑问:
1) 在日志
Tue Mar  4 22:14:47 [migrateThread] warning: migrate commit waiting for 3 slaves for 'archive.archive' { uid: -2130706432 } -> { uid: MaxKey } waiting for: 5315c844:6
中最后有一个时间戳, 是一个mongodb的oplog的时间戳, primary在等待这个时间戳的oplog写入到secondary. 但是我去secondary的数据表里面看到, 其他普通的数据库操作(非move chunk)早已同步到secondary里了, 那么可以断定这个oplog已经执行了, 说明并不存在延时问题.
2) 对于这个现象, 在google group上面得到10gen工程师(是一个北邮毕业的程序媛)的回复: 这个故障估计是触发mongodb2.2.0的一个bug, 建议我们升级到2.2.7. bug见链接: https://jira.mongodb.org/browse/SERVER-5351
3) 于是我搜了之前成功move chunk的操作日志, 如下
Wed Aug 21 01:31:35 [migrateThread] warning: migrate commit waiting for 2 slaves for 'post.post' { uid: 1107296256 } -> { uid: MaxKey } waiting for: 5213a7f7:4
奇怪了,我们明明有3个secondary, 但是为什么这里显示 2 呢?
“waiting for 2 slaves for 'post.post'”(正常)
“waiting for 3 slaves for 'archive.archive' ” (异常)
4)  同样的集群, 同样的replica set, 为啥一个2一个3呢, 于是跟了一些代码
//d_migrate.cppclass MigrateStatus {    void _go() {     //......     // pause to wait for replication     Timer t;     while ( t.minutes() < 600 ) { //10个小时的循环检查同步进度         if ( flushPendingWrites( lastOpApplied ) break;         sleepsecs(1);     }    }    bool flushPendingWrites( const ReplTime& lastOpApplied ) {     if ( ! opReplicatedEnough( lastOpApplied ) ) {         OpTime op( lastOpApplied ); //这里就是打waiting for 3 slaves 日志的地方         OCCASIONALLY warning() << "migrate commit waiting for " << slaveCount          << " slaves for '" << ns << "' " << min << " -> " << max << " waiting for: " << op << migrateLog;         return false;     }     //...    }   //下面这两个函数检查同步进度,注意不是引用传值,   //也就是说, slaveCount是由之前的代码计算好的,传进去opReplicatedEnough之后不会被修改    bool opReplicatedEnough( const ReplTime& lastOpApplied ) {     return mongo::opReplicatedEnough( lastOpApplied , slaveCount );    }    bool opReplicatedEnough( OpTime op , BSONElement w ) {     return slaveTracking.opReplicatedEnough( op , w );    }};

好了,在flushPendingWrites里面可以看到, slave的数量是打印的这个变量, 关键是 mongo::opReplicatedEnough传进来的 slaveCount ,他是如何计算的?可以找到这个变量的计算方式:
slaveCount = ( getSlaveCount() / 2 ) + 1; (在2.2.7版本,计算slaveCount不再使用这个函数)
也就是说, primary是等待超过半数的的secondary同步完成, 就ok. 我们的replica set有3个secondary, 也就是说, 正常的日志打印出来应该是  3/2+1=2 , 但是故障发生的时候, 打印出来是3, 即是4/2+1或者5/2+1, 也就是说这个mongod认为他有有4或者5个secondary, 可以猜测是数据库内部发生异常了.

处理:
这么看来, 或许只有按10gen的工程师说的那样升级数据库了,但是升级数据库不能很快进行, 我们的集群上百个mongod, 必须找其他办法先对付这个bug
1) 看一下这个slaveCount是怎么来的:
mongo::opReplicatedEnough( lastOpApplied , slaveCount );这个函数的功能就是, 如果有slaveCount个secondary的opTime大于lastOpApplied,那么返回true. 深入这个函数内部, 找到这个代码:
bool _replicatedToNum_slaves_locked(OpTime& op, int numSlaves ) {    for ( map<Ident,OpTime>::iterator i=_slaves.begin(); i!=_slaves.end(); i++) {    OpTime s = i->second;    if ( s < op ) {        continue;    }    if ( --numSlaves == 0 )        return true;    }    return numSlaves <= 0;}
到这里就发现了,其实就是这个变量_slaves存储了seconday的信息, 找到其定义: map<Ident,OpTime> _slaves;
2) 到这里,基本就可以确定,在故障的mongod里面, _slaves里面有4或者5个元素, 于是我想到, 如果能找到什么办法, 让_slaves清空或者重置则是再好不过,于是找到了修改_slaves的地方:
//repl_block.cpp:111void reset() {    if ( _currentlyUpdatingCache )return;    scoped_lock mylk(_mutex);    _slaves.clear(); //清除}//repl_block.cpp:118void update( const BSONObj& rid , const string& host , const string& ns , OpTime last ) {    REPLDEBUG( host << " " << rid << " " << ns << " " << last );    Ident ident(rid,host,ns);    scoped_lock mylk(_mutex);    _slaves[ident] = last; //修改    _dirty = true;    //.....}

有且只有这两个地方直接更改 _slaves. 于是接着看代码, 可以发现,在数据库关闭的时候和replica的配置修改的时候会重置_slaves. 因为_slaves.clear()的上层代码还是比较多,并且还想看看有没有其他更简便的方法, 并且确认重启或者修改replica的配置可以重置_slaves, 所以不犹豫直接在源码上面多个地方打log, 重新编译mongodb到测试环境集群去测试:
src/mongo/db/repl_block.cpp:115:log(LL_ERROR) << "[yyy] clear _slave in " << " void reset()  _slave.clear()\n";src/mongo/db/repl_block.cpp:126:log(LL_ERROR) << "[yyy] ident:" << rid.toString(false, false) << " " << host << " " << ns << "  |   last:" << last;src/mongo/db/repl/replset_commands.cpp:215:log(LL_ERROR) << "[yyy] clear _slave in " << " CmdReplSetReconfig";src/mongo/db/repl/rs_config.cpp:72:log(LL_ERROR) << "[yyy] clear _slave in " << "saveConfigLocally" << rsLog;src/mongo/db/oplog.cpp:50:log(LL_ERROR) << "[yyy] clear _slave in " << " close database";
最后确认了会重置_slaves的操作: mongod启动和replica配置的修改
2) 综合了上面的推论和测试, 认为重启mongod可以让mongod根据实际情况重置_slaves, 否则处于日常的运转中, 只有修改replica的配置. 鉴于修改配置风险更高,所以打算通过重启mongod的方式处理. 
3) 因为我们的集群请求量很大, 保险起见, 在凌晨操作. 切主 - 等待换主和同步 - 关闭原来的primary - 重启 - 等待同步 - 把主切回来 - 等待换主和同步 - move chunk - 成功{ "millis" : 2977, "ok" : 1 }! 

后续: 这两天跟了很多mongodb的代码,后面还写点代码分析.
















1 0