mongos分片集群下db数量过多导致服务不可用
来源:互联网 发布:iqr 淘宝网官网 编辑:程序博客网 时间:2024/05/21 17:51
故障现象
每隔一段时间发现mongos连不上,mongos日志出现大量的socket连接错误
2017-08-08T17:09:31.095+0800 I SHARDING [conn52677] couldn't find database [d4ee211b4024a2d9e6bd57d4a28679339d6b0692] in config db
2017-08-08T17:09:33.768+0800 I SHARDING [LockPinger] cluster 10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607 pinged successfully at Tue Aug 8 17:09:33 2017 by distributed lock pinger '10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607/6a0a848974c9:27017:1502169957:1804289383', sleeping for 30000ms
2017-08-08T17:10:03.779+0800 I SHARDING [LockPinger] cluster 10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607 pinged successfully at Tue Aug 8 17:10:03 2017 by distributed lock pinger '10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607/6a0a848974c9:27017:1502169957:1804289383', sleeping for 30000ms
2017-08-08T17:10:33.787+0800 I SHARDING [LockPinger] cluster 10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607 pinged successfully at Tue Aug 8 17:10:33 2017 by distributed lock pinger '10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607/6a0a848974c9:27017:1502169957:1804289383', sleeping for 30000ms
2017-08-08T17:11:03.798+0800 I SHARDING [LockPinger] cluster 10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607 pinged successfully at Tue Aug 8 17:11:03 2017 by distributed lock pinger '10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607/6a0a848974c9:27017:1502169957:1804289383', sleeping for 30000ms
2017-08-08T17:11:33.806+0800 I SHARDING [LockPinger] cluster 10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607 pinged successfully at Tue Aug 8 17:11:33 2017 by distributed lock pinger '10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607/6a0a848974c9:27017:1502169957:1804289383', sleeping for 30000ms
2017-08-08T17:12:03.815+0800 I SHARDING [LockPinger] cluster 10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607 pinged successfully at Tue Aug 8 17:12:03 2017 by distributed lock pinger '10.10.219.96:27607,10.10.223.127:27607,10.10.236.43:27607/6a0a848974c9:27017:1502169957:1804289383', sleeping for 30000ms
2017-08-08T17:12:29.180+0800 I SHARDING [conn52677] put [d4ee211b4024a2d9e6bd57d4a28679339d6b0692] on: 044dd16e-7706-4a49-b4ff-73d86a99d6fd:044dd16e-7706-4a49-b4ff-73d86a99d6fd/10.10.131.20:27607,10.10.139.113:27607,10.10.140.40:27607
2017-08-08T17:12:29.181+0800 I SHARDING [conn52677] Exception thrown while processing query op for d4ee211b4024a2d9e6bd57d4a28679339d6b0692.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.59.129:30983]
2017-08-08T17:12:29.181+0800 I NETWORK [conn52677] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.10.59.129:30983]
2017-08-08T17:12:29.181+0800 I SHARDING [conn52678] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.237.57:16624]
2017-08-08T17:12:29.181+0800 I NETWORK [conn52678] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.10.237.57:16624]
2017-08-08T17:12:29.182+0800 I SHARDING [conn51881] Exception thrown while processing query op for 4773506c0c9ab0871c4593fd13250fa07d0d4e90.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.59.129:19704]
2017-08-08T17:12:29.182+0800 I NETWORK [conn51881] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.10.59.129:19704]
2017-08-08T17:12:29.182+0800 I SHARDING [conn47787] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.77.67:56779]
2017-08-08T17:12:29.182+0800 I SHARDING [conn46993] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.237.57:57994]
2017-08-08T17:12:29.182+0800 I SHARDING [conn46922] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.237.57:57898]
2017-08-08T17:12:29.182+0800 I SHARDING [conn44943] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.44.190:31490]
2017-08-08T17:12:29.182+0800 I SHARDING [conn46911] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.63.59:62378]
2017-08-08T17:12:29.183+0800 I SHARDING [conn46764] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.44.190:63754]
2017-08-08T17:12:29.183+0800 I SHARDING [conn46936] Exception thrown while processing query op for admin.$cmd :: caused by :: 9001 socket exception [SEND_ERROR] server [10.10.36.82:3261]
2017-08-08T17:12:29.183+0800 I NETWORK [conn47787] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.10.77.67:56779]
排查步骤
1 怀疑是php驱动的问题,使用shell每隔1s连接mongos,发现故障复现时,确实有一段时间mongo shell卡顿,截图如下,12:00分到12:04分mongo shell客户端一直卡着,排除驱动问题
2 各项监控正常,故障点mongos的cpu,io很低
3 布置了1s间隔的连接数监控发现发生故障时连接数急剧下滑
4 布置了打开文件数等监控,没什么发现
5 最后发现每次发生故障时,日志里有这么一段类似的记录,猜测是它引起的
2017-08-08T17:09:31.095+0800 I SHARDING [conn52677] couldn't find database [d4ee211b4024a2d9e6bd57d4a28679339d6b0692] in config db
2017-08-08T17:12:29.180+0800 I SHARDING [conn52677] put [d4ee211b4024a2d9e6bd57d4a28679339d6b0692] on: 044dd16e-7706-4a49-b4ff-73d86a99d6fd:044dd16e-7706-4a49-b4ff-73d86a99d6fd/10.10.131.20:27607,10.10.139.113:27607,10.10.140.40:27607
详细分析
上面的5中的日志记录,应该是创建一个新的db,然后将其放入一个分片的过程;
该mongos集群下大约有2w多个db,由5个分片组成,创建db较为频繁;
查了下官方文档,有这么一条信息
The mongos
selects the primary shard when creating a new database by picking the shard in the cluster that has the least amount of data. mongos
uses the totalSize
field returned by the listDatabase
command as a part of the selection criteria.
即mongos创建一个新的db时,会选择在分片中磁盘容量最小的那个分片中创建。
根据日志提示翻看了下mongo的代码,截图如下,发现创建db时实际上主要就是调用shard类的pick方法
继续查看pick方法的定义,其实就是调用getStatus(),通过getStatus的结果采用轮询的方式选择出一个容量最小的分片进行创建。
Shard Shard::pick(const Shard& current) {
vector<Shard> all;
staticShardInfo.getAllShards(all);
if (all.size() == 0) {
staticShardInfo.reload();
staticShardInfo.getAllShards(all);
if (all.size() == 0)
return EMPTY;
}
// if current shard was provided, pick a different shard only if it is a better choice
ShardStatus best = all[0].getStatus();
if (current != EMPTY) {
best = current.getStatus();
}
for (size_t i = 0; i < all.size(); i++) {
ShardStatus t = all[i].getStatus();
if (t < best)
best = t;
}
LOG(1) << "best shard for new allocation is " << best << endl;
return best.shard();
}
我们继续看getstatus的内容,实际调用的getShardDataSizeBytes获取每个分片的磁盘容量大小,期间还通过调用getShardMongoVersion保证版本的一致性,查看版本通过调用db.serverstatus()命令实现,这个应该不是我们关心的内容。
ShardStatus Shard::getStatus() const {
return ShardStatus(
*this, getShardDataSizeBytes(getConnString()), getShardMongoVersion(getConnString()));
}
继续查看getShardDataSizeBytes的内容,发现实际上就是在分片上执行listdatabases,其实也就是show dbs,这个命令会显示db和该db的容量大小
long long Shard::getShardDataSizeBytes(const string& shardHost) {
ScopedDbConnection conn(shardHost);
BSONObj listDatabases;
bool ok = conn->runCommand("admin", BSON("listDatabases" << 1), listDatabases);
conn.done();
uassert(28599,
str::stream() << "call to listDatabases on " << shardHost
<< " failed: " << listDatabases,
ok);
BSONElement totalSizeElem = listDatabases["totalSize"];
uassert(28590, "totalSize field not found in listDatabases", totalSizeElem.isNumber());
return listDatabases["totalSize"].numberLong();
}
结论
该mongos下db数量太多,当创建一个新的db时,会加一个全局锁,然后挨个去每个分片执行show dbs找出最小容量这个过程太过耗时,导致了mongos服务不可用。测试了下在该mongos下执行show dbs大概需要1分钟以上,和服务不可用的时间相当
- mongos分片集群下db数量过多导致服务不可用
- mongos分片集群版本升级方案
- HDFS集群繁忙时删除大量的文件导致NameNode服务不可用原理解析
- 误杀病毒导致“RPC服务不可用”解决一例
- apache|php升级后导致apns服务不可用
- 搭建高可用mongodb集群-Sharding 分片
- Mongodb分片高可用集群搭建
- 接到oracle RAC集群的问题报告,这个问题十分致命,会导致RAC集群不可用
- nginx子请求数量过多导致的内存泄漏
- 数据库连接过多 导致连接不上去
- Mongos与集群均衡
- Mongos与集群均衡
- Mongos 与集群均衡
- Mongos 与集群均衡
- mongos 集群无法启动
- Xen占用ttyS0导致Linux下串口不可用
- Xen占用ttyS0导致Linux下串口不可用
- linux下网卡漂移导致网络不可用
- Viewparge(无限轮播)
- Summer Training day6 codeforces787D 最短路spfa,线段树建图
- Perlin noise(二)
- udp
- 二维spfa [JLOI2011]飞行路线
- mongos分片集群下db数量过多导致服务不可用
- POJ 2236Wireless Network
- 快速排序 改进
- cnn卷积神经网络讲解清晰的一篇文章
- dede后台生产网页显示“模板文件不存在 无法解析文档”
- 搜索引擎选择: Elasticsearch与Solr
- Android开发之Activity的生命周期
- Apache Shiro 实现方法级别细粒度权限控制
- HDU1164质数