elasticsearch源码分析之分片分配（十）

来源：互联网发布：喀秋莎6.0软件入门教程编辑：程序博客网时间：2024/06/05 04:16

分片

什么是分片

分片是把索引数据切分成多个小的索引块，这些小的索引块能够分发到同一个集群中的不同节点。在检索时，检索结果是该索引每个分片上检索结果的合并。类似于数据库的分库分表。

为什么分片

1、这样可以提高读写性能，实现负载均衡。
2、副本容易扩展，备份恢复快。

怎么分片

分片（或者叫分区）是分布式系统的一个经典问题。常用的分片方式：

分片方式说明优点缺点简单hash 除余取模简单扩展困难、容易数据倾斜数据范围分布按照数据的所处范围进行分类，每个分区可以动态分裂易扩展元数据服务维护复杂，容易成为瓶颈数据量分布按照数据块大小分布与数据内容无关，无数据倾斜，易扩容元数据维护维护复杂一致性hash 数据和节点hash后沿环形匹配、虚拟节点易扩展 -

es采用的是简单hash，默认hash的是_id字段，另外也可以指定分片字段。hash完同一结果的数据分配到一个分片shard中。

分片分配

什么是分片分配

已经切分为多份的索引块，索引块分发到同一个集群中的不同节点。这个把shard分发到node的过程就是分片的分配。分配的原则是主要还是基于提高读写性能，实现负载均衡，备份恢复快。

怎么分片分配

分片时机

This can happen during initial recovery, replica allocation, rebalancing, or when nodes are added or removed.

index的增删
node的增删
reroute操作
replica的设置更改
初始化恢复过程

AllocationService.reroute调用位置，也就是调用分片分配的时机：

分片规则

es的分片规则主要分为以下几类：
一、负载均衡规则，从负载均衡角度出发的一些规则，常见的有：

SameShardAllocationDecider，该决策者不允许相同分片（primary\replication）出现在相同的节点上，重写了canAllocate方法。该类也考虑到了同一物理机多个es实例的情况（es可能多个虚拟机上，多个虚拟机在一台物理机上），通过cluster.routing.allocation.same_shard.host=true(默认false)来处理该情况。判断的依据是hostname和hostaddress。
ShardsLimitAllocationDecider类，限制同一个节点上shard的数目。可以限制同一节点上的shard总数、同一节点上同一index的shard数目，分别通过index.routing.allocation.total_shards_per_node、cluster.routing.allocation.total_shards_per_node实现。index级别可以覆盖cluster级别。在elasticsearh.yml文件中配置或者用update API实时更改。默认的值是-1，代表没有任何限制。需要注意，如降低该值会导致集群强制进行分片的重新分配，在集群平衡这个过程中引发额外的负载。
AwarenessAllocationDecider类，感知分配功能。更够感知服务器、服务机架等，尽量分散存储shard。有两类参数可以使用。第一类参数举例：我们通过参数设置分组cluster.routing.allocation.awareness.attributes: rack_id，一node启动设置了node.attr.rack_id:1，另外一node（两个node不在一个机架上）启动设置了node.attr.rack_id:2，所以shard会尽量分散到不同的rack_id上。第二类参数举例：cluster.routing.allocation.awareness.attributes: zone，cluster.routing.allocation.awareness.force.zone.values: zone1,zone2 如果zone1的机器上不能容纳所有的shard，并且zone2没有启动，剩余没有分配的shard则不会进行分配（zone1过载），直到等到zone2启动才进行分配。

二、并发数量规则

ConcurrentRebalanceAllocationDecider类，rebalance并发数控制类。配置cluster.routing.allocation.cluster_concurrent_rebalance来控制，该配置运行时可变，默认值为2，如果设置为-1，则表示无限制并发。
ThrottlingAllocationDecider类，在recovery过程中，恢复分片并发数。可动态设置控制参数配置：cluster.routing.allocation.node_initial_primaries_recoveries:这个属性的默认值为4，它用来描述单个节点上允许recovery操作的初始主分片数量；cluster.routing.allocation.node_concurrent_recoveries:它的默认值是2，它用来限制单个节点上进行recovery操作的并发数。

三、条件限制规则

FilterAllocationDecider类，通过include、exclude参数（可动态设置）控制shard的节点分配。参数：index.routing.allocation.require.、index.routing.allocation.include.、index.routing.allocation.exclude.、cluster.routing.allocation.require.、cluster.routing.allocation.include.、cluster.routing.allocation.exclude.。其中require表示必须，include表示允许，exclude表示禁止。注意Cluster的设置会重载掉index的配置，意味着如果根据index的配置该shard可以分配到此node，但是cluster的配置是不允许，那么此shard将不允许。filter被应用的顺序依次为required、include、exclude。
ReplicaAfterPrimaryActiveAllocationDecider类，该类保证只会在主分片分配完毕后才开始分配分片副本。
RebalanceOnlyWhenActiveAllocationDecider类，保证该索引的所有分片都在活跃状态才能进行rebalance过程。
ClusterRebalanceAllocationDecider类，根据shard的active状态来判断是否可以执行rebalance。使用参数cluster.routing.allocation.allow_rebalance（不能动态更改）来进行判断，参数值意义：①indices_all_active:它是默认值，表示只有集群中所有的节点分配完毕，才能认定集群再平衡完成。②indices_primaries_active:这个值表示只要所有主分片分配完毕了，就可以认定集群再平衡完成。③always:它表示即使当主分片和分片副本都没有分配，集群再平衡操作也是允许的。
DiskThresholdDecider类，通过磁盘空间阈值来控制是否分配。默认该功能是关闭的，通过cluster.routing.allocation.disk.threshold_enabled属性设置为true可以打开。cluster.routing.allocation.disk.watermark.low属性允许用户指定一个百分比阈值或者绝对数值来控制何时能够进行分片分配。比如默认值是0.7，表示当可用磁盘空间低于70%时，新的分片才可以分配到该节点上。cluster.routing.allocation.disk.watermark.high属性允许用户指定一个百分比阈值或者绝对数值来控制何时需要将分片分配到其它的节点。比如默认值是0.85，表示当可用磁盘空间高于85%时，ElasticSearch会重新把该节点的分片分配到其它节点。参数可以yml文件或者api动态设置。

上述三类分配规则的java类全部继承了AllocationDeciders抽象类，该类是负责shard的分配做一个决策结果（Decision类，决策结果类。有四中类型，ALWAYS、YES、NO、THROTTLE）。定义了canRebalance方法（给定的shard routing是否可以rebalance），canAllocate方法（给定的shard routing是否可以分配到指定的node），canRemain方法，给定的shard routing是否可以继续保留在指定的node；该类所有方法默认都返回ALWAYS。

分配执行

核心逻辑是根据上述规则&分片权重（index、cluster）进行位置判断，然后进行数据移动、移动结束初始化启动、最后调整clusterstate完成分配。

1、执行入口是AllocationService.reroute方法，根据clusterstate构造出RoutingAllocation，该类持有当前集群shard分配的状态信息、决策信息、节点信息等，在后面的分配过程的主要操作类。

    protected ClusterState reroute(final ClusterState clusterState, String reason, boolean debug) {        RoutingNodes routingNodes = getMutableRoutingNodes(clusterState);        // shuffle the unassigned nodes, just so we won't have things like poison failed shards        routingNodes.unassigned().shuffle();        RoutingAllocation allocation = new RoutingAllocation(allocationDeciders, routingNodes, clusterState,            clusterInfoService.getClusterInfo(), currentNanoTime(), false);        allocation.debugDecision(debug);        reroute(allocation);        if (allocation.routingNodesChanged() == false) {            return clusterState;        }        return buildResultAndLogHealthChange(clusterState, allocation, reason);    }

2、执行真正reroute逻辑，如果有节点没有分配shard，则执行gatewayAllocator.allocateUnassigned。关于gatewayAllocator的分配主要分为primaryShardAllocator和replicaShardAllocator：

primaryShardAllocator.allocateUnassigned(allocation);replicaShardAllocator.processExistingRecoveries(allocation);replicaShardAllocator.allocateUnassigned(allocation);

3、执行数据分片分配BalancedShardsAllocator.allocate(allocation)。该类基于WeightFunction重新分配集群节点node持有shard的分配关系。allocate方法主要分三步：

final Balancer balancer = new Balancer(logger, allocation, weightFunction, threshold);balancer.allocateUnassigned();balancer.moveShards();balancer.balance();

第一步是allocateUnassigned，根据WeightFunction算法和所有AllocationDecider把所有给定的shard分配一个最小化匹配的node
第二步是moveShards，根据第一步的结果对需要移动的节点进行移动，移动过程中为RELOCATING，移动过去初始化INITIALIZING
第三步是负载均衡，rebalance其实是从负载高的node向负载低的做转移。

分配逻辑有很多没有读懂的地方，所以一些细节没有深入描述。等我明白再来补充，或者欢迎各位指导。

阅读全文

0 0