Load Balancer in HBase 0.90

来源：互联网发布：js实现图片旋转效果编辑：程序博客网时间：2024/05/22 17:32

Working with Stanislav Barton on Load Balancer in HBase 0.90, he asked for document on how load balancer works. This writeup would touch the internals of load balancer and how it evolved over time.

Master code (including load balancer) has been rewritten for HBase 0.90

When a region receives many writes and is split, the daughter regions are placed on the same region server as the parent region. Stan proposed to change this behavior and I summarized in HBASE-3779.

HBASE-3586 tried to solve the problem where load balancer moves inactive regions off an overloaded region server by randomly selecting regions to offload. This is to handle the potential problem of moving too many hot regions onto a region server which recently joined the cluster.

But this random selection isn't optimal. For Stan's cluster, there're around 600 regions on each region server. When 30 new regions were created on the same region server, random selector only chose 3 out of the 30 new regions for reassignment. The other region selection was from inactive (old) regions. This is expected behavior because new and old regions were selected equally probably.

Basically we traded some optimization for safety of not overloading a newly discovered region server.

So I continued enhancement using HBASE-3609 where one of the goals is to remove randomness from LoadBalancer so that we can deterministically produce near-optimal balancing actions.

If at least one region server joined the cluster just before the current balancing action, both new and old regions from overloaded region servers would be moved onto underloaded region servers. Otherwise, I find the new regions and put them on different underloaded servers. Previously one underloaded server would be filled up before the next underloaded server is considered.

I also utilize the randomizer which shuffles the list of underloaded region servers.
This way we can avoid distributing offloaded regions to few region servers.

HBASE-3609 has been integrated into trunk as of Apr 18th, 2011.

HBASE-3704 would help users observe the distribution of regions. It is currently only in HBase trunk code.

Also related is HBASE-3373. Stan proposed to make it more general. A new policy for load balancer can be added to have balanced number of regions per RS per table and not in total number of regions from all tables.

If you're interested in more detail, please take a look at the javadoc for LoadBalancer.balanceCluster()

For HBase trunk, I implemented HBASE-3681 upon Jean-Daniel Cryans's request. For 0.90.2 and later, the default value of sloppiness is 0.

I am planning for the next generation of load balancer where request histogram would play an important role in deciding which regions to move. Please take a look at HBASE-3679

HBaseWD project introduced multiple scanners for bucketed writes. I plan to accommodate this new feature through HBASE-3811 where additional attributes in Scan object would allow balancer to group the scanners generated by HBaseWD.

HBASE-3943 is supposed to solve the problem where region reassignment disrupts (potentially long) compaction.

HBASE-3945 tries to give regions more stability by not reassigning region(s) in consecutive balancing actions.

balanceCluster

public List<LoadBalancer.RegionPlan> balanceCluster(Map<ServerName,List<HRegionInfo>> clusterState)

Generate a global load balancing plan according to the specified map of server information to the most loaded regions of each server. The load balancing invariant is that all servers are within 1 region of the average number of regions per server. If the average is an integer number, all servers will be balanced to the average. Otherwise, all servers will have either floor(average) or ceiling(average) regions. HBASE-3609 Modeled regionsToMove using Guava's MinMaxPriorityQueue so that we can fetch from both ends of the queue. At the beginning, we check whether there was empty region server just discovered by Master. If so, we alternately choose new / old regions from head / tail of regionsToMove, respectively. This alternation avoids clustering young regions on the newly discovered region server. Otherwise, we choose new regions from head of regionsToMove. Another improvement from HBASE-3609 is that we assign regions from regionsToMove to underloaded servers in round-robin fashion. Previously one underloaded server would be filled before we move onto the next underloaded server, leading to clustering of young regions. Finally, we randomly shuffle underloaded servers so that they receive offloaded regions relatively evenly across calls to balanceCluster(). The algorithm is currently implemented as such:

Determine the two valid numbers of regions each server should have, MIN=floor(average) and MAX=ceiling(average).
Iterate down the most loaded servers, shedding regions from each so each server hosts exactly MAX regions. Stop once you reach a server that already has <= MAX regions.
Order the regions to move from most recent to least.
Iterate down the least loaded servers, assigning regions so each server has exactly MIN regions. Stop once you reach a server that already has >= MIN regions. Regions being assigned to underloaded servers are those that were shed in the previous step. It is possible that there were not enough regions shed to fill each underloaded server to MIN. If so we end up with a number of regions required to do so,neededRegions. It is also possible that we were able to fill each underloaded but ended up with regions that were unassigned from overloaded servers but that still do not have assignment. If neither of these conditions hold (no regions needed to fill the underloaded servers, no regions leftover from overloaded servers), we are done and return. Otherwise we handle these cases below.
If neededRegions is non-zero (still have underloaded servers), we iterate the most loaded servers again, shedding a single server from each (this brings them from having MAX regions to having MINregions).
We now definitely have more regions that need assignment, either from the previous step or from the original shedding from overloaded servers. Iterate the least loaded servers filling each to MIN.
If we still have more regions that need assignment, again iterate the least loaded servers, this time giving each one (filling them to MAX) until we run out.
All servers will now either host MIN or MAX regions. In addition, any server hosting >= MAX regions is guaranteed to end up with MAX regions at the end of the balancing. This ensures the minimal number of regions possible are moved.

TODO: We can at-most reassign the number of regions away from a particular server to be how many they report as most loaded. Should we just keep all assignment in memory? Any objections? Does this mean we need HeapSize on HMaster? Or just careful monitor? (current thinking is we will hold all assignments in memory)

Parameters:: clusterState - Map of regionservers and their load/region information to a list of their most loaded regions
Returns:: a list of regions to be moved, including source and destination, or null if cluster is already balanced; 原文地址：http://zhihongyu.blogspot.com/2011/04/load-balancer-in-hbase-090.html; http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/master/LoadBalancer.html