Hystrx权威指南--Hystrix隔离策略

来源：互联网发布：1024为熟知端口编辑：程序博客网时间：2024/06/03 13:33

隔离服务

计算机的线程、内存等资源是有上限的，达到上限时，离系统被拖垮宕机的时间就不短了。特别是访问网络资源，由于网络的不稳定性，被依赖资源的不稳定性都可能出现处理延迟。在一个高并发高流量的互联网系统中，一旦其中有一个依赖处理延迟，瞬间系统的所有线程和内存都会被这一个依赖所占用，导致其他服务也没有资源处理，甚至整个系统被宕机。

Hystrix提供了对访问资源的隔离机制，对每一个依赖分配合理的资源。如果一个依赖处理延迟，也只是分配给他的资源被占用，不会影响其他服务应有的资源。从而保证了系统能够高效稳定运行，继续提供其他服务。

Hystrix组件提供了两种隔离的解决方案：线程池隔离和信号量隔离。两种隔离方式都是限制对共享资源的并发访问量，线程在就绪状态、运行状态、阻塞状态、终止状态间转变时需要由操作系统调度，占用很大的性能消耗；而信号量是在访问共享资源时，进行tryAcquire，tryAcquire成功才允许访问共享资源。

线程池隔离

客户端（lib库，网络调用等等）都是在单独的线程上执行。从调用线程（Tomcat线程池）上隔离他们，以便用户可以直接响应一个耗时的依赖调用

线程池隔离一般用于不同业务间的隔离，防止相互间的影响。线程池隔离，同样是继承HystrixCommand ，重写 run方法，在里面实现业务逻辑。

protected HelloCommandIsolateThreadPool(String name) {            super(HystrixCommand.Setter.                    //设置GroupKey 用于dashboard 分组展示                            withGroupKey(HystrixCommandGroupKey.Factory.asKey("hello"))                            //设置commandKey 用户隔离线程池，不同的commandKey会使用不同的线程池                    .andCommandKey(HystrixCommandKey.Factory.asKey("hello" + name))                            //设置线程池名字的前缀，默认使用commandKey                    .andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("hello$Pool" + name))                            //设置线程池相关参数                    .andThreadPoolPropertiesDefaults(HystrixThreadPoolProperties.Setter()                            .withCoreSize(15)                            .withMaxQueueSize(10)                            .withQueueSizeRejectionThreshold(2))                            //设置command相关参数                    .andCommandPropertiesDefaults(HystrixCommandProperties.Setter()                            //是否开启熔断器机制                            .withCircuitBreakerEnabled(true)                                    //舱壁隔离策略                            .withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.THREAD)                                    //circuitBreaker打开后多久关闭                            .withCircuitBreakerSleepWindowInMilliseconds(5000)));        }

.withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.THREAD)设置隔离策略是关键，默认是信号隔离，如果不设置为线程池隔离，上面设置的线程池相关的参数都无意义。开启熔断器机制，如果在10秒内50%以上的请求都失败，回路就会被断开，后面的请求都会直接返回失败，即 Fast Fail 策略。withCircuitBreakerSleepWindowInMilliseconds 5秒后会尝试闭合电路。

在系统内部，线程池存放在一个ConcurrentHashMap中，key是commandKey ,value就是线程池。线程池的名字是 ThreadPoolKey值。

为避免在系统运行过程中，频繁的创建新的线程，过段时间又销毁线程，在Hystrix系统内部，线程池的最大线程数和核心线程数是同样大小，所以设置时，只有一个CoreSize参数需要设置。

不同的业务线之间选择用线程池隔离，降低互相影响的概率。设置隔离策略为线程池隔离：

.withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.THREAD))；

在Hystrix内部，是根据properties.executionIsolationStrategy().get()这个字段判断隔离级别。如在getRunObservableDecoratedForMetricsAndErrorHandling这个方法中会先判断是不是线程池隔离，如果是就获取线程池，如果不是则进行信号量隔离的操作。如果是线程池隔离，还需要设置线程池的相关参数如：

线程池名字andThreadPoolKey ,
coreSize(核心线程池大小) ,
KeepAliveTimeMinutes（线程存存活时间）,
MaxQueueSize（最大队列度）,
QueueSizeRejectionThreshold（拒绝执行的阀值）等等。

.andThreadPoolPropertiesDefaults(HystrixThreadPoolProperties.Setter()

                                    .withCoreSize(resourcesManager.getThreadPoolProperties(platformProtocol.getAppId()).getCoreSize())

                                    .withKeepAliveTimeMinutes(resourcesManager.getThreadPoolProperties(platformProtocol.getAppId()).getKeepAliveSeconds())                                    .withMaxQueueSize(resourcesManager.getThreadPoolProperties(platformProtocol.getAppId()).getMaxQueueSize())                                    .withQueueSizeRejectionThreshold(resourcesManager.getThreadPoolProperties(platformProtocol.getAppId()).getQueueSizeRejectionThreshold()))

threadPoolKey 也是线程池的名字的前缀，默认前缀是 hystrix 。在Hystrix中，核心线程数和最大线程数是一致的，减少线程临时创建和销毁带来的性能开销。线程池的默认参数都在HystrixThreadPoolProperties中，重点讲解一下参数queueSizeRejectionThreshold 和maxQueueSize 。

queueSizeRejectionThreshold默认值是5，允许在队列中的等待的任务数量。
maxQueueSize默认值是-1，队列大小。如果是Fast Fail 应用，建议使用默认值。线程池饱满后直接拒绝后续的任务，不再进行等待。

代码如下HystrixThreadPool类中：

@Override        public boolean isQueueSpaceAvailable() {            if (queueSize <= 0) {                // we don't have a queue so we won't look for space but instead                // let the thread-pool reject or not                return true;            } else {                return threadPool.getQueue().size() < properties.queueSizeRejectionThreshold().get();            }        }

线程池一旦创建完成，相关参数就不会更改，存放在静态的ConcurrentHashMap中，key是对应的commandKey 。而queueSizeRejectionThreshold是每个命令都是设置的。线程池的相关参数都保存在HystrixThreadPool这个类文件中，线程池的创建方法getThreadPool则在HystrixConcurrencyStrategy类文件中。从getThreadPool方法可以看出线程池的名字就是hystrix-threadPoolKey-threadNumber.

@Override            public Thread newThread(Runnable r) {                Thread thread = new Thread(r, "hystrix-" + threadPoolKey.name() + "-" + threadNumber.incrementAndGet());                thread.setDaemon(true);                return thread;            }

在HystrixThreadPool实现类的构造方法中，并发HystrixConcurrencyStrategy实例是通过HystrixPlugins获取的，所以可以通过HystrixPlugins设置自定义插件。具体的HystrixPlugins如何使用，会在后面章节中讲解。

线程池的创建

前面说了，在Hystrix内部大部分类都是单实例，同样ThreadPool也不例外，也是单实例。并且相同commandKey的依赖还必须是使用同一个线程池。这就需要把ThreadPool保存在一个静态的map中，key是commandKey，同时要保证线程安全，Hytstrix使用了ConcurrentHashMap。关于为什么不适用HashTable保证线程安全问题的疑问请自行Google。线程池的创建在HystrixThreadPool这个类文件中的内部类Factory中的getInstance方法。

/* package */final static ConcurrentHashMap<String, HystrixThreadPool> threadPools = new ConcurrentHashMap<String, HystrixThreadPool>();     String key = threadPoolKey.name();            // this should find it for all but the first time            HystrixThreadPool previouslyCached = threadPools.get(key);            if (previouslyCached != null) {                return previouslyCached;            }            // if we get here this is the first time so we need to initialize            synchronized (HystrixThreadPool.class) {                if (!threadPools.containsKey(key)) {                    threadPools.put(key, new HystrixThreadPoolDefault(threadPoolKey, propertiesBuilder));                }            }            return threadPools.get(key);

线程池的使用

HystrixCommand类的execute()内部调用了queue() ,queue又调用了父类AbstractCommand的toObservable方法，toObservable方法处理了是否可缓存问题后，交给了getRunObservableDecoratedForMetricsAndErrorHandling方法，这个方法设置了一系列的executionHook之后，交给了getExecutionObservableWithLifecycle，这个方法通过getExecutionObservable()获取了执行器。getExecutionObservable()是个抽象方法，具体实现放在了子类：HystrixCommand和HystrixObservableCommand类中。下面是HystrixCommand类中的getExecutionObservable方法实现：

final protected Observable<R> getExecutionObservable() {        return Observable.create(new OnSubscribe<R>() {            @Override            public void call(Subscriber<? super R> s) {                try {                    s.onNext(run());                    s.onCompleted();                } catch (Throwable e) {                    s.onError(e);                }            }        });    }

在这个Call方法中执行了具体的业务逻辑run() ;

线程隔离的优点:

使用线程可以完全隔离第三方代码,请求线程可以快速放回。
当一个失败的依赖再次变成可用时，线程池将清理，并立即恢复可用，而不是一个长时间的恢复。
可以完全模拟异步调用，方便异步编程。

线程隔离的缺点:

线程池的主要缺点是它增加了cpu，因为每个命令的执行涉及到排队(默认使用SynchronousQueue避免排队)，调度和上下文切换。
对使用ThreadLocal等依赖线程状态的代码增加复杂性，需要手动传递和清理线程状态。

注: Netflix公司内部认为线程隔离开销足够小，不会造成重大的成本或性能的影响。Netflix 内部API 每天100亿的HystrixCommand依赖请求使用线程隔，每个应用大约40多个线程池，每个线程池大约5-20个线程。

信号隔离

线程计算机中有限的宝贵资源，线程的调度也需要由用户空间切换到操作系统空间。可以通过Semaphore或者counts限制对依赖资源的并发访问量。如果是同一个业务部同资源的隔离，建议使用信号隔离。在需要申请资源时，先去try获取一个permit。在获取的过程中不需要操作系统参与，所以相比于线程来说，信号是个轻量级的隔离方式。

Hystrix库中的信号类 TryableSemaphore是JDK库的优化版，内部是通过一个AtomicInteger类型的变量来存储permitCount的。之所以说是JDK的优化版，TryableSemaphore在tryAcquire时不会阻塞，每次申请一个permit成功后，permitCount就会incrementAndGet，释放资源时，permitCount就会decrementAndGet。而JDK版本的Semaphore 是通过AQS实现的，内部逻辑复杂，并且在tryAcquire时，会阻塞。为什么不用JDK版本的Semaphore官方给的答案是:

Semaphore that only supports tryAcquire and never blocks and that supports a dynamic permit count.Using AtomicInteger increment/decrement instead of java.util.concurrent.Semaphore since we don't need blocking and need a custom implementation to get the dynamic permit count and since AtomicInteger achieves the same behavior and performance without the more complex implementation of the actual Semaphore class using AbstractQueueSynchronizer.

在开发时，跟线程池隔离类似，同样是继承HystrixCommand类，在run方法中实现业务逻辑，通过getFallback 实现优雅降级。只是在设置隔离策略及相关参数数有较小的变化：

  protected HelloCommandIsolateSemaphore(String key, int semaphoreCount) {            super(HystrixCommand.Setter                    //设置GroupKey 用于dashboard 分组展示                    .withGroupKey(HystrixCommandGroupKey.Factory.asKey("hello"))                            //设置CommandKey 用于Semaphore分组，相同的CommandKey属于同一组隔离资源                    .andCommandKey(HystrixCommandKey.Factory.asKey("hello" + key))                            //设置隔离级别：Semaphore                    .andCommandPropertiesDefaults(HystrixCommandProperties.Setter()                            //是否开启熔断器机制                            .withCircuitBreakerEnabled(true)                                    //舱壁隔离策略                            .withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.SEMAPHORE)                                    //设置每组command可以申请的permit最大数                            .withExecutionIsolationSemaphoreMaxConcurrentRequests(50)                                    //circuitBreaker打开后多久关闭                            .withCircuitBreakerSleepWindowInMilliseconds(5000)));        }

.withExecutionIsolationSemaphoreMaxConcurrentRequests(50)这个参数和线程池的核心线程数是同样的意义，允许有多少个请求同时请求资源。

到此，讲解了如何开发一个线程池隔离的服务，和信号隔离的服务，接下来从源码层面讲解隔离的设计实现。

TryableSemaphore 接口定义了信号隔离的行为，内部借助AtomicInteger类实现资源的分配。HystrixProperty<Integer> numberOfPermits 存储可分配的资源，AtomicInteger count存储已分配的资源。numberOfPermits 在类初始化时就需要赋值，所以定义成了final类型。

protected final HystrixProperty<Integer> numberOfPermits;private final AtomicInteger count = new AtomicInteger(0);

申请资源

tryAcquire（）负责资源的分配。有资源申请请求时，对count执行incrementAndGet()操作，如果返回值大于numberOfPermits的值，则进行decrementAndGet进行回退刚才的加一操作，并返回false，表示申请资源失败；如果返回值不大于numberOfPermits的值，则表示申请资源成功，返回true。详见代码：

@Override        public boolean tryAcquire() {            int currentCount = count.incrementAndGet();            if (currentCount > numberOfPermits.get()) {                count.decrementAndGet();                return false;            } else {                return true;            }        }

释放资源

逻辑执行完成后，要进行资源的释放，以便让其他线程获得资源。release()负责资源的释放。有资源释放请求时，对count进行decrementAndGet()操作。

同一个业务不同的资源依赖时，可以选择信号量隔离，降低线程调度带来的性能消耗。

到此，信号量资源的申请和释放讲完了。

阅读全文

0 0