Kubernetes Eviction Manager源码分析

来源：互联网发布：淘宝中差评回复语编辑：程序博客网时间：2024/05/22 12:50

摘要：本文作为Kubernetes Eviction Manager工作机制分析的后续篇，主要通过源码分析对其工作机制进行解读。

Kubernetes Eviction Manager介绍及工作原理

这部分内容，请看我的前一篇博文：Kubernetes Eviction Manager工作机制分析

Kubernetes Eviction Manager源码分析

Kubernetes Eviction Manager在何处启动

Kubelet在实例化一个kubelet对象的时候，调用eviction.NewManager新建了一个evictionManager对象。

pkg/kubelet/kubelet.go:273func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, standaloneMode bool) (*Kubelet, error) {    ...    thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)    if err != nil {        return nil, err    }    evictionConfig := eviction.Config{        PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,        MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),        Thresholds:               thresholds,        KernelMemcgNotification:  kubeCfg.ExperimentalKernelMemcgNotification,    }    ...    // setup eviction manager    evictionManager, evictionAdmitHandler, err := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, kubeDeps.Recorder, nodeRef, klet.clock)    if err != nil {        return nil, fmt.Errorf("failed to initialize eviction manager: %v", err)    }    klet.evictionManager = evictionManager    klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)    ...}

kubelet执行Run方法开始工作时，启动了一个goroutine，每5s执行一次updateRuntimeUp。在updateRuntimeUp中，待确认runtime启动成功后，会调用initializeRuntimeDependentModules完成runtime依赖模块的初始化工作。

pkg/kubelet/kubelet.go:1219func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {    go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)}pkg/kubelet/kubelet.go:2040func (kl *Kubelet) updateRuntimeUp() {    ...    kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)    ...}

再跟踪到initializeRuntimeDependentModules的代码可见，runtime的依赖模块包括cadvisor和evictionManager，初始化的工作其实就是分别调用它们的Start方法进行启动。

pkg/kubelet/kubelet.go:1206func (kl *Kubelet) initializeRuntimeDependentModules() {    if err := kl.cadvisor.Start(); err != nil {        // Fail kubelet and rely on the babysitter to retry starting kubelet.        // TODO(random-liu): Add backoff logic in the babysitter        glog.Fatalf("Failed to start cAdvisor %v", err)    }    // eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs    if err := kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod); err != nil {        kl.runtimeState.setInternalError(fmt.Errorf("failed to start eviction manager %v", err))    }}

因此，从这里开始就进入到evictionManager的分析了。

Kubernetes Eviction Manager的定义

从上面的分析可见，kubelet在启动过程中进行runtime依赖模块的初始化过程中，将evictionManager启动了。先别急，我们必须先来看看Eviction Manager是如何定义的。

pkg/kubelet/eviction/eviction_manager.go:40// managerImpl implements Managertype managerImpl struct {    //  used to track time    clock clock.Clock    // config is how the manager is configured    config Config    // the function to invoke to kill a pod    killPodFunc KillPodFunc    // the interface that knows how to do image gc    imageGC ImageGC    // protects access to internal state    sync.RWMutex    // node conditions are the set of conditions present    nodeConditions []v1.NodeConditionType    // captures when a node condition was last observed based on a threshold being met    nodeConditionsLastObservedAt nodeConditionsObservedAt    // nodeRef is a reference to the node    nodeRef *v1.ObjectReference    // used to record events about the node    recorder record.EventRecorder    // used to measure usage stats on system    summaryProvider stats.SummaryProvider    // records when a threshold was first observed    thresholdsFirstObservedAt thresholdsObservedAt    // records the set of thresholds that have been met (including graceperiod) but not yet resolved    thresholdsMet []Threshold    // resourceToRankFunc maps a resource to ranking function for that resource.    resourceToRankFunc map[v1.ResourceName]rankFunc    // resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.    resourceToNodeReclaimFuncs map[v1.ResourceName]nodeReclaimFuncs    // last observations from synchronize    lastObservations signalObservations    // notifiersInitialized indicates if the threshold notifiers have been initialized (i.e. synchronize() has been called once)    notifiersInitialized bool}

managerImpl就是evictionManager的具体定义，重点关注：

config - evictionManager的配置，包括:
- PressureTransitionPeriod( –eviction-pressure-transition-period)
- MaxPodGracePeriodSeconds(–eviction-max-pod-grace-period)
- Thresholds(–eviction-hard, –eviction-soft)
- KernelMemcgNotification(–experimental-kernel-memcg-notification)
killPodFunc - evict pod时kill pod的接口，kubelet NewManager的时候，赋值为killPodNow方法(pkg/kubelet/pod_workers.go:285)
imageGC - 当node出现diskPressure condition时，imageGC进行unused images删除操作以回收disk space。
summaryProvider - 提供node和node上所有pods的最新status数据汇总，既NodeStats and []PodStats。
thresholdsFirstObservedAt - 记录threshold第一次观察到的时间。
thresholdsMet - 保存已经触发但还没解决的Thresholds，包括那些处于grace period等待阶段的Thresholds。
resourceToRankFunc - 定义各种Resource进行evict 挑选时的排名方法。
resourceToNodeReclaimFuncs - 定义各种Resource进行回收时调用的方法。
lastObservations - 上一次获取的eviction signal的记录，确保每次更新thresholds时都是按照正确的时间序列进行。
notifierInitialized - bool值，表示threshold notifier是否已经初始化，以确定是否可以利用kernel memcg notification功能来提高evict的响应速度。目前创建manager时该值为false，是否要利用kernel memcg notification，完全取决于kubelet的--experimental-kernel-memcg-notification参数。

kubelet在NewMainKubelet时调用eviction.NewManager进行evictionManager的创建，eviction.NewManager的代码很简单，就是赋值。

pkg/kubelet/eviction/eviction_manager.go:79// NewManager returns a configured Manager and an associated admission handler to enforce eviction configuration.func NewManager(    summaryProvider stats.SummaryProvider,    config Config,    killPodFunc KillPodFunc,    imageGC ImageGC,    recorder record.EventRecorder,    nodeRef *v1.ObjectReference,    clock clock.Clock) (Manager, lifecycle.PodAdmitHandler, error) {    manager := &managerImpl{        clock:           clock,        killPodFunc:     killPodFunc,        imageGC:         imageGC,        config:          config,        recorder:        recorder,        summaryProvider: summaryProvider,        nodeRef:         nodeRef,        nodeConditionsLastObservedAt: nodeConditionsObservedAt{},        thresholdsFirstObservedAt:    thresholdsObservedAt{},    }    return manager, manager, nil}

但是，有一点很重要，NewManager不但返回evictionManager对象，还返回了一个lifecycle.PodAdmitHandler实例evictionAdmitHandler，它其实和evictionManager的内容相同，但是不同的两个实例。evictionAdmitHandler用来kubelet创建Pod前进行准入检查，满足条件后才会继续创建Pod，通过Admit(attrs *lifecycle.PodAdmitAttributes)方法来检查，代码如下：

pkg/kubelet/eviction/eviction_manager.go:102// Admit rejects a pod if its not safe to admit for node stability.func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {    m.RLock()    defer m.RUnlock()    if len(m.nodeConditions) == 0 {        return lifecycle.PodAdmitResult{Admit: true}    }    // the node has memory pressure, admit if not best-effort    if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) {        notBestEffort := qos.BestEffort != qos.GetPodQOS(attrs.Pod)        if notBestEffort || kubepod.IsCriticalPod(attrs.Pod) {            return lifecycle.PodAdmitResult{Admit: true}        }    }    // reject pods when under memory pressure (if pod is best effort), or if under disk pressure.    glog.Warningf("Failed to admit pod %v - %s", format.Pod(attrs.Pod), "node has conditions: %v", m.nodeConditions)    return lifecycle.PodAdmitResult{        Admit:   false,        Reason:  reason,        Message: fmt.Sprintf(message, m.nodeConditions),    }}

上述Pod Admit逻辑，正是Kubernetes Eviction Manager工作机制分析中Scheduler一节提到的EvictionManager对Pod调度的逻辑影响：

Kubelet会定期的将Node Condition传给kube-apiserver并存于etcd。kube-scheduler watch到Node Condition Pressure之后，会根据以下策略，阻止更多Pods Bind到该Node。

Node Condition Scheduler Behavior MemoryPressure No new BestEffort pods are scheduled to the node. DiskPressure No new pods are scheduled to the node.

killPodNow的代码，后面再分析。

基本上，这一小节我们把evictionManager是什么以及怎么来的问题搞清楚了。下面我们来看看evictionManager的启动过程。

Kubernetes Eviction Manager的启动

上面分析过，kubelet在启动过程中进行runtime依赖模块的初始化过程中，将evictionManager启动了(kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod)),那我们先来看看Start方法：

pkg/kubelet/eviction/eviction_manager.go:126// Start starts the control loop to observe and response to low compute resources.func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, monitoringInterval time.Duration) error {    // start the eviction manager monitoring    go wait.Until(func() { m.synchronize(diskInfoProvider, podFunc) }, monitoringInterval, wait.NeverStop)    return nil}

很简单，启动一个goroutine，每执行完一次m.synchronize就间隔monitoringInterval(10s)的时间再次执行m.synchronize，如此反复。

接下来，就是evictionManager的关键工作流程了：

pkg/kubelet/eviction/eviction_manager.go:181// synchronize is the main control loop that enforces eviction thresholds.func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) {    // if we have nothing to do, just return    thresholds := m.config.Thresholds    if len(thresholds) == 0 {        return    }    // build the ranking functions (if not yet known)    if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 {        // this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass.        hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs()        if err != nil {            return        }        m.resourceToRankFunc = buildResourceToRankFunc(hasDedicatedImageFs)        m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, hasDedicatedImageFs)    }    // make observations and get a function to derive pod usage stats relative to those observations.    observations, statsFunc, err := makeSignalObservations(m.summaryProvider)    if err != nil {        glog.Errorf("eviction manager: unexpected err: %v", err)        return    }    // attempt to create a threshold notifier to improve eviction response time    if m.config.KernelMemcgNotification && !m.notifiersInitialized {        glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")        m.notifiersInitialized = true        // start soft memory notification        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {            glog.Infof("soft memory eviction threshold crossed at %s", desc)            // TODO wait grace period for soft memory limit            m.synchronize(diskInfoProvider, podFunc)        })        if err != nil {            glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)        }        // start hard memory notification        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {            glog.Infof("hard memory eviction threshold crossed at %s", desc)            m.synchronize(diskInfoProvider, podFunc)        })        if err != nil {            glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)        }    }    // determine the set of thresholds met independent of grace period    thresholds = thresholdsMet(thresholds, observations, false)    // determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim    if len(m.thresholdsMet) > 0 {        thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)        thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)    }    // determine the set of thresholds whose stats have been updated since the last sync    thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)    // track when a threshold was first observed    now := m.clock.Now()    thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)    // the set of node conditions that are triggered by currently observed thresholds    nodeConditions := nodeConditions(thresholds)    // track when a node condition was last observed    nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)    // node conditions report true if it has been observed within the transition period window    nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)    // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)    thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)    // update internal state    m.Lock()    m.nodeConditions = nodeConditions    m.thresholdsFirstObservedAt = thresholdsFirstObservedAt    m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt    m.thresholdsMet = thresholds    m.lastObservations = observations    m.Unlock()    // determine the set of resources under starvation    starvedResources := getStarvedResources(thresholds)    if len(starvedResources) == 0 {        glog.V(3).Infof("eviction manager: no resources are starved")        return    }    // rank the resources to reclaim by eviction priority    sort.Sort(byEvictionPriority(starvedResources))    resourceToReclaim := starvedResources[0]    glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim)    // determine if this is a soft or hard eviction associated with the resource    softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)    // record an event about the resources we are now attempting to reclaim via eviction    m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)    // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.    if m.reclaimNodeLevelResources(resourceToReclaim, observations) {        glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)        return    }    glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim)    // rank the pods for eviction    rank, ok := m.resourceToRankFunc[resourceToReclaim]    if !ok {        glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim)        return    }    // the only candidates viable for eviction are those pods that had anything running.    activePods := podFunc()    if len(activePods) == 0 {        glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")        return    }    // rank the running pods for eviction for the specified resource    rank(activePods, statsFunc)    glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods))    // we kill at most a single pod during each eviction interval    for i := range activePods {        pod := activePods[i]        status := v1.PodStatus{            Phase:   v1.PodFailed,            Message: fmt.Sprintf(message, resourceToReclaim),            Reason:  reason,        }        // record that we are evicting the pod        m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))        gracePeriodOverride := int64(0)        if softEviction {            gracePeriodOverride = m.config.MaxPodGracePeriodSeconds        }        // this is a blocking call and should only return when the pod and its containers are killed.        err := m.killPodFunc(pod, status, &gracePeriodOverride)        if err != nil {            glog.Infof("eviction manager: pod %s failed to evict %v", format.Pod(pod), err)            continue        }        // success, so we return until the next housekeeping interval        glog.Infof("eviction manager: pod %s evicted successfully", format.Pod(pod))        return    }    glog.Infof("eviction manager: unable to evict any pods from the node")}

代码写的非常工整，注释也很到位，很棒。关键流程如下：

通过buildResourceToRankFunc和buildResourceToNodeReclaimFuncs分别注册Evict Pod时各种Resource的排名函数和回收Node Resource的Reclaim函数。
通过makeSignalObservations从cAdvisor中获取Eviction Signal Observation和Pod的StatsFunc(后续对Pods进行Rank时需要用)。
如果kubelet配置了--experimental-kernel-memcg-notification且为true，则通过startMemoryThresholdNotifier启动soft & hard memory notification，当system usage第一时间达到soft & hard memory thresholds时，会立刻通知kubelet，并触发evictionManager.synchronize进行资源回收的流程。这样提高了eviction的实时性。
根据从cAdvisor数据计算得到的Observation（observasions）和配置的thresholds通过thresholdsMet计算得到此次Met的thresholds。
再根据从cAdvisor数据计算得到的Observation（observasions）和thresholdsMet通过thresholdsMet计算得到已记录但还没解决的thresholds，然后与上一步中的thresholds进行合并。
根据lastObservations中Signal的时间，对比observasions的中Signal中的时间，过滤thresholds。
更新thresholdsFirstObservedAt, nodeConditions。
过滤出那些从observed time到now，已经历过grace period时间的thresholds。
更新evictionManager对象的内部数据: nodeConditions，thresholdsFirstObservedAt，nodeConditionsLastObservedAt，thresholds，observations。
根据thresholds得到starvedResources，并进行排序，如果memory属于starvedResources，则memory排序第一。
取starvedResources排第一的Resource，调用reclaimNodeLevelResources对Node上这种Resource进行资源回收。如果回收完后，available满足thresholdValue+evictionMinimumReclaim,则流程结束，不再evict user-pods。
如果reclaimNodeLevelResources后，还不足以达到要求，则会继续evict user-pods，首先根据前面buildResourceToRankFunc注册的方法对所有active Pods进行排序。
按照前面的排序，顺序的调用killPodNow将选出的pod干掉。如果kill某个pod失败，则会跳过这个pod，再按顺序挑下一个pod进行kill。只要某个pod kill成功，就返回结束，也就是说这个流程中，最多只会kill最多一个Pod。

上面流程中，有两个最关键的步骤，回收节点资源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。

pkg/kubelet/eviction/eviction_manager.go:340// reclaimNodeLevelResources attempts to reclaim node level resources.  returns true if thresholds were satisfied and no pod eviction is required.func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceName, observations signalObservations) bool {    nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim]    for _, nodeReclaimFunc := range nodeReclaimFuncs {        // attempt to reclaim the pressured resource.        reclaimed, err := nodeReclaimFunc()        if err == nil {            // update our local observations based on the amount reported to have been reclaimed.            // note: this is optimistic, other things could have been still consuming the pressured resource in the interim.            signal := resourceToSignal[resourceToReclaim]            value, ok := observations[signal]            if !ok {                glog.Errorf("eviction manager: unable to find value associated with signal %v", signal)                continue            }            value.available.Add(*reclaimed)            // evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals            if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 {                return true            }        } else {            glog.Errorf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)        }    }    return false}pkg/kubelet/pod_workers.go:283// killPodNow returns a KillPodFunc that can be used to kill a pod.// It is intended to be injected into other modules that need to kill a pod.func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {    return func(pod *v1.Pod, status v1.PodStatus, gracePeriodOverride *int64) error {        // determine the grace period to use when killing the pod        gracePeriod := int64(0)        if gracePeriodOverride != nil {            gracePeriod = *gracePeriodOverride        } else if pod.Spec.TerminationGracePeriodSeconds != nil {            gracePeriod = *pod.Spec.TerminationGracePeriodSeconds        }        // we timeout and return an error if we don't get a callback within a reasonable time.        // the default timeout is relative to the grace period (we settle on 2s to wait for kubelet->runtime traffic to complete in sigkill)        timeout := int64(gracePeriod + (gracePeriod / 2))        minTimeout := int64(2)        if timeout < minTimeout {            timeout = minTimeout        }        timeoutDuration := time.Duration(timeout) * time.Second        // open a channel we block against until we get a result        type response struct {            err error        }        ch := make(chan response)        podWorkers.UpdatePod(&UpdatePodOptions{            Pod:        pod,            UpdateType: kubetypes.SyncPodKill,            OnCompleteFunc: func(err error) {                ch <- response{err: err}            },            KillPodOptions: &KillPodOptions{                PodStatusFunc: func(p *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {                    return status                },                PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,            },        })        // wait for either a response, or a timeout        select {        case r := <-ch:            return r.err        case <-time.After(timeoutDuration):            recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")            return fmt.Errorf("timeout waiting to kill pod")        }    }}

讲到这里，整个evictionManager的主要流程都分析完了。

总结

kubelet在NewMainKubelet时创建了evictionManager。
kubelet在启动过程中进行runtime依赖模块的初始化过程中，将evictionManager启动了。
整个EvictionManager工作流程中两个最关键的步骤是：回收节点资源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。
每次evict pods的流程中，最多只能成功kill一个pod，如果kill某个pod时候，会从排序好的pods中选择下一个进行kill，直到kill成功或者遍历完本节点所有的Pods为止。
每次synchronize操作完成一次eviction流程，10s后都会再次循环这个流程。
如果配置了--experimental-kernel-memcg-notification为true，那么会利用kernel memcg notification，当system usage第一时间达到soft & hard memory thresholds时，会立刻通知kubelet，并触发evictionManager.synchronize进行资源回收的流程，这样提高了eviction的实时性。

1 0