http://blog.csdn.net/WaltonWang/article/details/56329109
摘要:本文作为Kubernetes Eviction Manager工作机制分析的后续篇,主要通过源码分析对其工作机制进行解读。
Kubernetes Eviction Manager介绍及工作原理
这部分内容,请看我的前一篇博文:Kubernetes Eviction Manager工作机制分析
Kubernetes Eviction Manager源码分析
Kubernetes Eviction Manager在何处启动
Kubelet在实例化一个kubelet对象的时候,调用eviction.NewManager
新建了一个evictionManager对象。
pkg/kubelet/kubelet.go:273func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, standaloneMode bool) (*Kubelet, error) { ... thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim) if err != nil { return nil, err } evictionConfig := eviction.Config{ PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration, MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod), Thresholds: thresholds, KernelMemcgNotification: kubeCfg.ExperimentalKernelMemcgNotification, } ... // setup eviction manager evictionManager, evictionAdmitHandler, err := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, kubeDeps.Recorder, nodeRef, klet.clock) if err != nil { return nil, fmt.Errorf("failed to initialize eviction manager: %v", err) } klet.evictionManager = evictionManager klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler) ...}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
kubelet执行Run方法开始工作时,启动了一个goroutine,每5s执行一次updateRuntimeUp。在updateRuntimeUp中,待确认runtime启动成功后,会调用initializeRuntimeDependentModules完成runtime依赖模块的初始化工作。
pkg/kubelet/kubelet.go:1219func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) { go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)}pkg/kubelet/kubelet.go:2040func (kl *Kubelet) updateRuntimeUp() { ... kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules) ...}
再跟踪到initializeRuntimeDependentModules的代码可见,runtime的依赖模块包括cadvisor和evictionManager,初始化的工作其实就是分别调用它们的Start方法进行启动。
pkg/kubelet/kubelet.go:1206func (kl *Kubelet) initializeRuntimeDependentModules() { if err := kl.cadvisor.Start() // Fail kubelet and rely on the babysitter to retry starting kubelet. // TODO(random-liu): Add backoff logic in the babysitter glog.Fatalf("Failed to start cAdvisor %v", err) } // eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs if err := kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod) kl.runtimeState.setInternalError(fmt.Errorf("failed to start eviction manager %v", err)) }}
因此,从这里开始就进入到evictionManager的分析了。
Kubernetes Eviction Manager的定义
从上面的分析可见,kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了。先别急,我们必须先来看看Eviction Manager是如何定义的。
pkg/kubelet/eviction/eviction_manager.go:40// managerImpl implements Managertype managerImpl struct { // used to track time clock clock.Clock // config is how the manager is configured config Config // the function to invoke to kill a pod killPodFunc KillPodFunc // the interface that knows how to do image gc imageGC ImageGC // protects access to internal state sync.RWMutex // node conditions are the set of conditions present nodeConditions []v1.NodeConditionType // captures when a node condition was last observed based on a threshold being met nodeConditionsLastObservedAt nodeConditionsObservedAt // nodeRef is a reference to the node nodeRef *v1.ObjectReference // used to record events about the node recorder record.EventRecorder // used to measure usage stats on system summaryProvider stats.SummaryProvider // records when a threshold was first observed thresholdsFirstObservedAt thresholdsObservedAt // records the set of thresholds that have been met (including graceperiod) but not yet resolved thresholdsMet []Threshold // resourceToRankFunc maps a resource to ranking function for that resource. resourceToRankFunc map[v1.ResourceName]rankFunc // resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource. resourceToNodeReclaimFuncs map[v1.ResourceName]nodeReclaimFuncs // last observations from synchronize lastObservations signalObservations // notifiersInitialized indicates if the threshold notifiers have been initialized (i.e. synchronize() has been called once) notifiersInitialized bool}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
managerImpl就是evictionManager的具体定义,重点关注:
kubelet在NewMainKubelet时调用eviction.NewManager
进行evictionManager的创建,eviction.NewManager
的代码很简单,就是赋值。
pkg/kubelet/eviction/eviction_manager.go:79// NewManager returns a configured Manager and an associated admission handler to enforce eviction configuration.func NewManager( summaryProvider stats.SummaryProvider, config Config, killPodFunc KillPodFunc, imageGC ImageGC, recorder record.EventRecorder, nodeRef *v1.ObjectReference, clock clock.Clock) (Manager, lifecycle.PodAdmitHandler, error) { manager := &managerImpl{ clock: clock, killPodFunc: killPodFunc, imageGC: imageGC, config: config, recorder: recorder, summaryProvider: summaryProvider, nodeRef: nodeRef, nodeConditionsLastObservedAt: nodeConditionsObservedAt{}, thresholdsFirstObservedAt: thresholdsObservedAt{}, } return manager, manager, nil}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
但是,有一点很重要,NewManager不但返回evictionManager对象,还返回了一个lifecycle.PodAdmitHandler
实例evictionAdmitHandler,它其实和evictionManager的内容相同,但是不同的两个实例。evictionAdmitHandler用来kubelet创建Pod前进行准入检查,满足条件后才会继续创建Pod,通过Admit(attrs *lifecycle.PodAdmitAttributes)
方法来检查,代码如下:
pkg/kubelet/eviction/eviction_manager.go:102// Admit rejects a pod if its not safe to admit for node stability.func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult { m.RLock() defer m.RUnlock() if len(m.nodeConditions) == 0 { return lifecycle.PodAdmitResult{Admit: true} } // the node has memory pressure, admit if not best-effort if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) { notBestEffort := qos.BestEffort != qos.GetPodQOS(attrs.Pod) if notBestEffort || kubepod.IsCriticalPod(attrs.Pod) { return lifecycle.PodAdmitResult{Admit: true} } } // reject pods when under memory pressure (if pod is best effort), or if under disk pressure. glog.Warningf("Failed to admit pod %v - %s", format.Pod(attrs.Pod), "node has conditions: %v", m.nodeConditions) return lifecycle.PodAdmitResult{ Admit: false, Reason: reason, Message: fmt.Sprintf(message, m.nodeConditions), }}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
上述Pod Admit逻辑,正是Kubernetes Eviction Manager工作机制分析中Scheduler一节提到的EvictionManager对Pod调度的逻辑影响:
Kubelet会定期的将Node Condition传给kube-apiserver并存于etcd。kube-scheduler watch到Node Condition Pressure之后,会根据以下策略,阻止更多Pods Bind到该Node。
Node Condition | Scheduler Behavior | MemoryPressureNo new BestEffort pods are scheduled to the node.DiskPressureNo new pods are scheduled to the node.killPodNow的代码,后面再分析。
基本上,这一小节我们把evictionManager是什么以及怎么来的问题搞清楚了。下面我们来看看evictionManager的启动过程。
Kubernetes Eviction Manager的启动
上面分析过,kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了(kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod)
),那我们先来看看Start方法:
pkg/kubelet/eviction/eviction_manager.go:126// Start starts the control loop to observe and response to low compute resources.func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, monitoringInterval time.Duration) error { // start the eviction manager monitoring go wait.Until(func() { m.synchronize(diskInfoProvider, podFunc) }, monitoringInterval, wait.NeverStop) return nil}
很简单,启动一个goroutine,每执行完一次m.synchronize
就间隔monitoringInterval(10s)的时间再次执行m.synchronize
,如此反复。
接下来,就是evictionManager的关键工作流程了:
pkg/kubelet/eviction/eviction_manager.go:181// synchronize is the main control loop that enforces eviction thresholds.func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) { // if we have nothing to do, just return thresholds := m.config.Thresholds if len(thresholds) == 0 { return } // build the ranking functions (if not yet known) if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 { // this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass. hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs() if err != nil { return } m.resourceToRankFunc = buildResourceToRankFunc(hasDedicatedImageFs) m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, hasDedicatedImageFs) } // make observations and get a function to derive pod usage stats relative to those observations. observations, statsFunc, err := makeSignalObservations(m.summaryProvider) if err != nil { glog.Errorf("eviction manager: unexpected err: %v", err) return } // attempt to create a threshold notifier to improve eviction response time if m.config.KernelMemcgNotification && !m.notifiersInitialized { glog.Infof("eviction manager attempting to integrate with kernel memcg notification api") m.notifiersInitialized = true // start soft memory notification err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) { glog.Infof("soft memory eviction threshold crossed at %s", desc) // TODO wait grace period for soft memory limit m.synchronize(diskInfoProvider, podFunc) }) if err != nil { glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err) } // start hard memory notification err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) { glog.Infof("hard memory eviction threshold crossed at %s", desc) m.synchronize(diskInfoProvider, podFunc) }) if err != nil { glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err) } } // determine the set of thresholds met independent of grace period thresholds = thresholdsMet(thresholds, observations, false) // determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim if len(m.thresholdsMet) > 0 { thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true) thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved) } // determine the set of thresholds whose stats have been updated since the last sync thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations) // track when a threshold was first observed now := m.clock.Now() thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now) // the set of node conditions that are triggered by currently observed thresholds nodeConditions := nodeConditions(thresholds) // track when a node condition was last observed nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now) // node conditions report true if it has been observed within the transition period window nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now) // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met) thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now) // update internal state m.Lock() m.nodeConditions = nodeConditions m.thresholdsFirstObservedAt = thresholdsFirstObservedAt m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt m.thresholdsMet = thresholds m.lastObservations = observations m.Unlock() // determine the set of resources under starvation starvedResources := getStarvedResources(thresholds) if len(starvedResources) == 0 { glog.V(3).Infof("eviction manager: no resources are starved") return } // rank the resources to reclaim by eviction priority sort.Sort(byEvictionPriority(starvedResources)) resourceToReclaim := starvedResources[0] glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim) // determine if this is a soft or hard eviction associated with the resource softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim) // record an event about the resources we are now attempting to reclaim via eviction m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim) // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods. if m.reclaimNodeLevelResources(resourceToReclaim, observations) { glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim) return } glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim) // rank the pods for eviction rank, ok := m.resourceToRankFunc[resourceToReclaim] if !ok { glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim) return } // the only candidates viable for eviction are those pods that had anything running. activePods := podFunc() if len(activePods) == 0 { glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict") return } // rank the running pods for eviction for the specified resource rank(activePods, statsFunc) glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods)) // we kill at most a single pod during each eviction interval for i := range activePods { pod := activePods[i] status := v1.PodStatus{ Phase: v1.PodFailed, Message: fmt.Sprintf(message, resourceToReclaim), Reason: reason, } // record that we are evicting the pod m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim)) gracePeriodOverride := int64(0) if softEviction { gracePeriodOverride = m.config.MaxPodGracePeriodSeconds } // this is a blocking call and should only return when the pod and its containers are killed. err := m.killPodFunc(pod, status, &gracePeriodOverride) if err != nil { glog.Infof("eviction manager: pod %s failed to evict %v", format.Pod(pod), err) continue } // success, so we return until the next housekeeping interval glog.Infof("eviction manager: pod %s evicted successfully", format.Pod(pod)) return } glog.Infof("eviction manager: unable to evict any pods from the node")}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
代码写的非常工整,注释也很到位,很棒。关键流程如下:
- 通过
buildResourceToRankFunc
和buildResourceToNodeReclaimFuncs
分别注册Evict Pod时各种Resource的排名函数和回收Node Resource的Reclaim函数。 - 通过
makeSignalObservations
从cAdvisor中获取Eviction Signal Observation和Pod的StatsFunc(后续对Pods进行Rank时需要用)。 - 如果kubelet配置了
--experimental-kernel-memcg-notification
且为true,则通过startMemoryThresholdNotifier
启动soft & hard memory notification,当system usage第一时间达到soft & hard memory thresholds时,会立刻通知kubelet,并触发evictionManager.synchronize
进行资源回收的流程。这样提高了eviction的实时性。 - 根据从cAdvisor数据计算得到的Observation(observasions)和配置的thresholds通过
thresholdsMet
计算得到此次Met的thresholds。 - 再根据从cAdvisor数据计算得到的Observation(observasions)和thresholdsMet通过
thresholdsMet
计算得到已记录但还没解决的thresholds,然后与上一步中的thresholds进行合并。 - 根据lastObservations中Signal的时间,对比observasions的中Signal中的时间,过滤thresholds。
- 更新
thresholdsFirstObservedAt
, nodeConditions
。 - 过滤出那些从observed time到now,已经历过grace period时间的thresholds。
- 更新evictionManager对象的内部数据: nodeConditions,thresholdsFirstObservedAt,nodeConditionsLastObservedAt,thresholds,observations。
- 根据thresholds得到starvedResources,并进行排序,如果memory属于starvedResources,则memory排序第一。
- 取starvedResources排第一的Resource,调用
reclaimNodeLevelResources
对Node上这种Resource进行资源回收。如果回收完后,available满足thresholdValue+evictionMinimumReclaim
,则流程结束,不再evict user-pods。 - 如果
reclaimNodeLevelResources
后,还不足以达到要求,则会继续evict user-pods,首先根据前面buildResourceToRankFunc
注册的方法对所有active Pods进行排序。 - 按照前面的排序,顺序的调用
killPodNow
将选出的pod干掉。如果kill某个pod失败,则会跳过这个pod,再按顺序挑下一个pod进行kill。只要某个pod kill成功,就返回结束,也就是说这个流程中,最多只会kill最多一个Pod。
上面流程中,有两个最关键的步骤,回收节点资源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。
pkg/kubelet/eviction/eviction_manager.go:340func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceName, observations signalObservations) bool { nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim] for _, nodeReclaimFunc := range nodeReclaimFuncs { reclaimed, err := nodeReclaimFunc() if err == nil { signal := resourceToSignal[resourceToReclaim] value, ok := observations[signal] if !ok { glog.Errorf("eviction manager: unable to find value associated with signal %v", signal) continue } value.available.Add(*reclaimed) if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 { return true } } else { glog.Errorf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err) } } return false}pkg/kubelet/pod_workers.go:283func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc { return func(pod *v1.Pod, status v1.PodStatus, gracePeriodOverride *int64) error { gracePeriod := int64(0) if gracePeriodOverride != nil { gracePeriod = *gracePeriodOverride } else if pod.Spec.TerminationGracePeriodSeconds != nil { gracePeriod = *pod.Spec.TerminationGracePeriodSeconds } timeout := int64(gracePeriod + (gracePeriod / 2)) minTimeout := int64(2) if timeout < minTimeout { timeout = minTimeout } timeoutDuration := time.Duration(timeout) * time.Second type response struct { err error } ch := make(chan response) podWorkers.UpdatePod(&UpdatePodOptions{ Pod: pod, UpdateType: kubetypes.SyncPodKill, OnCompleteFunc: func(err error) { ch <- response{err: err} }, KillPodOptions: &KillPodOptions{ PodStatusFunc: func(p *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus { return status }, PodTerminationGracePeriodSecondsOverride: gracePeriodOverride, }, }) select { case r := <-ch: return r.err case <-time.After(timeoutDuration): recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.") return fmt.Errorf("timeout waiting to kill pod") } }}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
讲到这里,整个evictionManager的主要流程都分析完了。
总结
- kubelet在NewMainKubelet时创建了evictionManager。
- kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了。
- 整个EvictionManager工作流程中两个最关键的步骤是:回收节点资源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。
- 每次evict pods的流程中,最多只能成功kill一个pod,如果kill某个pod时候,会从排序好的pods中选择下一个进行kill,直到kill成功或者遍历完本节点所有的Pods为止。
- 每次synchronize操作完成一次eviction流程,10s后都会再次循环这个流程。
- 如果配置了
--experimental-kernel-memcg-notification
为true,那么会利用kernel memcg notification,当system usage第一时间达到soft & hard memory thresholds时,会立刻通知kubelet,并触发evictionManager.synchronize
进行资源回收的流程,这样提高了eviction的实时性。