kubernetes集群问题排查
来源:互联网 发布:临床药物查询软件 编辑:程序博客网 时间:2024/05/22 06:10
本文个人博客地址为:http://www.huweihuang.com/article/kubernetes/kubernetes-troubleshooting/
1. 查看系统Event事件
kubectl describe pod <PodName> --namespace=<NAMESPACE>
该命令可以显示Pod创建时的配置定义、状态等信息和最近的Event事件,事件信息可用于排错。例如当Pod状态为Pending,可通过查看Event事件确认原因,一般原因有几种:
- 没有可用的Node可调度
- 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源
- 正在下载镜像(镜像拉取耗时太久)或镜像下载失败。
kubectl describe还可以查看其它k8s对象:NODE,RC,Service,Namespace,Secrets。
1.1. Pod
kubectl describe pod <PodName> --namespace=<NAMESPACE>
以下是容器的启动命令非阻塞式导致容器挂掉,被k8s频繁重启所产生的事件。
kubectl describe pod--namespace= Events: FirstSeen LastSeen Count From SubobjectPath Reason Message ───────── ──────── ───── ──── ───────────── ────── ─────── 7m 7m 1 {scheduler } Scheduled Successfully assigned yangsc-1-0-0-index0 to 10.8.216.19 7m 7m 1 {kubelet 10.8.216.19} containers{infra} Pulled Container image "registry.wae.haplat.net/kube-system/pause:0.8.0" already present on machine 7m 7m 1 {kubelet 10.8.216.19} containers{infra} Created Created with docker id 84f133c324d0 7m 7m 1 {kubelet 10.8.216.19} containers{infra} Started Started with docker id 84f133c324d0 7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 3f9f82abb145 7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 3f9f82abb145 7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id fb112e4002f4 7m 7m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id fb112e4002f4 6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 613b119d4474 6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 613b119d4474 6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 25cb68d1fd3d 6m 6m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 25cb68d1fd3d 5m 5m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 7d9ee8610b28 5m 5m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 7d9ee8610b28 3m 3m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 88b9e8d582dd 3m 3m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 88b9e8d582dd 7m 1m 7 {kubelet 10.8.216.19} containers{yangsc0} Pulling Pulling image "registry.ts.wae.haplat.net/test/tcp-hello:1.0.0" 1m 1m 1 {kubelet 10.8.216.19} containers{yangsc0} Started Started with docker id 089abff050e7 1m 1m 1 {kubelet 10.8.216.19} containers{yangsc0} Created Created with docker id 089abff050e7 7m 1m 7 {kubelet 10.8.216.19} containers{yangsc0} Pulled Successfully pulled image "registry.ts.wae.haplat.net/test/tcp-hello:1.0.0" 6m 7s 34 {kubelet 10.8.216.19} containers{yangsc0} Backoff Back-off restarting failed docker container
1.2. NODE
kubectl describe node 10.8.216.20
[root@FC-43745A-10 ~]# kubectl describe node 10.8.216.20Name:10.8.216.20Labels:kubernetes.io/hostname=10.8.216.20,namespace/bcs-cc=true,namespace/myview=trueCreationTimestamp:Mon, 17 Apr 2017 11:32:52 +0800Phase:Conditions: TypeStatusLastHeartbeatTimeLastTransitionTimeReasonMessage ────────────────────────────────────────────────────────── Ready True Fri, 18 Aug 2017 09:38:33 +0800 Tue, 02 May 2017 17:40:58 +0800 KubeletReady kubelet is posting ready status OutOfDisk False Fri, 18 Aug 2017 09:38:33 +0800 Mon, 17 Apr 2017 11:31:27 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space availableAddresses:10.8.216.20,10.8.216.20Capacity: cpu:32 memory:67323039744 pods:40System Info: Machine ID:723bafc7f6764022972b3eae1ce6b198 System UUID:4C4C4544-0042-4210-8044-C3C04F595631 Boot ID:da01f2e3-987a-425a-9ca7-1caaec35d1e5 Kernel Version:3.10.0-327.28.3.el7.x86_64 OS Image:CentOS Linux 7 (Core) Container Runtime Version:docker://1.13.1 Kubelet Version:v1.1.1-wae2-13.1+79c90c68bfb72f-dirty Kube-Proxy Version:v1.1.1-wae2-13.1+79c90c68bfb72f-dirtyExternalID:10.8.216.20Non-terminated Pods:(6 in total) NamespaceNameCPU RequestsCPU LimitsMemory RequestsMemory Limits ─────────────────────────────────────────────────────────────── bcs-ccbcs-cc-api-0-0-1364-index01 (3%)1 (3%)4294967296 (6%)4294967296 (6%) bcs-ccbcs-cc-api-0-0-1444-index01 (3%)1 (3%)4294967296 (6%)4294967296 (6%) fwfw-demo2-0-0-1519-index01 (3%)1 (3%)4294967296 (6%)4294967296 (6%) myviewmyview-api-0-0-1362-index01 (3%)1 (3%)4294967296 (6%)4294967296 (6%) myviewmyview-api-0-0-1442-index01 (3%)1 (3%)4294967296 (6%)4294967296 (6%) qa-ts-dnats-dna-console3-0-0-1434-index01 (3%)1 (3%)4294967296 (6%)4294967296 (6%)Allocated resources: (Total limits may be over 100%, i.e., overcommitted. More info: http://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md) CPU RequestsCPU LimitsMemory RequestsMemory Limits ────────────────────────────────────────────────── 6 (18%)6 (18%)25769803776 (38%)25769803776 (38%)No events.
1.3. RC
kubectl describe rc mytest-1-0-0 --namespace=test
[root@FC-43745A-10 ~]# kubectl describe rc mytest-1-0-0 --namespace=testName:mytest-1-0-0Namespace:testImage(s):registry.ts.wae.haplat.net/test/mywebcalculator:1.0.1Selector:app=mytest,appVersion=1.0.0Labels:app=mytest,appVersion=1.0.0,env=ts,zone=innerReplicas:1 current / 1 desiredPods Status:1 Running / 0 Waiting / 0 Succeeded / 0 FailedNo volumes.Events: FirstSeenLastSeenCountFromSubobjectPathReasonMessage ──────────────────────────────────────────────────── 20h19h9{replication-controller }FailedCreateError creating: Pod "mytest-1-0-0-index0" is forbidden: limited to 10 pods 20h17h7{replication-controller }FailedCreateError creating: pods "mytest-1-0-0-index0" already exists 20h17h4{replication-controller }SuccessfulCreateCreated pod: mytest-1-0-0-index0
1.4. NAMESPACE
kubectl describe namespace test
[root@FC-43745A-10 ~]# kubectl describe namespace testName:testLabels:<none>Status:ActiveResource Quotas ResourceUsedHard --------- cpu520 memory134217728053687091200 persistentvolumeclaims010 pods410 replicationcontrollers820 resourcequotas11 secrets310 services820No resource limits.
1.5. Service
kubectl describe service wae-containers-1-1-0 --namespace=test
[root@FC-43745A-10 ~]# kubectl describe service wae-containers-1-1-0 --namespace=testName:wae-containers-1-1-0Namespace:testLabels:app=wae-containers,appVersion=1.1.0,env=ts,zone=innerSelector:app=wae-containers,appVersion=1.1.0Type:ClusterIPIP:10.254.46.42Port:port-dna-tcp-3591335913/TCPEndpoints:10.0.92.17:35913Port:port-l7-tcp-80808080/TCPEndpoints:10.0.92.17:8080Session Affinity:NoneNo events.
2. 查看容器日志
1、查看指定pod的日志
kubectl logs <pod_name>
kubectl logs -f <pod_name> #类似tail -f的方式查看
2、查看上一个pod的日志
kubectl logs -p <pod_name>
3、查看指定pod中指定容器的日志
kubectl logs <pod_name> -c <container_name>
4、kubectl logs的更多使用请参考kubectl logs --help
[root@node5 ~]# kubectl logs --helpPrint the logs for a container in a pod. If the pod has only one container, the container name is optional.Usage: kubectl logs [-f] [-p] POD [-c CONTAINER] [flags]Aliases: logs, log Examples:# Return snapshot logs from pod nginx with only one container$ kubectl logs nginx# Return snapshot of previous terminated ruby container logs from pod web-1$ kubectl logs -p -c ruby web-1# Begin streaming the logs of the ruby container in pod web-1$ kubectl logs -f -c ruby web-1# Display only the most recent 20 lines of output in pod nginx$ kubectl logs --tail=20 nginx# Show all logs from pod nginx written in the last hour$ kubectl logs --since=1h nginx
3. 查看k8s服务日志
3.1. journalctl
在Linux系统上systemd系统来管理kubernetes服务,并且journal系统会接管服务程序的输出日志,可以通过systemctl status <xxx>或journalctl -u <xxx> -f来查看kubernetes服务的日志。
其中kubernetes组件包括:
3.2. 日志文件
也可以通过指定日志存放目录来保存和查看日志
- --logtostderr=false:不输出到stderr
- --log-dir=/var/log/kubernetes:日志的存放目录
- --alsologtostderr=false:设置为true表示日志输出到文件也输出到stderr
- --v=0:glog的日志级别
- --vmodule=gfs*=2,test*=4:glog基于模块的详细日志级别
4. 常见问题
4.1. Pod状态一直为Pending
kubectl describe <pod_name> --namespace=<NAMESPACE>
查看该POD的事件。
- 正在下载镜像但拉取不下来(镜像拉取耗时太久)[一般都是该原因]
- 没有可用的Node可调度
- 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源
解决方法:
- 查看该POD所在宿主机与镜像仓库之间的网络是否有问题,可以手动拉取镜像
- 删除POD实例,让POD调度到别的宿主机上
4.2. Pod创建后不断重启
kubectl get pods中Pod状态一会running,一会不是,且RESTARTS次数不断增加。
一般原因为容器启动命令不是阻塞式命令,导致容器运行后马上退出。
非阻塞式命令:
- 本身CMD指定的命令就是非阻塞式命令
- 将服务启动方式设置为后台运行
解决方法:
1、将命令改为阻塞式命令(前台运行),例如:zkServer.sh start-foreground
2、java运行程序的启动脚本将 nohup xxx &的nobup和&去掉,例如:
nohup $JAVA_HOME/bin/java $JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main &
改为:
$JAVA_HOME/bin/java $JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main
文章参考《Kubernetes权威指南》
- kubernetes集群问题排查
- kubernetes集群搭建以及遇到的问题
- kubernetes集群搭建以及遇到的问题
- Storm 集群监控报警-问题排查记录
- mc集群写入恍惚问题排查
- kubernetes集群
- Kubernetes集群搭建过程中遇到的问题
- kubernetes 1.6集群再遇rbac问题(helm安装spark)
- kubernetes搭建ETCD集群时遇到的一个问题
- HBase集群出现NotServingRegionException问题的排查及解决方法
- HBase集群出现NotServingRegionException问题的排查及解决方法
- HBase集群出现NotServingRegionException问题的排查及解决方法
- HBase集群出现NotServingRegionException问题的排查及解决方法
- HBase集群出现NotServingRegionException问题的排查及解决方法
- kubernetes集群部署
- kubernetes-ubuntu集群部署
- 搭建kubernetes集群
- kubeadm 搭建 kubernetes 集群
- JS 中对变量类型的判断
- 嵌套的盒子
- MapReduce之自定义Key和Value
- Java 8新特性 Stream API 编程
- linux-使用apt-get安装软件问题汇总
- kubernetes集群问题排查
- ajax判断用户名是否存在
- 相对,绝对定位
- 关于 error: LNK1123: failure during conversion to COFF: file invalid or corrupt 错误的解决方案
- js的 nextsibling 和previousSibling
- DataBase can’t be open after shutdown immediate
- matlab二分类实验(使用libsvm工具包+SVMcgForClass函数)
- 冒泡排序和内置类比较排序操作
- QuickCache缓存原理解析