nagios配置详解与集群监控
来源:互联网 发布:php单例模式代码 编辑:程序博客网 时间:2024/04/30 15:22
1 前言
本系列文章主要讲述如何一步一步地监控大数据平台集群状况,接上篇文章nagios安装部署,本文主要阐述Nagios主要配置文件,Nagios运作流程,如何监控一个Zookeeper集群,并以一个该实例贯穿全文。
2 Nagios文件结构
2.1 监控端文件结构
nagios/├── bin├── etc│ └── objects├── libexec├── sbin├── share│ ├── contexthelp│ ├── docs│ │ └── images│ ├── images│ │ └── logos│ ├── includes│ │ └── rss│ │ └── extlib│ ├── js│ ├── locale│ │ ├── de│ │ │ └── LC_MESSAGES│ │ └── fr│ │ └── LC_MESSAGES│ ├── media│ ├── ssi│ └── stylesheets └── var├── archives├── rw└── spool └── checkresults
我们只捡重要的加以说明:
- bin:可执行程序
- etc:配置文件(重要)
- libexec:nagios插件
- sbin:执行命令
展开配置文件目录,如下:
etc/├── cgi.cfg├── htpasswd.users├── nagios.cfg├── objects│ ├── commands.cfg│ ├── contacts.cfg│ ├── localhost.cfg│ ├── printer.cfg│ ├── switch.cfg│ ├── templates.cfg│ ├── timeperiods.cfg│ └── windows.cfg└── resource.cfg
展开nagios插件目录,如下:
libexec/├── check_apt├── check_breeze├── check_by_ssh├── check_clamd -> check_tcp├── check_cluster├── check_dhcp├── check_dig├── check_disk├── check_disk_smb├── check_dns├── check_dummy├── check_file_age├── check_flexlm├── check_ftp -> check_tcp├── check_http├── check_icmp├── check_ide_smart├── check_ifoperstatus├── check_ifstatus├── check_imap -> check_tcp├── check_ircd├── check_jabber -> check_tcp├── check_load├── check_log├── check_mailq├── check_mrtg├── check_mrtgtraf├── check_nagios├── check_nntp -> check_tcp├── check_nntps -> check_tcp├── check_nrpe├── check_nt├── check_ntp├── check_ntp_peer├── check_ntp_time├── check_nwstat├── check_oracle├── check_overcr├── check_ping├── check_pop -> check_tcp├── check_procs├── check_real├── check_rpc├── check_sensors├── check_simap -> check_tcp├── check_smtp├── check_spop -> check_tcp├── check_ssh├── check_ssmtp -> check_tcp├── check_swap├── check_tcp├── check_time├── check_udp -> check_tcp├── check_ups├── check_users├── check_wave├── negate├── urlize├── utils.pm└── utils.sh
3 Nagios运作流程
面向对象思想在Nagios上体现得淋漓尽致,Nagios主要涉及联系人,主机,服务,命令,时间周期等对象。
3.1 Nagios启动流程分析
调试运行脚本,启动nagios服务流程:
[root@hadoop-ehp0 etc]# sh -x /etc/init.d/nagios start+ prefix=/usr/local/nagios+ exec_prefix=/usr/local/nagios+ NagiosBin=/usr/local/nagios/bin/nagios+ NagiosCfgFile=/usr/local/nagios/etc/nagios.cfg+ NagiosCfgtestFile=/usr/local/nagios/var/nagios.configtest+ NagiosStatusFile=/usr/local/nagios/var/status.dat+ NagiosRetentionFile=/usr/local/nagios/var/retention.dat+ NagiosCommandFile=/usr/local/nagios/var/rw/nagios.cmd+ NagiosVarDir=/usr/local/nagios/var+ NagiosRunFile=/usr/local/nagios/var/nagios.lock+ NagiosLockDir=/var/lock/subsys+ NagiosLockFile=nagios+ NagiosCGIDir=/usr/local/nagios/sbin+ NagiosUser=nagios+ NagiosGroup=nagios+ checkconfig=true+ '[' -f /etc/rc.d/init.d/functions ']'+ . /etc/rc.d/init.d/functions++ TEXTDOMAIN=initscripts++ umask 022++ PATH=/sbin:/usr/sbin:/bin:/usr/bin++ export PATH++ '[' -z '' ']'++ COLUMNS=80++ '[' -z '' ']'+++ /sbin/consoletype++ CONSOLETYPE=pty++ '[' -f /etc/sysconfig/i18n -a -z '' -a -z '' ']'++ . /etc/profile.d/lang.sh++ unset LANGSH_SOURCED++ '[' -z '' ']'++ '[' -f /etc/sysconfig/init ']'++ . /etc/sysconfig/init+++ BOOTUP=color+++ RES_COL=60+++ MOVE_TO_COL='echo -en \033[60G'+++ SETCOLOR_SUCCESS='echo -en \033[0;32m'+++ SETCOLOR_FAILURE='echo -en \033[0;31m'+++ SETCOLOR_WARNING='echo -en \033[0;33m'+++ SETCOLOR_NORMAL='echo -en \033[0;39m'+++ PROMPT=yes+++ AUTOSWAP=no+++ ACTIVE_CONSOLES='/dev/tty[1-6]'+++ SINGLE=/sbin/sushell++ '[' pty = serial ']'++ __sed_discard_ignored_files='/\(~|\.bak|\.orig|\.rpmnew|\.rpmorig|\.rpmsave\)$/d'+ test -f /etc/sysconfig/nagios+ USE_RAMDISK=0+ test 0 -ne 0+ '[' '!' -f /usr/local/nagios/bin/nagios ']'+ '[' '!' -f /usr/local/nagios/etc/nagios.cfg ']'+ case "$1" in+ echo -n 'Starting nagios:'Starting nagios:+ test true = true+ check_config++ mktemp /tmp/.configtest.XXXXXXXX+ TMPFILE=/tmp/.configtest.NbW8s1FN+ /usr/local/nagios/bin/nagios -vp /usr/local/nagios/etc/nagios.cfg++ sed 's/ //g'++ awk -F: '{print $2}'++ grep '^Total Warnings:' /tmp/.configtest.NbW8s1FN+ WARN=0++ awk -F: '{print $2}'++ grep '^Total Errors:' /tmp/.configtest.NbW8s1FN++ sed 's/ //g'+ ERR=0+ test 0 = 0+ test 0 = 0+ echo 'OK - Configuration check verified'+ chmod 0644 /usr/local/nagios/var/nagios.configtest+ chown nagios:nagios /usr/local/nagios/var/nagios.configtest+ /bin/rm /tmp/.configtest.NbW8s1FN+ return 0+ test -f /usr/local/nagios/var/nagios.lock+ touch /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat+ rm -f /usr/local/nagios/var/rw/nagios.cmd+ touch /usr/local/nagios/var/nagios.lock+ chown nagios:nagios /usr/local/nagios/var/nagios.lock /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat+ /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg+ '[' -d /var/lock/subsys ']'+ touch /var/lock/subsys/nagios+ echo ' done.'done.
分析上述脚本,首先设置各种变量,接着执行/etc/rc.d/init.d/functions,它是为init.d下的执行文件提供基本功能支持,接着是执行/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
,由此可知,nagios.cfg
是一切的配置的开始。
继续分析nagios.cfg
都干了啥,我们截取重要部分:
cfg_file=/usr/local/nagios/etc/objects/commands.cfgcfg_file=/usr/local/nagios/etc/objects/contacts.cfgcfg_file=/usr/local/nagios/etc/objects/timeperiods.cfgcfg_file=/usr/local/nagios/etc/objects/templates.cfg# Definitions for monitoring the local (Linux) hostcfg_file=/usr/local/nagios/etc/objects/localhost.cfg
该文件实际上是调用了默认的几个配置文件,后面分别从下面每个文件进行分析。
3.2 命令(commands)
截取部分进行分析:
# 'notify-host-by-email' command definitiondefine command{ command_name notify-host-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$ }# 'notify-service-by-email' command definitiondefine command{ command_name notify-service-by-email command_line /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$ }# 'check-host-alive' command definitiondefine command{ command_name check-host-alive command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5 }
上文注释已经很清晰了,是定义的发送邮件,检查主机是否活着等命令,它们肯定会被调用,哪里呢?我们后面再分析。
3.3 联系人(contacts)
截取部分进行分析:
define contact{ contact_name nagiosadmin ; Short name of user use generic-contact ; Inherit default values from generic-contact template (defined above) alias Nagios Admin ; Full name of user email nagios@localhost ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ****** }define contactgroup{ contactgroup_name admins alias Nagios Administrators members nagiosadmin }
Nagios号称报警神器,终于看到了有关发送邮电相关的信息,这里可以定义联系人email。
3.4 时间周期(timeperiods)
截取部分进行分析:
# This defines a timeperiod where all times are valid for checks, # notifications, etc. The classic "24x7" support nightmare. :-)define timeperiod{ timeperiod_name 24x7 alias 24 Hours A Day, 7 Days A Week sunday 00:00-24:00 monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 }# 'workhours' timeperiod definitiondefine timeperiod{ timeperiod_name workhours alias Normal Work Hours monday 09:00-17:00 tuesday 09:00-17:00 wednesday 09:00-17:00 thursday 09:00-17:00 friday 09:00-17:00 }
报警通知总不能一直通知,我们得按照我们的需求设定是24小时还是工作日,这样才符合我们事情嘛。
3.5 模板(templates)
截取部分进行分析:
# Generic contact definition template - This is NOT a real contact, just a template!define contact{ name generic-contact ; The name of this contact template service_notification_period 24x7 ; service notifications can be sent anytime host_notification_period 24x7 ; host notifications can be sent anytime service_notification_options w,u,c,r,f,s ; send notifications for all service states, flapping events, and scheduled downtime events host_notification_options d,u,r,f,s ; send notifications for all host states, flapping events, and scheduled downtime events service_notification_commands notify-service-by-email ; send service notifications via email host_notification_commands notify-host-by-email ; send host notifications via email register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE! }# Generic host definition template - This is NOT a real host, just a template!define host{ name generic-host ; The name of this host template notifications_enabled 1 ; Host notifications are enabled event_handler_enabled 1 ; Host event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts notification_period 24x7 ; Send host notifications at any time register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! }# Generic service definition template - This is NOT a real service, just a template!define service{ name generic-service ; The 'name' of this service template active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems) obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts is_volatile 0 ; The service is not volatile check_period 24x7 ; The service can be checked at any time of the day max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state normal_check_interval 10 ; Check the service every 10 minutes under normal conditions retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined contact_groups admins ; Notifications get sent out to everyone in the 'admins' group notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events notification_interval 60 ; Re-notify about service problems every hour notification_period 24x7 ; Notifications can be sent out at any time register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! }
联系人,主机,服务,时间周期,命令(服务中调用),既然模式就是那么几种,避免重复定义,何不定义成模板已供继承,上述就是这么干的。
3.5 主机服务(host and service)
以自带的localhost.cfg
为例,截取部分进行分析:
# Define a host for the local machinedefine host{ use linux-server ; Name of host template to use ; This host definition will inherit all variables that are defined ; in (or inherited by) the linux-server host template definition. host_name localhost alias localhost address 127.0.0.1 }# Define a service to "ping" the local machinedefine service{ use local-service ; Name of service template to use host_name localhost service_description PING check_command check_ping!100.0,20%!500.0,60% }
localhost.cfg
才是最后的实体,前面对象都只是为之做铺垫,监控某台主机上某种服务的状态,并根据状态作出反应,这才是我们的初衷。监控机又是怎么监控被监控机的呢?依靠NRPE插件,NRPE插件也是CS架构,监控机是C端,被监控机是S端(需开启nrped daemon),C端定时地向所有S端发送我们定义的主机服务,S端收到消息后,调用本地Nagios-Plugins插件监控本机服务,并将结果返回给C端,C端接收结果,做出反应,或邮件或电话,并可提供web UI查看。
4 监控Zookeeper集群
我们以监控Zookeeper集群中的每个QuorumPeerMain
进程为例,将整个流程重新梳理一遍,以增强我们的理解。
4.1 自定义命令
NRPE C端,在commands.cfg
中添加命令:
define command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ }define command{ command_name check_nrpe_args command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ $ARG2$ }
4.2 主机组与服务
定义主机组文件,在etc下新建目录hostservers,创建group.cfg,hadoop-ehp1.cfg,hadoop-ehp1.cfg,hadoop-ehp1.cfg文件:
hostservers/├── group.cfg├── hadoop-ehp1.cfg├── hadoop-ehp2.cfg└── hadoop-ehp3.cfg
在主配置nagios.cfg
中添加文件组(注释本机文件,不然后面检查报错):
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfgcfg_dir=/usr/local/nagios/etc/hostservers
该配置会加载目录下所有.cfg文件,group.cfg
内容如下:
# 主机组 define hostgroup{ hostgroup_name linux-servers ; The name of the hostgroup alias Linux Servers ; Long name of the group members hadoop-ehp1,hadoop-ehp2,hadoop-ehp3 ; Comma separated list of hosts that belong to this group }
hadoop-ehpx.cfg
内容如下(hadoop-ehp1.cfg为例):
# 主机与服务define host{ use linux-server host_name hadoop-ehp1 alias hadoop-ehp1 address 192.168.137.101 }define service{ use generic-service host_name hadoop-ehp1 service_description check_nrpe_users check_command check_nrpe!check_users } define service{ use generic-service host_name hadoop-ehp1 service_description QuorumPeerMain check_command check_nrpe_args!check_procs_args!"-c1:1 -Cjava -aserver.quorum.QuorumPeerMain" }
作为参数check_procs_args
被传至NRPE S端,故S端需要定义命令(编辑nrpe.cfg):
command[check_procs_args]=/usr/local/nagios/libexec/check_procs $ARG1$
重新启动nrped服务。
4.2 联系人
define contact{ contact_name nagiosadmin ; Short name of user use generic-contact ; Inherit default values from generic-contact template (defined above) alias Nagios Admin ; Full name of user email 361197893@qq.com ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ****** }
编辑/etc/mail.rc,添加:
set from=13823254902@139.com smtp=smtp.139.comset smtp-auth-user=13823254902@139.com smtp-auth-password=wxl123456 smtp-auth=login
4.3 检验
检验配置文件是否正确:
[root@hadoop-ehp0 nagios]# bin/nagios -v etc/nagios.cfgNagios Core 4.0.8Copyright (c) 2009-present Nagios Core Development Team and Community ContributorsCopyright (c) 1999-2009 Ethan GalstadLast Modified: 08-12-2014License: GPLWebsite: http://www.nagios.orgReading configuration data... Read main config file okay... Read object config files okay...Running pre-flight check on configuration data...Checking objects... Checked 6 services. Checked 3 hosts. Checked 1 host groups. Checked 0 service groups. Checked 1 contacts. Checked 1 contact groups. Checked 26 commands. Checked 5 time periods. Checked 0 host escalations. Checked 0 service escalations.Checking for circular paths... Checked 3 hosts Checked 0 service dependencies Checked 0 host dependencies Checked 5 timeperiodsChecking global event handlers...Checking obsessive compulsive processor commands...Checking misc settings...Total Warnings: 0Total Errors: 0Things look okay - No serious problems were detected during the pre-flight check
验证NRPE S端命令是否成功:
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H hadoop-ehp1 -c check_procs_args 11PROCS OK: 141 processes | procs=141;;;0;
重启nagios。
进入web UI查看:
杀死其中一个QuorumPeerMain进程:
查看邮箱,收到消息:
5 小结
本文详细地介绍了Nagios如何一步一步监控集群,并通知邮电,但作为大数据平台监控工具,它在具体监控方面还有很多不足,后面我们与Ganglia集成,更细粒度地监控大数据平台。
- nagios配置详解与集群监控
- Nagios远程监控软件的安装与配置详解
- Nagios远程监控软件的安装与配置详解(上)
- Nagios远程监控软件的安装与配置详解(中)
- Nagios远程监控软件的安装与配置详解(下)
- linux下Nagios远程监控安装与配置详解
- Nagios远程监控软件的安装与配置详解
- Nagios远程监控软件的安装与配置详解
- Nagios远程监控软件的安装与配置详解(1)
- Nagios远程监控软件的安装与配置详解(2)
- Nagios远程监控软件的安装与配置详解
- Nagios远程监控软件的安装与配置详解(1)
- Nagios远程监控软件的安装与配置详解
- 运维监控利器Nagios之:nagios配置详解
- 运维监控利器Nagios之:nagios配置详解
- Nagios监控搭建与配置详细步骤
- nagios安装.配置与监控的整个过程
- Nagios安装与配置详解
- XUilte
- CodeForces 546C Soldier and Cards (队列)
- 最长公共子序列和最长公共子串
- calculate pow(x,n)
- C语言开发总结(十一)
- nagios配置详解与集群监控
- codeforces 412C Pattern
- Java——多人聊天室(基于TCP的网络编程——Socket)
- Oracle例外处理
- 智能家居喂鱼系统(五)-树莓派C++库
- TIMESTAMP类型插入到VARCHAR2后转成DATE类型和指定格式字符串
- 菜鸟学Java----基本类型及其封装类
- 关于switch语句中省略break语句的问题
- ViewPager