nagios配置详解与集群监控

来源：互联网发布：php单例模式代码编辑：程序博客网时间：2024/04/30 15:22

1 前言

本系列文章主要讲述如何一步一步地监控大数据平台集群状况，接上篇文章nagios安装部署,本文主要阐述Nagios主要配置文件，Nagios运作流程，如何监控一个Zookeeper集群，并以一个该实例贯穿全文。

2 Nagios文件结构

2.1 监控端文件结构

nagios/├── bin├── etc│   └── objects├── libexec├── sbin├── share│   ├── contexthelp│   ├── docs│   │   └── images│   ├── images│   │   └── logos│   ├── includes│   │   └── rss│   │       └── extlib│   ├── js│   ├── locale│   │   ├── de│   │   │   └── LC_MESSAGES│   │   └── fr│   │       └── LC_MESSAGES│   ├── media│   ├── ssi│   └── stylesheets └── var├── archives├── rw└── spool    └── checkresults

我们只捡重要的加以说明：

bin：可执行程序
etc：配置文件(重要)
libexec:nagios插件
sbin：执行命令

展开配置文件目录,如下：

etc/├── cgi.cfg├── htpasswd.users├── nagios.cfg├── objects│   ├── commands.cfg│   ├── contacts.cfg│   ├── localhost.cfg│   ├── printer.cfg│   ├── switch.cfg│   ├── templates.cfg│   ├── timeperiods.cfg│   └── windows.cfg└── resource.cfg

展开nagios插件目录，如下：

libexec/├── check_apt├── check_breeze├── check_by_ssh├── check_clamd -> check_tcp├── check_cluster├── check_dhcp├── check_dig├── check_disk├── check_disk_smb├── check_dns├── check_dummy├── check_file_age├── check_flexlm├── check_ftp -> check_tcp├── check_http├── check_icmp├── check_ide_smart├── check_ifoperstatus├── check_ifstatus├── check_imap -> check_tcp├── check_ircd├── check_jabber -> check_tcp├── check_load├── check_log├── check_mailq├── check_mrtg├── check_mrtgtraf├── check_nagios├── check_nntp -> check_tcp├── check_nntps -> check_tcp├── check_nrpe├── check_nt├── check_ntp├── check_ntp_peer├── check_ntp_time├── check_nwstat├── check_oracle├── check_overcr├── check_ping├── check_pop -> check_tcp├── check_procs├── check_real├── check_rpc├── check_sensors├── check_simap -> check_tcp├── check_smtp├── check_spop -> check_tcp├── check_ssh├── check_ssmtp -> check_tcp├── check_swap├── check_tcp├── check_time├── check_udp -> check_tcp├── check_ups├── check_users├── check_wave├── negate├── urlize├── utils.pm└── utils.sh

3 Nagios运作流程

面向对象思想在Nagios上体现得淋漓尽致，Nagios主要涉及联系人，主机，服务，命令，时间周期等对象。

3.1 Nagios启动流程分析

调试运行脚本，启动nagios服务流程：

[root@hadoop-ehp0 etc]# sh -x /etc/init.d/nagios start+ prefix=/usr/local/nagios+ exec_prefix=/usr/local/nagios+ NagiosBin=/usr/local/nagios/bin/nagios+ NagiosCfgFile=/usr/local/nagios/etc/nagios.cfg+ NagiosCfgtestFile=/usr/local/nagios/var/nagios.configtest+ NagiosStatusFile=/usr/local/nagios/var/status.dat+ NagiosRetentionFile=/usr/local/nagios/var/retention.dat+ NagiosCommandFile=/usr/local/nagios/var/rw/nagios.cmd+ NagiosVarDir=/usr/local/nagios/var+ NagiosRunFile=/usr/local/nagios/var/nagios.lock+ NagiosLockDir=/var/lock/subsys+ NagiosLockFile=nagios+ NagiosCGIDir=/usr/local/nagios/sbin+ NagiosUser=nagios+ NagiosGroup=nagios+ checkconfig=true+ '[' -f /etc/rc.d/init.d/functions ']'+ . /etc/rc.d/init.d/functions++ TEXTDOMAIN=initscripts++ umask 022++ PATH=/sbin:/usr/sbin:/bin:/usr/bin++ export PATH++ '[' -z '' ']'++ COLUMNS=80++ '[' -z '' ']'+++ /sbin/consoletype++ CONSOLETYPE=pty++ '[' -f /etc/sysconfig/i18n -a -z '' -a -z '' ']'++ . /etc/profile.d/lang.sh++ unset LANGSH_SOURCED++ '[' -z '' ']'++ '[' -f /etc/sysconfig/init ']'++ . /etc/sysconfig/init+++ BOOTUP=color+++ RES_COL=60+++ MOVE_TO_COL='echo -en \033[60G'+++ SETCOLOR_SUCCESS='echo -en \033[0;32m'+++ SETCOLOR_FAILURE='echo -en \033[0;31m'+++ SETCOLOR_WARNING='echo -en \033[0;33m'+++ SETCOLOR_NORMAL='echo -en \033[0;39m'+++ PROMPT=yes+++ AUTOSWAP=no+++ ACTIVE_CONSOLES='/dev/tty[1-6]'+++ SINGLE=/sbin/sushell++ '[' pty = serial ']'++ __sed_discard_ignored_files='/\(~|\.bak|\.orig|\.rpmnew|\.rpmorig|\.rpmsave\)$/d'+ test -f /etc/sysconfig/nagios+ USE_RAMDISK=0+ test 0 -ne 0+ '[' '!' -f /usr/local/nagios/bin/nagios ']'+ '[' '!' -f /usr/local/nagios/etc/nagios.cfg ']'+ case "$1" in+ echo -n 'Starting nagios:'Starting nagios:+ test true = true+ check_config++ mktemp /tmp/.configtest.XXXXXXXX+ TMPFILE=/tmp/.configtest.NbW8s1FN+ /usr/local/nagios/bin/nagios -vp /usr/local/nagios/etc/nagios.cfg++ sed 's/ //g'++ awk -F: '{print $2}'++ grep '^Total Warnings:' /tmp/.configtest.NbW8s1FN+ WARN=0++ awk -F: '{print $2}'++ grep '^Total Errors:' /tmp/.configtest.NbW8s1FN++ sed 's/ //g'+ ERR=0+ test 0 = 0+ test 0 = 0+ echo 'OK - Configuration check verified'+ chmod 0644 /usr/local/nagios/var/nagios.configtest+ chown nagios:nagios /usr/local/nagios/var/nagios.configtest+ /bin/rm /tmp/.configtest.NbW8s1FN+ return 0+ test -f /usr/local/nagios/var/nagios.lock+ touch /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat+ rm -f /usr/local/nagios/var/rw/nagios.cmd+ touch /usr/local/nagios/var/nagios.lock+ chown nagios:nagios /usr/local/nagios/var/nagios.lock /usr/local/nagios/var/nagios.log /usr/local/nagios/var/retention.dat+ /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg+ '[' -d /var/lock/subsys ']'+ touch /var/lock/subsys/nagios+ echo ' done.'done.

分析上述脚本，首先设置各种变量，接着执行/etc/rc.d/init.d/functions，它是为init.d下的执行文件提供基本功能支持，接着是执行/usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg,由此可知，nagios.cfg是一切的配置的开始。

继续分析nagios.cfg都干了啥，我们截取重要部分：

cfg_file=/usr/local/nagios/etc/objects/commands.cfgcfg_file=/usr/local/nagios/etc/objects/contacts.cfgcfg_file=/usr/local/nagios/etc/objects/timeperiods.cfgcfg_file=/usr/local/nagios/etc/objects/templates.cfg# Definitions for monitoring the local (Linux) hostcfg_file=/usr/local/nagios/etc/objects/localhost.cfg

该文件实际上是调用了默认的几个配置文件，后面分别从下面每个文件进行分析。

3.2 命令(commands)

截取部分进行分析：

# 'notify-host-by-email' command definitiondefine command{    command_name    notify-host-by-email    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$    }# 'notify-service-by-email' command definitiondefine command{    command_name    notify-service-by-email    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$    }# 'check-host-alive' command definitiondefine command{        command_name    check-host-alive        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5        }

上文注释已经很清晰了，是定义的发送邮件，检查主机是否活着等命令，它们肯定会被调用，哪里呢？我们后面再分析。

3.3 联系人(contacts)

截取部分进行分析：

define contact{    contact_name    nagiosadmin     ; Short name of user    use             generic-contact     ; Inherit default values from generic-contact template (defined above)    alias           Nagios Admin        ; Full name of user    email           nagios@localhost    ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******        }define contactgroup{        contactgroup_name       admins        alias                   Nagios Administrators        members                 nagiosadmin        }

Nagios号称报警神器，终于看到了有关发送邮电相关的信息，这里可以定义联系人email。

3.4 时间周期(timeperiods)

截取部分进行分析：

# This defines a timeperiod where all times are valid for checks, # notifications, etc.  The classic "24x7" support nightmare. :-)define timeperiod{        timeperiod_name 24x7        alias           24 Hours A Day, 7 Days A Week        sunday          00:00-24:00        monday          00:00-24:00        tuesday         00:00-24:00        wednesday       00:00-24:00        thursday        00:00-24:00        friday          00:00-24:00        saturday        00:00-24:00        }# 'workhours' timeperiod definitiondefine timeperiod{    timeperiod_name workhours    alias       Normal Work Hours    monday      09:00-17:00    tuesday     09:00-17:00    wednesday   09:00-17:00    thursday    09:00-17:00    friday      09:00-17:00    }

报警通知总不能一直通知，我们得按照我们的需求设定是24小时还是工作日，这样才符合我们事情嘛。

3.5 模板(templates)

截取部分进行分析：

# Generic contact definition template - This is NOT a real contact, just a template!define contact{        name                            generic-contact     ; The name of this contact template        service_notification_period     24x7            ; service notifications can be sent anytime        host_notification_period        24x7            ; host notifications can be sent anytime        service_notification_options    w,u,c,r,f,s     ; send notifications for all service states, flapping events, and scheduled downtime events        host_notification_options       d,u,r,f,s       ; send notifications for all host states, flapping events, and scheduled downtime events        service_notification_commands   notify-service-by-email ; send service notifications via email        host_notification_commands      notify-host-by-email    ; send host notifications via email        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!        }# Generic host definition template - This is NOT a real host, just a template!define host{        name                            generic-host    ; The name of this host template        notifications_enabled           1           ; Host notifications are enabled        event_handler_enabled           1           ; Host event handler is enabled        flap_detection_enabled          1           ; Flap detection is enabled        process_perf_data               1           ; Process performance data        retain_status_information       1           ; Retain status information across program restarts        retain_nonstatus_information    1           ; Retain non-status information across program restarts        notification_period     24x7        ; Send host notifications at any time        register                        0           ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!        }# Generic service definition template - This is NOT a real service, just a template!define service{        name                            generic-service     ; The 'name' of this service template        active_checks_enabled           1               ; Active service checks are enabled        passive_checks_enabled          1               ; Passive service checks are enabled/accepted        parallelize_check               1               ; Active service checks should be parallelized (disabling this can lead to major performance problems)        obsess_over_service             1               ; We should obsess over this service (if necessary)        check_freshness                 0               ; Default is to NOT check service 'freshness'        notifications_enabled           1               ; Service notifications are enabled        event_handler_enabled           1               ; Service event handler is enabled        flap_detection_enabled          1               ; Flap detection is enabled        process_perf_data               1               ; Process performance data        retain_status_information       1               ; Retain status information across program restarts        retain_nonstatus_information    1               ; Retain non-status information across program restarts        is_volatile                     0               ; The service is not volatile        check_period                    24x7            ; The service can be checked at any time of the day        max_check_attempts              3           ; Re-check the service up to 3 times in order to determine its final (hard) state        normal_check_interval           10          ; Check the service every 10 minutes under normal conditions        retry_check_interval            2           ; Re-check the service every two minutes until a hard state can be determined        contact_groups                  admins          ; Notifications get sent out to everyone in the 'admins' group        notification_options        w,u,c,r         ; Send notifications about warning, unknown, critical, and recovery events        notification_interval           60          ; Re-notify about service problems every hour        notification_period             24x7            ; Notifications can be sent out at any time         register                        0              ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!        }

联系人，主机，服务，时间周期，命令(服务中调用),既然模式就是那么几种，避免重复定义，何不定义成模板已供继承，上述就是这么干的。

3.5 主机服务(host and service)

以自带的localhost.cfg为例，截取部分进行分析：

# Define a host for the local machinedefine host{        use                     linux-server            ; Name of host template to use                            ; This host definition will inherit all variables that are defined                            ; in (or inherited by) the linux-server host template definition.        host_name               localhost        alias                   localhost        address                 127.0.0.1        }# Define a service to "ping" the local machinedefine service{        use                             local-service         ; Name of service template to use        host_name                       localhost        service_description             PING        check_command                   check_ping!100.0,20%!500.0,60%        }

localhost.cfg才是最后的实体，前面对象都只是为之做铺垫，监控某台主机上某种服务的状态，并根据状态作出反应，这才是我们的初衷。监控机又是怎么监控被监控机的呢？依靠NRPE插件，NRPE插件也是CS架构，监控机是C端，被监控机是S端(需开启nrped daemon)，C端定时地向所有S端发送我们定义的主机服务，S端收到消息后，调用本地Nagios-Plugins插件监控本机服务，并将结果返回给C端，C端接收结果，做出反应，或邮件或电话，并可提供web UI查看。

4 监控Zookeeper集群

我们以监控Zookeeper集群中的每个QuorumPeerMain进程为例，将整个流程重新梳理一遍，以增强我们的理解。

4.1 自定义命令

NRPE C端，在commands.cfg中添加命令：

define command{    command_name    check_nrpe    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$    }define command{    command_name    check_nrpe_args    command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ $ARG2$    }

4.2 主机组与服务

定义主机组文件，在etc下新建目录hostservers，创建group.cfg,hadoop-ehp1.cfg，hadoop-ehp1.cfg，hadoop-ehp1.cfg文件：

hostservers/├── group.cfg├── hadoop-ehp1.cfg├── hadoop-ehp2.cfg└── hadoop-ehp3.cfg

在主配置nagios.cfg中添加文件组(注释本机文件，不然后面检查报错)：

#cfg_file=/usr/local/nagios/etc/objects/localhost.cfgcfg_dir=/usr/local/nagios/etc/hostservers

该配置会加载目录下所有.cfg文件，group.cfg内容如下：

# 主机组       define hostgroup{        hostgroup_name  linux-servers ; The name of the hostgroup        alias           Linux Servers ; Long name of the group        members         hadoop-ehp1,hadoop-ehp2,hadoop-ehp3     ; Comma separated list of hosts that belong to this group        }

hadoop-ehpx.cfg内容如下(hadoop-ehp1.cfg为例)：

# 主机与服务define host{       use                     linux-server       host_name               hadoop-ehp1       alias                   hadoop-ehp1       address                 192.168.137.101       }define service{       use                             generic-service       host_name                       hadoop-ehp1       service_description             check_nrpe_users       check_command                   check_nrpe!check_users       }       define service{       use                             generic-service       host_name                       hadoop-ehp1       service_description             QuorumPeerMain       check_command                   check_nrpe_args!check_procs_args!"-c1:1 -Cjava -aserver.quorum.QuorumPeerMain"       }

作为参数check_procs_args被传至NRPE S端，故S端需要定义命令(编辑nrpe.cfg)：

command[check_procs_args]=/usr/local/nagios/libexec/check_procs $ARG1$

重新启动nrped服务。

4.2 联系人

define contact{        contact_name                    nagiosadmin     ; Short name of user        use             generic-contact     ; Inherit default values from generic-contact template (defined above)        alias                           Nagios Admin        ; Full name of user        email                           361197893@qq.com    ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******        }

编辑/etc/mail.rc，添加：

set from=13823254902@139.com smtp=smtp.139.comset smtp-auth-user=13823254902@139.com  smtp-auth-password=wxl123456 smtp-auth=login

4.3 检验

检验配置文件是否正确：

[root@hadoop-ehp0 nagios]# bin/nagios -v etc/nagios.cfgNagios Core 4.0.8Copyright (c) 2009-present Nagios Core Development Team and Community ContributorsCopyright (c) 1999-2009 Ethan GalstadLast Modified: 08-12-2014License: GPLWebsite: http://www.nagios.orgReading configuration data...   Read main config file okay...   Read object config files okay...Running pre-flight check on configuration data...Checking objects...    Checked 6 services.    Checked 3 hosts.    Checked 1 host groups.    Checked 0 service groups.    Checked 1 contacts.    Checked 1 contact groups.    Checked 26 commands.    Checked 5 time periods.    Checked 0 host escalations.    Checked 0 service escalations.Checking for circular paths...    Checked 3 hosts    Checked 0 service dependencies    Checked 0 host dependencies    Checked 5 timeperiodsChecking global event handlers...Checking obsessive compulsive processor commands...Checking misc settings...Total Warnings: 0Total Errors:   0Things look okay - No serious problems were detected during the pre-flight check

验证NRPE S端命令是否成功：

[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H hadoop-ehp1 -c check_procs_args 11PROCS OK: 141 processes | procs=141;;;0;

重启nagios。
进入web UI查看：

杀死其中一个QuorumPeerMain进程：

查看邮箱，收到消息：

5 小结

本文详细地介绍了Nagios如何一步一步监控集群，并通知邮电，但作为大数据平台监控工具，它在具体监控方面还有很多不足，后面我们与Ganglia集成，更细粒度地监控大数据平台。

1 0