Nagios自己编写监控磁盘脚本check_disk

来源:互联网 发布:魔灵召唤淘宝刷塔 编辑:程序博客网 时间:2024/05/16 18:18

不知不觉已经实习了一个月了,实习期间做的主要工作就是搭建Nagios+Centreon监控平台了,自己动手还是比较快的,搭这个东西虽然bug一堆,但还算顺利,后来就开始自行编写监控磁盘的脚本了。
先说一下为什么要自己编写监控磁盘的脚本,其实,我自己也不是太清楚,因为Nagios-plugins里面是有check_disk的脚本的,可能我的导师是想锻炼一下我,同时也为了有一个更符合自己实际情况的脚本。
面对的硬件有:三台服务器搭建测试云平台,两台服务器上有RAID卡,两台服务器上有SSD,还有HDD若干。对的,只有这么点,但对于我这个小菜鸟,也够我折腾了。


对于有RAID卡的主机,MegaCli就是个不错的选择了,自行下载安装MegaCli,然后就动手了:

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL  ---查raid/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL    ---查raid卡信息/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL           ---查看硬盘信息

自己弄着弄着玩一下,观察一下显示的东西,显示出来的东西有很大一片的,随便看看。如果该主机本身没有RAID卡,那你在它上面使用MegaCli的话,显示的就只有 Exit Code: 0x00
主要用的是第三条命令/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
然后抓取我要的信息/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep -E 'Device Id|Error|Media Type'
Device Id — 监控SSD寿命的时候用到,就是一个Id而已
Error — Error Count 就是我们要观察的错误信息了,为0就是木有错误,不为0就要担心了
Media Type — 硬盘类型,主要是我要找主机面的SSD对应的是哪个Device Id,因为除了这样,我也不知道Device Id跟硬盘或者跟分区有什么对应关系,贴一下我显示的结果:

[root@cloud-13 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL  | grep -E 'Device Id|Error|Media Type'Device Id: 0Media Error Count: 0Other Error Count: 0Media Type: Hard Disk DeviceDevice Id: 1Media Error Count: 0Other Error Count: 0Media Type: Hard Disk DeviceDevice Id: 2Media Error Count: 0Other Error Count: 0Media Type: Hard Disk DeviceDevice Id: 3Media Error Count: 0Other Error Count: 0Media Type: Hard Disk DeviceDevice Id: 4Media Error Count: 0Other Error Count: 0Media Type: Solid State Device

这样,自行写代码观察Error Count后面的数值就行了,就达到监控的效果了。
刚刚有提到SSD寿命的问题,在这一并说了吧,使用smartctl可以检测SSD的寿命,当然还有很多其它结果,SSD寿命只是其中一部分,但是对于有RAID卡的主机,需要刚刚获取到的Device Id。

[root@cloud-13 ~]# smartctl -a -d megaraid,4 /dev/sdc1smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net/dev/sdc1 [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'Smartctl open device: /dev/sdc1 [megaraid_disk_04] [SAT] failed: SATA device detected,MegaRAID SAT layer is reportedly buggy, use '-d sat+megaraid,N' to try anyhow

我的主机上需要我加上sat,就听他话咯

[root@cloud-13 ~]# smartctl -a -d megaraid,4 /dev/sdc1smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net/dev/sdc1 [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'Smartctl open device: /dev/sdc1 [megaraid_disk_04] [SAT] failed: SATA device detected,MegaRAID SAT layer is reportedly buggy, use '-d sat+megaraid,N' to try anyhow[root@cloud-13 ~]# smartctl -a -d sat+megaraid,4 /dev/sdc1smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Device Model:     OCZ INTREPID 3600Serial Number:    A21N8061423000004LU WWN Device Id: 5 e83a97 100006dc5Firmware Version: 1.4.6.0User Capacity:    800,166,076,416 bytes [800 GB]Sector Size:      512 bytes logical/physicalDevice is:        Not in smartctl database [for details use: -P showall]ATA Version is:   8ATA Standard is:  ACS-2 (revision not indicated)Local Time is:    Tue Aug 25 15:20:02 2015 CSTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDWarning: This result is based on an Attribute check.General SMART Values:Offline data collection status:  (0x00) Offline data collection activity                                        was never started.                                        Auto Offline Data Collection: Disabled.Self-test execution status:      ( 249) Self-test routine in progress...                                        90% of test remaining.Total time to complete Offline data collection:                (    0) seconds.Offline data collectioncapabilities:                    (0x1d) SMART execute Offline immediate.                                        No Auto Offline data collection support.                                        Abort Offline collection upon new                                        command.                                        Offline surface scan supported.                                        Self-test supported.                                        No Conveyance Self-test supported.                                        No Selective Self-test supported.SMART capabilities:            (0x0003) Saves SMART data before entering                                        power-saving mode.                                        Supports SMART auto save timer.Error logging capability:        (0x00) Error logging NOT supported.                                        General Purpose Logging supported.Short self-test routine recommended polling time:        (   0) minutes.Extended self-test routinerecommended polling time:        (   0) minutes.SMART Attributes Data Structure revision number: 18Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       3964 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       28100 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       2547072171 Unknown_Attribute       0x0000   090   000   000    Old_age   Offline      -       12030174 Unknown_Attribute       0x0000   071   100   000    Old_age   Offline      -       20184 End-to-End_Error        0x0000   009   100   000    Old_age   Offline      -       1282187 Reported_Uncorrect      0x0000   100   100   000    Old_age   Offline      -       0190 Airflow_Temperature_Cel 0x0000   048   054   000    Old_age   Offline      -       48195 Hardware_ECC_Recovered  0x0000   000   100   000    Old_age   Offline      -       0196 Reallocated_Event_Count 0x0000   000   100   000    Old_age   Offline      -       0197 Current_Pending_Sector  0x0000   000   100   000    Old_age   Offline      -       0198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       3562199 UDMA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      -       3443202 Data_Address_Mark_Errs  0x0000   100   100   000    Old_age   Offline      -       2061332509205 Thermal_Asperity_Rate   0x0000   100   100   000    Old_age   Offline      -       3000206 Flying_Height           0x0000   000   100   000    Old_age   Offline      -       0207 Spin_High_Current       0x0000   002   100   000    Old_age   Offline      -       64208 Spin_Buzz               0x0000   000   100   000    Old_age   Offline      -       9210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0211 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0212 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0213 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0214 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0221 G-Sense_Error_Rate      0x0000   100   100   000    Old_age   Offline      -       0222 Loaded_Hours            0x0000   100   100   000    Old_age   Offline      -       0230 Head_Amplitude          0x0000   001   100   000    Old_age   Offline      -       1233 Media_Wearout_Indicator 0x0000   100   000   000    Old_age   Offline      -       100249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       5792251 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       22849SMART Error Log not supportedWarning! SMART Self-Test Log Structure error: invalid SMART checksum.SMART Self-test log structure revision number 1No self-tests have been logged.  [To run self-tests, use: smartctl -t]Device does not support Selective Self Tests/Logging

然后抓取这个就行了,那个100就是表示寿命还剩100%,就是一点都没损耗,毕竟是新的呢
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
我也都是参照下面这两个博客做的,他们说得很详细

http://blog.yufeng.info/archives/1096
http://www.woxihuan.com/117417/1336095005082619.shtml


对于没有RAID卡的主机,smartctl可以很好的用来检测磁盘是否有错误
# smartctl -a /dev/sdx 显示所有信息sdx为自己电脑分区
因为我只要观察Error Count log,可以使用这个:
# smartctl -l error /dev/sdc 则只列出Error Counter

[root@cloud-11 ~]# smartctl -l error /dev/sdcsmartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.netError counter log:           Errors Corrected by           Total   Correction     Gigabytes    Total               ECC          rereads/    errors   algorithm      processed    uncorrected           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errorsread:          0        0         0         0      20680        755.998           0write:         0        0         0         0       8177       1356.647           0verify:        0        0         0         0        760         61.354           0Non-medium error count:        0

观察带error的列,为0则是木有问题,实现代码抓取就行了
对于这台没有RAID卡的主机,使用smartctl检测ssd的时候,是没有Error Counter log的

[root@cloud-11 ~]# smartctl -a /dev/sdbsmartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Device Model:     OCZ INTREPID 3600Serial Number:    A21N8061423000020LU WWN Device Id: 5 e83a97 100006dd5Firmware Version: 1.4.6.0User Capacity:    800,166,076,416 bytes [800 GB]Sector Size:      512 bytes logical/physicalDevice is:        Not in smartctl database [for details use: -P showall]ATA Version is:   8ATA Standard is:  ACS-2 (revision not indicated)Local Time is:    Tue Aug 25 15:34:29 2015 CSTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===SMART overall-health self-assessment test result: PASSEDGeneral SMART Values:Offline data collection status:  (0x00) Offline data collection activity                                        was never started.                                        Auto Offline Data Collection: Disabled.Self-test execution status:      (  25) The self-test routine was aborted by                                        the host.Total time to complete Offline data collection:                (    0) seconds.Offline data collectioncapabilities:                    (0x1d) SMART execute Offline immediate.                                        No Auto Offline data collection support.                                        Abort Offline collection upon new                                        command.                                        Offline surface scan supported.                                        Self-test supported.                                        No Conveyance Self-test supported.                                        No Selective Self-test supported.SMART capabilities:            (0x0003) Saves SMART data before entering                                        power-saving mode.                                        Supports SMART auto save timer.Error logging capability:        (0x00) Error logging NOT supported.                                        General Purpose Logging supported.Short self-test routine recommended polling time:        (   0) minutes.Extended self-test routinerecommended polling time:        (   0) minutes.SMART Attributes Data Structure revision number: 18Vendor Specific SMART Attributes with Thresholds:ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       5116 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       12100 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       4009824171 Unknown_Attribute       0x0000   090   000   000    Old_age   Offline      -       12041174 Unknown_Attribute       0x0000   066   100   000    Old_age   Offline      -       8184 End-to-End_Error        0x0000   009   100   000    Old_age   Offline      -       1271187 Reported_Uncorrect      0x0000   100   100   000    Old_age   Offline      -       0190 Airflow_Temperature_Cel 0x0000   045   063   000    Old_age   Offline      -       45195 Hardware_ECC_Recovered  0x0000   000   100   000    Old_age   Offline      -       0196 Reallocated_Event_Count 0x0000   000   100   000    Old_age   Offline      -       0197 Current_Pending_Sector  0x0000   000   100   000    Old_age   Offline      -       0198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       2732199 UDMA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      -       2458202 Data_Address_Mark_Errs  0x0000   100   100   000    Old_age   Offline      -       2371926836205 Thermal_Asperity_Rate   0x0000   100   100   000    Old_age   Offline      -       3000206 Flying_Height           0x0000   000   100   000    Old_age   Offline      -       0207 Spin_High_Current       0x0000   003   100   000    Old_age   Offline      -       90208 Spin_Buzz               0x0000   000   100   000    Old_age   Offline      -       14210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       9175211 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0212 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0213 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0214 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0221 G-Sense_Error_Rate      0x0000   100   100   000    Old_age   Offline      -       0222 Loaded_Hours            0x0000   100   100   000    Old_age   Offline      -       0230 Head_Amplitude          0x0000   001   100   000    Old_age   Offline      -       1233 Media_Wearout_Indicator 0x0000   100   000   000    Old_age   Offline      -       100249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       7079251 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       20961SMART Error Log not supportedSMART Self-test log structure revision number 1Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error# 1  Extended offline    Aborted by host               90%         0         -# 2  Short offline       Aborted by host               90%         0         -Device does not support Selective Self Tests/Logging

但却是有SSD的寿命的:
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
找了很久,对这块没有RAID的SSD的错误检测依旧没有办法,只能监控其寿命,要是哪位高人有办法,请指教。

至此就实现得差不多了,总体思路就是如此:

通过检测工具
对于没有使用raid卡的硬盘,可以用smartctl -a /dev/sdX 观察Error counter log的列的值有没有增加;
使用raid卡的硬盘,则用MegaCli来观察Error Count


最后就是对ioerr_cnt的研究了,操作系统为redhat5.x,具体版本不记得了,可以用df -h来查看磁盘分区情况
对于每一块磁盘,其目录下都会有这个文件,里面存放了一个值

# cat /sys/block/sdb/device/ioerr_cnt 0x1494

从ioerr_cnt这个名字就觉得这个应该是对IO错误的计数,那么它的值就表示发生的IO错误数,0x1494,这可不是一个很低的值,它是否象征着磁盘错误?
而后导师在redhat社区找了一篇关于这个问题的讨论文章给我看,有兴趣的可自行去红帽社区找,我这里不方便提供

[Troubleshooting] How do I determine which io are causing ioerr_cnt to increase?

而这篇文章的存在就是为了确定是哪个IO发生了错误提供寻找办法,就是提出一个解决办法去找到是哪个IO导致错误,但是就算找到了,跟磁盘的健康状态有关系吗?或者说,只是某个进程发生了IO错误,如果这是那个进程本身的关系,那就跟磁盘毫不相干了。
我观察了我三台主机,9块磁盘的ioerr_cnt,发现只有一块硬盘的ioerr_cnt值为0,但是smartctl和MegaCli显示的error都为0。
最后决定放弃对ioerr_cnt的检测,毕竟它并不能全部和磁盘的健康状态挂钩,所以把MegaCli和smartctl作为标准。


这样写下来,总觉得好少,可是自己也将近做了一星期的研究,还要加上好几天的写代码,全部用Python实现的,因为对Python也生疏了好久,查了好久的函数怎么怎么用。但自己收获还是很大的,之前对nagios的脚本还一直抱有敬畏的心态(有一些打开全是乱码),现在发现其实还蛮简单的,主要还是要挑对工具,接着大多数都是字符串处理了,Python是个好东西。
最后的代码如下了,挺简单的,没什么含金量:

#!/usr/bin/env python# -*- coding: utf-8 -*-## Description:#   This application is used to discovery the pyhsical disk by using the MegaCLI tool.## Author: Jiang Chuan <806692341@qq.com>#import commandsimport osimport sysimport stringimport argparseSMARTCTL = 'smartctl'ListError = '-l error'DISK = '/dev/sdc'LSPCI = 'lspci | grep -i raid'MEGACLI = '/opt/MegaRAID/MegaCli/MegaCli64'PDLIST = '-PDList -aALL'DEVICE = '|grep \'Device Id\''ERROR = '|grep Error'# nagios exit codeSTATUS_OK = 0STATUS_WARNING = 1STATUS_ERROR = 2STATUS_UNKNOWN = 3def check_smartctl():    (status, output) = commands.getstatusoutput('%s %s %s' % (SMARTCTL, ListError, DISK))    line = output.split('\n')    if status != 0:        print 'UNKNOWN|Something not unexpected happened:' + line[3]        return STATUS_UNKNOWN    else:        num = [0,1,2,3,4]        str_read = ''        str_write = ''        str_verify = ''        for item in line:            if item.find("read") in num:                str_read = item            if item.find("write") in num:                str_write = item            if item.find("verify") in num:                str_verify = item            if str_read != '' and str_write != '' and str_verify != '':                error_list = [max_error(str_read), max_error(str_write), max_error(str_verify)]                if max(error_list) >= 5:                    print 'ERROR|There is too much error:' + str(error_list) + ' >= 5'                    return STATUS_ERROR                elif max(error_list) == 0:                    print 'OK'                    return STATUS_OK                else:                    print 'WARNING|There is some error need handle:' + str(error_list) + '< 5'                    return STATUS_WARNING            else:                print 'UNKNOWN|We can not get the error count,please check'                return STATUS_UNKNOWNdef max_error(str):    words = str.split(' ')    words = filter(lambda x:x != '', words)    lis = [int(words[1]), int(words[2]), int(words[3]), int(words[4]), int(words[7])]    return max(lis)def check_lsi():    (status, output) = commands.getstatusoutput('%s' % (LSPCI))    if status != 0:        print 'UNKNOWN|LSPCI encounter a problem'        return STATUS_UNKNOWN        sys.exit(1)    else:        if(output.find('LSI') >=0 ):            return STATUS_OK        else:            print 'ERROR|There is no lspci raid'            return STATUS_ERRORdef check_MegaCli():    check_lsi()    device_id = get_device_id()    error_count = get_error_count()    # Some judgement, maybe useless    if len(device_id)<1 or len(error_count)<1:        print 'ERROR|There is some error because one of the device_id and error_count is 0'        return STATUS_ERROR    elif len(device_id)*2 != len(error_count):        print 'ERROR|There is some error because the num of error_count does not equal to double device_id'        return STATUS_ERROR    else:        warn_num = [1,2,3,4]        # 0 represent NORMAL.1---WARNING.2---CRITICAL        status_num = 0;        if max(error_count) == 0:            print 'OK'            return STATUS_OK        elif max(error_count) >=5:            print 'ERROR|There is ' + str(max(error_count)) + ' error in device ' + error_count.index(max(error_count))            return STATUS_ERROR        else:            print 'ERROR|There is ' + str(max(error_count)) + ' error in device ' + error_count.index(max(error_count))            return STATUS_WARNING        # Just for testing, print the error and the device_id        # if status_num == 0:        #     i = 0        #     while i < len(device_id):        #         print 'Device_Id ' + str(device_id[i]) + ':'        #         print 'Media Error Count :' + str(error_count[2*i])        #         print 'Other Error Count :' + str(error_count[2*i+1])        #         i = i + 1        # return status_numdef get_device_id():    (status, output) = commands.getstatusoutput('%s %s %s' % (MEGACLI, PDLIST, DEVICE))    if status != 0:        print 'ERROR|Error for get device id'        return STATUS_ERROR        sys.exit(1)    else:        device_id = []        line = output.split('\n')        for item in line:            device_id.append(int(item.split(' ')[-1]))        return device_iddef get_error_count():    (status, output) = commands.getstatusoutput('%s %s %s' % (MEGACLI, PDLIST, ERROR))    if status != 0:        print 'Error|Error for get MegaCli error count'        return STATUS_ERROR        sys.exit(1)    else:        error_count = []        line = output.split('\n')        for item in line:            error_count.append(int(item.split(' ')[-1]))        return error_countdef check_ssd(device_id,disk):    (status, output) = commands.getstatusoutput('%s %s%s %s %s' % (SMARTCTL, '-a -d sat+megaraid,', device_id,disk, '|grep Media_Wearout_Indicator'))    if status != 0:        print 'UNKNOWN|Something unexpected happened,now is doing check_ssd().'        return STATUS_UNKNOWN        sys.exit(1)    else:        life = int(str(output).split(' ')[5])        if life >= 50:            print 'OK|The life of the SSD is ' + str(life) +'% left'            return STATUS_OK        elif life < 50 and life >= 20:            print 'WARNING|The life of the SSD is ' + str(life) + '% < 20%'            return STATUS_WARNING        else:            print 'CRITICAL|The life of the SSD is ' + str(life) + '% < 10%'            return STATUS_ERRORdef check_ssd_no_id(disk):    (status, output) = commands.getstatusoutput('%s %s %s %s' % (SMARTCTL, '-a ', disk, '|grep Media_Wearout_Indicator'))    if status != 0:        print 'UNKNOWN|Something unexpected happened,now is doing check_ssd().'        return STATUS_UNKNOWN        sys.exit(1)    else:        life = int(str(output).split(' ')[5])        if life >= 50:            print 'OK|The life of the SSD is ' + str(life) +'% left'            return STATUS_OK        elif life < 50 and life >= 20:            print 'WARNING|The life of the SSD is ' + str(life) + '% < 20%'            return STATUS_WARNING        else:            print 'CRITICAL|The life of the SSD is ' + str(life) + '% < 10%'            return STATUS_ERRORdef init_option():    parser = argparse.ArgumentParser(description="DISK nagios plugin.")    parser.add_argument('-r', '--raid', help='raid or not(y/n)')    parser.add_argument('-s', '--ssd', help='ssd or not(y/n), need device_id(0,1,2) and disk(/dev/sdc)')    parser.add_argument('-i', '--device', help='Device Id(0,1,2), which is needed in check_ssd')    parser.add_argument('-d', '--disk', help='DISK(/dev/sdx),which is needed in check_ssd')    return parserdef main():    parser = init_option()    args = parser.parse_args()    if args.raid == 'y':        if not args.ssd:            return check_MegaCli()        else:            if not args.device or not args.disk:                    print 'Error|Check ssd needs device id and disk'                    return STATUS_ERROR                    sys.exit(1)            else:                # If it doesn't in the list of device id                device_id = get_device_id()                if int(args.device) in device_id:                    return check_ssd(args.device,args.disk)                else:                    print 'Error|You must specify a Device_Id ' + str(args.device)                    return STATUS_ERROR                    sys.exit(1)    else:        if not args.ssd:            return check_smartctl()        elif args.ssd == 'y':            # For the ssd doesn't need device id(no MegaCli)            if not args.disk:                print 'Error|Check the life of SSD with no ID must assign the DISK(/dev/sdx)'                return STATUS_ERROR                sys.exit(1)            else:                return check_ssd_no_id(args.disk)if __name__ == '__main__':    sys.exit(main())# usage: check_disk_health_v2.py [-h] [-r RAID] [-s SSD] [-i DEVICE] [-d DISK]# 要监控一台电脑的磁盘,因为不带自动识别,所以对于每一台电脑,都需要指定其:# 是否有RAID:# 是:是否检测SSD#     是:check_ssd()#     否:check_megacli()# 否:是否检测SSD#     是:check_ssd_no_id()#     否:check_smartctl()## 都需要自行指定参数,有点小麻烦
0 0