Cluster Health Monitor (CHM) FAQ (文档 ID 1328466.1)

来源：互联网发布：中国大数据网站编辑：程序博客网时间：2024/04/28 22:28

Applies to:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.2.0.2.0 [Release 10.1 to 11.2]
Information in this document applies to any platform.

Purpose

The Cluster Health Monitor FAQ is an evolving document that answers common questions about the Cluster Health Monitor

Questions and Answers

What is the Cluster Health Monitor?

The Cluster Health Monitor collects OS statistics (system metrics) such as memory and swap space usage, processes, IO usage, and network related data. The Cluster Health Monitor collects information in real time and usually once a second. The Cluster Health Monitor collects OS statistics using OS API to gain performance and reduce the CPU usage overhead. The Cluster Health Monitor collects as much of system metrics and data as feasible that is restricted by the acceptable level of resource consumption by the tool.

集群健康监视器收集OS统计数据（如内存和交换空间使用率，进程，IO使用率，和网络相关的数据的系统度量）。集群健康监视器实时收集信息，通常以每秒一次的。集群健康监视器统计收集操作系统使用OS API来提高性能并降低CPU使用量开销。集群健康监视器收集尽可能多的系统可行的指标和数据，限制资源消耗在可接受的水平。

What is the purpose of the Cluster Health Monitor?

The Cluster Health Monitor is developed to provide system metrics and data for troubleshooting many different types of problems such as node reboot and hang, instance eviction and hang, severe performance degradation, and any other problems that need the system metrics and data.

By monitoring the data constantly, users can use the Cluster Health Monitor detect potential problem areas such as CPU load, memory constraints, and spinning processes before the problem causes an unwanted outage.

集群健康监控研制提供系统解决许多不同类型的问题的操作系统指标和数据，如节点重新启动、挂起，实例的驱逐和挂起，严重的性能退化，以及任何其他的问题。

通过不断地监测的数据，用户可以使用集群健康监视器发现潜在的问题区域，如CPU负载，内存限制，以及纺纱工艺问题导致不必要的停运之前。

Where can I get the Cluster Health Monitor?

The Cluster Health Monitor is integrated part of 11.2.0.2 Oracle Grid Infrastructure for Linux (not on Linux Itanium) and Solaris (Sparc 64 and x86-64 only), so installing 11.2.0.2 Oracle Grid Infrastructure on those platforms will automatically install the Cluster Health Monitor. AIX will have the Cluster Health Monitor starting from 11.2.0.3. The Cluster Health Monitor is also enabled for Windows (except Windows Itanium) in 11.2.0.3.
集群健康监视器集成在11.2.0.2 Oracle Grid Infrastructure for Linux（不适在Linux Itanium）和Solaris（SPARC64和x86-64），所以在这些平台上安装11.2.0.2的Oracle网格基础架构，将自动安装群集健康监视器。 AIX集群健康监视器从11.2.0.3开始。Windows集群健康监视器（除了Itanium的Windows）在11.2.0.3启用。
Prior to 11.2.0.2 on Linux, the Cluster Health Monitor can be downloaded from OTN.

http://www.oracle.com/technetwork/database/clustering/downloads/ipd-download-homepage-087212.html

The OTN version for Windows is not available. Please upgrade to 11.2.0.3 if you need CHM for Windows.

What is the resource name for Cluster Health Monitor in 11.2.0.2?

ora.crf is the Cluster Health Monitor resource name that ohasd manages. Issue “crsctl stat res –t –init” to check the current status of the Cluster Health Monitor.

ora.crf是集群健康监视的资源名，该资源有ohasd管理。通过命令“crsctl stat res -t -init” 检查集群健康监视器的当前状态。

Can the Cluster Health Monitor be installed on a single node, non-RAC server?

The Cluster Health Monitor from OTN can be installed on a single node, non-RAC server. There is no need to install grid infrastructure or CRS to install and run the Cluster Health Monitor from OTN.

Where is oclumon?

If the CHM is installed as a part of 11.2 installation on the supported platform, then the location of oclumon is in GI_HOME/bin directory.

If the CHM is manually installed using the CHM file from OTN, then the location of oclumon is in:
Linux : /usr/lib/oracrf/bin
Windows : C:\Program Files\oracrf\bin

How do I collect the Cluster Health Monitor data?

如何收集群集健康监视器数据?

“<GI_HOME>/bin/diagcollection.pl --collect --chmos” will produce output for all data that is collected in the repository. There may be too much data and may take long time, so the suggestion is limit the query to an interesting time interval.

For example, issue “<GI_HOME>/bin/diagcollection.pl --collect --crshome $ORA_CRS_HOME --chmos --incidenttime <start time of interesting time period> --incidentduration 05:00”

The above outputs the report that covers 5 hours from the time specified by incidenttime.
The incidenttime must be in MM/DD/YYYYHH:MN:SS where MM is month, DD is date, YYYY is year, HH is hour in 24 hour format, MN is minute, and SS is second. For example, if you want to put the incident time to start from 10:15 PM on June 01, 2011, the incident time is 06/01/201122:15:00. The incidenttime and incidentduration can be changed to capture more data.

Alternatively, ‘oclumon dumpnodeview -allnodes -v -last "11:59:59" > your-filename’ if diagcollection.pl fails with any reason. This will generate a report from the repository up to last 12 hours. The -last value can be changed to get more or less data.
Another example of using oclumon is 'oclumon dumpnodeview -allnodes -v -s "2012-06-01 22:15:00" -e "2012-06-02 03:15:00" > /tmp/chm.log '. The difference in this command is that it specifies the start (-s flag) and end time (-e flag).

另外的方法：

oclumon dumpnodeview -allnodes -v -last "11:59:59" > your-filename

产生最后12个小时的报告，可以自己改变-last指定的参数，决定数据的多少

oclumon dumpnodeview -allnodes -v -s "2012-06-01 22:15:00" -e "2012-06-02 03:15:00" > /tmp/chm.log

通过指定-s开始和-e结束的时间产生报告。

Why does “diagcollection.pl --collect --chmos” return “Cannot parse master from output: ERROR : in reading init file” error?

This is due to bug 10048487 that affects 11.2.0.2. As a result, the bug in the script causes the diagcollection.pl to never be able to retrieve the master node.

The workaround for this is to issue
oclumon dumpnodeview -allnodes -v -last “amount of data needed”
For example, oclumon dumpnodeview -allnodes -v -last “01:00:00”
will provide last one hour of data from all nodes.

How do you get the syntax of different options and explanations for those options for diagcollection.pl and oclumon?

Issue “<GI_HOME>/bin/diagcollection.pl –h” and “oclumon –h”. You may need to drill down further to get information for different options.

What is IPD/OS?

The IPD/OS is an old name for the Cluster Health Monitor. The names can be used interchangeably although Oracle now calls the tool Cluster Health Monitor.

IPD / OS是集群健康监视器的原来的老名称。这个名称可以交替使用，虽然甲骨文现在使用集群健康监视器。

How is the Cluster Health Monitor different from OSWatcher?

OSWatcher collects OS statistics by running regular unix commands such as vmstat, top, ps, iostat, netstat, mpstat, and meminfo. The private.net file can be configured in OSWatcher to issue traceroute command over the private interconnect to test the private interconnect. However, because the commands that OSWatcher runs are Unix commands, it uses more CPU and introduces more overhead to the servers compared to the Cluster Health Monitor. For example, each time OSWatcher issues vmstat and other commands, that spawns new processes. OSWatcher also runs in a user priority, so OSWatcher often cannot run when CPU load is heavy.

Is the Cluster Health Monitor replacing OSWatcher?

The Cluster Health Monitor has many advantages over OSWatcher, and the most significant is that the Cluster Health Monitor runs in real time and usually once a second, so the Cluster Health Monitor will collect data even when OSWatcher cannot. However, there are some information such as top, traceroute, and netstat that the Cluster Health Monitor does not collect, so running the Cluster Health Monitor while running OSWatcher is ideal. Both tools complement each other rather than supplement.
On the other hand, if only one of the tools can be used, then Oracle recommends that the Cluster Health Monitor is used.

How much of overhead does the Cluster Health Monitor cause?

In today's server environment, the Cluster Health Monitor uses approximately less than 5% of one CPU/core. The overhead of using the Cluster Health Monitor is minimal.

在当今的服务器环境中，集群健康监视器使用一个CPU/核心约低于5％。使用集群健康监控的开销是最小的。

How much of disk space is needed for the Cluster Health Monitor?

The Cluster Health Monitor takes up 1GB space by default on all nodes in the cluster. The approximate amount of data collected is 0.5 GB per node per day. The size of the repository can increase to collect and save data up to 3 days, and this will increase the disk usage appropriately.

集群健康监控器默认情况下，集群中的所有节点上占用1GB的空间。收集的数据量的大约是每个节点每天0.5 GB。存储库的大小可以增加收集和保存数据长达3天，这将适当增加磁盘的使用量。

How do I find out the size of data collected and saved by the Cluster Health Monitor in my system?

我如何找出数据收集和保存在我的系统健康监控由群集的大小?

“oclumon manage -get repsize” will show the size in seconds.

oclumon manage -get repsize 会显示资料库的大小。

How can I increase the size of the Cluster Health Monitor repository ?

我怎样才能提高集群的健康监视器库的大小？

“oclumon manage -repos resize <number in seconds less than 259200>”. Setting the value to 259200 will collect and save the data for three days, but that can be very large. The suggestion is that the repository is not set greater than 1 day (86400) unless needed.

oclumon manage -repos resize <number in seconds less than 259200>.将值设置为259200，将收集和保存三天的数据，但可以是非常大的。存储库的建议是，不设置大于1天（86400），除非需要。

What platforms can I run the Cluster Health Monitor?

11.2.0.1 and earlier: Linux only (download from OTN)
11.2.0.2: Solaris (Sparc 64 and x86-64 only), and Linux.
11.2.0.3: AIX, Solaris (Sparc 64 and x86-64 only), Linux, and Windows.

Cluster Health Monitor is NOT available for any Itanium platform such as Linux Itanium and Windows Itanium.

What steps are needed to install 11.2.0.2 when the Cluster Health Monitor from OTN is already running?

Remove the Cluster Health Monitor from OTN before upgrading the CRS or installing Grid Infrastructure.

Where does the Cluster Health Monitor from OTN installed in Linux?

$CRF_HOME is set /usr/lib/oracrf on Linux by default if the Cluster Health Monitor is from OTN. This is the Cluster Health Monitor home location.

What logs and data should I gather before logging a SR for the Cluster Health Monitor error?

1) provide 3-4 pstack outputs over a minute for osysmond.bin
2) output of strace -v for osysmond.bin about 2 minutes.
3) strace -cp <osysmond.bin pid> for about 2 min
4) oclumon dumpnodeview -v output for that node for 2 min.
5) output of "uname -a"
6) outpuft of "ps -eLf | grep osysmond.bin"
7) The ologgerd and sysmond log files in the CRS_HOME/log/<host name> directory from all nodes

How do I increase the trace level the Cluster Health Monitor?

Increase the log level for the daemons using,
oclumon debug log all allcomp:<trace level from 0 to 3>

Higher the trace level, more detailed tracing is done, so do not forget to reset the trace level back to 1 (thedefault trace level when the CHM is first installed) by issuing "oclumon debug log all allcomp:1"

Can I use procwatcher to get the pstack of the Cluster Health Monitor regularly?

Procwatcher version 030810 can now be used to monitor IPD procs. Just add the proc names to the CLUSTERPROCS list. The change is that Procwatcher is now smarter about picking the path of the executable so now it can find the IPD daemons if it is looking for them.

What are the processes and components for the Cluster Health Monitor ?

什么是集群健康监视器的进程和组件？

Cluster Logger Service (Ologgerd) – there is a master ologgerd that receives the data from other nodes and saves them in the repository (Berkeley database). It compresses the data before persisting to save the disk space. In an environment with multiple nodes, a replica ologgerd is also started on a node where the master ologgerd is not running. The master ologgerd will sync the data with replica ologgerd by sending the data to the replica ologgerd. The replica ologgerd takes over if the master ologgerd dies. A new replica ologgerd starts when the replica ologgerd dies. There is only one master ologgerd and one replica ologgerd per cluster.

System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends the data to the master ologgerd. A sysmond process runs on every node and collects the system statistics including CPU, memory usage, platform info, disk info, nic info, process info, and filesystem info.

What is oclumon?

什么是oclumon?

OCLUMON command-line tool - use oclumon command line to query the CHM repository to display node-specific metrics for a specified time period.

You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period. These states are based on predefined thresholds for each resource metric and are denoted as red, orange, yellow, and green, indicating decreasing order of criticality.

OCLUMON命令行工具--查询CHM资料库，显示指定时间段节点的统计指标。也可以查询和打印指定时间段节点上资源的状态及持续时间。这些状态是基于资源统计指标预先定义的阀值，标记为红，橙，黄，绿表示重要性的递减顺序。

**What is definition of some of the files like .bdb, _db. , .ldb , log. filescreated by tool in the BDB (Berkeley Database) location directory ?**

*.bdb & _db.* - These are files created for the berkeley db which stores the data collected.

log.* - These are berkeley bdb logfiles which preserve changes before making them to the db files. We have checkpointing setup and it reuses the log files.

*.ldb - This is the local logging file and MUST be present on all servers.

p.s. In any case you must not remove any of these files.

Because it takes many days / weeks to resolve a problem like the node reboot or performance degradation, is there any way to keep the Cluster Health Monitor data for that long so that it can be replayed any time later when needed ?

The Cluster Health Monitor is designed to store data up to 3 days by increasing the size of the repository. If you want to store data more than that, one way is to zip the output from ‘oclumon dumpnodeviews’ or ‘diagcollection’ regularly (like every hour).

Another way (suggested) is to archive the whole BDB regularly (like every day).

The way that CHMOS reads archived BDB is to start it in debug mode. It starts by using
ologdbg -d <bdb location>
After it starts, issue the oclumon dumpnodeview to get the data from the archived BDB.
For example, issue
oclumon dumpnodeview -n <node name> -s <start time> -e <end time> -v

Where is the location for the log files for the Cluster Health Monitor from OTN (pre 11.2.0.2)?

Check directory /usr/lib/oracrf/log/* for the alert<nodename>.log and other subdir for each daemons (SYSMOND, LOGGERD, OPROXYD) log.

How do I fix the problem that the time in the oclumon report is in UTC time zone instead of the time zone of my server?

The time in the repository is in UTC, and by default, oclumon shows the time in UTC. Check README, it shows UTC if ORACRF_TZ not set. Setting ORACRF_TZ should fix the time zone issue.

Can I install CHM from OTN on 11.2.0.2? What if I stop and disable CHM resource (ora.crf) on 11.2.0.2?

You cannot install CHM from OTN if there is any conflicting install, so installing CHM from OTN on servers that has 11.2.0.2 Grid Infrastructure will not work. Disabling CHM resource (ora.crf) on 11.2.0.2 will still keep the installation; hence, OTN install will fail.

Where is the trace file for client like oclumon? How do I increase the trace level for oclumon?

The 'log' file for oclumon is in log/<hostname>/clients/oclumon.log.

Generally its not generated because, at the log level 0, there is no log data.
To see logs at higher log level one needs to do the following
1. oclumon [Enter the interactive mode]
2. query> debug log all allcomp:3

After this, any command execution will produce finer logs in oclumon.log

Can the Directory path to the CHM Repository be same on all nodes if shared storage is used?

One can set CHM repository at a shared storage under the same directory although it is recommended not to do so. One reason is the performance issue. In such a case, each node’s repository location is under the directory named as its hostname.

How much of data (how long in time) does the node store CHM data locally when it cannot communicate with the master?

The local repository size is small for nodes that need to save the local CHM data when it cannot communicate with the master.

With a sampling interval of 1 second, ideally it will be around 1 hour of data. With 11.2.0.3, we have moved to sampling interval of 5 seconds, hence, in that case the data that can be retained is 4-5 hours of data.

How often does CHM collect the system metric data? Can this be changed?

In pre-11.2.0.3, the CHM collection interval is usually once a second, but this can change depending on the the amount of data getting collected. In 11.2.0.3, the CHM collection interval is changed to once every 5 seconds.

Currently, the collection interval can not be changed.

11.2.0.3之前，CHM收集的时间间隔通常是以每秒一次的，但是这可以改变取决于得到收集的数据量。 11.2.0.3，CHM收集间隔改为每5秒一次。

目前，收集的时间间隔不能改变。

What is the default CHM retention time?

In pre-11.2.0.2 CHM available from OTN, the default data retention time was 24 hours.

In 11.2.0.2, the retention time is determined by the size. The default size has changed to 1GB. Depending on how large the cluster is, thedefault retention time is different. For example, it is usually 6.9 hours for a one-node cluster when sampling interval is 1 second. Please issue "oclumon manage -get repsize" to find out the retention time of your cluster. The output is in seconds.

With sampling interval moving to 5 seconds in 11.2.0.3, the retention time becomes 5 times retention time with sampling interval 1 second.

How can you reduce the size of bdb file that became big for any reason?

如何减少bdb文件占用的空间?

You can manage repository size in terms of space using below command. This feature is present from 11.2.0.3.

oclumon manage -repos changesize <memsize>.

As a temporary work around, you can kill ologgerd and delete the contents in the BDB directory. osysmond should respawn ologgerd and new bdb file will get created. The past data is lost when this is done.

可以使用命令改变资料库的大小

oclumon manage -repos changesize <memsize>

作为一个临时变通，你可以杀ologgerd删除BDB目录中的内容。 osysmond会重生一个ologgerd，创建新的BDB文件。这样做，过去的数据会丢失。

Can you set up CHM to run locally on each node?

On OTN, one can do that by installing CHM on each node independently although it is not recommended.

The Cluster Health Monitor that comes with the Grid Infrastructure install image must run with only one master ologgerd, so it can not be set up to run locally on each node.

Is GUI tool for CHM available for 11.2.0.2?

There is currently no GUI tool available that can be used with the 11.2.0.2 version of the Cluster Health Monitor. The OTN-version of the GUI tool and the 11.2.0.2 version are incompatible.

Can CHM be used on a single node non-RAC server?

The CHM available on OTN can be used on a single node non-RAC server, but only Linux and Windows version of CHM is available from OTN. The CHM that comes with GI in 11.2 and higher must run with GI (RAC)

How to start and stop CHM that is installed as a part of GI in 11.2 and higher?

如何开始和停止GI中的CHM?

The ora.crf resource in 11.2 GI (and higher) is the resource for CHM, and the ora.crf resource is managed by ohasd. Starting and stopping ora.crf resource starts and stops CHM.

ora.crf是11.2 GI中CHM的资源名，该资源是有ohasd管理的。启动和停止ora.crf资源就会启动和停止CHM.

To stop CHM (or ora.crf resource managed by ohasd)
$GRID_HOME/bin/crsctl stop res ora.crf -init

To stop CHM (or ora.crf resource managed by ohasd)
$GRID_HOME/bin/crsctl start res ora.crf -init

转自：https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=122290510358569&id=1328466.1&_afrWindowMode=0&_adf.ctrl-state=1cqyk1bczf_71