Ganglia 3.6.0 监控 Hadoop 2.2.0

来源：互联网发布：阳春市网络问政编辑：程序博客网时间：2024/04/30 22:27

在《Ganglia3.6.0 安装步骤 (含python module)》一文中，安装完毕以后，ganglia可以正常监控到机器负载、cpu、磁盘等机器相关信息。但是，如果需要进一步通过ganglia监控hadoop集群，则需要配置hadoop metrics配置文件及适量修改ganglia配置。

一、hadoop相关配置

HADOOP_PATH/etc/hadoop/目录下有两个配置文件：hadoop-metrics.properties和hadoop-metrics2.properties。

1. hadoop-metrics.properties

用于hadoop与3.1版本以前的ganglia集成做监控的配置文件（在ganglia3.0到3.1的过程中，消息的格式发生了重要的变化，不兼容之前的版本）。

示例如下：

# Configuration of the "dfs" context for null##dfs.class=org.apache.hadoop.metrics.spi.NullContext# Configuration of the "dfs" context for file#dfs.class=org.apache.hadoop.metrics.file.FileContext#dfs.period=10#dfs.fileName=/tmp/dfsmetrics.log# Configuration of the "dfs" context for ganglia# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)# dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 dfs.period=10 dfs.servers=Mas2:8649# Configuration of the "mapred" context for null##mapred.class=org.apache.hadoop.metrics.spi.NullContext# Configuration of the "mapred" context for file#mapred.class=org.apache.hadoop.metrics.file.FileContext#mapred.period=10#mapred.fileName=/tmp/mrmetrics.log# Configuration of the "mapred" context for ganglia# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 mapred.period=10 mapred.servers=Mas2:8649# Configuration of the "jvm" context for null#jvm.class=org.apache.hadoop.metrics.spi.NullContext# Configuration of the "jvm" context for file#jvm.class=org.apache.hadoop.metrics.file.FileContext#jvm.period=10#jvm.fileName=/tmp/jvmmetrics.log# Configuration of the "jvm" context for ganglia# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=Mas2:8649# Configuration of the "rpc" context for null##rpc.class=org.apache.hadoop.metrics.spi.NullContext# Configuration of the "rpc" context for file#rpc.class=org.apache.hadoop.metrics.file.FileContext#rpc.period=10#rpc.fileName=/tmp/rpcmetrics.log# Configuration of the "rpc" context for ganglia# rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=Mas2:8649# Configuration of the "ugi" context for null##ugi.class=org.apache.hadoop.metrics.spi.NullContext# Configuration of the "ugi" context for file#ugi.class=org.apache.hadoop.metrics.file.FileContext#ugi.period=10#ugi.fileName=/tmp/ugimetrics.log# Configuration of the "ugi" context for ganglia# ugi.class=org.apache.hadoop.metrics.ganglia.GangliaContext ugi.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 ugi.period=10 ugi.servers=Mas2:8649

2. hadoop-metrics2.properties

用于hadoop与3.1版本以后的ganglia集成做监控的配置文件（本文采用此配置文件）。

示例如下（根据具体的namenode、datanode或是其他，选择相应的配置）：

##   Licensed to the Apache Software Foundation (ASF) under one or more#   contributor license agreements.  See the NOTICE file distributed with#   this work for additional information regarding copyright ownership.#   The ASF licenses this file to You under the Apache License, Version 2.0#   (the "License"); you may not use this file except in compliance with#   the License.  You may obtain a copy of the License at##       http://www.apache.org/licenses/LICENSE-2.0##   Unless required by applicable law or agreed to in writing, software#   distributed under the License is distributed on an "AS IS" BASIS,#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.#   See the License for the specific language governing permissions and#   limitations under the License.## syntax: [prefix].[source|sink].[instance].[options]# See javadoc of package-info.java for org.apache.hadoop.metrics2 for details*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink# default sampling period, in seconds*.period=10# The namenode-metrics.out will contain metrics from all context#namenode.sink.file.filename=namenode-metrics.out# Specifying a special sampling period for namenode:#namenode.sink.*.period=8#datanode.sink.file.filename=datanode-metrics.out# the following example split metrics of different# context to different sinks (in this case files)#jobtracker.sink.file_jvm.context=jvm#jobtracker.sink.file_jvm.filename=jobtracker-jvm-metrics.out#jobtracker.sink.file_mapred.context=mapred#jobtracker.sink.file_mapred.filename=jobtracker-mapred-metrics.out#tasktracker.sink.file.filename=tasktracker-metrics.out#maptask.sink.file.filename=maptask-metrics.out#reducetask.sink.file.filename=reducetask-metrics.out##*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31*.sink.ganglia.period=10*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40#namenode.sink.ganglia.servers=Mas2:8649#resourcemanager.sink.ganglia.servers=namenode1:8649datanode.sink.ganglia.servers=namenode1:8649#nodemanager.sink.ganglia.servers=namenode1:8649#maptask.sink.ganglia.servers=namenode1:8649#reducetask.sink.ganglia.servers=namenode1:8649

二、ganglia相关配置

1. gmeted.conf

用于收集各节点上传上来的监控数据。

示例如下（启用示例文件中开启的配置选项即可）：

# This is an example of a Ganglia Meta Daemon configuration file#                http://ganglia.sourceforge.net/###-------------------------------------------------------------------------------# Setting the debug_level to 1 will keep daemon in the forground and# show only error messages. Setting this value higher than 1 will make # gmetad output debugging information and stay in the foreground.# default: 0# debug_level 10##-------------------------------------------------------------------------------# What to monitor. The most important section of this file. ## The data_source tag specifies either a cluster or a grid to# monitor. If we detect the source is a cluster, we will maintain a complete# set of RRD databases for it, which can be used to create historical # graphs of the metrics. If the source is a grid (it comes from another gmetad),# we will only maintain summary RRDs for it.## Format: # data_source "my cluster" [polling interval] address1:port addreses2:port ...# # The keyword 'data_source' must immediately be followed by a unique# string which identifies the source, then an optional polling interval in # seconds. The source will be polled at this interval on average. # If the polling interval is omitted, 15sec is asssumed. ## If you choose to set the polling interval to something other than the default,# note that the web frontend determines a host as down if its TN value is less# than 4 * TMAX (20sec by default).  Therefore, if you set the polling interval# to something around or greater than 80sec, this will cause the frontend to# incorrectly display hosts as down even though they are not.## A list of machines which service the data source follows, in the # format ip:port, or name:port. If a port is not specified then 8649# (the default gmond port) is assumed.# default: There is no default value## data_source "my cluster" 10 localhost  my.machine.edu:8649  1.2.3.5:8655# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651# data_source "another source" 1.3.4.7:8655  1.3.4.8#data_source "my cluster" localhostdata_source "my cluster" 10 Mas1:8650 Mas2:8650 Sla1:8650 Sla2:8650## Round-Robin Archives# You can specify custom Round-Robin archives here (defaults are listed below)## Old Default RRA: Keep 1 hour of metrics at 15 second resolution. 1 day at 6 minute# RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \#      "RRA:AVERAGE:0.5:5760:374"# New Default RRA# Keep 5856 data points at 15 second resolution assuming 15 second (default) polling. That's 1 day# Two weeks of data points at 1 minute resolution (average)#RRAs "RRA:AVERAGE:0.5:1:5856" "RRA:AVERAGE:0.5:4:20160" "RRA:AVERAGE:0.5:40:52704"##-------------------------------------------------------------------------------# Scalability mode. If on, we summarize over downstream grids, and respect# authority tags. If off, we take on 2.5.0-era behavior: we do not wrap our output# in <GRID></GRID> tags, we ignore all <GRID> tags we see, and always assume# we are the "authority" on data source feeds. This approach does not scale to# large groups of clusters, but is provided for backwards compatibility.# default: on# scalable off##-------------------------------------------------------------------------------# The name of this Grid. All the data sources above will be wrapped in a GRID# tag with this name.# default: unspecified# gridname "MyGrid"##-------------------------------------------------------------------------------# The authority URL for this grid. Used by other gmetads to locate graphs# for our data sources. Generally points to a ganglia/# website on this machine.# default: "http://hostname/ganglia/",#   where hostname is the name of this machine, as defined by gethostname().# authority "http://mycluster.org/newprefix/"##-------------------------------------------------------------------------------# List of machines this gmetad will share XML with. Localhost# is always trusted. # default: There is no default value# trusted_hosts 127.0.0.1 169.229.50.165 my.gmetad.org##-------------------------------------------------------------------------------# If you want any host which connects to the gmetad XML to receive# data, then set this value to "on"# default: off# all_trusted on##-------------------------------------------------------------------------------# If you don't want gmetad to setuid then set this to off# default: on# setuid off##-------------------------------------------------------------------------------# User gmetad will setuid to (defaults to "nobody")# default: "nobody"# setuid_username "nobody"##-------------------------------------------------------------------------------# Umask to apply to created rrd files and grid directory structure# default: 0 (files are public)# umask 022##-------------------------------------------------------------------------------# The port gmetad will answer requests for XML# default: 8651xml_port 8651##-------------------------------------------------------------------------------# The port gmetad will answer queries for XML. This facility allows# simple subtree and summation views of the XML tree.# default: 8652interactive_port 8652##-------------------------------------------------------------------------------# The number of threads answering XML requests# default: 4# server_threads 10##-------------------------------------------------------------------------------# Where gmetad stores its round-robin databases# default: "/var/lib/ganglia/rrds"rrd_rootdir "/var/www/html/rrds"##-------------------------------------------------------------------------------# List of metric prefixes this gmetad will not summarize at cluster or grid level.# default: There is no default value# unsummarized_metrics diskstat CPU##-------------------------------------------------------------------------------# In earlier versions of gmetad, hostnames were handled in a case# sensitive manner# If your hostname directories have been renamed to lower case,# set this option to 0 to disable backward compatibility.# From version 3.2, backwards compatibility will be disabled by default.# default: 1   (for gmetad < 3.2)# default: 0   (for gmetad >= 3.2)case_sensitive_hostnames 0#-------------------------------------------------------------------------------# It is now possible to export all the metrics collected by gmetad directly to# graphite by setting the following attributes. ## The hostname or IP address of the Graphite server# default: unspecified# carbon_server "my.graphite.box"## The port and protocol on which Graphite is listening# default: 2003# carbon_port 2003## default: tcp# carbon_protocol udp## **Deprecated in favor of graphite_path** A prefix to prepend to the # metric names exported by gmetad. Graphite uses dot-# separated paths to organize and refer to metrics. # default: unspecified# graphite_prefix "datacenter1.gmetad"## A user-definable graphite path. Graphite uses dot-# separated paths to organize and refer to metrics. # For reverse compatibility graphite_prefix will be prepended to this# path, but this behavior should be considered deprecated.# This path may include 3 variables that will be replaced accordingly:# %s -> source (cluster name)# %h -> host (host name)# %m -> metric (metric name)# default: graphite_prefix.%s.%h.%m# graphite_path "datacenter1.gmetad.%s.%h.%m# Number of milliseconds gmetad will wait for a response from the graphite server # default: 500# carbon_timeout 500#-------------------------------------------------------------------------------# Memcached configuration (if it has been compiled in)# Format documentation at http://docs.libmemcached.org/libmemcached_configuration.html# default: ""# memcached_parameters "--SERVER=127.0.0.1"#

2. gmond.conf

用于指定本机监控信息的发送，本机所属的集群等的相关配置（gmond可以指定多播，在本示例中采用将监控数据发送到指定地址）。

示例如下（需要改动的部分为第29~72行）：

/* This configuration is as close to 2.5.x default behavior as possible   The values closely match ./gmond/metric.h definitions in 2.5.x */globals {  daemonize = yes  setuid = yes  user = nobody  debug_level = 0  max_udp_msg_len = 1472  mute = no  deaf = no  allow_extra_data = yes  host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */  host_tmax = 20 /*secs */  cleanup_threshold = 300 /*secs */  gexec = no  # By default gmond will use reverse DNS resolution when displaying your hostname  # Uncommeting following value will override that value.  # override_hostname = "mywebserver.domain.com"  # If you are not using multicast this value should be set to something other than 0.  # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable  send_metadata_interval = 0 /*secs */}/* * The cluster attributes specified will be used as part of the <CLUSTER> * tag that will wrap all hosts collected by this instance. */cluster {  name = "my cluster"  owner = "nobody"  latlong = "unspecified"  url = "unspecified"}/* The host section describes attributes of the host, like the location */host {  location = "unspecified"}/* Feel free to specify as many udp_send_channels as you like.  Gmond   used to only support having a single channel */udp_send_channel {  bind_hostname = yes # Highly recommended, soon to be default.                       # This option tells gmond to use a source address                       # that resolves to the machine's hostname.  Without                       # this, the metrics may appear to come from any                       # interface and the DNS names associated with                       # those IPs will be used to create the RRDs.#  mcast_join = 239.2.11.71  port = 8649  ttl = 1}/* You can specify as many udp_recv_channels as you like as well. */udp_recv_channel {#  mcast_join = 239.2.11.71  port = 8649#  bind = 239.2.11.71#  retry_bind = true  # Size of the UDP buffer. If you are handling lots of metrics you really  # should bump it up to e.g. 10MB or even higher.  # buffer = 10485760}/* You can specify as many tcp_accept_channels as you like to share   an xml description of the state of the cluster */tcp_accept_channel {  port = 8650  # If you want to gzip XML output  gzip_output = no}/* Channel to receive sFlow datagrams */#udp_recv_channel {#  port = 6343#}/* Optional sFlow settings */#sflow {# udp_port = 6343# accept_vm_metrics = yes# accept_jvm_metrics = yes# multiple_jvm_instances = no# accept_http_metrics = yes# multiple_http_instances = no# accept_memcache_metrics = yes# multiple_memcache_instances = no#}/* Each metrics module that is referenced by gmond must be specified and   loaded. If the module has been statically linked with gmond, it does   not require a load path. However all dynamically loadable modules must   include a load path. */modules {  module {    name = "core_metrics"  }  module {    name = "cpu_module"    path = "modcpu.so"  }  module {    name = "disk_module"    path = "moddisk.so"  }  module {    name = "load_module"    path = "modload.so"  }  module {    name = "mem_module"    path = "modmem.so"  }  module {    name = "net_module"    path = "modnet.so"  }  module {    name = "proc_module"    path = "modproc.so"  }  module {    name = "sys_module"    path = "modsys.so"  }}/* The old internal 2.5.x metric array has been replaced by the following   collection_group directives.  What follows is the default behavior for   collecting and sending metrics that is as close to 2.5.x behavior as   possible. *//* This collection group will cause a heartbeat (or beacon) to be sent every   20 seconds.  In the heartbeat is the GMOND_STARTED data which expresses   the age of the running gmond. */collection_group {  collect_once = yes  time_threshold = 20  metric {    name = "heartbeat"  }}/* This collection group will send general info about this host every   1200 secs.   This information doesn't change between reboots and is only collected   once. */collection_group {  collect_once = yes  time_threshold = 1200  metric {    name = "cpu_num"    title = "CPU Count"  }  metric {    name = "cpu_speed"    title = "CPU Speed"  }  metric {    name = "mem_total"    title = "Memory Total"  }  /* Should this be here? Swap can be added/removed between reboots. */  metric {    name = "swap_total"    title = "Swap Space Total"  }  metric {    name = "boottime"    title = "Last Boot Time"  }  metric {    name = "machine_type"    title = "Machine Type"  }  metric {    name = "os_name"    title = "Operating System"  }  metric {    name = "os_release"    title = "Operating System Release"  }  metric {    name = "location"    title = "Location"  }}/* This collection group will send the status of gexecd for this host   every 300 secs.*//* Unlike 2.5.x the default behavior is to report gexecd OFF. */collection_group {  collect_once = yes  time_threshold = 300  metric {    name = "gexec"    title = "Gexec Status"  }}/* This collection group will collect the CPU status info every 20 secs.   The time threshold is set to 90 seconds.  In honesty, this   time_threshold could be set significantly higher to reduce   unneccessary  network chatter. */collection_group {  collect_every = 20  time_threshold = 90  /* CPU status */  metric {    name = "cpu_user"    value_threshold = "1.0"    title = "CPU User"  }  metric {    name = "cpu_system"    value_threshold = "1.0"    title = "CPU System"  }  metric {    name = "cpu_idle"    value_threshold = "5.0"    title = "CPU Idle"  }  metric {    name = "cpu_nice"    value_threshold = "1.0"    title = "CPU Nice"  }  metric {    name = "cpu_aidle"    value_threshold = "5.0"    title = "CPU aidle"  }  metric {    name = "cpu_wio"    value_threshold = "1.0"    title = "CPU wio"  }  metric {    name = "cpu_steal"    value_threshold = "1.0"    title = "CPU steal"  }  /* The next two metrics are optional if you want more detail...     ... since they are accounted for in cpu_system.  metric {    name = "cpu_intr"    value_threshold = "1.0"    title = "CPU intr"  }  metric {    name = "cpu_sintr"    value_threshold = "1.0"    title = "CPU sintr"  }  */}collection_group {  collect_every = 20  time_threshold = 90  /* Load Averages */  metric {    name = "load_one"    value_threshold = "1.0"    title = "One Minute Load Average"  }  metric {    name = "load_five"    value_threshold = "1.0"    title = "Five Minute Load Average"  }  metric {    name = "load_fifteen"    value_threshold = "1.0"    title = "Fifteen Minute Load Average"  }}/* This group collects the number of running and total processes */collection_group {  collect_every = 80  time_threshold = 950  metric {    name = "proc_run"    value_threshold = "1.0"    title = "Total Running Processes"  }  metric {    name = "proc_total"    value_threshold = "1.0"    title = "Total Processes"  }}/* This collection group grabs the volatile memory metrics every 40 secs and   sends them at least every 180 secs.  This time_threshold can be increased   significantly to reduce unneeded network traffic. */collection_group {  collect_every = 40  time_threshold = 180  metric {    name = "mem_free"    value_threshold = "1024.0"    title = "Free Memory"  }  metric {    name = "mem_shared"    value_threshold = "1024.0"    title = "Shared Memory"  }  metric {    name = "mem_buffers"    value_threshold = "1024.0"    title = "Memory Buffers"  }  metric {    name = "mem_cached"    value_threshold = "1024.0"    title = "Cached Memory"  }  metric {    name = "swap_free"    value_threshold = "1024.0"    title = "Free Swap Space"  }}collection_group {  collect_every = 40  time_threshold = 300  metric {    name = "bytes_out"    value_threshold = 4096    title = "Bytes Sent"  }  metric {    name = "bytes_in"    value_threshold = 4096    title = "Bytes Received"  }  metric {    name = "pkts_in"    value_threshold = 256    title = "Packets Received"  }  metric {    name = "pkts_out"    value_threshold = 256    title = "Packets Sent"  }}/* Different than 2.5.x default since the old config made no sense */collection_group {  collect_every = 1800  time_threshold = 3600  metric {    name = "disk_total"    value_threshold = 1.0    title = "Total Disk Space"  }}collection_group {  collect_every = 40  time_threshold = 180  metric {    name = "disk_free"    value_threshold = 1.0    title = "Disk Space Available"  }  metric {    name = "part_max_used"    value_threshold = 1.0    title = "Maximum Disk Space Used"  }}include ("/usr/local/ganglia/etc/conf.d/*.conf")

三、总结

自此，hadoop和ganglia的配置部分完成，接下来重启hadoop和ganglia进程，即可实现监控！

0 0