Ganglia监控hadoop、hbase集群

来源:互联网 发布:搬家软件app 编辑:程序博客网 时间:2024/04/18 14:05
1. 在主节点上安装ganglia-webfrontend和ganglia-monitor
  1. sudo apt-get install ganglia-webfrontend ganglia-monitor
复制代码
在主节点上安装ganglia-webfrontend和ganglia-monitor。在其他监视节点上,只需要安装ganglia-monitor即可
将ganglia的文件链接到apache的默认目录下
  1. sudo ln -s /usr/share/ganglia-webfrontend /var/www/ganglia
复制代码

2. 安装ganglia-monitor
在其他监视节点上,只需要安装ganglia-monitor
  1. sudo apt-get install ganglia-monitor
复制代码

3. Ganglia配置
gmond.conf
在每个节点上都需要配置/etc/ganglia/gmond.conf,配置相同如下所示
  1. sudo vim /etc/ganglia/gmond.conf
复制代码

修改后的/etc/ganglia/gmond.conf
  1. globals {                    
  2.   daemonize = yes  ##以后台的方式运行            
  3.   setuid = yes             
  4.   user = ganglia     #运行Ganglia的用户              
  5.   debug_level = 0               
  6.   max_udp_msg_len = 1472        
  7.   mute = no             
  8.   deaf = no             
  9.   host_dmax = 0 /*secs */ 
  10.   cleanup_threshold = 300 /*secs */ 
  11.   gexec = no             
  12.   send_metadata_interval = 10     #发送数据的时间间隔


  13. /* If a cluster attribute is specified, then all gmond hosts are wrapped inside 
  14. * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will 
  15. * NOT be wrapped inside of a <CLUSTER> tag. */ 
  16. cluster { 
  17.   name = "hadoop-cluster"         #集群名称
  18.   owner = "ganglia"               #运行Ganglia的用户
  19.   latlong = "unspecified" 
  20.   url = "unspecified" 


  21. /* The host section describes attributes of the host, like the location */ 
  22. host { 
  23.   location = "unspecified" 


  24. /* Feel free to specify as many udp_send_channels as you like.  Gmond 
  25.    used to only support having a single channel */ 
  26. udp_send_channel { 
  27.   #mcast_join = 239.2.11.71     #注释掉组播
  28.   host = master                 #发送给安装gmetad的机器
  29.   port = 8649                   #监听端口
  30.   ttl = 1 


  31. /* You can specify as many udp_recv_channels as you like as well. */ 
  32. udp_recv_channel { 
  33.   #mcast_join = 239.2.11.71     #注释掉组播
  34.   port = 8649 
  35.   #bind = 239.2.11.71 


  36. /* You can specify as many tcp_accept_channels as you like to share 
  37.    an xml description of the state of the cluster */ 
  38. tcp_accept_channel { 
  39.   port = 8649 
  40. }
复制代码

gmetad.conf
在主节点上还需要配置/etc/ganglia/gmetad.conf,这里面的名字hadoop-cluster和上面gmond.conf中name应该一致。 
/etc/ganglia/gmetad.conf
  1. sudo vim /etc/ganglia/gmetad.conf
复制代码
修改为以下内容
  1. data_source "hadoop-cluster" 10 master:8649 slave:8649
  2. setuid_username "nobody"
  3. rrd_rootdir "/var/lib/ganglia/rrds"
  4. gridname "hadoop-cluster"
  5. 注:master:8649 slave:8649为要监听的主机和端口,data_source中hadoop-cluster与gmond.conf中name一致
复制代码


4. Hadoop配置
在所有hadoop所在的节点,均需要配置hadoop-metrics2.properties,配置如下:
  1. #   Licensed to the Apache Software Foundation (ASF) under one or more
  2. #   contributor license agreements.  See the NOTICE file distributed with
  3. #   this work for additional information regarding copyright ownership.
  4. #   The ASF licenses this file to You under the Apache License, Version 2.0
  5. #   (the "License"); you may not use this file except in compliance with
  6. #   the License.  You may obtain a copy of the License at
  7. #
  8. #       http://www.apache.org/licenses/LICENSE-2.0
  9. #
  10. #   Unless required by applicable law or agreed to in writing, software
  11. #   distributed under the License is distributed on an "AS IS" BASIS,
  12. #   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. #   See the License for the specific language governing permissions and
  14. #   limitations under the License.
  15. #

  16. # syntax: [prefix].[source|sink].[instance].[options]
  17. # See javadoc of package-info.java for org.apache.hadoop.metrics2 for details

  18. #注释掉以前原有配置

  19. #*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
  20. # default sampling period, in seconds
  21. #*.period=10

  22. # The namenode-metrics.out will contain metrics from all context
  23. #namenode.sink.file.filename=namenode-metrics.out
  24. # Specifying a special sampling period for namenode:
  25. #namenode.sink.*.period=8

  26. #datanode.sink.file.filename=datanode-metrics.out

  27. # the following example split metrics of different
  28. # context to different sinks (in this case files)
  29. #jobtracker.sink.file_jvm.context=jvm
  30. #jobtracker.sink.file_jvm.filename=jobtracker-jvm-metrics.out
  31. #jobtracker.sink.file_mapred.context=mapred
  32. #jobtracker.sink.file_mapred.filename=jobtracker-mapred-metrics.out

  33. #tasktracker.sink.file.filename=tasktracker-metrics.out

  34. #maptask.sink.file.filename=maptask-metrics.out

  35. #reducetask.sink.file.filename=reducetask-metrics.out

  36. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31  
  37. *.sink.ganglia.period=10

  38. *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both  
  39. *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40  

  40. namenode.sink.ganglia.servers=master:8649  
  41. resourcemanager.sink.ganglia.servers=master:8649  

  42. datanode.sink.ganglia.servers=master:8649    
  43. nodemanager.sink.ganglia.servers=master:8649    


  44. maptask.sink.ganglia.servers=master:8649    
  45. reducetask.sink.ganglia.servers=master:8649
复制代码


5. Hbase配置
在所有的hbase节点中均配置hadoop-metrics2-hbase.properties,配置如下:
  1. # syntax: [prefix].[source|sink].[instance].[options]
  2. # See javadoc of package-info.java for org.apache.hadoop.metrics2 for details

  3. #*.sink.file*.class=org.apache.hadoop.metrics2.sink.FileSink
  4. # default sampling period
  5. #*.period=10

  6. # Below are some examples of sinks that could be used
  7. # to monitor different hbase daemons.

  8. # hbase.sink.file-all.class=org.apache.hadoop.metrics2.sink.FileSink
  9. # hbase.sink.file-all.filename=all.metrics

  10. # hbase.sink.file0.class=org.apache.hadoop.metrics2.sink.FileSink
  11. # hbase.sink.file0.context=hmaster
  12. # hbase.sink.file0.filename=master.metrics

  13. # hbase.sink.file1.class=org.apache.hadoop.metrics2.sink.FileSink
  14. # hbase.sink.file1.context=thrift-one
  15. # hbase.sink.file1.filename=thrift-one.metrics

  16. # hbase.sink.file2.class=org.apache.hadoop.metrics2.sink.FileSink
  17. # hbase.sink.file2.context=thrift-two
  18. # hbase.sink.file2.filename=thrift-one.metrics

  19. # hbase.sink.file3.class=org.apache.hadoop.metrics2.sink.FileSink
  20. # hbase.sink.file3.context=rest
  21. # hbase.sink.file3.filename=rest.metrics


  22. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31  
  23. *.sink.ganglia.period=10  

  24. hbase.sink.ganglia.period=10  
  25. hbase.sink.ganglia.servers=master:8649
复制代码


6. 启动hadoop、hbase集群
  1. start-dfs.sh
  2. start-yarn.sh
  3. start-hbase.sh
复制代码


7. 启动Ganglia
先需要重启hadoop和hbase 。在各个节点上启动gmond服务,主节点还需要启动gmetad服务。
使用apt-get方式安装的Ganglia,可以直接用service方式启动。
  1. sudo service ganglia-monitor start(每台机器都需要启动)

  2. sudo service gmetad start(在安装了ganglia-webfrontend的机器上启动)
复制代码


8. 检验
登录浏览器查看:http://master/ganglia,如果Hosts up为9即表示安装成功。
若安装不成功,有几个很有用的调试命令:
以调试模式启动gmetad:gmetad -d 9
查看gmetad收集到的XML文件:telnet master 8649


9. 截图

 


 


master节点gmetad.conf配置
  1. # This is an example of a Ganglia Meta Daemon configuration file
  2. #                http://ganglia.sourceforge.net/
  3. #
  4. #
  5. #-------------------------------------------------------------------------------
  6. # Setting the debug_level to 1 will keep daemon in the forground and
  7. # show only error messages. Setting this value higher than 1 will make 
  8. # gmetad output debugging information and stay in the foreground.
  9. # default: 0
  10. # debug_level 10
  11. #
  12. #-------------------------------------------------------------------------------
  13. # What to monitor. The most important section of this file. 
  14. #
  15. # The data_source tag specifies either a cluster or a grid to
  16. # monitor. If we detect the source is a cluster, we will maintain a complete
  17. # set of RRD databases for it, which can be used to create historical 
  18. # graphs of the metrics. If the source is a grid (it comes from another gmetad),
  19. # we will only maintain summary RRDs for it.
  20. #
  21. # Format: 
  22. # data_source "my cluster" [polling interval] address1:port addreses2:port ...

  23. # The keyword 'data_source' must immediately be followed by a unique
  24. # string which identifies the source, then an optional polling interval in 
  25. # seconds. The source will be polled at this interval on average. 
  26. # If the polling interval is omitted, 15sec is asssumed. 
  27. #
  28. # If you choose to set the polling interval to something other than the default,
  29. # note that the web frontend determines a host as down if its TN value is less
  30. # than 4 * TMAX (20sec by default).  Therefore, if you set the polling interval
  31. # to something around or greater than 80sec, this will cause the frontend to
  32. # incorrectly display hosts as down even though they are not.
  33. #
  34. # A list of machines which service the data source follows, in the 
  35. # format ip:port, or name:port. If a port is not specified then 8649
  36. # (the default gmond port) is assumed.
  37. # default: There is no default value
  38. #
  39. # data_source "my cluster" 10 localhost  my.machine.edu:8649  1.2.3.5:8655
  40. # data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
  41. # data_source "another source" 1.3.4.7:8655  1.3.4.8

  42. data_source "hadoop-cluster" 10 master:8649 slave:8649
  43. setuid_username "nobody"
  44. rrd_rootdir "/var/lib/ganglia/rrds"
  45. gridname "hadoop-cluster"

  46. #
  47. # Round-Robin Archives
  48. # You can specify custom Round-Robin archives here (defaults are listed below)
  49. #
  50. # Old Default RRA: Keep 1 hour of metrics at 15 second resolution. 1 day at 6 minute
  51. # RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \
  52. #      "RRA:AVERAGE:0.5:5760:374"
  53. # New Default RRA
  54. # Keep 5856 data points at 15 second resolution assuming 15 second (default) polling. That's 1 day
  55. # Two weeks of data points at 1 minute resolution (average)
  56. #RRAs "RRA:AVERAGE:0.5:1:5856" "RRA:AVERAGE:0.5:4:20160" "RRA:AVERAGE:0.5:40:52704"

  57. #
  58. #-------------------------------------------------------------------------------
  59. # Scalability mode. If on, we summarize over downstream grids, and respect
  60. # authority tags. If off, we take on 2.5.0-era behavior: we do not wrap our output
  61. # in <GRID></GRID> tags, we ignore all <GRID> tags we see, and always assume
  62. # we are the "authority" on data source feeds. This approach does not scale to
  63. # large groups of clusters, but is provided for backwards compatibility.
  64. # default: on
  65. # scalable off
  66. #
  67. #-------------------------------------------------------------------------------
  68. # The name of this Grid. All the data sources above will be wrapped in a GRID
  69. # tag with this name.
  70. # default: unspecified
  71. # gridname "MyGrid"
  72. #
  73. #-------------------------------------------------------------------------------
  74. # The authority URL for this grid. Used by other gmetads to locate graphs
  75. # for our data sources. Generally points to a ganglia/
  76. # website on this machine.
  77. # default: "http://hostname/ganglia/",
  78. #   where hostname is the name of this machine, as defined by gethostname().
  79. # authority "http://mycluster.org/newprefix/"
  80. #
  81. #-------------------------------------------------------------------------------
  82. # List of machines this gmetad will share XML with. Localhost
  83. # is always trusted. 
  84. # default: There is no default value
  85. # trusted_hosts 127.0.0.1 169.229.50.165 my.gmetad.org
  86. #
  87. #-------------------------------------------------------------------------------
  88. # If you want any host which connects to the gmetad XML to receive
  89. # data, then set this value to "on"
  90. # default: off
  91. # all_trusted on
  92. #
  93. #-------------------------------------------------------------------------------
  94. # If you don't want gmetad to setuid then set this to off
  95. # default: on
  96. # setuid off
  97. #
  98. #-------------------------------------------------------------------------------
  99. # User gmetad will setuid to (defaults to "nobody")
  100. # default: "nobody"
  101. # setuid_username "nobody"
  102. #
  103. #-------------------------------------------------------------------------------
  104. # Umask to apply to created rrd files and grid directory structure
  105. # default: 0 (files are public)
  106. # umask 022
  107. #
  108. #-------------------------------------------------------------------------------
  109. # The port gmetad will answer requests for XML
  110. # default: 8651
  111. # xml_port 8651
  112. #
  113. #-------------------------------------------------------------------------------
  114. # The port gmetad will answer queries for XML. This facility allows
  115. # simple subtree and summation views of the XML tree.
  116. # default: 8652
  117. # interactive_port 8652
  118. #
  119. #-------------------------------------------------------------------------------
  120. # The number of threads answering XML requests
  121. # default: 4
  122. # server_threads 10
  123. #
  124. #-------------------------------------------------------------------------------
  125. # Where gmetad stores its round-robin databases
  126. # default: "/var/lib/ganglia/rrds"
  127. # rrd_rootdir "/some/other/place"
  128. #
  129. #-------------------------------------------------------------------------------
  130. # In earlier versions of gmetad, hostnames were handled in a case
  131. # sensitive manner
  132. # If your hostname directories have been renamed to lower case,
  133. # set this option to 0 to disable backward compatibility.
  134. # From version 3.2, backwards compatibility will be disabled by default.
  135. # default: 1   (for gmetad < 3.2)
  136. # default: 0   (for gmetad >= 3.2)
  137. case_sensitive_hostnames 0

  138. #-------------------------------------------------------------------------------
  139. # It is now possible to export all the metrics collected by gmetad directly to
  140. # graphite by setting the following attributes. 
  141. #
  142. # The hostname or IP address of the Graphite server
  143. # default: unspecified
  144. # carbon_server "my.graphite.box"
  145. #
  146. # The port on which Graphite is listening
  147. # default: 2003
  148. # carbon_port 2003
  149. #
  150. # A prefix to prepend to the metric names exported by gmetad. Graphite uses dot-
  151. # separated paths to organize and refer to metrics. 
  152. # default: unspecified
  153. # graphite_prefix "datacenter1.gmetad"
  154. #
  155. # Number of milliseconds gmetad will wait for a response from the graphite server 
  156. # default: 500
  157. # carbon_timeout 500
  158. #
  159. master-gmond.conf.md Raw
复制代码

master节点gmond.conf配置
  1. /* This configuration is as close to 2.5.x default behavior as possible 
  2.    The values closely match ./gmond/metric.h definitions in 2.5.x */ 
  3. globals {                    
  4.   daemonize = yes              
  5.   setuid = yes             
  6.   user = ganglia              
  7.   debug_level = 0               
  8.   max_udp_msg_len = 1472        
  9.   mute = no             
  10.   deaf = no             
  11.   host_dmax = 0 /*secs */ 
  12.   cleanup_threshold = 300 /*secs */ 
  13.   gexec = no             
  14.   send_metadata_interval = 10 


  15. /* If a cluster attribute is specified, then all gmond hosts are wrapped inside 
  16. * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will 
  17. * NOT be wrapped inside of a <CLUSTER> tag. */ 
  18. cluster { 
  19.   name = "hadoop-cluster" 
  20.   owner = "ganglia" 
  21.   latlong = "unspecified" 
  22.   url = "unspecified" 


  23. /* The host section describes attributes of the host, like the location */ 
  24. host { 
  25.   location = "unspecified" 


  26. /* Feel free to specify as many udp_send_channels as you like.  Gmond 
  27.    used to only support having a single channel */ 
  28. udp_send_channel { 
  29.   #mcast_join = 239.2.11.71 
  30.   host = master
  31.   port = 8649 
  32.   ttl = 1 


  33. /* You can specify as many udp_recv_channels as you like as well. */ 
  34. udp_recv_channel { 
  35.   #mcast_join = 239.2.11.71 
  36.   port = 8649 
  37.   #bind = 239.2.11.71 


  38. /* You can specify as many tcp_accept_channels as you like to share 
  39.    an xml description of the state of the cluster */ 
  40. tcp_accept_channel { 
  41.   port = 8649 


  42. /* Each metrics module that is referenced by gmond must be specified and 
  43.    loaded. If the module has been statically linked with gmond, it does not 
  44.    require a load path. However all dynamically loadable modules must include 
  45.    a load path. */ 
  46. modules { 
  47.   module { 
  48.     name = "core_metrics" 
  49.   } 
  50.   module { 
  51.     name = "cpu_module" 
  52.     path = "/usr/lib/ganglia/modcpu.so" 
  53.   } 
  54.   module { 
  55.     name = "disk_module" 
  56.     path = "/usr/lib/ganglia/moddisk.so" 
  57.   } 
  58.   module { 
  59.     name = "load_module" 
  60.     path = "/usr/lib/ganglia/modload.so" 
  61.   } 
  62.   module { 
  63.     name = "mem_module" 
  64.     path = "/usr/lib/ganglia/modmem.so" 
  65.   } 
  66.   module { 
  67.     name = "net_module" 
  68.     path = "/usr/lib/ganglia/modnet.so" 
  69.   } 
  70.   module { 
  71.     name = "proc_module" 
  72.     path = "/usr/lib/ganglia/modproc.so" 
  73.   } 
  74.   module { 
  75.     name = "sys_module" 
  76.     path = "/usr/lib/ganglia/modsys.so" 
  77.   } 


  78. include ('/etc/ganglia/conf.d/*.conf') 


  79. /* The old internal 2.5.x metric array has been replaced by the following 
  80.    collection_group directives.  What follows is the default behavior for 
  81.    collecting and sending metrics that is as close to 2.5.x behavior as 
  82.    possible. */

  83. /* This collection group will cause a heartbeat (or beacon) to be sent every 
  84.    20 seconds.  In the heartbeat is the GMOND_STARTED data which expresses 
  85.    the age of the running gmond. */ 
  86. collection_group { 
  87.   collect_once = yes 
  88.   time_threshold = 20 
  89.   metric { 
  90.     name = "heartbeat" 
  91.   } 


  92. /* This collection group will send general info about this host every 1200 secs. 
  93.    This information doesn't change between reboots and is only collected once. */ 
  94. collection_group { 
  95.   collect_once = yes 
  96.   time_threshold = 1200 
  97.   metric { 
  98.     name = "cpu_num" 
  99.     title = "CPU Count" 
  100.   } 
  101.   metric { 
  102.     name = "cpu_speed" 
  103.     title = "CPU Speed" 
  104.   } 
  105.   metric { 
  106.     name = "mem_total" 
  107.     title = "Memory Total" 
  108.   } 
  109.   /* Should this be here? Swap can be added/removed between reboots. */ 
  110.   metric { 
  111.     name = "swap_total" 
  112.     title = "Swap Space Total" 
  113.   } 
  114.   metric { 
  115.     name = "boottime" 
  116.     title = "Last Boot Time" 
  117.   } 
  118.   metric { 
  119.     name = "machine_type" 
  120.     title = "Machine Type" 
  121.   } 
  122.   metric { 
  123.     name = "os_name" 
  124.     title = "Operating System" 
  125.   } 
  126.   metric { 
  127.     name = "os_release" 
  128.     title = "Operating System Release" 
  129.   } 
  130.   metric { 
  131.     name = "location" 
  132.     title = "Location" 
  133.   } 


  134. /* This collection group will send the status of gexecd for this host every 300 secs */
  135. /* Unlike 2.5.x the default behavior is to report gexecd OFF.  */ 
  136. collection_group { 
  137.   collect_once = yes 
  138.   time_threshold = 300 
  139.   metric { 
  140.     name = "gexec" 
  141.     title = "Gexec Status" 
  142.   } 


  143. /* This collection group will collect the CPU status info every 20 secs. 
  144.    The time threshold is set to 90 seconds.  In honesty, this time_threshold could be 
  145.    set significantly higher to reduce unneccessary network chatter. */ 
  146. collection_group { 
  147.   collect_every = 20 
  148.   time_threshold = 90 
  149.   /* CPU status */ 
  150.   metric { 
  151.     name = "cpu_user"  
  152.     value_threshold = "1.0" 
  153.     title = "CPU User" 
  154.   } 
  155.   metric { 
  156.     name = "cpu_system"   
  157.     value_threshold = "1.0" 
  158.     title = "CPU System" 
  159.   } 
  160.   metric { 
  161.     name = "cpu_idle"  
  162.     value_threshold = "5.0" 
  163.     title = "CPU Idle" 
  164.   } 
  165.   metric { 
  166.     name = "cpu_nice"  
  167.     value_threshold = "1.0" 
  168.     title = "CPU Nice" 
  169.   } 
  170.   metric { 
  171.     name = "cpu_aidle" 
  172.     value_threshold = "5.0" 
  173.     title = "CPU aidle" 
  174.   } 
  175.   metric { 
  176.     name = "cpu_wio" 
  177.     value_threshold = "1.0" 
  178.     title = "CPU wio" 
  179.   } 
  180.   /* The next two metrics are optional if you want more detail... 
  181.      ... since they are accounted for in cpu_system.  
  182.   metric { 
  183.     name = "cpu_intr" 
  184.     value_threshold = "1.0" 
  185.     title = "CPU intr" 
  186.   } 
  187.   metric { 
  188.     name = "cpu_sintr" 
  189.     value_threshold = "1.0" 
  190.     title = "CPU sintr" 
  191.   } 
  192.   */ 


  193. collection_group { 
  194.   collect_every = 20 
  195.   time_threshold = 90 
  196.   /* Load Averages */ 
  197.   metric { 
  198.     name = "load_one" 
  199.     value_threshold = "1.0" 
  200.     title = "One Minute Load Average" 
  201.   } 
  202.   metric { 
  203.     name = "load_five" 
  204.     value_threshold = "1.0" 
  205.     title = "Five Minute Load Average" 
  206.   } 
  207.   metric { 
  208.     name = "load_fifteen" 
  209.     value_threshold = "1.0" 
  210.     title = "Fifteen Minute Load Average" 
  211.   }


  212. /* This group collects the number of running and total processes */ 
  213. collection_group { 
  214.   collect_every = 80 
  215.   time_threshold = 950 
  216.   metric { 
  217.     name = "proc_run" 
  218.     value_threshold = "1.0" 
  219.     title = "Total Running Processes" 
  220.   } 
  221.   metric { 
  222.     name = "proc_total" 
  223.     value_threshold = "1.0" 
  224.     title = "Total Processes" 
  225.   } 
  226. }

  227. /* This collection group grabs the volatile memory metrics every 40 secs and 
  228.    sends them at least every 180 secs.  This time_threshold can be increased 
  229.    significantly to reduce unneeded network traffic. */ 
  230. collection_group { 
  231.   collect_every = 40 
  232.   time_threshold = 180 
  233.   metric { 
  234.     name = "mem_free" 
  235.     value_threshold = "1024.0" 
  236.     title = "Free Memory" 
  237.   } 
  238.   metric { 
  239.     name = "mem_shared" 
  240.     value_threshold = "1024.0" 
  241.     title = "Shared Memory" 
  242.   } 
  243.   metric { 
  244.     name = "mem_buffers" 
  245.     value_threshold = "1024.0" 
  246.     title = "Memory Buffers" 
  247.   } 
  248.   metric { 
  249.     name = "mem_cached" 
  250.     value_threshold = "1024.0" 
  251.     title = "Cached Memory" 
  252.   } 
  253.   metric { 
  254.     name = "swap_free" 
  255.     value_threshold = "1024.0" 
  256.     title = "Free Swap Space" 
  257.   } 


  258. collection_group { 
  259.   collect_every = 40 
  260.   time_threshold = 300 
  261.   metric { 
  262.     name = "bytes_out" 
  263.     value_threshold = 4096 
  264.     title = "Bytes Sent" 
  265.   } 
  266.   metric { 
  267.     name = "bytes_in" 
  268.     value_threshold = 4096 
  269.     title = "Bytes Received" 
  270.   } 
  271.   metric { 
  272.     name = "pkts_in" 
  273.     value_threshold = 256 
  274.     title = "Packets Received" 
  275.   } 
  276.   metric { 
  277.     name = "pkts_out" 
  278.     value_threshold = 256 
  279.     title = "Packets Sent" 
  280.   } 
  281. }

  282. /* Different than 2.5.x default since the old config made no sense */ 
  283. collection_group { 
  284.   collect_every = 1800 
  285.   time_threshold = 3600 
  286.   metric { 
  287.     name = "disk_total" 
  288.     value_threshold = 1.0 
  289.     title = "Total Disk Space" 
  290.   } 
  291. }

  292. collection_group { 
  293.   collect_every = 40 
  294.   time_threshold = 180 
  295.   metric { 
  296.     name = "disk_free" 
  297.     value_threshold = 1.0 
  298.     title = "Disk Space Available" 
  299.   } 
  300.   metric { 
  301.     name = "part_max_used" 
  302.     value_threshold = 1.0 
  303.     title = "Maximum Disk Space Used" 
  304.   } 
  305. }
  306. master-hadoop-metrics2-hbase.properties.md Raw
  307. master节点hadoop-metrics2-hbase.properties配置

  308. # syntax: [prefix].[source|sink].[instance].[options]
  309. # See javadoc of package-info.java for org.apache.hadoop.metrics2 for details

  310. #*.sink.file*.class=org.apache.hadoop.metrics2.sink.FileSink
  311. # default sampling period
  312. #*.period=10

  313. # Below are some examples of sinks that could be used
  314. # to monitor different hbase daemons.

  315. # hbase.sink.file-all.class=org.apache.hadoop.metrics2.sink.FileSink
  316. # hbase.sink.file-all.filename=all.metrics

  317. # hbase.sink.file0.class=org.apache.hadoop.metrics2.sink.FileSink
  318. # hbase.sink.file0.context=hmaster
  319. # hbase.sink.file0.filename=master.metrics

  320. # hbase.sink.file1.class=org.apache.hadoop.metrics2.sink.FileSink
  321. # hbase.sink.file1.context=thrift-one
  322. # hbase.sink.file1.filename=thrift-one.metrics

  323. # hbase.sink.file2.class=org.apache.hadoop.metrics2.sink.FileSink
  324. # hbase.sink.file2.context=thrift-two
  325. # hbase.sink.file2.filename=thrift-one.metrics

  326. # hbase.sink.file3.class=org.apache.hadoop.metrics2.sink.FileSink
  327. # hbase.sink.file3.context=rest
  328. # hbase.sink.file3.filename=rest.metrics


  329. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31  
  330. *.sink.ganglia.period=10  

  331. hbase.sink.ganglia.period=10  
  332. hbase.sink.ganglia.servers=master:8649
  333. master-hadoop-metrics2.properties.md Raw
  334. master节点hadoop-metrics2.properties配置

  335. #
  336. #   Licensed to the Apache Software Foundation (ASF) under one or more
  337. #   contributor license agreements.  See the NOTICE file distributed with
  338. #   this work for additional information regarding copyright ownership.
  339. #   The ASF licenses this file to You under the Apache License, Version 2.0
  340. #   (the "License"); you may not use this file except in compliance with
  341. #   the License.  You may obtain a copy of the License at
  342. #
  343. #       http://www.apache.org/licenses/LICENSE-2.0
  344. #
  345. #   Unless required by applicable law or agreed to in writing, software
  346. #   distributed under the License is distributed on an "AS IS" BASIS,
  347. #   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  348. #   See the License for the specific language governing permissions and
  349. #   limitations under the License.
  350. #

  351. # syntax: [prefix].[source|sink].[instance].[options]
  352. # See javadoc of package-info.java for org.apache.hadoop.metrics2 for details

  353. #*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
  354. # default sampling period, in seconds
  355. #*.period=10

  356. # The namenode-metrics.out will contain metrics from all context
  357. #namenode.sink.file.filename=namenode-metrics.out
  358. # Specifying a special sampling period for namenode:
  359. #namenode.sink.*.period=8

  360. #datanode.sink.file.filename=datanode-metrics.out

  361. # the following example split metrics of different
  362. # context to different sinks (in this case files)
  363. #jobtracker.sink.file_jvm.context=jvm
  364. #jobtracker.sink.file_jvm.filename=jobtracker-jvm-metrics.out
  365. #jobtracker.sink.file_mapred.context=mapred
  366. #jobtracker.sink.file_mapred.filename=jobtracker-mapred-metrics.out

  367. #tasktracker.sink.file.filename=tasktracker-metrics.out

  368. #maptask.sink.file.filename=maptask-metrics.out

  369. #reducetask.sink.file.filename=reducetask-metrics.out

  370. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31  
  371. *.sink.ganglia.period=10

  372. *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both  
  373. *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40  

  374. namenode.sink.ganglia.servers=master:8649  
  375. resourcemanager.sink.ganglia.servers=master:8649  

  376. datanode.sink.ganglia.servers=master:8649    
  377. nodemanager.sink.ganglia.servers=master:8649    


  378. maptask.sink.ganglia.servers=master:8649    
  379. reducetask.sink.ganglia.servers=master:8649
  380. slave-gmond.conf.md Raw
  381. slave节点gmond.conf配置

  382. /* This configuration is as close to 2.5.x default behavior as possible 
  383.    The values closely match ./gmond/metric.h definitions in 2.5.x */ 
  384. globals {                    
  385.   daemonize = yes              
  386.   setuid = yes             
  387.   user = ganglia              
  388.   debug_level = 0               
  389.   max_udp_msg_len = 1472        
  390.   mute = no             
  391.   deaf = no             
  392.   host_dmax = 0 /*secs */ 
  393.   cleanup_threshold = 300 /*secs */ 
  394.   gexec = no             
  395.   send_metadata_interval = 10     


  396. /* If a cluster attribute is specified, then all gmond hosts are wrapped inside 
  397. * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will 
  398. * NOT be wrapped inside of a <CLUSTER> tag. */ 
  399. cluster { 
  400.   name = "hadoop-cluster" 
  401.   owner = "ganglia" 
  402.   latlong = "unspecified" 
  403.   url = "unspecified" 


  404. /* The host section describes attributes of the host, like the location */ 
  405. host { 
  406.   location = "unspecified" 


  407. /* Feel free to specify as many udp_send_channels as you like.  Gmond 
  408.    used to only support having a single channel */ 
  409. udp_send_channel { 
  410.   #mcast_join = 239.2.11.71
  411.   host = master 
  412.   port = 8649 
  413.   ttl = 1 


  414. /* You can specify as many udp_recv_channels as you like as well. */ 
  415. udp_recv_channel { 
  416.   #mcast_join = 239.2.11.71 
  417.   port = 8649 
  418.   #bind = 239.2.11.71 


  419. /* You can specify as many tcp_accept_channels as you like to share 
  420.    an xml description of the state of the cluster */ 
  421. tcp_accept_channel { 
  422.   port = 8649 


  423. /* Each metrics module that is referenced by gmond must be specified and 
  424.    loaded. If the module has been statically linked with gmond, it does not 
  425.    require a load path. However all dynamically loadable modules must include 
  426.    a load path. */ 
  427. modules { 
  428.   module { 
  429.     name = "core_metrics" 
  430.   } 
  431.   module { 
  432.     name = "cpu_module" 
  433.     path = "/usr/lib/ganglia/modcpu.so" 
  434.   } 
  435.   module { 
  436.     name = "disk_module" 
  437.     path = "/usr/lib/ganglia/moddisk.so" 
  438.   } 
  439.   module { 
  440.     name = "load_module" 
  441.     path = "/usr/lib/ganglia/modload.so" 
  442.   } 
  443.   module { 
  444.     name = "mem_module" 
  445.     path = "/usr/lib/ganglia/modmem.so" 
  446.   } 
  447.   module { 
  448.     name = "net_module" 
  449.     path = "/usr/lib/ganglia/modnet.so" 
  450.   } 
  451.   module { 
  452.     name = "proc_module" 
  453.     path = "/usr/lib/ganglia/modproc.so" 
  454.   } 
  455.   module { 
  456.     name = "sys_module" 
  457.     path = "/usr/lib/ganglia/modsys.so" 
  458.   } 


  459. include ('/etc/ganglia/conf.d/*.conf') 


  460. /* The old internal 2.5.x metric array has been replaced by the following 
  461.    collection_group directives.  What follows is the default behavior for 
  462.    collecting and sending metrics that is as close to 2.5.x behavior as 
  463.    possible. */

  464. /* This collection group will cause a heartbeat (or beacon) to be sent every 
  465.    20 seconds.  In the heartbeat is the GMOND_STARTED data which expresses 
  466.    the age of the running gmond. */ 
  467. collection_group { 
  468.   collect_once = yes 
  469.   time_threshold = 20 
  470.   metric { 
  471.     name = "heartbeat" 
  472.   } 


  473. /* This collection group will send general info about this host every 1200 secs. 
  474.    This information doesn't change between reboots and is only collected once. */ 
  475. collection_group { 
  476.   collect_once = yes 
  477.   time_threshold = 1200 
  478.   metric { 
  479.     name = "cpu_num" 
  480.     title = "CPU Count" 
  481.   } 
  482.   metric { 
  483.     name = "cpu_speed" 
  484.     title = "CPU Speed" 
  485.   } 
  486.   metric { 
  487.     name = "mem_total" 
  488.     title = "Memory Total" 
  489.   } 
  490.   /* Should this be here? Swap can be added/removed between reboots. */ 
  491.   metric { 
  492.     name = "swap_total" 
  493.     title = "Swap Space Total" 
  494.   } 
  495.   metric { 
  496.     name = "boottime" 
  497.     title = "Last Boot Time" 
  498.   } 
  499.   metric { 
  500.     name = "machine_type" 
  501.     title = "Machine Type" 
  502.   } 
  503.   metric { 
  504.     name = "os_name" 
  505.     title = "Operating System" 
  506.   } 
  507.   metric { 
  508.     name = "os_release" 
  509.     title = "Operating System Release" 
  510.   } 
  511.   metric { 
  512.     name = "location" 
  513.     title = "Location" 
  514.   } 


  515. /* This collection group will send the status of gexecd for this host every 300 secs */
  516. /* Unlike 2.5.x the default behavior is to report gexecd OFF.  */ 
  517. collection_group { 
  518.   collect_once = yes 
  519.   time_threshold = 300 
  520.   metric { 
  521.     name = "gexec" 
  522.     title = "Gexec Status" 
  523.   } 


  524. /* This collection group will collect the CPU status info every 20 secs. 
  525.    The time threshold is set to 90 seconds.  In honesty, this time_threshold could be 
  526.    set significantly higher to reduce unneccessary network chatter. */ 
  527. collection_group { 
  528.   collect_every = 20 
  529.   time_threshold = 90 
  530.   /* CPU status */ 
  531.   metric { 
  532.     name = "cpu_user"  
  533.     value_threshold = "1.0" 
  534.     title = "CPU User" 
  535.   } 
  536.   metric { 
  537.     name = "cpu_system"   
  538.     value_threshold = "1.0" 
  539.     title = "CPU System" 
  540.   } 
  541.   metric { 
  542.     name = "cpu_idle"  
  543.     value_threshold = "5.0" 
  544.     title = "CPU Idle" 
  545.   } 
  546.   metric { 
  547.     name = "cpu_nice"  
  548.     value_threshold = "1.0" 
  549.     title = "CPU Nice" 
  550.   } 
  551.   metric { 
  552.     name = "cpu_aidle" 
  553.     value_threshold = "5.0" 
  554.     title = "CPU aidle" 
  555.   } 
  556.   metric { 
  557.     name = "cpu_wio" 
  558.     value_threshold = "1.0" 
  559.     title = "CPU wio" 
  560.   } 
  561.   /* The next two metrics are optional if you want more detail... 
  562.      ... since they are accounted for in cpu_system.  
  563.   metric { 
  564.     name = "cpu_intr" 
  565.     value_threshold = "1.0" 
  566.     title = "CPU intr" 
  567.   } 
  568.   metric { 
  569.     name = "cpu_sintr" 
  570.     value_threshold = "1.0" 
  571.     title = "CPU sintr" 
  572.   } 
  573.   */ 


  574. collection_group { 
  575.   collect_every = 20 
  576.   time_threshold = 90 
  577.   /* Load Averages */ 
  578.   metric { 
  579.     name = "load_one" 
  580.     value_threshold = "1.0" 
  581.     title = "One Minute Load Average" 
  582.   } 
  583.   metric { 
  584.     name = "load_five" 
  585.     value_threshold = "1.0" 
  586.     title = "Five Minute Load Average" 
  587.   } 
  588.   metric { 
  589.     name = "load_fifteen" 
  590.     value_threshold = "1.0" 
  591.     title = "Fifteen Minute Load Average" 
  592.   }


  593. /* This group collects the number of running and total processes */ 
  594. collection_group { 
  595.   collect_every = 80 
  596.   time_threshold = 950 
  597.   metric { 
  598.     name = "proc_run" 
  599.     value_threshold = "1.0" 
  600.     title = "Total Running Processes" 
  601.   } 
  602.   metric { 
  603.     name = "proc_total" 
  604.     value_threshold = "1.0" 
  605.     title = "Total Processes" 
  606.   } 
  607. }

  608. /* This collection group grabs the volatile memory metrics every 40 secs and 
  609.    sends them at least every 180 secs.  This time_threshold can be increased 
  610.    significantly to reduce unneeded network traffic. */ 
  611. collection_group { 
  612.   collect_every = 40 
  613.   time_threshold = 180 
  614.   metric { 
  615.     name = "mem_free" 
  616.     value_threshold = "1024.0" 
  617.     title = "Free Memory" 
  618.   } 
  619.   metric { 
  620.     name = "mem_shared" 
  621.     value_threshold = "1024.0" 
  622.     title = "Shared Memory" 
  623.   } 
  624.   metric { 
  625.     name = "mem_buffers" 
  626.     value_threshold = "1024.0" 
  627.     title = "Memory Buffers" 
  628.   } 
  629.   metric { 
  630.     name = "mem_cached" 
  631.     value_threshold = "1024.0" 
  632.     title = "Cached Memory" 
  633.   } 
  634.   metric { 
  635.     name = "swap_free" 
  636.     value_threshold = "1024.0" 
  637.     title = "Free Swap Space" 
  638.   } 


  639. collection_group { 
  640.   collect_every = 40 
  641.   time_threshold = 300 
  642.   metric { 
  643.     name = "bytes_out" 
  644.     value_threshold = 4096 
  645.     title = "Bytes Sent" 
  646.   } 
  647.   metric { 
  648.     name = "bytes_in" 
  649.     value_threshold = 4096 
  650.     title = "Bytes Received" 
  651.   } 
  652.   metric { 
  653.     name = "pkts_in" 
  654.     value_threshold = 256 
  655.     title = "Packets Received" 
  656.   } 
  657.   metric { 
  658.     name = "pkts_out" 
  659.     value_threshold = 256 
  660.     title = "Packets Sent" 
  661.   } 
  662. }

  663. /* Different than 2.5.x default since the old config made no sense */ 
  664. collection_group { 
  665.   collect_every = 1800 
  666.   time_threshold = 3600 
  667.   metric { 
  668.     name = "disk_total" 
  669.     value_threshold = 1.0 
  670.     title = "Total Disk Space" 
  671.   } 
  672. }

  673. collection_group { 
  674.   collect_every = 40 
  675.   time_threshold = 180 
  676.   metric { 
  677.     name = "disk_free" 
  678.     value_threshold = 1.0 
  679.     title = "Disk Space Available" 
  680.   } 
  681.   metric { 
  682.     name = "part_max_used" 
  683.     value_threshold = 1.0 
  684.     title = "Maximum Disk Space Used" 
  685.   } 
  686. }
  687. slave-hadoop-metrics2-hbase.properties.md Raw
  688. slave节点hadoop-metrics2-hbase.properties配置

  689. # syntax: [prefix].[source|sink].[instance].[options]
  690. # See javadoc of package-info.java for org.apache.hadoop.metrics2 for details

  691. #*.sink.file*.class=org.apache.hadoop.metrics2.sink.FileSink
  692. # default sampling period
  693. #*.period=10

  694. # Below are some examples of sinks that could be used
  695. # to monitor different hbase daemons.

  696. # hbase.sink.file-all.class=org.apache.hadoop.metrics2.sink.FileSink
  697. # hbase.sink.file-all.filename=all.metrics

  698. # hbase.sink.file0.class=org.apache.hadoop.metrics2.sink.FileSink
  699. # hbase.sink.file0.context=hmaster
  700. # hbase.sink.file0.filename=master.metrics

  701. # hbase.sink.file1.class=org.apache.hadoop.metrics2.sink.FileSink
  702. # hbase.sink.file1.context=thrift-one
  703. # hbase.sink.file1.filename=thrift-one.metrics

  704. # hbase.sink.file2.class=org.apache.hadoop.metrics2.sink.FileSink
  705. # hbase.sink.file2.context=thrift-two
  706. # hbase.sink.file2.filename=thrift-one.metrics

  707. # hbase.sink.file3.class=org.apache.hadoop.metrics2.sink.FileSink
  708. # hbase.sink.file3.context=rest
  709. # hbase.sink.file3.filename=rest.metrics


  710. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31  
  711. *.sink.ganglia.period=10  

  712. hbase.sink.ganglia.period=10  
  713. hbase.sink.ganglia.servers=master:8649
  714. slave-hadoop-metrics2.properties.md Raw
  715. slave节点hadoop-metrics2.properties配置

  716. #
  717. #   Licensed to the Apache Software Foundation (ASF) under one or more
  718. #   contributor license agreements.  See the NOTICE file distributed with
  719. #   this work for additional information regarding copyright ownership.
  720. #   The ASF licenses this file to You under the Apache License, Version 2.0
  721. #   (the "License"); you may not use this file except in compliance with
  722. #   the License.  You may obtain a copy of the License at
  723. #
  724. #       http://www.apache.org/licenses/LICENSE-2.0
  725. #
  726. #   Unless required by applicable law or agreed to in writing, software
  727. #   distributed under the License is distributed on an "AS IS" BASIS,
  728. #   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  729. #   See the License for the specific language governing permissions and
  730. #   limitations under the License.
  731. #

  732. # syntax: [prefix].[source|sink].[instance].[options]
  733. # See javadoc of package-info.java for org.apache.hadoop.metrics2 for details

  734. #*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
  735. # default sampling period, in seconds
  736. #*.period=10

  737. # The namenode-metrics.out will contain metrics from all context
  738. #namenode.sink.file.filename=namenode-metrics.out
  739. # Specifying a special sampling period for namenode:
  740. #namenode.sink.*.period=8

  741. #datanode.sink.file.filename=datanode-metrics.out

  742. # the following example split metrics of different
  743. # context to different sinks (in this case files)
  744. #jobtracker.sink.file_jvm.context=jvm
  745. #jobtracker.sink.file_jvm.filename=jobtracker-jvm-metrics.out
  746. #jobtracker.sink.file_mapred.context=mapred
  747. #jobtracker.sink.file_mapred.filename=jobtracker-mapred-metrics.out

  748. #tasktracker.sink.file.filename=tasktracker-metrics.out

  749. #maptask.sink.file.filename=maptask-metrics.out

  750. #reducetask.sink.file.filename=reducetask-metrics.out

  751. *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31  
  752. *.sink.ganglia.period=10

  753. *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both  
  754. *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40  

  755. namenode.sink.ganglia.servers=master:8649  
  756. resourcemanager.sink.ganglia.servers=master:8649  

  757. datanode.sink.ganglia.servers=master:8649    
  758. nodemanager.sink.ganglia.servers=master:8649    


  759. maptask.sink.ganglia.servers=master:8649    
  760. reducetask.sink.ganglia.servers=master:8649
复制代码
0 0
原创粉丝点击