Tuning the Cluster for MapReduce v2 (YARN)
来源:互联网 发布:php找工作 编辑:程序博客网 时间:2024/06/10 21:45
Tuning the Cluster for MapReduce v2 (YARN)
This topic applies to YARN clusters only, and describes how to tune and optimize YARN for your cluster. It introduces the following terms:
- ResourceManager: A master daemon that authorizes submitted jobs to run, assigns an ApplicationMaster to them, and enforces resource limits.
- NodeManager: A worker daemon that launches ApplicationMaster and task containers.
- ApplicationMaster: A supervisory task that requests the resources needed for executor tasks. An ApplicationMaster runs on a different NodeManager for each application. The ApplicationMaster requests containers, which are sized by the resources a task requires to run.
- vcore: Virtual CPU core; a logical unit of processing power. In a basic case, it is equivalent to a physical CPU core or hyperthreaded virtual CPU core.
- Container: A resource bucket and process space for a task. A container’s resources consist of vcores and memory.
Identifying Hardware Resources and Service Demand
- Impalad
- HBase RegionServer
- Solr
Worker nodes also run system support services and possibly third-party monitoring or asset management services. This includes the Linux operating system.
Estimating and Configuring Resource Requirements
- 10-20% of RAM for Linux and its daemon services
- At least 16 GB RAM for an Impalad process
- No more than 12-16 GB RAM for an HBase RegionServer process
In addition, you must allow resources for task buffers, such as the HDFS Sort I/O buffer. For vcore demand, consider the number of concurrent processes or tasks each service runs as an initial guide. For the operating system, start with a count of two.
You can now configure YARN to use the remaining resources for its supervisory processes and task containers. Start with the NodeManager, which has the following settings:
- (total vcores) – (number of vcores reserved for non-YARN use), or
- 2 x (number of physical disks used for DataNode storage)
Sizing the ResourceManager
If a NodeManager has 50 GB or more RAM available for containers, consider increasing the minimum allocation to 2 GB. The default memory increment is 512 MB. For minimum memory of 1 GB, a container that requires 1.2 GB receives 1.5 GB. You can set maximum memory allocation equal to yarn.nodemanager.resource.memory-mb.
The default minimum and increment value for vcores is 1. Because application tasks are not commonly multithreaded, you generally do not need to change this value. The maximum value is usually equal to yarn.nodemanager.resource.cpu-vcores. Reduce this value to limit the number of containers running concurrently on one node.
The example leaves more than 50 GB RAM available for containers, which accommodates the following settings:
Configuring YARN Settings
The settings for mapreduce.[map | reduce].java.opts.max.heap specify the default memory allotted for mapper and reducer heap size, respectively. The mapreduce.[map| reduce].memory.mb settings specify memory allotted their containers, and the value assigned should allow overhead beyond the task heap size. Cloudera recommends applying a factor of 1.2 to the mapreduce.[map | reduce].java.opts.max.heap setting. The optimal value depends on the actual tasks. Cloudera also recommends settingmapreduce.map.memory.mb to 1–2 GB and setting mapreduce.reduce.memory.mb to twice the mapper value. The ApplicationMaster heap size is 1 GB by default, and can be increased if your jobs contain many concurrent tasks. Using these guides, size the example worker node as follows:
Defining Containers
With YARN worker resources configured, you can determine how many containers best support a MapReduce application, based on job type and system resources. For example, a CPU-bound workload such as a Monte Carlo simulation requires very little data but complex, iterative processing. The ratio of concurrent containers to spindle is likely greater than for an ETL workload, which tends to be I/O-bound. For applications that use a lot of memory in the map or reduce phase, the number of containers that can be scheduled is limited by RAM available to the container and the RAM required by the task. Other applications may be limited based on vcores not in use by other YARN applications or the rules employed by dynamic resource pools (if used).
To calculate the number of containers for mappers and reducers based on actual system constraints, start with the following formulas:
The workload factor can be set to 2.0 for most workloads. Consider a higher setting for CPU-bound workloads.
- Configured rack awareness
- Skewed or imbalanced data
- Network throughput
- Co-tenancy demand (other services or applications using the cluster)
- Dynamic resource pooling
You may also have to maximize or minimize cluster utilization for your workload or to meet Service Level Agreements (SLAs). To find the best resource configuration for an application, try various container and gateway/client settings and record the results.
For example, the following TeraGen/TeraSort script supports throughput testing with a 10-GB data load and a loop of varying YARN container and gateway/client settings. You can observe which configuration yields the best results.
#!/bin/shHADOOP_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreducefor i in 2 4 8 16 32 64 # Number of mapper containers to testdofor j in 2 4 8 16 32 64 # Number of reducer containers to testdofor k in 1024 2048 # Container memory for mappers/reducers to testdoMAP_MB=`echo "($k*0.8)/1" | bc` # JVM heap size for mappersRED_MB=`echo "($k*0.8)/1" | bc` # JVM heap size for reducershadoop jar $HADOOP_PATH/hadoop-examples.jar teragen-Dmapreduce.job.maps=$i -Dmapreduce.map.memory.mb=$k-Dmapreduce.map.java.opts.max.heap=$MAP_MB 100000000/results/tg-10GB-${i}-${j}-${k} 1>tera_${i}_${j}_${k}.out 2>tera_${i}_${j}_${k}.errhadoop jar $HADOOP_PATH/hadoop-examples.jar terasort-Dmapreduce.job.maps=$i -Dmapreduce.job.reduces=$j -Dmapreduce.map.memory.mb=$k-Dmapreduce.map.java.opts.max.heap=$MAP_MB -Dmapreduce.reduce.memory.mb=$k-Dmapreduce.reduce.java.opts.max.heap=$RED_MB /results/ts-10GB-${i}-${j}-${k}1>>tera_${i}_${j}_${k}.out 2>>tera_${i}_${j}_${k}.errhadoop fs -rmr -skipTrash /results/tg-10GB-${i}-${j}-${k} hadoop fs -rmr -skipTrash /results/ts-10GB-${i}-${j}-${k}donedonedone
http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_yarn_tuning.html
- Tuning the Cluster for MapReduce v2 (YARN)
- 解决Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the co
- MapReduce Tuning
- yarn cluster
- MapReduce V2---Yarn的架构及其执行原理
- hadoop2.2 MapReduce and yarn(二) MapReduce in MR v2 API
- ubuntu搭建hadoop 2.7.2 Single Node Cluster及windows eclipse yarn提交Mapreduce笔记
- 053-3 While tuning a SQLstatement, the SQL Tuning Advisor finds an existing SQL profile for
- Oracle Concepts - Guidelines for Tuning the Oracle Shared Pool
- spark yarn-client和yarn-cluster
- Spark Yarn-cluster 与 Yarn-client
- Spark on Yarn-cluster与Yarn-client
- Spark Yarn-cluster与Yarn-client
- Yarn-cluster 与 Yarn-client的区别
- yarn-client、yarn-cluster 的区别
- Hadoop Mapreduce 2.0 Yarn
- YARN与MapReduce
- mapreduce与mapreduceV2/yarn
- linux升级openssh-6.7p1
- 采购订单凭证层字段属性定义
- android 使用单元测试的注意事项
- javascript 验证例子
- 各种数据库默认端口汇集
- Tuning the Cluster for MapReduce v2 (YARN)
- cocos2d-js 实现双指缩放地图效果 和 单点移动效果
- [易飞]录入信息传递设置信息
- Latex数学公式中的空格表示方法
- css控制table中列td的宽度,超出部分省略号...代替
- PHP学习笔记(七)文件系统
- [POJ] DNA Sorting
- Java中ArrayList和LinkedList区别
- C语言调试宏技巧