10 MapReduce Tips
来源:互联网 发布:晨曦软件免费下载 编辑:程序博客网 时间:2024/06/05 02:43
This piece is based on the talk “Practical MapReduce” that I gave at Hadoop User Group UK on April 14. There are many languages and frameworks that sit on top of MapReduce, so it’s worth thinking up-front which one to use for a particular problem. There is no one-size-fits-all language; each has different strengths and weaknesses. While there are no hard and fast rules, in general, we recommend using pure Java for large, recurring jobs, Hive for SQL style analysis and data warehousing, and Pig orStreaming for ad-hoc analysis. Are you generating large, unbounded files, like log files? Or lots of small files, like image files? How frequently do you need to run jobs? Answers to these questions determine how your store and process data using HDFS. For large unbounded files, one approach (until HDFS appends are working) is to write files in batches and merge them periodically. For lots of small files, see The Small Files Problem.HBase is a good abstraction for some of these problems too, so may be worth considering. SequenceFiles are a very useful tool. They are: A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key. However, both are Java-centric, so you can’t read them with non-Java tools. The Thriftand Avro projects are the places to look for language-neutral container file formats. (For example, see Avro’s DataFileWriter although there is no MapReduce integration yet.) If you are writing a Java driver, then consider implementing the By taking this step you also make your driver more testable, since you can inject arbitrary configurations using It’s often natural to split a problem into multiple MapReduce jobs. The benefits are a better decomposition of the problem into smaller, more-easily understood (and more easily tested) steps. It can also boost re-usability. Also, by using the Fair Scheduler, you can run a small job promptly, and not worry that it will be stuck in a long queue of (other people’s) jobs. Pig and Hive do this kind of thing all the time, and it can be instructive to understand what they are doing behind the scenes by using EXPLAIN, or even by reading their source code, to make you a better MapReduce programmer. Of course, you could always use Pig or Hive in the first place… We’re used to thinking that the output data is contained in one file. This is OK for small datasets, but if the output is large (more than a few tens of gigabytes, say) then it’s normally better to have a partitioned file, so you take advantage of the cluster parallelism for the reducer tasks. Conceptually, you should think of your output/part-*files as a single “file”: the fact it is broken up is an implementation detail. Often, the output forms the input to another MapReduce job, so it is naturally processed as a partitioned output by specifying output as the input path to the second job. In some cases the partitioning can be exploited. For small outputs you can merge the partitions into a single file, either by setting the number of reducers to 1 (the default), or by using the handy This concatenates the HDFS files hdfs-output-dir/part-* into a single local file. If your task reports no progress for 10 minutes (see the Using the Status descriptions are shown on the web UI so you can monitor a job and keep and eye on the statuses (as long as all the tasks fit on a single page). You can send extra debugging information to standard error which you can then retrieve through the web UI (click through to the task attempt, and find the stderr file). You can do more advanced debugging with debug scripts. Before you start profiling tasks there are a number of job-level checks to run through: Getting a cluster up and running can be decidely non-trivial, so use some of the free tools to get started. For example, Cloudera provides an online configuration tool, RPMs, and Debian packages to set up Hadoop on your own hardware, as well as scripts to run on Amazon EC2. Do you have a MapReduce tip to share? Please let us know in the comments.10 MapReduce Tips
1. Use an appropriate MapReduce language
2. Consider your input data “chunk” size
3. Use SequenceFile and MapFile containers
4. Implement the Tool interface
Tool
interface to get the following options for free:-D
to pass in arbitrary properties (e.g. -D mapred.reduce.tasks=7
sets the number of reducers to 7)-files
to put files into the distributed cache -archives
to put archives (tar, tar.gz, zip, jar) into the distributed cache-libjars
to put JAR files on the task classpathpublic class MyJob extends Configured implements Tool { public int run(String[] args) throws Exception { JobConf job = new JobConf(getConf(), MyJob.class); // run job ... } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new MyJob(), args); System.exit(res); }}
Configured
’s setConf()
method.5. Chain your jobs
ChainMapper
and ChainReducer
(in 0.20.0) are worth checking out too, as they allow you to use smaller units within one job, effectively allowing multiple mappers before and afterthe (single) reducer: M+RM*
.6. Favor multiple partitions
CompositeInputFormat
, for example, uses the partitioning to do joins efficiently on the map-side. Another example: if your output is a MapFile, you can use MapFileOutputFormat
’s getReaders()
method to do lookups on the partitioned output.-getmerge
option on the filesystem shell:% hadoop fs -getmerge hdfs-output-dir local-file
7. Report progress
mapred.task.timeout
property) then it will be killed by Hadoop. Most tasks don’t encounter this situation since they report progress implicitly by reading input and writing output. However, some jobs which don’t process records in this way may fall foul of this behavior and have their tasks killed. Simulations are a good example, since they do a lot of CPU-intensive processing in each map and typically only write the result at the end of the computation. They should be written in such a way as to report progress on a regular basis (more frequently than every 10 minutes). This may be achieved in a number of ways:setStatus()
on Reporter
to set a human-readable description of
the task’s progressincrCounter()
on Reporter
to increment a user counterprogress()
on Reporter
to tell Hadoop that your task is still there (and making progress)8. Debug with status and counters
Reporter
’s setStatus()
and incrCounter()
methods is a simple but effective way to debug your jobs. Counters are often better than printing to standard error since they are aggregated centrally, and allow you to see how many times a condition has occurred.9. Tune at the job level before the task level
JobConf.setCompressMapOutput()
, or equivalently mapred.compress.map.output
).RawComparator
?10. Let someone else do the cluster administration
- 10 MapReduce Tips
- 7 Tips forImproving MapReduce Performance
- mapreduce作业调优tips
- 7 Tips for Improving MapReduce Performance
- 10 Android NDK Tips
- ubuntu10.10 TIPS
- Tips
- Tips
- Tips
- Tips
- > tips
- Tips
- Tips
- Tips
- Tips
- Tips
- Tips
- tips
- exe4j打包jar成exe文件(将jdk打包在内)
- union和struct的区别
- VC6, “Add files to folder”不能用了。
- Waveform Audio 驱动(Wavedev2)之:WAV 驱动解析
- 一些待查概念
- 10 MapReduce Tips
- 操作数的寻址方式(非常重要)
- Bitmap & Canvas
- java 正则式替换{"@ 替换成{"
- 通信中的基本概念(慢慢整理)
- IronPython安装及体验
- 通过在 Page 指令或 配置节中设置 validateRequest=false
- 借来的
- 初识多媒体