hadoop实例 RandomWriter
来源:互联网 发布:黎明杀机同步数据 编辑:程序博客网 时间:2024/06/08 09:55
参考文献:http://www.hadooper.cn/dct/page/65778
1.概述
RandomWriter(随机写)例子利用 Map/Reduce把 数据随机的写到dfs中。每个map输入单个文件名,然后随机写BytesWritable的键和值到DFS顺序文件。map没有产生任何输出,所以reduce没有执行。产生的数据是可以配置的。配置变量如下
test.randomwriter.maps_per_host
test.randomwrite.bytes_per_map
test.randomwrite.min_key
test.randomwrite.max_key
test.randomwrite.min_value
test.randomwrite.max_value
test.randomwriter.maps_per_host表示每个工作节点(datanode)上运行map的次数。默认情况下,只有一个数据节点,那么就有10个map,每个map的数据量为1G,因此要将10G数据写入到hdfs中。我配置的试验环境中只有2个工作节点,不过我希望每个工作节点只有1个map任务。
test.randomwrite.bytes_per_map我原本以为是随机写输出的测试文件的大小,默认为1G=1*1024*1024*1024,但是我将这个数据改成1*1024*1024以后,输出的测试文件还是1G,这让我很不解。(PS:2011-11-2,今天知道这个参数表示没个map任务产生的数据量,如果将其改为1*1024*1024,那么就表示没个map任务产生的数据量为1MB。)(PS:2011-11-3,修改参数test.randomwrite.bytes_per_map并不能更改每个map任务产生的数据量,还是1G,不管我将这个参数设定为什么值。不过修改参数:test.randomwriter.maps_per_host是有效的。测试发现将该参数设为1和2都测试通过。问题:在哪里修改test.randomwrite.bytes_per_map才能真正修改map任务产生的数据量。!)
2.代码实例
其中test.randomwrite.bytes_per_map=1*1024*1024,test.randomwriter.maps_per_host=1。
/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package org.apache.hadoop.examples;import java.io.IOException;import java.util.Date;import java.util.Random;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.BytesWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.mapred.ClusterStatus;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.FileSplit;import org.apache.hadoop.mapred.InputFormat;import org.apache.hadoop.mapred.InputSplit;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.RecordReader;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileOutputFormat;import org.apache.hadoop.mapred.lib.IdentityReducer;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;/** * This program uses map/reduce to just run a distributed job where there is * no interaction between the tasks and each task write a large unsorted * random binary sequence file of BytesWritable. * In order for this program to generate data for terasort with 10-byte keys * and 90-byte values, have the following config: * <xmp> * <?xml version="1.0"?> * <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> * <configuration> * <property> * <name>test.randomwrite.min_key</name> * <value>10</value> * </property> * <property> * <name>test.randomwrite.max_key</name> * <value>10</value> * </property> * <property> * <name>test.randomwrite.min_value</name> * <value>90</value> * </property> * <property> * <name>test.randomwrite.max_value</name> * <value>90</value> * </property> * <property> * <name>test.randomwrite.total_bytes</name> * <value>1099511627776</value> * </property> * </configuration></xmp> * * Equivalently, {@link RandomWriter} also supports all the above options * and ones supported by {@link GenericOptionsParser} via the command-line. */public class RandomWriter extends Configured implements Tool { /** * User counters */ static enum Counters { RECORDS_WRITTEN, BYTES_WRITTEN } /** * A custom input format that creates virtual inputs of a single string * for each map. */ static class RandomInputFormat implements InputFormat<Text, Text> { /** * Generate the requested number of file splits, with the filename * set to the filename of the output file. */ public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { InputSplit[] result = new InputSplit[numSplits]; Path outDir = FileOutputFormat.getOutputPath(job); for(int i=0; i < result.length; ++i) { result[i] = new FileSplit(new Path(outDir, "dummy-split-" + i), 0, 1, (String[])null); } return result; } /** * Return a single record (filename, "") where the filename is taken from * the file split. */ static class RandomRecordReader implements RecordReader<Text, Text> { Path name; public RandomRecordReader(Path p) { name = p; } public boolean next(Text key, Text value) { if (name != null) { key.set(name.getName()); name = null; return true; } return false; } public Text createKey() { return new Text(); } public Text createValue() { return new Text(); } public long getPos() { return 0; } public void close() {} public float getProgress() { return 0.0f; } } public RecordReader<Text, Text> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { return new RandomRecordReader(((FileSplit) split).getPath()); } } static class Map extends MapReduceBase implements Mapper<WritableComparable, Writable, BytesWritable, BytesWritable> { private long numBytesToWrite; private int minKeySize; private int keySizeRange; private int minValueSize; private int valueSizeRange; private Random random = new Random(); private BytesWritable randomKey = new BytesWritable(); private BytesWritable randomValue = new BytesWritable(); private void randomizeBytes(byte[] data, int offset, int length) { for(int i=offset + length - 1; i >= offset; --i) { data[i] = (byte) random.nextInt(256); } } /** * Given an output filename, write a bunch of random records to it. */ public void map(WritableComparable key, Writable value, OutputCollector<BytesWritable, BytesWritable> output, Reporter reporter) throws IOException { int itemCount = 0; while (numBytesToWrite > 0) { int keyLength = minKeySize + (keySizeRange != 0 ? random.nextInt(keySizeRange) : 0); randomKey.setSize(keyLength); randomizeBytes(randomKey.getBytes(), 0, randomKey.getLength()); int valueLength = minValueSize + (valueSizeRange != 0 ? random.nextInt(valueSizeRange) : 0); randomValue.setSize(valueLength); randomizeBytes(randomValue.getBytes(), 0, randomValue.getLength()); output.collect(randomKey, randomValue); numBytesToWrite -= keyLength + valueLength; reporter.incrCounter(Counters.BYTES_WRITTEN, keyLength + valueLength); reporter.incrCounter(Counters.RECORDS_WRITTEN, 1); if (++itemCount % 200 == 0) { reporter.setStatus("wrote record " + itemCount + ". " + numBytesToWrite + " bytes left."); } } reporter.setStatus("done with " + itemCount + " records."); } /** * Save the values out of the configuaration that we need to write * the data. */ @Override public void configure(JobConf job) { numBytesToWrite = job.getLong("test.randomwrite.bytes_per_map", 1*1024*1024); minKeySize = job.getInt("test.randomwrite.min_key", 10); keySizeRange = job.getInt("test.randomwrite.max_key", 1000) - minKeySize; minValueSize = job.getInt("test.randomwrite.min_value", 0); valueSizeRange = job.getInt("test.randomwrite.max_value", 20000) - minValueSize; } } /** * This is the main routine for launching a distributed random write job. * It runs 10 maps/node and each node writes 1 gig of data to a DFS file. * The reduce doesn't do anything. * * @throws IOException */ public int run(String[] args) throws Exception { if (args.length == 0) { System.out.println("Usage: writer <out-dir>"); ToolRunner.printGenericCommandUsage(System.out); return -1; } Path outDir = new Path(args[0]); JobConf job = new JobConf(getConf()); job.setJarByClass(RandomWriter.class); job.setJobName("random-writer"); FileOutputFormat.setOutputPath(job, outDir); job.setOutputKeyClass(BytesWritable.class); job.setOutputValueClass(BytesWritable.class); job.setInputFormat(RandomInputFormat.class); job.setMapperClass(Map.class); job.setReducerClass(IdentityReducer.class); job.setOutputFormat(SequenceFileOutputFormat.class); JobClient client = new JobClient(job); ClusterStatus cluster = client.getClusterStatus(); int numMapsPerHost = job.getInt("test.randomwriter.maps_per_host", 1); long numBytesToWritePerMap = job.getLong("test.randomwrite.bytes_per_map", 1*1024*1024); if (numBytesToWritePerMap == 0) { System.err.println("Cannot have test.randomwrite.bytes_per_map set to 0"); return -2; } long totalBytesToWrite = job.getLong("test.randomwrite.total_bytes", numMapsPerHost*numBytesToWritePerMap*cluster.getTaskTrackers()); int numMaps = (int) (totalBytesToWrite / numBytesToWritePerMap); if (numMaps == 0 && totalBytesToWrite > 0) { numMaps = 1; job.setLong("test.randomwrite.bytes_per_map", totalBytesToWrite); } job.setNumMapTasks(numMaps); System.out.println("Running " + numMaps + " maps."); // reducer NONE job.setNumReduceTasks(0); Date startTime = new Date(); System.out.println("Job started: " + startTime); JobClient.runJob(job); Date endTime = new Date(); System.out.println("Job ended: " + endTime); System.out.println("The job took " + (endTime.getTime() - startTime.getTime()) /1000 + " seconds."); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new RandomWriter(), args); System.exit(res); }}输出信息:
11/10/17 13:27:46 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectivelyRunning 2 maps.Job started: Mon Oct 17 13:27:47 CST 201111/10/17 13:27:47 INFO mapred.JobClient: Running job: job_201110171322_000111/10/17 13:27:48 INFO mapred.JobClient: map 0% reduce 0%11/10/17 13:29:58 INFO mapred.JobClient: map 50% reduce 0%11/10/17 13:30:05 INFO mapred.JobClient: map 100% reduce 0%11/10/17 13:30:07 INFO mapred.JobClient: Job complete: job_201110171322_000111/10/17 13:30:07 INFO mapred.JobClient: Counters: 811/10/17 13:30:07 INFO mapred.JobClient: Job Counters 11/10/17 13:30:07 INFO mapred.JobClient: Launched map tasks=311/10/17 13:30:07 INFO mapred.JobClient: org.apache.hadoop.examples.RandomWriter$Counters11/10/17 13:30:07 INFO mapred.JobClient: BYTES_WRITTEN=214750407811/10/17 13:30:07 INFO mapred.JobClient: RECORDS_WRITTEN=20452811/10/17 13:30:07 INFO mapred.JobClient: FileSystemCounters11/10/17 13:30:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215458031811/10/17 13:30:07 INFO mapred.JobClient: Map-Reduce Framework11/10/17 13:30:07 INFO mapred.JobClient: Map input records=211/10/17 13:30:07 INFO mapred.JobClient: Spilled Records=011/10/17 13:30:07 INFO mapred.JobClient: Map input bytes=011/10/17 13:30:07 INFO mapred.JobClient: Map output records=204528Job ended: Mon Oct 17 13:30:07 CST 2011The job took 140 seconds.在hdfs上产生了两个文件,在/home/hadoop/rand目录下,分别是part-00000(1Gb,r3)和part-00001(1Gb,r3)
- hadoop实例 RandomWriter
- hadoop实例 RandomWriter
- hadoop自带RandomWriter例子解析
- hadoop实例
- Hadoop RPC 实例
- Hadoop排序实例
- hadoop wordcount运行实例
- 五、Hadoop+HBase实例
- hadoop实例sort
- Hadoop实战实例
- Hadoop RPC 实例
- Hadoop实战实例
- hadoop实例之HELLOWORLD
- hadoop实例之HELLOWORLD
- Hadoop RPC 实例
- hadoop MapReduce实例解析
- hadoop Partition使用实例
- Hadoop RPC 实例
- 输入任意个数字存入单链表中
- struts2.0整合tiles2
- APIDemos 学习
- glEnable/glDisable(GL_CULL_FACE)与glCullFace()
- union和union all的区别||hashmap和hashtable的区别
- hadoop实例 RandomWriter
- union, struct, enum 的 大小区别
- 针对Flash字体嵌入的Unicode范围生成工具
- DateTimeUtils 日期时间工具类 (AS3,Flex3)
- Android开发进阶(四)--深入Android通过Apache HTTP访问HTTP资源
- 让Robotlegs支持Starling框架
- 血拼2011中国移动开发者大会门票超低价团购!(已结束)
- 20111016 工作记录
- td border