MR练习之uid的去重
来源:互联网 发布:球球刷爱心软件 编辑:程序博客网 时间:2024/05/16 02:19
MR练习之uid的去重
此次用了map阶段了,reduce阶段只是用来不同的key(uid)写入了文件。
数据
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/20111230000005 57375476989eea12893c0c3811607bcf 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=120111230000007 b97920521c78de70ac38e3713f524b50 本本联盟 1 1 http://www.bblianmeng.com/20111230000008 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1 http://lib.scnu.edu.cn/20111230000008 f2f5a21c764aebde1e8afcc2871e086f 在线代理 2 1 http://proxyie.cn/20111230000009 96994a0480e7e1edcaef67b20d8816b7 伟大导演 1 1 http://movie.douban.com/review/1128960/20111230000009 698956eb07815439fe5f46e9a4503997 youku 1 1 http://www.youku.com/20111230000009 599cd26984f72ee68b2b6ebefccf6aed 安徽合肥365房产网 1 1 http://hf.house365.com/20111230000010 f577230df7b6c532837cd16ab731f874 哈萨克网址大全 1 1 http://www.kz321.com/20111230000010 285f88780dd0659f5fc8acc7cc4949f2 IQ数码 1 1 http://www.iqshuma.com/
10条数据,第二列是uid,前两行是2个重复的uid
代码
import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Mapper.Context;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class UuidMain { /** * @param args */ public static class UuidMapper extends Mapper<LongWritable , Text, Text, NullWritable>{ private static Text val= new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //切分一行 String [] line = value.toString().split("\t"); //取uid String uid = line[1]; val.set(uid); System.out.println("-------------uid :"+uid); context.write(val, NullWritable.get()); } } public static class UuidReduce extends Reducer<Text, NullWritable, Text, NullWritable>{ @Override protected void reduce(Text key, Iterable values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException { System.out.println("Reduce......."); System.out.println("key:"+key); context.write(key, NullWritable.get()); } } public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { if (null == args || args.length != 2) { System.err.println(": UidCollectot "); System.exit(1); } Path inputPath = new Path(args[0]); Path outputPath = new Path(args[1]); Job job = new Job(new Configuration(), "Uuid"); // jarClass job.setJarByClass(UuidMain.class); // mapper class job.setMapperClass(UuidMapper.class); // reducer class job.setReducerClass(UuidReduce.class); // 设置输入输出格式 job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); // 设置输入输出路径 FileInputFormat.addInputPath(job, inputPath); FileOutputFormat.setOutputPath(job, outputPath); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
控制台输出
2017-07-22 16:38:40,404 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable2017-07-22 16:38:41,701 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - session.id is deprecated. Instead, use dfs.metrics.session-id2017-07-22 16:38:41,712 INFO [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=2017-07-22 16:38:42,154 WARN [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.2017-07-22 16:38:42,159 WARN [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(259)) - No job jar file set. User classes may not be found. See Job or Job#setJar(String).2017-07-22 16:38:42,440 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 12017-07-22 16:38:42,583 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:12017-07-22 16:38:43,161 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_local1393818339_00012017-07-22 16:38:43,273 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/staging/zkpk1393818339/.staging/job_local1393818339_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-07-22 16:38:43,284 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/staging/zkpk1393818339/.staging/job_local1393818339_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-07-22 16:38:43,924 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/local/localRunner/zkpk/job_local1393818339_0001/job_local1393818339_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.2017-07-22 16:38:43,982 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/local/localRunner/zkpk/job_local1393818339_0001/job_local1393818339_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.2017-07-22 16:38:44,035 INFO [main] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://localhost:8080/2017-07-22 16:38:44,038 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1334)) - Running job: job_local1393818339_00012017-07-22 16:38:44,058 INFO [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null2017-07-22 16:38:44,093 INFO [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter2017-07-22 16:38:44,435 INFO [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks2017-07-22 16:38:44,441 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local1393818339_0001_m_000000_02017-07-22 16:38:44,535 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(587)) - Using ResourceCalculatorProcessTree : [ ]2017-07-22 16:38:44,540 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:runNewMapper(733)) - Processing split: hdfs://master:9000/user/wordcount/input/sogou.10.utf8:0+9662017-07-22 16:38:44,578 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:createSortingCollector(388)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer2017-07-22 16:38:44,842 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:setEquator(1182)) - (EQUATOR) 0 kvi 26214396(104857584)2017-07-22 16:38:44,842 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(975)) - mapreduce.task.io.sort.mb: 1002017-07-22 16:38:44,842 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(976)) - soft limit at 838860802017-07-22 16:38:44,842 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(977)) - bufstart = 0; bufvoid = 1048576002017-07-22 16:38:44,842 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(978)) - kvstart = 26214396; length = 65536002017-07-22 16:38:45,042 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Job job_local1393818339_0001 running in uber mode : false2017-07-22 16:38:45,043 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%2017-07-22 16:38:45,728 INFO [LocalJobRunner Map Task Executor #0] input.LineRecordReader (LineRecordReader.java:skipUtfByteOrderMark(156)) - Found UTF-8 BOM and skipped itMapper......-------------uid :57375476989eea12893c0c3811607bcfMapper......-------------uid :57375476989eea12893c0c3811607bcfMapper......-------------uid :b97920521c78de70ac38e3713f524b50Mapper......-------------uid :6961d0c97fe93701fc9c0d861d096cd9Mapper......-------------uid :f2f5a21c764aebde1e8afcc2871e086fMapper......-------------uid :96994a0480e7e1edcaef67b20d8816b7Mapper......-------------uid :698956eb07815439fe5f46e9a4503997Mapper......-------------uid :599cd26984f72ee68b2b6ebefccf6aedMapper......-------------uid :f577230df7b6c532837cd16ab731f874Mapper......-------------uid :285f88780dd0659f5fc8acc7cc4949f22017-07-22 16:38:45,732 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 2017-07-22 16:38:46,078 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1437)) - Starting flush of map output2017-07-22 16:38:46,079 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1455)) - Spilling map output2017-07-22 16:38:46,079 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1456)) - bufstart = 0; bufend = 330; bufvoid = 1048576002017-07-22 16:38:46,079 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1458)) - kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/65536002017-07-22 16:38:46,191 INFO [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:sortAndSpill(1641)) - Finished spill 02017-07-22 16:38:46,203 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1001)) - Task:attempt_local1393818339_0001_m_000000_0 is done. And is in the process of committing2017-07-22 16:38:46,223 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map2017-07-22 16:38:46,224 INFO [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local1393818339_0001_m_000000_0' done.2017-07-22 16:38:46,224 INFO [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local1393818339_0001_m_000000_02017-07-22 16:38:46,224 INFO [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.2017-07-22 16:38:46,229 INFO [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for reduce tasks2017-07-22 16:38:46,229 INFO [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(302)) - Starting task: attempt_local1393818339_0001_r_000000_02017-07-22 16:38:46,241 INFO [pool-6-thread-1] mapred.Task (Task.java:initialize(587)) - Using ResourceCalculatorProcessTree : [ ]2017-07-22 16:38:46,250 INFO [pool-6-thread-1] mapred.ReduceTask (ReduceTask.java:run(362)) - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@10080e182017-07-22 16:38:46,274 INFO [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:<init>(193)) - MergerManager: memoryLimit=304244320, maxSingleShuffleLimit=76061080, mergeThreshold=200801264, ioSortFactor=10, memToMemMergeOutputsThreshold=102017-07-22 16:38:46,283 INFO [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(61)) - attempt_local1393818339_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events2017-07-22 16:38:46,591 INFO [localfetcher#1] reduce.LocalFetcher (LocalFetcher.java:copyMapOutput(140)) - localfetcher#1 about to shuffle output of map attempt_local1393818339_0001_m_000000_0 decomp: 352 len: 356 to MEMORY2017-07-22 16:38:46,614 INFO [localfetcher#1] reduce.InMemoryMapOutput (InMemoryMapOutput.java:shuffle(100)) - Read 352 bytes from map-output for attempt_local1393818339_0001_m_000000_02017-07-22 16:38:46,637 INFO [localfetcher#1] reduce.MergeManagerImpl (MergeManagerImpl.java:closeInMemoryFile(307)) - closeInMemoryFile -> map-output of size: 352, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->3522017-07-22 16:38:46,656 INFO [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(76)) - EventFetcher is interrupted.. Returning2017-07-22 16:38:46,657 INFO [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.2017-07-22 16:38:46,658 INFO [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(667)) - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs2017-07-22 16:38:46,665 INFO [pool-6-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments2017-07-22 16:38:46,666 INFO [pool-6-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 317 bytes2017-07-22 16:38:46,670 INFO [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(742)) - Merged 1 segments, 352 bytes to disk to satisfy reduce memory limit2017-07-22 16:38:46,670 INFO [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(772)) - Merging 1 files, 356 bytes from disk2017-07-22 16:38:46,678 INFO [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(787)) - Merging 0 segments, 0 bytes from memory into reduce2017-07-22 16:38:46,678 INFO [pool-6-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments2017-07-22 16:38:46,680 INFO [pool-6-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 317 bytes2017-07-22 16:38:46,681 INFO [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.2017-07-22 16:38:46,775 INFO [pool-6-thread-1] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecordsReduce.......key:285f88780dd0659f5fc8acc7cc4949f2Reduce.......key:57375476989eea12893c0c3811607bcfReduce.......key:599cd26984f72ee68b2b6ebefccf6aedReduce.......key:6961d0c97fe93701fc9c0d861d096cd9Reduce.......key:698956eb07815439fe5f46e9a4503997Reduce.......key:96994a0480e7e1edcaef67b20d8816b7Reduce.......key:b97920521c78de70ac38e3713f524b50Reduce.......key:f2f5a21c764aebde1e8afcc2871e086fReduce.......key:f577230df7b6c532837cd16ab731f8742017-07-22 16:38:47,078 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) - map 100% reduce 0%2017-07-22 16:38:47,319 INFO [pool-6-thread-1] mapred.Task (Task.java:done(1001)) - Task:attempt_local1393818339_0001_r_000000_0 is done. And is in the process of committing2017-07-22 16:38:47,323 INFO [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.2017-07-22 16:38:47,323 INFO [pool-6-thread-1] mapred.Task (Task.java:commit(1162)) - Task attempt_local1393818339_0001_r_000000_0 is allowed to commit now2017-07-22 16:38:47,375 INFO [pool-6-thread-1] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(439)) - Saved output of task 'attempt_local1393818339_0001_r_000000_0' to hdfs://master:9000/user/wordcount/output1/_temporary/0/task_local1393818339_0001_r_0000002017-07-22 16:38:47,379 INFO [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - reduce > reduce2017-07-22 16:38:47,379 INFO [pool-6-thread-1] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local1393818339_0001_r_000000_0' done.2017-07-22 16:38:47,380 INFO [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(325)) - Finishing task: attempt_local1393818339_0001_r_000000_02017-07-22 16:38:47,380 INFO [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - reduce task executor complete.2017-07-22 16:38:48,079 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) - map 100% reduce 100%2017-07-22 16:38:48,080 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1373)) - Job job_local1393818339_0001 completed successfully2017-07-22 16:38:48,201 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Counters: 38 File System Counters FILE: Number of bytes read=1086 FILE: Number of bytes written=458862 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1932 HDFS: Number of bytes written=297 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Map-Reduce Framework Map input records=10 Map output records=10 Map output bytes=330 Map output materialized bytes=356 Input split bytes=118 Combine input records=0 Combine output records=0 Reduce input groups=9 Reduce shuffle bytes=356 Reduce input records=10 Reduce output records=9 Spilled Records=20 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=394264576 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=966 File Output Format Counters Bytes Written=297
成功输出的文件 part-r-00000
285f88780dd0659f5fc8acc7cc4949f257375476989eea12893c0c3811607bcf599cd26984f72ee68b2b6ebefccf6aed6961d0c97fe93701fc9c0d861d096cd9698956eb07815439fe5f46e9a450399796994a0480e7e1edcaef67b20d8816b7b97920521c78de70ac38e3713f524b50f2f5a21c764aebde1e8afcc2871e086ff577230df7b6c532837cd16ab731f874
可以看出没有重复的uid了
阅读全文
0 0
- MR练习之uid的去重
- MR案例之去重
- ArrayList练习 ,去重
- 算法练习:两指针之有序数组去重
- 单链表练习1--去重
- java练习——数组去重
- Leetcode练习<十> 列表元素去重
- mysql去重之if的用法の数据去重
- js学习之javascript引用类型object--练习模拟Map和数组去重
- C#之list去重
- C++ 之 去重函数
- MapReduce 之 数据去重
- python之list去重
- JavaScript之数组去重
- javascript 之数组去重
- javaScript之数组去重
- 搜索引擎手记(三)之网页的去重
- 网页去重(三)之特征值的提取
- Poj 3304 Segments 【线段于直线相交】
- double类型输出,不以科学计数法方式输出
- RecyclerView的多选模式
- python爬取百度图片
- ActiveSupport eager_autoload源码分析
- MR练习之uid的去重
- 管理感言_进度把控2原则
- 简单的vim命令
- 高仿QQ空间广告位 ——— 一个位置来回切换两张广告图
- 2.0vue.js 第二种组件局部写法
- Performance Testing vs. Load Testing vs. Stress Testing
- hdu 5726 GCD (区间gcd-RMQ)
- OpenGL实现3D自由变形
- 本地修改hosts