hadoop入门之wordcount小案例

来源：互联网发布：在线ps网站源码编辑：程序博客网时间：2024/06/08 21:17

1.创建工程

file->new->other->map/reduce->map/reduce project->next->project name  -->finish

2.建立工程目录

3.写java文件

3.1WCMapper.java

package hadoop.example.wordcount;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class WCMapper extends Mapper<LongWritable,Text,Text,LongWritable>{    @Override    protected void map(LongWritable key, Text value,Context context)            throws IOException, InterruptedException {        // TODO Auto-generated method stub        //接收数据         String line = value.toString();        //切分数据         String[] words=line.split(" ");        //循环所有数据        for(String w : words){            // 查询一个记一次              //new Text(w), new LongWritable(1) 将数据进行包装            context.write(new Text(w), new LongWritable(1));        }    }}

3.2WCReducer.java

package hadoop.example.wordcount;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class WCReducer extends Reducer<Text,LongWritable,Text,LongWritable >{    @Override    protected void reduce(Text key, Iterable<LongWritable> values,            Context context) throws IOException, InterruptedException {        //接受数据    //  Text  key3=key；        //定义一个计数器        long counter=0;        //循values        for(LongWritable l :values){            counter+=l.get();        }               //输出        context.write(key, new LongWritable(counter));          }}

3.3WordCount.java

package hadoop.example.wordcount;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {    public static void main(String args[]) throws IOException, ClassNotFoundException, InterruptedException{        //构建一个job对象        Job job=Job.getInstance(new Configuration());        //action ：main 方法所在的类         job.setJarByClass(WordCount.class);        //设置Mapper的相关属性        job.setMapperClass(WCMapper.class);        job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(LongWritable.class);        FileInputFormat.setInputPaths(job, new Path("/root/workplace/hdfs/wdcount/1.txt"));        //设置Reducer的相关的属性        job.setReducerClass(WCReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(LongWritable.class);        FileOutputFormat.setOutputPath(job,new Path("/root/workplace/hdfs/wdcount/output"));        //提交任务        //打印进度详情        job.waitForCompletion(true);    }}

4.将写好的文件打成jar包
project->export->JAR file->next->在jar file里选择要讲jar放在的目录->next->next->main Class 选择你要在此jar里配置的mian.class->finish

5.上传文件1.txt

hello tom hello jerryhello kittyhello worldhello tom上传文件[root@centos ~]# hadoop dfs -put /root/workplace/wdcount  /root/workplace/hdfsWarning: $HADOOP_HOME is deprecated.查看文件[root@centos ~]# hadoop dfs -ls  /root/workplace/hdfs/wdcount/1.txtWarning: $HADOOP_HOME is deprecated.Found 1 items-rw-r--r--   1 root supergroup         57 2016-08-20 09:06 /root/workplace/hdfs/wdcount/1.txt查看文件详情[root@centos ~]# hadoop dfs -cat  /root/workplace/hdfs/wdcount/1.txtWarning: $HADOOP_HOME is deprecated.hello tom hello jerryhello kittyhello worldhello tom

6.两种执行jar文件的方式
6.1执行jar程序—工作里执行的方式

[root@centos wdcount]# hadoop jar /root/workplace/wdcount/wc.jar Warning: $HADOOP_HOME is deprecated.16/08/20 09:20:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.16/08/20 09:20:35 INFO input.FileInputFormat: Total input paths to process : 116/08/20 09:20:35 INFO util.NativeCodeLoader: Loaded the native-hadoop library16/08/20 09:20:35 WARN snappy.LoadSnappy: Snappy native library not loaded16/08/20 09:20:35 INFO mapred.JobClient: Running job: job_201608192017_000816/08/20 09:20:36 INFO mapred.JobClient:  map 0% reduce 0%16/08/20 09:20:43 INFO mapred.JobClient:  map 100% reduce 0%16/08/20 09:20:52 INFO mapred.JobClient:  map 100% reduce 33%16/08/20 09:20:54 INFO mapred.JobClient:  map 100% reduce 100%16/08/20 09:20:56 INFO mapred.JobClient: Job complete: job_201608192017_000816/08/20 09:20:56 INFO mapred.JobClient: Counters: 2916/08/20 09:20:56 INFO mapred.JobClient:   Job Counters 16/08/20 09:20:56 INFO mapred.JobClient:     Launched reduce tasks=116/08/20 09:20:56 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=871916/08/20 09:20:56 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=016/08/20 09:20:56 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=016/08/20 09:20:56 INFO mapred.JobClient:     Launched map tasks=116/08/20 09:20:56 INFO mapred.JobClient:     Data-local map tasks=116/08/20 09:20:56 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=1040516/08/20 09:20:56 INFO mapred.JobClient:   File Output Format Counters 16/08/20 09:20:56 INFO mapred.JobClient:     Bytes Written=3816/08/20 09:20:56 INFO mapred.JobClient:   FileSystemCounters16/08/20 09:20:56 INFO mapred.JobClient:     FILE_BYTES_READ=16216/08/20 09:20:56 INFO mapred.JobClient:     HDFS_BYTES_READ=17716/08/20 09:20:56 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=11036516/08/20 09:20:56 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=3816/08/20 09:20:56 INFO mapred.JobClient:   File Input Format Counters 16/08/20 09:20:56 INFO mapred.JobClient:     Bytes Read=5716/08/20 09:20:56 INFO mapred.JobClient:   Map-Reduce Framework16/08/20 09:20:56 INFO mapred.JobClient:     Map output materialized bytes=16216/08/20 09:20:56 INFO mapred.JobClient:     Map input records=516/08/20 09:20:56 INFO mapred.JobClient:     Reduce shuffle bytes=16216/08/20 09:20:56 INFO mapred.JobClient:     Spilled Records=2016/08/20 09:20:56 INFO mapred.JobClient:     Map output bytes=13616/08/20 09:20:56 INFO mapred.JobClient:     Total committed heap usage (bytes)=15879782416/08/20 09:20:56 INFO mapred.JobClient:     CPU time spent (ms)=229016/08/20 09:20:56 INFO mapred.JobClient:     Combine input records=016/08/20 09:20:56 INFO mapred.JobClient:     SPLIT_RAW_BYTES=12016/08/20 09:20:56 INFO mapred.JobClient:     Reduce input records=1016/08/20 09:20:56 INFO mapred.JobClient:     Reduce input groups=516/08/20 09:20:56 INFO mapred.JobClient:     Combine output records=016/08/20 09:20:56 INFO mapred.JobClient:     Physical memory (bytes) snapshot=26368409616/08/20 09:20:56 INFO mapred.JobClient:     Reduce output records=516/08/20 09:20:56 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=372654080016/08/20 09:20:56 INFO mapred.JobClient:     Map output records=10----------------------执行jar程序完成查看执行结果[root@centos wdcount]# hadoop dfs -ls /root/workplace/hdfs/wdcount/outputWarning: $HADOOP_HOME is deprecated.Found 3 items-rw-r--r--   1 root supergroup          0 2016-08-20 09:20 /root/workplace/hdfs/wdcount/output/_SUCCESSdrwxr-xr-x   - root supergroup          0 2016-08-20 09:20 /root/workplace/hdfs/wdcount/output/_logs-rw-r--r--   1 root supergroup         38 2016-08-20 09:20 /root/workplace/hdfs/wdcount/output/part-r-00000[root@centos wdcount]# hadoop dfs -cat  /root/workplace/hdfs/wdcount/output/part-r-00000Warning: $HADOOP_HOME is deprecated.hello   5jerry   1kitty   1tom 2world   1[root@centos wdcount]#

6.2执行程序的另外一种方法

控制台的输出    16/08/20 10:32:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable16/08/20 10:32:12 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.16/08/20 10:32:12 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).16/08/20 10:32:12 INFO input.FileInputFormat: Total input paths to process : 116/08/20 10:32:13 WARN snappy.LoadSnappy: Snappy native library not loaded16/08/20 10:32:13 INFO mapred.JobClient: Running job: job_local288352385_000116/08/20 10:32:13 INFO mapred.LocalJobRunner: Waiting for map tasks16/08/20 10:32:13 INFO mapred.LocalJobRunner: Starting task: attempt_local288352385_0001_m_000000_016/08/20 10:32:13 INFO util.ProcessTree: setsid exited with exit code 016/08/20 10:32:13 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@68fa8cf916/08/20 10:32:13 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/root/workplace/hdfs/wdcount/1.txt:0+5716/08/20 10:32:14 INFO mapred.MapTask: io.sort.mb = 10016/08/20 10:32:14 INFO mapred.MapTask: data buffer = 79691776/9961472016/08/20 10:32:14 INFO mapred.MapTask: record buffer = 262144/32768016/08/20 10:32:14 INFO mapred.MapTask: Starting flush of map output16/08/20 10:32:14 INFO mapred.MapTask: Finished spill 016/08/20 10:32:14 INFO mapred.Task: Task:attempt_local288352385_0001_m_000000_0 is done. And is in the process of commiting16/08/20 10:32:14 INFO mapred.LocalJobRunner: 16/08/20 10:32:14 INFO mapred.Task: Task 'attempt_local288352385_0001_m_000000_0' done.16/08/20 10:32:14 INFO mapred.LocalJobRunner: Finishing task: attempt_local288352385_0001_m_000000_016/08/20 10:32:14 INFO mapred.LocalJobRunner: Map task executor complete.16/08/20 10:32:14 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@384f27a116/08/20 10:32:14 INFO mapred.LocalJobRunner: 16/08/20 10:32:14 INFO mapred.Merger: Merging 1 sorted segments16/08/20 10:32:14 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 158 bytes16/08/20 10:32:14 INFO mapred.LocalJobRunner: 16/08/20 10:32:14 INFO mapred.Task: Task:attempt_local288352385_0001_r_000000_0 is done. And is in the process of commiting16/08/20 10:32:14 INFO mapred.LocalJobRunner: 16/08/20 10:32:14 INFO mapred.Task: Task attempt_local288352385_0001_r_000000_0 is allowed to commit now16/08/20 10:32:14 INFO output.FileOutputCommitter: Saved output of task 'attempt_local288352385_0001_r_000000_0' to hdfs://localhost:9000/root/workplace/hdfs/wdcount/output16/08/20 10:32:14 INFO mapred.LocalJobRunner: reduce > reduce16/08/20 10:32:14 INFO mapred.Task: Task 'attempt_local288352385_0001_r_000000_0' done.16/08/20 10:32:14 INFO mapred.JobClient:  map 100% reduce 100%16/08/20 10:32:14 INFO mapred.JobClient: Job complete: job_local288352385_000116/08/20 10:32:14 INFO mapred.JobClient: Counters: 2216/08/20 10:32:14 INFO mapred.JobClient:   File Output Format Counters 16/08/20 10:32:14 INFO mapred.JobClient:     Bytes Written=3816/08/20 10:32:14 INFO mapred.JobClient:   File Input Format Counters 16/08/20 10:32:14 INFO mapred.JobClient:     Bytes Read=5716/08/20 10:32:14 INFO mapred.JobClient:   FileSystemCounters16/08/20 10:32:14 INFO mapred.JobClient:     FILE_BYTES_READ=51016/08/20 10:32:14 INFO mapred.JobClient:     HDFS_BYTES_READ=11416/08/20 10:32:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=13604216/08/20 10:32:14 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=3816/08/20 10:32:14 INFO mapred.JobClient:   Map-Reduce Framework16/08/20 10:32:14 INFO mapred.JobClient:     Reduce input groups=516/08/20 10:32:14 INFO mapred.JobClient:     Map output materialized bytes=16216/08/20 10:32:14 INFO mapred.JobClient:     Combine output records=016/08/20 10:32:14 INFO mapred.JobClient:     Map input records=516/08/20 10:32:14 INFO mapred.JobClient:     Reduce shuffle bytes=016/08/20 10:32:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=016/08/20 10:32:14 INFO mapred.JobClient:     Reduce output records=516/08/20 10:32:14 INFO mapred.JobClient:     Spilled Records=2016/08/20 10:32:14 INFO mapred.JobClient:     Map output bytes=13616/08/20 10:32:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=25848217616/08/20 10:32:14 INFO mapred.JobClient:     CPU time spent (ms)=016/08/20 10:32:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=016/08/20 10:32:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=12016/08/20 10:32:14 INFO mapred.JobClient:     Map output records=1016/08/20 10:32:14 INFO mapred.JobClient:     Combine input records=016/08/20 10:32:14 INFO mapred.JobClient:     Reduce input records=10

7.完成wrodcount案例

8.HDFS的一些基本操作

删除文件夹的操作

[root@centos ~]# hadoop dfs -rmr /root/workplace/wdcountWarning: $HADOOP_HOME is deprecated.Deleted hdfs://localhost:9000/root/workplace/wdcount

查看文件的操作

    [root@centos ~]# hadoop dfs -ls  /root/workplace/hdfs/wdcount/1.txtWarning: $HADOOP_HOME is deprecated.Found 1 items-rw-r--r--   1 root supergroup         57 2016-08-20 09:06 /root/workplace/hdfs/wdcount/1.txt

查看文件的详情

    [root@centos ~]# hadoop dfs -cat  /root/workplace/hdfs/wdcount/1.txtWarning: $HADOOP_HOME is deprecated.hello tom hello jerryhello kittyhello worldhello tom

删除文件夹下的所有的东西

[root@centos ~]# hadoop dfs -rm /root/workplace/wdcount/*Warning: $HADOOP_HOME is deprecated.Deleted hdfs://localhost:9000/root/workplace/wdcount/1.txtDeleted hdfs://localhost:9000/root/workplace/wdcount/1.txt~Deleted hdfs://localhost:9000/root/workplace/wdcount/wc.jar

将本地文件上传到HDFS文件系统

[root@centos ~]# hadoop dfs -put /root/workplace/wdcount/1.txt  /root/workplace/hdfs/wdcountWarning: $HADOOP_HOME is deprecated.[root@centos ~]# hadoop dfs -ls  /root/workplace/hdfs/wdcountWarning: $HADOOP_HOME is deprecated.Found 1 items-rw-r--r--   1 root supergroup         57 2016-08-20 09:02 /root/workplace/hdfs/wdcount

0 0