Hadoop集群初步使用-编写wordcount程序

来源：互联网发布：nginx的server配置编辑：程序博客网时间：2024/05/17 03:58

1.HDFS使用

查看集群的状态：hdfs dfsadmin -report
web控制台查看hdfs集群信息：http://hadoop1:50070/
查看HDFS中的目录信息：hadoop fs -ls /
在HDFS上创建文件夹：hadoop fs -mkdir -p /aaa/bbb/ccc
上传文件： hadoop fs -put 本地文件路径 to hdfs路径
下载文件：hadoop fs -get hdfs路径

2.MAPREDUCE使用

mapreduce是hadoop中的分布式运算编程框架，只要按照其编程规范，只需要编写少量的业务逻辑代码即可实现一个强大的海量数据并发处理程序。
Demo-统计单词出现次数
1）.mapper

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{    IntWritable one = new IntWritable(1);    Text word = new Text();    //map方法的生命周期：  框架每传一行数据就被调用一次    //key :  这一行的起始点在文件中的偏移量    //value: 这一行的内容    @Override    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {        //利用jdk的工具类拆分value，工具类默认是用空格分隔数据的        StringTokenizer itr = new StringTokenizer(value.toString());        while(itr.hasMoreTokens()) {            word.set(itr.nextToken());            //输出<单词，1>            context.write(word, one);        }    }}

2）reducer

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {    //生命周期：框架每传递进来一个kv 组，reduce方法被调用一次    @Override    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {        //定义一个计数器        int count = 0;        //遍历这一组kv的所有v，累加到count中        for(IntWritable value:values){            count += value.get();        }        context.write(key, new IntWritable(count));    }}

3）定义一个主类，用来描述job并提交job

public class WordCountRunner {    //把业务逻辑相关的信息（哪个是mapper，哪个是reducer，要处理的数据在哪里，输出的结果放哪里。。。。。。）描述成一个job对象    //把这个描述好的job提交给集群去运行    public static void main(String[] args) throws Exception {        Configuration conf = new Configuration();        Job wcjob = Job.getInstance(conf);        //指定我这个job所在的jar包//      wcjob.setJar("/home/hadoop/wordcount.jar");        wcjob.setJarByClass(WordCountRunner.class);        wcjob.setMapperClass(WordCountMapper.class);        wcjob.setReducerClass(WordCountReducer.class);        //设置我们的业务逻辑Mapper类的输出key和value的数据类型        wcjob.setMapOutputKeyClass(Text.class);        wcjob.setMapOutputValueClass(IntWritable.class);        //设置我们的业务逻辑Reducer类的输出key和value的数据类型        wcjob.setOutputKeyClass(Text.class);        wcjob.setOutputValueClass(IntWritable.class);        //指定要处理的数据所在的位置        FileInputFormat.setInputPaths(wcjob, "hdfs://hadoop1:9000/wordcount/intput/wordcount.txt");        //指定处理完成之后的结果所保存的位置        FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://hdp-server01:9000/wordcount/output/"));        //向yarn集群提交这个job        boolean res = wcjob.waitForCompletion(true);        System.exit(res?0:1);    }

4）打包项目，准备好输入的数据并上传到hdfs中，将打包的jar包放到集群机器的任意位置中。

打包项目用myeclipse打包并指定好运行方法准备好一个文本文件并输入一些内容，名称叫wordcount.txthadoop fs -mkdir -p wordcount/input/hadoop fs -put /home/wordcount.txt /wordcount/input

5）使用命令启动wordcount程序jar包

hodoop jar wordcount.jar com.xt.hadoop.mapreduce wordcount/input wordcount/ouput

6）查看执行结果

hadoop fs -cat /wordcount/output/part-r-00000

参考：
http://blog.csdn.net/lisonglisonglisong/article/details/47125319

阅读全文

0 0