Hadoop 复习与实践(1)

来源:互联网 发布:charles 4.2 破解 mac 编辑:程序博客网 时间:2024/06/07 03:26

看过很多东西,都忘的差不多了, 记录下便于回忆。


    • 基础复习
      • hadoop安装
      • HDFS
        • 概念
        • 简单操作
    • 实践使用IntelliJ
      • Word Count
      • FileSystem

基础复习

hadoop安装

这部分我是用homebrew安装的,没什么好说的,主要就是hadoop相关的一些配置。几个需要注意的配置如下:

core-site.xml

<configuration>  <property>     <name>hadoop.tmp.dir</name>       <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>    <description>A base for other temporary directories.</description>  </property>  <property>     <name>fs.default.name</name>                                          <value>hdfs://localhost:9000</value>                               </property>  

hdfs-site.xml

<property>      <name>dfs.replication</name>      <value>1</value></property>

mapred-site.xml

<property>         <name>mapred.job.tracker</name>         <value>localhost:9010</value></property>

yarn-site.xml

hadoop分namenode和datanode,简单的说namenode负责存储目录结构,用于管理datanode。安装完hadoop之后需要格式化namenode。

$ hadoop namenode -format
接下来需要启动hadoop服务,这里分两种:
dfs + map reduce => 启动namenode(主从), datanode, jobtracker, tasktracker。50030端口监测jobtracker,50070端口监测namenode。
dfs + yarn => 启动yarn资源管理器和节点管理,端口8088查看资源管理器。
$ start-dfs.sh$ start-mapred.sh$ start-yarn.sh$ start-all.sh

HDFS

概念:

概念 相关 数据块 $ hadoop fsck / -files -blocks namenode namenode镜像+log, 两种容错机制(备份+辅助namenode) datanode 根据需求存储,并向namenode发送块列表 联邦HDFS namenode分卷管理 namenode恢复 namenode镜像导入->重做编辑日志->接受datanode 文件系统 hadoop.fs.FileSystem的实现

简单操作

$ hadoop fs -copyFromLocal xx.txt hdfs://localhost/test/xx.txt$ hadoop fs -copyFromLocal xx.txt /test/xx.txt  //core-site指定URI$ hadoop fs -copyToLocal hdfs://localhost/test/xx.txt xx.txt $ hadoop fs -put xx.txt /test/xx.txt$ hadoop fs -mkdir /dir$ hadoop fs -ls .$ hadoop fs -ls file:///

实践,使用IntelliJ

原理很简单,本地生成main函数,实现map和reduce,生成jar包,导入hdfs并执行。

$ hadoop jar rudi-hadoop_main.jar /input/test /output/result// 这里rudi-hadoop_main.jar为本地jar包。

这里需要说的是map任务可在本地节点运行,但reduce需要整个map输出,map导入reduce如果优化不好会过多占用集群带宽,可使用combiner做嫁接以节省资源。

Created with Raphaël 2.1.0map和reduce的形式:输入key,输入value,输出key,输出valueMapMapReduceReduceReducer<Text, IntWritable, Text, IntWritable>Mapper<LongWritable, Text, Text,IntWritable>

Word Count:

public class WordCount {    public static class WordCountMap extends        Mapper<LongWritable, Text, Text, IntWritable> {        private final IntWritable one = new IntWritable(1);        private Text word = new Text();        public void map(LongWritable key, Text value, Context context)            throws IOException, InterruptedException {            String line = value.toString();            StringTokenizer token = new StringTokenizer(line);            while (token.hasMoreTokens()) {                word.set(token.nextToken());                context.write(word, one);            }        }    }    public static class WordCountReduce extends        Reducer<Text, IntWritable, Text, IntWritable> {        public void reduce(Text key, Iterable<IntWritable> values,            Context context) throws IOException, InterruptedException {            int sum = 0;            for (IntWritable val : values) {                sum += val.get();            }            context.write(key, new IntWritable(sum));        }    }    public static void main(String[] args) throws Exception {        Configuration conf = new Configuration();        Job job = new Job(conf);        job.setJarByClass(WordCount.class);        job.setJobName("wordcount");        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        job.setMapperClass(WordCountMap.class);        job.setReducerClass(WordCountReduce.class);        job.setInputFormatClass(TextInputFormat.class);        job.setOutputFormatClass(TextOutputFormat.class);        FileInputFormat.addInputPath(job, new Path(args[0]));        FileOutputFormat.setOutputPath(job, new Path(args[1]));        job.waitForCompletion(true);    }}

FileSystem:

public class URLCat {    static {        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());    }    public static void main(String[] args) throws Exception{        InputStream in = null;        try{            in = new URL(args[0]).openStream();            IOUtils.copyBytes(in, System.out, 4096,false);        }finally {            IOUtils.closeStream(in);        }    }}

在使用IntellJ生成jar包的时候,需要主要一点,manifest文件不要放在系统指定的目录中,不然生成的jar包会缺失该文件导致不能执行
这里写图片描述

这里写图片描述