Hadoop初学笔记

来源：互联网发布：2g3g4g网络的发展历史编辑：程序博客网时间：2024/05/17 06:21

环境：
unbuntu
jdk8
hadoop-2.6.4

一、介绍hadoop

Hadoop由两部分组成：HDFS和MapReducer；
HDFS为一个分布式文件系统，由google的GFS演变而来。 HDFS有高容错性的特点，并且设计用来部署在低廉的（low-cost）硬件上；而且它提供高吞吐量（high throughput）来访问应用程序的数据，适合那些有着超大数据集（large data set）的应用程序。
MapReduce是处理大量半结构化数据集合的编程模型。编程模型是一种处理并结构化特定问题的方式。（就像Oracle和SQL）

二、安装

部署hadoop需要安装如下软件：

JDK（1.5以上）、hadoop、SSH

1、JDK安装
export JAVA_HOME=/usr/lib/jvm/java
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

2、hadoop安装
http://hadoop.apache.org/
从hadoop官网下载安装包，解压后修改etc/hadoop目录下的四个文件yarn-site.xml、core-site.xml、hdfs-site.xml、mapred-site.xml

core-site.xml

<name>fs.default.name</name>

<value>hdfs://localhost:9000/</value>

</property>

<name>hadoop.tmp.dir</name>

<value>/home/hadoop-2.6.4/tmp</value>

</property>

</configuration>

hdfs-site.xml

<name>dfs.replication</name>

</property>

<value>/home/hadoop-2.6.4/name</value>

</property>

<value>/home/hadoop-2.6.4/data</value>

</property>

<name>dfs.permissions</name>

<value>false</value>

</property>

</configuration>

mapred-site.xml

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

3、格式化hadoop文件系统
配置好后用以下命令格式化hdfs文件系统
bin/hadoop namenode -format
（格式化之前需要保证core-site.xml中配置的/tmp目录下无文件 ）

4、启动hadoop：
sbin/start-all.sh

5、核查Hadoop是否启动成功
利用命令“jps”观察hadoop的启动进程，出现如下6个进程表示启动成功

$ jps
4258 ResourceManager
3749 NameNode
5944 Jps
4088 SecondaryNameNode
4378 NodeManager
3869 DataNode

6、通过前台界面管理hadoop，http://localhost:50070

三、hadoop基本操作命令

查看 bin/hadoop dfs -ls /

创建目录 bin/hadoop dfs -mkdir /input

...（更多的操作方式可以baidu、google、bing）

四、利用hadoop编写单词统计程序

1、hadoop计算基本原理

例如：
以下是文件的内容：
asiainfo.txt
hello asiainfo
asiainfo is big
hello big


hadoop按照如下的4个步骤对最后计算出每个单词的数量，其中Map和Reduce两个步骤是需要自己实现：

Map（interface）排序汇总Reduce（interface）<hello,1>
<asiainfo,1>
<asiainfo,1>
<is,1>
<big,1>
<hello,1>
<big,1>
<asiainfo,1>
<asiainfo,1>
<big,1>
<big,1>
<hello,1>
<hello,1>
<is,1>
<asiainfo,[1,1]>
<big,[1,1]>
<hello,[1,1]>
<is,[1]>
<asiainfo,2>
<big,2>
<hello,2>
<is,1>

官方图解：

2、程序实现

按照上一步的4个步骤，再程序中需要自己实现Map、Reduce两个步骤

package com.hadooptest;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

//定义文件内容的拆分规则

public static class WordCountMapper extendsMapper<Object, Text, Text, IntWritable>{

String line = null;

@Override

protected void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

line = value.toString();

String[] arry = line.split("/");

for(String s : arry){

context.write(new Text(s),new IntWritable(1));

}

//定义处理结果的方式

public static class WordCountReducer extendsReducer<Text,IntWritable,Text,IntWritable>{

@Override

protected void reduce(Text key, Iterable<IntWritable> values,Context context)

throws IOException, InterruptedException {

int sum = 0;

for(IntWritable v : values){

sum += v.get();

}

context.write(key,new IntWritable(sum));

}

public static void main(String[] args) throws Exception{

Configuration conf = new Configuration();

String[] arguments = new GenericOptionsParser(conf,args).getRemainingArgs();

//程序执行需要输入两个参数：输入目录、输出目录，其中输出目录不需要预先创建（如存在需要删除）

if(arguments.length!=2){

System.out.println("invalid arguments");

System.exit(2);

}

//创建调度Job

Job job = new Job(conf,"word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCountMapper.class); //定义Mapper

job.setReducerClass(WordCountReducer.class);//定义Mapper

job.setCombinerClass(WordCountReducer.class); //定义计算合并，与Reducer一样

job.setOutputKeyClass(Text.class); //定义输出的Key值类型

job.setOutputValueClass(IntWritable.class);//定义输出的Value类型

FileInputFormat.addInputPath(job, new Path(arguments[0]));//输入目录

FileOutputFormat.setOutputPath(job,new Path(arguments[1])); //输出目录

//执行Job

System.exit(job.waitForCompletion(true)?0:1);

}

执行jar文件：

$ bin/hadoop jar ./wordCount.jar /input /output

执行结果存放在/output目录下面

$ bin/hadoop dfs -ls /output

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Found 2 items

-rw-r--r-- 1 hadoop supergroup 0 2016-07-03 16:52 /output/_SUCCESS

-rw-r--r-- 1 hadoop supergroup 14683 2016-07-03 16:52 /output/part-r-00000

查看输出结果

$ bin/hadoop dfs -cat /output/part-r-00000

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

asiainfo 2

big 2

hello 2

is 1

0 0