Hadoop Streaming编程实例
来源:互联网 发布:仟佰盾口罩 淘宝多少钱 编辑:程序博客网 时间:2024/05/22 00:12
(1)对于一种编写语言,应该怎么编写Mapper和Reduce,需遵循什么样的编程规范
(2) 如何在Hadoop Streaming中自定义Hadoop Counter
(3) 如何在Hadoop Streaming中自定义状态信息,进而给用户反馈当前作业执行进度
(4) 如何在Hadoop Streaming中打印调试日志,在哪里可以看到这些日志
(5)如何使用Hadoop Streaming处理二进制文件,而不仅仅是文本文件
我已经在多篇文章中介绍了Hadoop Streaming,如果你对它还不了解,可以阅读:“Hadoop Streaming 编程”,“Hadoop Streaming高级编程”等文章。
本文重点解决前四个问题,给出了C++和Shell编写的Wordcount实例,供大家参考。
1. C++版WordCount
(1)Mapper实现(mapper.cpp)
#include <iostream>
#include <string>
using
namespace
std;
int
main() {
string key;
while
(cin >> key) {
cout << key <<
"\t"
<<
"1"
<< endl;
// Define counter named counter_no in group counter_group
cerr <<
"reporter:counter:counter_group,counter_no,1\n"
;
// dispaly status
cerr <<
"reporter:status:processing......\n"
;
// Print logs for testing
cerr <<
"This is log, will be printed in stdout file\n"
;
}
return
0;
}
(2)Reducer实现(reducer.cpp)
#include <iostream>
#include <string>
using
namespace
std;
int
main() {
//reducer将会被封装成一个独立进程,因而需要有main函数
string cur_key, last_key, value;
cin >> cur_key >> value;
last_key = cur_key;
int
n = 1;
while
(cin >> cur_key) {
//读取map task输出结果
cin >> value;
if
(last_key != cur_key) {
//识别下一个key
cout << last_key <<
"\t"
<< n << endl;
last_key = cur_key;
n = 1;
}
else
{
//获取key相同的所有value数目
n++;
//key值相同的,累计value值
}
}
cout << last_key <<
"\t"
<< n << endl;
return
0;
}
(3)编译运行
编译以上两个程序:
g++ -o mapper mapper.cpp
g++ -o reducer reducer.cpp
测试一下:
echo “dong xicheng is here now, talk to dong xicheng now” | ./mapper | sort | ./reducer
注:上面这种测试方法会频繁打印以下字符串,可以先注释掉,这些字符串hadoop能够识别
reporter:counter:counter_group,counter_no,1
reporter:status:processing……
This is log, will be printed in stdout file
测试通过后,可通过以下脚本将作业提交到集群中(run_cpp_mr.sh):
#!/bin/bash
HADOOP_HOME=
/opt/yarn-client
INPUT_PATH=
/test/input
OUTPUT_PATH=
/test/output
echo
"Clearing output path: $OUTPUT_PATH"
$HADOOP_HOME
/bin/hadoop
fs -rmr $OUTPUT_PATH
${HADOOP_HOME}
/bin/hadoop
jar\
${HADOOP_HOME}
/share/hadoop/tools/lib/hadoop-streaming-2
.2.0.jar\
-files mapper,reducer\
-input $INPUT_PATH\
-output $OUTPUT_PATH\
-mapper mapper\
-reducer reducer
2. Shell版WordCount
(1)Mapper实现(mapper.sh)
#! /bin/bash
while
read
LINE;
do
for
word
in
$LINE
do
echo
"$word 1"
# in streaming, we define counter by
# [reporter:counter:<group>,<counter>,<amount>]
# define a counter named counter_no, in group counter_group
# increase this counter by 1
# counter shoule be output through stderr
echo
"reporter:counter:counter_group,counter_no,1"
>&2
echo
"reporter:counter:status,processing......"
>&2
echo
"This is log for testing, will be printed in stdout file"
>&2
done
done
(2)Reducer实现(mapper.sh)
#! /bin/bash
count=0
started=0
word=
""
while
read
LINE;
do
newword=`
echo
$LINE |
cut
-d
' '
-f 1`
if
[
"$word"
!=
"$newword"
];
then
[ $started -
ne
0 ] &&
echo
"$word\t$count"
word=$newword
count=1
started=1
else
count=$(( $count + 1 ))
fi
done
echo
"$word\t$count"
(3)测试运行
测试以上两个程序:
echo “dong xicheng is here now, talk to dong xicheng now” | sh mapper.sh | sort | sh reducer.sh
注:上面这种测试方法会频繁打印以下字符串,可以先注释掉,这些字符串hadoop能够识别
reporter:counter:counter_group,counter_no,1
reporter:status:processing……
This is log, will be printed in stdout file
测试通过后,可通过以下脚本将作业提交到集群中(run_shell_mr.sh):
#!/bin/bash
HADOOP_HOME=
/opt/yarn-client
INPUT_PATH=
/test/input
OUTPUT_PATH=
/test/output
echo
"Clearing output path: $OUTPUT_PATH"
$HADOOP_HOME
/bin/hadoop
fs -rmr $OUTPUT_PATH
${HADOOP_HOME}
/bin/hadoop
jar\
${HADOOP_HOME}
/share/hadoop/tools/lib/hadoop-streaming-2
.2.0.jar\
-files mapper.sh,reducer.sh\
-input $INPUT_PATH\
-output $OUTPUT_PATH\
-mapper
"sh mapper.sh"
\
-reducer
"sh reducer.sh"
3. 程序说明
在Hadoop Streaming中,标准输入、标准输出和错误输出各有妙用,其中,标准输入和输出分别用于接受输入数据和输出处理结果,而错误输出的意义视内容而定:
(1)如果标准错误输出的内容为:reporter:counter:group,counter,amount,表示将名称为counter,所在组为group的hadoop counter值增加amount,hadoop第一次读到这个counter时,会创建它,之后查找counter表,增加对应counter值
(2)如果标准错误输出的内容为:reporter:status:message,则表示在界面或者终端上打印message信息,可以是一些状态提示信息
(3)如果采用错误输出的内容不是以上两种情况,则表示调试日志,Hadoop会将其重定向到stderr文件中。注:每个Task对应三个日志文件,分别是stdout、stderr和syslog,都是文本文件,可以在web界面上查看这三个日志文件内容,也可以登录到task所在节点上,到对应目录中查看。
另外,需要注意一点,默认Map Task输出的key和value分隔符是\t,Hadoop会在Map和Reduce阶段按照\t分离key和value,并对key排序,注意这点非常重要,当然,你可以使用stream.map.output.field.separator指定新的分隔符。
原创文章,转载请注明: 转载自董的博客
本文链接地址: http://dongxicheng.org/mapreduce-nextgen/hadoop-streaming-examples/
- Hadoop Streaming编程实例
- Hadoop Streaming编程实例
- Hadoop Streaming编程实例
- hadoop streaming 编程
- Hadoop Streaming 编程
- Hadoop Streaming高级编程
- Hadoop Streaming 编程
- hadoop streaming 编程参数设置
- Hadoop Streaming编程总结
- Hadoop Streaming 编程
- 转载:Hadoop Streaming 编程
- Hadoop Streaming 编程
- Hadoop Streaming 编程
- Hadoop Streaming 编程
- Hadoop Streaming 编程
- Hadoop Streaming高级编程
- Hadoop Streaming 编程
- Hadoop Streaming高级编程
- SOAPUI 接口测试学习笔记节选 结果写入操作方法介绍
- 探索C++0x: 1. 静态断言(static_assert)
- 配置DirectShow开发环境(VS2010,64位Win8系统)
- nova list命令的代码流程分析
- Java基础之十二:网络编程
- Hadoop Streaming编程实例
- Objective-C中不同方式实现锁
- 利用Hadoop Streaming处理二进制格式文件
- Java基础之十三:Java集合框架
- linux 工作队列workqueue
- iOS中头条新闻滑动效果
- Blackjack - Intro
- Java基础之十五:JDBC基础
- Linux笔记(61)——mysql源码包安装