Hadoop C++单词统计

来源：互联网发布：扎克拉文体测数据编辑：程序博客网时间：2024/06/04 19:13

转自：http://www.pluscn.net/?p=789
Hadoop 提供了两种方式来运行C++程序， Hadoop流和Pipes.
流方式:
1、首先编写map程序（map.cpp）
#include <string>#include <iostream>using namespace std;int main(){        string line;         while(cin>>line)//如果是中文的话，用fgets(char*, int n, stdin)读进来，再分词处理         {                   cout<<line<<"\t"<<1<<endl;         }         return 0;}>>g++ -o map map.cpp
2、编写reduce程序（reduce.cpp）
#include <map>#include <string>#include <iostream>using namespace std;int main(){         string key;         string value;         map<string,int> word_count;         map<string,int> :: iterator it;         while(cin>>key)         {                   cin>>value;                   it= word_count.find(key);                   if(it!= word_count.end())                   {                            ++(it->second);                   }                   else                   {                          word_count.insert(make_pair(key,1));                   }         }         for(it= word_count.begin(); it != word_count.end(); ++it)                   cout<<it->first<<"\t"<<it->second<<endl;         return 0;}>>g++ -o reduce reduce.cpp
3、需要统计的文件，并提交至hadoop中
File1.txt：hello hadoop helloworld
File2.txt：this is a firsthadoop
>>hadoop fs –put File1.txt File2.txt  ans
4 、运行程序
>>hadoop jar /data/users/hadoop/hadoop/contrib/streaming/hadoop-streaming-0.20.9.jar  -file map -file reduce -input ans/* -output output1 -mapper /data/name/hadoop_streaming/map -reducer /data/name/hadoop_streaming/reduce
Pipes方式:
1、编写程序（wordcount.cpp）
#include<algorithm>#include<limits>#include<string>#include"stdint.h"#include"hadoop/Pipes.hh"#include"hadoop/TemplateFactory.hh"#include"hadoop/StringUtils.hh"usingnamespace std;class WordCountMapper:publicHadoopPipes::Mapper{public:    WordCountMapper(HadoopPipes::TaskContext&context){}    void map(HadoopPipes::MapContext& context)    {       string line =context.getInputValue();       vector<string>word = HadoopUtils::splitString(line, " ");       for (unsignedint i=0; i<word.size(); i++)       {           context.emit(word[i],HadoopUtils::toString(1));       }    }};class WordCountReducer:publicHadoopPipes::Reducer{public:    WordCountReducer(HadoopPipes::TaskContext&context){}    void reduce(HadoopPipes::ReduceContext& context)    {       int count = 0;       while (context.nextValue())       {           count +=HadoopUtils::toInt(context.getInputValue());       }       context.emit(context.getInputKey(),HadoopUtils::toString(count));    }};int main(int argc, char **argv){    returnHadoopPipes::runTask(HadoopPipes::TemplateFactory<WordCountMapper,WordCountReducer>());}
2、编写makefile
CC = g++HADOOP_INSTALL =../../data/users/hadoop/hadoop/PLATFORM = Linux-amd64-64CPPFLAGS = -m64-I$(HADOOP_INSTALL)/c++/$(PLATFORM)/includewordcount:wordcount.cpp       $(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib-lhadooppipes -lhadooputils -lpthread -g -O2 -o $@
3、编译程序并且放入hadoop系统
> make wordcount> hadoop fs –put wordcount name/worcount
4、运行程序
> hadoop pipes -conf ./job_config.xml -input/user/hadoop/name/input/* -output /user/hadoop/name/output -program/user/hadoop/name/wordcount
这个例子很简单，只是统计词频。可以将MapReduce程序看成普通的C++程序，要初始化东西，放到构造函数，具体处理放到Map和Reduce里。
另外关于Streaming与Pipes的区别，有如下总结：
1、Streaming是Hadoop提供的一个可以使用其他编程语言来进行MapReduce来的API，因为Hadoop是基于Java（由于作者比较擅长Java，Lucene和Nutch都是出于Hadoop的作者）。Hadoop Streaming并不复杂，其只是使用了Unix的标准输入输出作为Hadoop和其他编程语言的开发接口，因此在其他的编程语言所写的程序中，只需要将标准输入作为程序的输入，将标准输出作为程序的输出就可以了。在标准的输入输出中，key和value是以tab作为分隔符，并且在reduce的标准输入中，hadoop框架保证了输入的数据是经过了按key排序的。
2、Hadoop Pipes是Hadoop MapReduce的C++接口。与使用标准输入输出的Hadoop Streaming不同（当然Streaming也可以用于C++），Hadoop Pipes在tasktacker和map/reduce进行通信时使用的socket作为管道，不是标准输入输出，而不是JNI。Hadoop Pipes不能运行在standalone模式下，所以要先配置成pseudo-distributed模式，因为Hadoop Pipes依赖于Hadoop的分布式缓存技术，而分布式缓存只会在HDFS运行的时候才会支持。与Java的接口不一样，Hadoop Pipes的key和value都是基于STL的string，因此在处理时开发人员需要手动地进行数据类型的转换。
3、从本质上 hadoop pipes 和 hadoop streaming 做的事情几乎一样，除了两者的通信不同，pipes 可以利用 hadoop 的 counter 特性。与 Java native code 比较，Java native code 可以使用实现了 Writable 接口的任何数据类型作为 key/value，而 pipes 和 streaming 就必须通过字符串进行一次转换（通信开销大，存储开销大）。也许正是这个原因，pipes 可能以后会被从 hadoop 中移除。当然，如果计算代价较高，可能 Java native code 并没有 C++ 执行效率高，那么以后可能就写 streaming code 了。pipes 使用的是 byte array，正好可以用 std:string 封装，只是例子里面都是转换成为字符串输入输出。这要求程序员自己设计合理的输入输出方式（数据 key/value 的分段）。

0 0