人民大学云计算编程的网上评估平台--解题报告 1004-1007

来源：互联网发布：淘宝神笔使用教程视频编辑：程序博客网时间：2024/04/28 17:25

因为一次写7道题，文章太长了，为了方便大家阅读，我分成了两篇。

接着上一篇文章，我们继续mapreduce编程之旅~~

1004：题目

Single Table Join

描述

输入文件是一个包含有子女-父母表的文件。请编写一个程序，输入为此输入文件，输出是包含在子女-父母表中的孙子女-祖父母关系表。

输入

输入是包含有子女-父母表的一个文件

输出

输出是包含有孙子女-祖父母关系的一个文件，孙子女-祖父母关系是从子女-父母表中得出的。

样例输入

child parent
Tom Lucy
Tom Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Ben
Jack Alice
Jack Jesse
Terry Alice
Terry Jesse
Philip Terry
Philip Alma
Mark Terry
Mark Alma

样例输出

grandchild  grandparent
Jone        Alice
Jone        Jesse
Tom         Alice
Tom         Jesse
Jone        Mary
Jone        Ben
Tom         Mary
Tom         Ben
Mark        Jesse
Mark        Alice
Philip      Jesse
Philip      Alice

1004：解题思路

单表的连接，这个比较有味道~~当然有可能是我水平有问题，所以写的比较复杂。

首先，我定义了一个自定义数据类型TextPair 关于自定义数据类型我这里也不多说了，大家可以百度一下，或者看看hadoop权威指南上面都会讲解。

接着：我们从输入可以看出，孩子和双亲都写在同一个文件，而我们要求的是祖孙关系，所以双亲类也会出现在孩子列。为了正确区分，所以我们借助自定义数据类型来完成。

我先上代码，在代码中我会详细注释：

[java] view plaincopy
public class MyMapre {  
public static  class wordcountMapper extends  
Mapper{  
public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException{  
String key1 = "";  
String value1 = "";  
StringTokenizer itr = new StringTokenizer(value.toString());  
//从读入得行中 取出 孩子、双亲  
if (itr.hasMoreElements()){  
key1 = itr.nextToken();  
}  
if (itr.hasMoreElements()){  
value1 = itr.nextToken();  
}  
//使用自定义的数据类型，作为key-value  
//0-孩子， 1-代表双亲  
//我这里将孩子和双亲进了交换输出，方便reduce进行 孩子-祖父的配对  
context.write(new TextPair(key1, 0), new TextPair(value1, 1));  
context.write(new TextPair(value1, 1), new TextPair(key1, 0));  
}  
}  
public static  class wordcountReduce extends  
Reducer{  
public void reduce(TextPair key, Iterablevalues, Context context)throws IOException, InterruptedException{  
//上面定义了两个list，保存孩子和双亲  
List child = new ArrayList();  
List parent = new ArrayList();  
for (TextPair str : values){  
//通过比对 0 或者 1 就可以直接是孩子还是双亲  
//具有同一个key值，表示这是双亲，而与双亲有关系的就是孩子和双亲的双亲，所以通过判断就是可以孩子和祖父  
if (str.second.get() == 0){  
child.add(str.first.toString());  
}  
else{  
parent.add(str.first.toString());  
}  
}  
if (child.size() != 0 && parent.size() != 0){  
//一个孩子可能对应多个祖父、所以采用了双重循环，孩子作为外层循环  
for (int i = 0; i < child.size(); i++){  
for (int j = 0; j < parent.size(); j++){  
context.write(new Text(child.get(i)), new Text(parent.get(j)));  
}  
}  
}  
}  
}  
//自定义数据类型，这个我就不多说了。  
public static class TextPair implements WritableComparable {  
private Text first;  
private IntWritable second;  
public TextPair() {  
set(new Text(), new IntWritable());  
}  
public TextPair(String first, int second) {  
set(new Text(first), new IntWritable(second));  
}  
public TextPair(Text first, IntWritable second) {  
set(first, second);  
}  
public void set(Text first, IntWritable second) {  
this.first = first;  
this.second = second;  
}  
public Text getFirst() {  
return first;  
}  
public String toString() {  
return (first.toString());  
}  
public IntWritable getSecond() {  
return second;  
}  
public void write(DataOutput out) throws IOException {  
first.write(out);  
second.write(out);  
}  
public void readFields(DataInput in) throws IOException {  
first.readFields(in);  
second.readFields(in);  
}  
public int compareTo(TextPair tp) {  
//注意这里排序时，只对first排序，不对进行判断的0、1进行排序  
int cmp = first.compareTo(tp.first);  
return cmp;  
}  
}  
public static  void main(String args[])throws Exception{  
Configuration conf = new Configuration();  
Job job = new Job(conf, "SingleJoin");  
job.setJarByClass(MyMapre.class);  
job.setMapOutputKeyClass(TextPair.class);  
job.setMapOutputValueClass(TextPair.class);  
job.setOutputKeyClass(Text.class);  
job.setOutputValueClass(Text.class);  
job.setMapperClass(wordcountMapper.class);  
job.setReducerClass(wordcountReduce.class);  
FileInputFormat.setInputPaths(job, new Path(args[0]));  
FileOutputFormat.setOutputPath(job, new Path(args[1]));  
job.waitForCompletion(true);  
}  
}  

1005：题目

Multi-table Join

描述

输入有两个文件，一个名为factory的输入文件包含描述工厂名和其对应地址ID的表，另一个名为address的输入文件包含描述地址名和其ID的表格。请编写一个程序输出工厂名和其对应地址的名字。

输入

输入有两个文件，第一个描述了工厂名和对应地址的ID，第二个输入文件描述了地址名和其ID。

输出

输出是一个包含工厂名和其对应地名的文件。

输入样例

input:
factory:
factoryname addressID
Beijing Red Star 1
Shenzhen Thunder 3
Guangzhou Honda 2
Beijing Rising 1
Guangzhou Development Bank 2
Tencent 3
Bank of Beijing 1
address:
addressID addressname
1 Beijing
2 Guangzhou
3 Shenzhen
4 Xian

输出样例

output:
factoryname addressname
Bank of Beijing Beijing
Beijing Red Star Beijing
Beijing Rising Beijing
Guangzhou Development Bank Guangzhou
Guangzhou Honda Guangzhou
Shenzhen Thunder Shenzhen
Tencent Shenzhen
1005解题思路：

这题跟1004的思路都差不多，能做出1004，那么1005也就不在话下了。

我们已经使用1004的自定义数据类型TextPair ，因为我们从一个文件中读入得数据分为两类，所以使用TextPair 对其进行区分。

还是上代码吧，我在代码里详细注释：

[java] view plaincopy
public class MyMapre {  
public static  class wordcountMapper extends  
Mapper{  
public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException{  
//这里比较特殊，因为一个工厂名中包含了空格，所以我们要正确分割就要注意了。  
String str = "";  
String id = "";  
String value1 = "";  
//分割  
StringTokenizer itr = new StringTokenizer(value.toString());  
while (itr.hasMoreElements()){  
str = itr.nextToken();  
//如果第一个域不包含了0-9就证明是factory文件的内容  
if (!str.matches("[0-9]")){  
value1 += str;  //包含多个str  
value1 += " ";  
}else{ //否则是address文件的内容  
id = str;  //第一个域就是Id  
//如果value1不为空则是factor，已经分解完全 factor-1  
if (!value1.isEmpty()) {   
context.write(new Text(id), new TextPair(value1, 1));  
return;  
}   
}  
}  
//如果前面都没return 那么就是address文件的内容 adress-0  
context.write(new Text(id), new TextPair(value1, 0)); }  
}  
public static  class wordcountReduce extends  
Reducer{  
public void reduce(Text key, Iterablevalues, Context context)throws IOException, InterruptedException{  
//依旧定义两个list来保存。  
List factor = new ArrayList();  
List address = new ArrayList();  
for (TextPair str : values){  
//1-factor  
if (str.second.get() == 1){  
factor.add(str.first.toString());  
}  
else{  
//0-adress  
address.add(str.first.toString());  
}  
}  
//因为一个地方可能对应多个工厂，所以将adress作为外层循环  
if (factor.size() != 0 && address.size() != 0){  
for (int i = 0; i < address.size(); i++){  
for (int j = 0; j < factor.size(); j++){  
context.write(new Text(factor.get(j)), new Text(address.get(i)));  
}  
}  
}  
}  
}  
//自定义数据类型，不多说了。  
public static class TextPair implements WritableComparable {  
private Text first;  
private IntWritable second;  
public TextPair() {  
set(new Text(), new IntWritable());  
}  
public TextPair(String first, int second) {  
set(new Text(first), new IntWritable(second));  
}  
public TextPair(Text first, IntWritable second) {  
set(first, second);  
}  
public void set(Text first, IntWritable second) {  
this.first = first;  
this.second = second;  
}  
public Text getFirst() {  
return first;  
}  
public String toString() {  
return (first.toString());  
}  
public IntWritable getSecond() {  
return second;  
}  
public void write(DataOutput out) throws IOException {  
first.write(out);  
second.write(out);  
}  
public void readFields(DataInput in) throws IOException {  
first.readFields(in);  
second.readFields(in);  
}  
public int compareTo(TextPair tp) {  
int cmp = first.compareTo(tp.first);  
return cmp;  
}  
}  
public static  void main(String args[])throws Exception{  
Configuration conf = new Configuration();  
Job job = new Job(conf, "MultiTableJoin");  
job.setJarByClass(MyMapre.class);  
job.setMapOutputKeyClass(Text.class);  
job.setMapOutputValueClass(TextPair.class);  
job.setOutputKeyClass(Text.class);  
job.setOutputValueClass(Text.class);  
job.setMapperClass(wordcountMapper.class);  
job.setReducerClass(wordcountReduce.class);  
FileInputFormat.setInputPaths(job, new Path(args[0]));  
FileOutputFormat.setOutputPath(job, new Path(args[1]));  
job.waitForCompletion(true);  
}  
}  

1006：题目

Sum

描述

输入文件是一组文本文件，每个输入文件中都包含很多行，每行都是一个数字字符串，代表了一个特别大的数字。需要注意的是这个数字的低位在字符串的开头，高位在字符串的结尾。请编写一个程序求包含在输入文件中的所有数字的和并输出。

输入

输入有很多文件组成，每个文件都有很多行，每行都由一个数字字符串代表一个数字。

输出

输出时一个文件，这个文件中第一行的第一个数字是行标，第二个数字式输入文件中所有数字的和。

输入样例

input:
file1:
1235546665312
112344569882
326434546462
21346546846
file2:
3654354655
3215456463
21235465463
321265465
65465463
32
file3:
31654
654564564
3541231564
351646846
3164646
3163

输出样例

output:
1 8685932816082

注意:
1 只有一个输出文件;
2 输出文件的第一行由行标"1"和所有数字的和组成;
3 每个数字都是正整数或者零。每个数字都超过50位，所以常用数据类型是无法存储的;
4 数字的低位在数字字符串的左侧，高位在数字字符串的右侧。比如样例输入第一个输入文件的第一行代表的数字是2135666455321。

1006解题思路：1006主要解决两个问题，一：大数加法。二：将所有数据归一

第一个问题是常规解法，我不多说。第二，因为我们最后需要求出一个总结果，所以就需要将所有的key归成一个group。当然我们可以自定义group的划分,这个可以参考hadoop权威指南，以后如果有需要，我会写出来的。我这里用了一个简单解决办法。（能用简单的办法，当然用简单的办法）

我结合代码给大家讲解吧：

[java] view plaincopy
public class MyMapre {  
public static  class wordcountMapper extends  
Mapper{  
public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException{  
//注意这里的key,这就是我所谓的简单办法，用同一个key,那么在reduce阶段就可以加所有数据归到一个group  
context.write(new LongWritable(1), value);  
}  
}  
public static  class wordcountReduce extends  
Reducer{  
String tem = "0"; //因为是大数，所以要string来存储  
public void reduce(LongWritable key, Iterablevalues, Context context)throws IOException, InterruptedException{  
for (Text str : values){  
//获取大数,调用Sum（）大数加法函数  
tem = Sum(tem, str.toString());  
}  
context.write(key, new Text(tem));  
}  
}  
//这是我实现的大数加法函数，其实我作了很久心理斗争，因为这个函数写的实在不怎么样，大家可以自己实现，不要看我这个坏例子。呵呵~~ 这个函数我就不写注释了。  
public  static String  Sum(String a, String b){  
String c = "";  
int a_len = a.length();  
int b_len = b.length();  
int jin = 0;  
int a_first;  
int b_first;  
int temp;  
while (a_len  > 0 && b_len  > 0){  
a_first = Integer.parseInt(a.substring(0, 1));  
b_first = Integer.parseInt(b.substring(0, 1));  
a = a.substring(1);  
b = b.substring(1);  
temp= a_first + b_first +jin;  
jin = temp/ 10;  
temp= temp- 10 * jin;  
c += temp;  
a_len--;  
b_len--;  
}  
if (a_len == 0 && b_len == 0 && jin != 0)  
c += jin;  
while (a_len > 0){  
int k = Integer.parseInt(a.substring(0, 1)) + jin;  
a = a.substring(1);  
c += k;  
a_len--;  
jin = 0;  
}  
while (b_len > 0){  
int k = Integer.parseInt(b.substring(0, 1)) + jin;  
b = b.substring(1);  
c += k;  
b_len --;  
jin = 0;  
}  
return c;  
}   
public static  void main(String args[])throws Exception{  
Configuration conf = new Configuration();  
Job job = new Job(conf, "Sum");  
job.setJarByClass(MyMapre.class);  
job.setMapOutputKeyClass(LongWritable.class);  
job.setMapOutputValueClass(Text.class);  
job.setOutputKeyClass(LongWritable.class);  
job.setOutputValueClass(Text.class);  
job.setMapperClass(wordcountMapper.class);  
job.setReducerClass(wordcountReduce.class);  
FileInputFormat.setInputPaths(job, new Path(args[0]));  
FileOutputFormat.setOutputPath(job, new Path(args[1]));  
job.waitForCompletion(true);  
}  
}  

1007：题目

WordCount Plus

描述

WordCount例子输入文本文件并计算单词出现的次数。现在有一个WordCount2.0版本，在这个版本中你必须处理含有"/.',"{}[]:;"等等字符的输入文件。在你切词的时候，你应该把"declare," 切成 "declare"，同样 "Hello!"应该切成"Hello"，"can't"应该切成"can't"。

输入

输入是包含很多单词的文本文件。

出入

输出是一个文本文件，这个文件的每一行包含一个单词和这个单词在所有输入文件中出现的次数。在输出文件中单词是按照字典顺序排序的。

输入样例

input1:
hello world, bye world.
input2:
hello hadoop, bye hadoop!
输出样例

bye 2
hadoop 2
hello 2
world 2
1007解题思路：1007主要是对字符的过滤，这里我可以使用正则表达式来过滤。没什么难点~~

我们还是边看代码边说吧：

[java] view plaincopy
public class MyMapre {  
public static  class wordcountMapper extends  
Mapper{  
private final static IntWritable  one = new IntWritable(1);  
private String pattern = "[^//w/']";  //定义正则表达式，过滤除数字、字母、“'” 外的字符  
public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException{  
String line = value.toString().toLowerCase();  
//用空格代替要过滤的字符  
line = line.replaceAll(pattern, " ");  
//划分  
StringTokenizer itr = new StringTokenizer(line);  
while(itr.hasMoreElements()){  
context.write(new Text(itr.nextToken()), one);  
}  
}  
}  
public static  class wordcountReduce extends  
Reducer{  
public void reduce(Text key, Iterablevalues, Context context)throws IOException, InterruptedException{  
//这里就比较简单了，跟wordcount一样，我就不多说了。  
int sum = 0;  
for (IntWritable str : values){  
sum += str.get();  
}  
context.write(key, new IntWritable(sum));  
}  
}  
public static  void main(String args[])throws Exception{  
Configuration conf = new Configuration();  
Job job = new Job(conf, "Plus");  
job.setJarByClass(MyMapre.class);  
job.setMapOutputKeyClass(Text.class);  
job.setMapOutputValueClass(IntWritable.class);  
job.setOutputKeyClass(Text.class);  
job.setOutputValueClass(IntWritable.class);  
job.setMapperClass(wordcountMapper.class);  
job.setReducerClass(wordcountReduce.class);  
FileInputFormat.setInputPaths(job, new Path(args[0]));  
FileOutputFormat.setOutputPath(job, new Path(args[1]));  
job.waitForCompletion(true);  
}  
}  

终于写完了，当然这里写的是我的解题思路，如果各位大大有更好的想法，不妨分享出来，大家一起happy。上面的程序都能正确提交。

当然我不排除我程序中有考虑不周的地方或错误的地方（测试数据的不全面造成）的，如果各位大大能指出，我将不胜感激~~

我最后再说明下，因为程序是我从网站上的提交库直接取回来的，格式不太好看。对不住各位了~~

0 0