hadoop学习-倒排索引

来源：互联网发布：文具淘宝店铺编辑：程序博客网时间：2024/06/01 19:21

倒排索引是文档搜索系统中常用的数据结构。它主要用来存储某个词组在一个或多个文档中的位置映射。通常情况下，倒排索引由词组以及相关的文档列表组成。如下表所示。

表1：

单词文档列表

单词1文档1文档2文档3单词2文档2文档4文档5单词3文档3文档5文档6倒排索引

从表1可以看出单词1出现在｛文档1，文档2，文档3｝，单词2出现在｛文档2，文档4，文档5｝，单词3出现在｛文档3，文档5，文档6｝。

实际使用中还需要给文档添加一个权值，用来表示该词组与文档的相关性。如表2所示。

表2：

单词文档列表

单词1文档1权文档2权文档3权单词2文档2权文档4权文档5权单词3文档3权文档5权文档6权添加权重的倒排索引

这里的权重，一般可以使用词频，即记录词组在文档中出现的次数。更复杂的权重可以使用TF-IDF算法等等。

本例子以词频为权重，使用MapReduce来实现倒排索引。

举个例子：

现有2个文件1.txt，2.txt,其内容分别是：

1.txt

hello world

2.txt

hello hadoop

则相应的倒排索引：

"hello": 1.txt,1;2.txt,1

"world": 1.txt,1

“hadoop": 2.txt,1

下面介绍下mapreduce实现过程：

1、Map

在倒排索引中需要3个信息，词组、来源文档、词频；

因此在，Map阶段我们需要得到：

1.txt

hello world -------->> hello 1.txt 1

-------->> world 1.txt 1

2.txt

hello hadoop -------->> hello 2.txt 1

-------->> hadoop 2.txt 1

这里的map结果有3个值，而<key,value>只有2个值，为了简便处理，不使用hadoop自定义数据类型。我们对其中的2个值合并成一个。

以<"hello:1.txt,1">作为<key,value>输出到Combine过程。

2、Combine

Combine过程将key值相同的value值累加，计算同一个词组在一个文档中的词频。

除了对其累加，还需要对key值进行拆分。将文档来源和词频合并在一起。如下所示。

hello:1.txt,1 -------->> hello 1.txt:1

world:1.txt,1 -------->> world 1.txt:1

hello:2.txt,1 -------->> hello 2.txt:1

hadoop:2.txt,1 -------->> hadoop 2.txt:1

<hadoop 2.txt:1>作为<key,value>输出

3、Reduce

在Reduce 只需要对相同key值的value相加即可。

如 hello 1.txt:1;2.txt:1

代码：

[java] view plaincopy

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.*;
public class InvertedIndex2 extends Configured implements Tool {
public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
private Text keyInfo = new Text();
private Text valueInfo = new Text();
private FileSplit split;
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
split = (FileSplit)reporter.getInputSplit();
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()){
keyInfo.set(itr.nextToken() + ":" + split.getPath().getName().toString());
valueInfo.set("1");
output.collect(keyInfo,valueInfo);
}
}
}
public static class Combine extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
private Text info = new Text();
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum = sum + Integer.parseInt(values.next().toString());
}
int splitIndex = key.toString().indexOf(":");
info.set(key.toString().substring(splitIndex+1)+":"+sum);
key.set(key.toString().substring(0,splitIndex));
output.collect(key, info);
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String strtemp = new String();
//String csv = "";
//int count = 0;
while (values.hasNext()) {
strtemp += values.next().toString() + ";";
}
result.set(strtemp);
output.collect(key, result);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, InvertedIndex2.class);
//Path in = new Path(args[0]);
//Path out = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
//FileInputFormat.setInputPaths(job, in);
//FileOutputFormat.setOutputPath(job, out);
job.setJobName("InveredIndex2");
job.setMapperClass(MapClass.class);
job.setCombinerClass(Combine.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new InvertedIndex2(), args);
System.exit(res);
}
}

运行结果：

参考资料：
《hadoop-开启通向云计算的捷径》(刘鹏)

关于 hadoop reduce 阶段遍历 Iterable 的 2 个“坑”

0 0