Hadoop中SequenceFile的使用

来源：互联网发布：冒险岛夜光法师v矩阵编辑：程序博客网时间：2024/04/30 01:38

1.对于某些应用而言，需要特殊的数据结构来存储自己的数据。对于基于MapReduce的数据处理，将每个二进制数据的大对象融入自己的文件中并不能实现很高的可扩展性，针对上述情况，Hadoop开发了一组更高层次的容器SequenceFile。

     2. 考虑日志文件，其中每一条日志记录是一行文本。如果想记录二进制类型，纯文本是不合适的。这种情况下，Hadoop的SequenceFile类非常合适，因为上述提供了二进制键/值对的永久存储的数据结构。当作为日志文件的存储格式时，可以自己选择键，比如由LongWritable类型表示的时间戳，以及值可以是Writable类型，用于表示日志记录的数量。SequenceFile同样为可以作为小文件的容器。而HDFS和 MapReduce是针对大文件进行优化的，所以通过SequenceFile类型将小文件包装起来，可以获得更高效率的存储和处理。
     3. SequenceFile类内部有两个比较主要的内部类分别是SequenceFile.Reader和SequenceFile.Writer
      SequenceFile.Reader
      通过createWriter（）静态方法可以创建SequenceFile对象，并返SequenceFile.Writer实例。该静态方法有多个重载版本，但都需要指定待写入的数据流（FSDataOutputStream或FileSystem对象和Path对象），Configuration对象，以及键和值的类型。另外可选参数包括压缩类型以及相应的codec，Progressable回调函数用于通知写入的进度，以及在SequenceFile头文件中存储的Metadata实例。存储在SequenceFile中的键和值对并不一定是Writable类型。任意可以通过Serialization类实现序列化和反序列化的类型均可被使用。一旦拥有SequenceFile.Writer实例，就可以通过append（）方法在文件末尾附件键/值对。
    SequenceFile.Writer
    创建SequenceFile.Writer可以通过调用本身的构造函数 SequenceFile.Reader(FileSystem fs, Path file, Configuration conf) 来构造实例对象，从头到尾读取顺序文件的过程是创建SequenceFile.Reader实例后反复调用next（）方法迭代读取记录的过程。读取的是哪条记录与你使用的序列化框架相关。如果使用的是Writable类型，那么通过键和值作为参数的Next（）方法可以将数据流中的下一条键值对读入变量中：
     public boolean next（Writable key，Writable val），如果键值对成功读取，则返回true，如果已读到文件末尾，则返回false。具体示例代码如下所示：

import java.io.IOException;
import java.net.URI;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

public class sequence {
/**
* @param args
*/
public static FileSystem fs;
public static final String Output_path="/home/hadoop/test/A.txt";
public static Random random=new Random();
private static final String[] DATA={
"One,two,buckle my shoe",
"Three,four,shut the door",
"Five,six,pick up sticks",
"Seven,eight,lay them straight",
"Nine,ten,a big fat hen"
};
public static Configuration conf=new Configuration();
public static void write(String pathStr) throws IOException{
Path path=new Path(pathStr);
FileSystem fs=FileSystem.get(URI.create(pathStr), conf);
SequenceFile.Writer writer=SequenceFile.createWriter(fs, conf, path, Text.class, IntWritable.class);
Text key=new Text();
IntWritable value=new IntWritable();
for(int i=0;i<DATA.length;i++){
key.set(DATA[i]);
value.set(random.nextInt(10));
System.out.println(key);
System.out.println(value);

System.out.println(writer.getLength());
writer.append(key, value);

}
writer.close();
}
public static void read(String pathStr) throws IOException{
FileSystem fs=FileSystem.get(URI.create(pathStr), conf);
SequenceFile.Reader reader=new SequenceFile.Reader(fs, new Path(pathStr), conf);
Text key=new Text();
IntWritable value=new IntWritable();
while(reader.next(key, value)){
System.out.println(key);
System.out.println(value);
}
}

public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
write(Output_path);
read(Output_path);
}
}

如果需要在mapreduce中进行SequenceFile的读取和写入，则需要到SequcenFileInputFormat和SequenceFileOutputFormat，示例代码如下所示：

1）输出格式为SequenceFileOutputFormat

public class SequenceFileOutputFormatDemo extends Configured implements Tool {
public static class SequenceFileOutputFormatDemoMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}

public static void main(String[] args) throws Exception {
int nRet = ToolRunner.run(new Configuration(),
new SequenceFileOutputFormatDemo(), args);
System.out.println(nRet);
}
@Override
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
Configuration conf = getConf();
Job job = new Job(conf, "sequence file output demo ");
job.setJarByClass(SequenceFileOutputFormatDemo.class);
FileInputFormat.addInputPaths(job, args[0]);
HdfsUtil.deleteDir(args[1]);
job.setMapperClass(SequenceFileOutputFormatDemoMapper.class);
// 因为没有reducer，所以map的输出为job的最后输出，所以需要把outputkeyclass
// outputvalueclass设置为与map的输出一致
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
// 如果不希望有reducer，设置为0
job.setNumReduceTasks(0);
// 设置输出类
job.setOutputFormatClass(SequenceFileOutputFormat.class);
// 设置sequecnfile的格式，对于sequencefile的输出格式，有多种组合方式,
//从下面的模式中选择一种，并将其余的注释掉
// 组合方式1：不压缩模式
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.NONE);

// 组合方式2：record压缩模式，并指定采用的压缩方式：默认、gzip压缩等
// SequenceFileOutputFormat.setOutputCompressionType(job,
// CompressionType.RECORD);
// SequenceFileOutputFormat.setOutputCompressorClass(job,
// DefaultCodec.class);

// 组合方式3：block压缩模式，并指定采用的压缩方式：默认、gzip压缩等
// SequenceFileOutputFormat.setOutputCompressionType(job,
// CompressionType.BLOCK);
// SequenceFileOutputFormat.setOutputCompressorClass(job,
// DefaultCodec.class);
SequenceFileOutputFormat.setOutputPath(job, new Path(args[1]));
int result = job.waitForCompletion(true) ? 0 : 1;
return result;
}
}

2）输入格式为SequcenFileInputFormat

public class SequenceFileInputFormatDemo extends Configured implements Tool {
public static class SequenceFileInputFormatDemoMapper extends
Mapper<LongWritable, Text, Text, NullWritable> {

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
System.out.println("key: " + key.toString() + " ; value: "
+ value.toString());
}

}

public static void main(String[] args) throws Exception {

int nRet = ToolRunner.run(new Configuration(),
new SequenceFileInputFormatDemo(), args);
System.out.println(nRet);
}

@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = new Job(conf, "sequence file input demo");
job.setJarByClass(SequenceFileInputFormatDemo.class);
FileInputFormat.addInputPaths(job, args[0]);
HdfsUtil.deleteDir(args[1]);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(SequenceFileInputFormatDemoMapper.class);
job.setNumReduceTasks(1);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
int result = job.waitForCompletion(true) ? 0 : 1;
return result;
}
}

或者读取的时候也可以如下面的方式进行读取，但是此时输出格式就为普通FileOutputFormat了，输入格式也为普通FileInputFormat了。示例代码如下面所示：

public class MapReduceReadFile {

private static SequenceFile.Reader reader = null;
private static Configuration conf = new Configuration();

public static class ReadFileMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {

/* (non-Javadoc)
* @see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN, org.apache.hadoop.mapreduce.Mapper.Context)
*/
@Override
public void map(LongWritable key, Text value, Context context) {
key = (LongWritable) ReflectionUtils.newInstance(
reader.getKeyClass(), conf);
value = (Text) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
try {
while (reader.next(key, value)) {
System.out.printf("%s\t%s\n", key, value);
context.write(key, value);
}
} catch (IOException e1) {
e1.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
/**
* @param args
* @throws IOException
* @throws InterruptedException
* @throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

Job job = new Job(conf,"read seq file");
job.setJarByClass(MapReduceReadFile.class);
job.setMapperClass(ReadFileMapper.class);
job.setMapOutputValueClass(Text.class);
Path path = new Path("logfile2");
FileSystem fs = FileSystem.get(conf);
reader = new SequenceFile.Reader(fs, path, conf);
FileInputFormat.addInputPath(job, path);
FileOutputFormat.setOutputPath(job, new Path("result"));
System.exit(job.waitForCompletion(true)?0:1);
}
}

1 0