java程序员的大数据之路（7）：基于文件的数据结构

来源：互联网发布：微老板软件编辑：程序博客网时间：2024/06/07 20:11

SequenceFile

介绍

由于日志文件中每一条日志记录是一行文本。如果想记录二进制类型，纯文本是不合适的。这种情况下，Hadoop的SequenceFile类非常合适。SequenceFile可以作为小文件容器。而HDFS和MapReduce是针对大文件进行优化的，所以通过SequenceFile类型将小文件包装起来，可以获得更高效率的存储和处理。

读写操作

public class SequenceFileDemo {    public static final String[] DATA = {            "The only thing you really have in your life is time.",            "And if you invest that time in yourself,",            "to have great experiences that are going to enrich you,",            "then you can’t possibly lose.",            "– Steve Jobs, Entrepreneur"    };    private static void write(String uri) throws IOException {        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(uri), conf);        Path path = new Path(uri);        IntWritable key = new IntWritable();        Text value = new Text();        SequenceFile.Writer writer = null;        try {            writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());            for (int i = 0; i < 100; i++) {                key.set(100 - i);                value.set(DATA[i % DATA.length]);                System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);                writer.append(key, value);            }        } finally {            IOUtils.closeStream(writer);        }    }    private static void read(String uri) throws IOException {        Configuration conf = new Configuration();        FileSystem fs = FileSystem.get(URI.create(uri), conf);        Path path = new Path(uri);        SequenceFile.Reader reader = null;        try {            reader = new SequenceFile.Reader(fs, path, conf);            Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);            Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);            long position = reader.getPosition();            while (reader.next(key, value)) {                String syncSeen = reader.syncSeen() ? "*" : "";                System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);                position = reader.getPosition();            }        } finally {            IOUtils.closeStream(reader);        }    }    public static void main(String[] args) throws IOException {//        write(args[0]);        read(args[0]);    }}

说明

存储在SequenceFile中的键和值并不一定需要是Writable类型。任一可以通过Serialization类实现序列化和反序列化的类型均可被使用。如果使用Writable类型，则可以通过键和值作为参数的next()方法将数据流中的下一条键值对读入变量；否则，需要使用方法：

public Object next(Object key) throws IOExceptionpublic Object getCurrentValue(Object val) throws IOException

MapFile

介绍

MapFile是已经排序的SequenceFile，它已经加入用于搜索键的索引。可以将MapFile视为java.util.Map的持久化形式。

读写操作

MapFile的读写操作与SequenceFile类似，可参见基于文件的数据结构：关于MapFile

SequenceFile转换为MapFile

在MapFile中搜索相当于在索引和排序过的SequenceFile中搜索。对MapFile调用fix()静态方法，可以为MapFile重建索引。fix()方法通常用于重建已损坏的索引。
步骤：
1. 对顺序文件进行排序
2. 将MapReduce的输出重命名为data文件
3. 建立index文件

阅读全文

0 0