hadoop自定义InputFormat

来源：互联网发布：mac装机必备软件知乎编辑：程序博客网时间：2024/06/10 14:56

接触hadoop一年多了，但是自己一直没有用hadoop写过什么程序。最近，由于项目需要，将一些文件转换成hadoop的MapFile。网上的例子基本是直接处理文本输入，自定义输入格式的见到两个，但是都是用的旧的API，用新API写的还没有，可能高手不屑于写这些。但是处理自定义输入是每个用hadoop的人都要学会才行的，因为不是每个人的输入都是文本文件。

数据输入是hadoop的第一步，不能读自己的数据，后面的处理就无从谈起。文本格式处理起来容易些，对于二进制格式的文件，虽然hadoop有一个SequenceFileInputFormat，可以先把自己的数据转成SequenceFile，再处理，但是这样要多一倍的处理时间、存储空间。无奈之下，参考了hadoop的源代码，自己写了个ConverterInputFormat，在这里贴出来，供大家参考。

代码是基于hadoop 0.20的，其中的FetcherOutput是用Java的DataOutputStream写入到本地磁盘的，可以换成自己想要的格式。
ConvertertRecordReader好像必须有个默认的构造器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
packagecom.randyho.hadoop.converter;
  
importjava.io.DataInputStream;
importjava.io.IOException; 
  
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.FileSystem;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.InputSplit;
importorg.apache.hadoop.mapreduce.JobContext;
importorg.apache.hadoop.mapreduce.RecordReader;
importorg.apache.hadoop.mapreduce.TaskAttemptContext;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.input.FileSplit;
  
importcom.randyho.FetcherOutput;
  
publicclass ConverterInputFormat extendsFileInputFormat<Text, FetcherOutput> {
  
    // Do not split files.
    protectedbooleanisSplitable(JobContext context, Path file) {
        returnfalse;
    }
  
    publicRecordReader<Text, FetcherOutput> createRecordReader(
            InputSplit split, TaskAttemptContext context)throws IOException,
            InterruptedException {
        returnnew ConvertertRecordReader(); 
    }
  
    classConvertertRecordReaderextends RecordReader<Text, FetcherOutput> { 
          
        privateDataInputStream dis; 
        privateText key = null;
        privateFetcherOutput value; 
        privatebooleanmore = true;
        privateConfiguration conf; 
  
        publicConvertertRecordReader(){
            key =new Text();
            value =new FetcherOutput();
            more =true; 
        }
          
        publicvoid close() throwsIOException { 
            if(dis != null) {
                dis.close();
            }
        }
  
        publicText getCurrentKey() throws IOException, InterruptedException {
            returnkey; 
        }
  
        publicFetcherOutput getCurrentValue()throws IOException,
                InterruptedException {
            returnvalue; 
        }
  
        publicfloat getProgress() throwsIOException, InterruptedException {
            returnmore ? 0f : 100f; 
        }
  
        publicvoid initialize(InputSplit gensplit, TaskAttemptContext context)
                throwsIOException, InterruptedException {
            FileSplit split = (FileSplit) gensplit;
            conf = context.getConfiguration();  
            Path file = split.getPath();
            FileSystem fs = file.getFileSystem(conf);
              
            System.out.println("reading: "+ file); 
  
            // open the file
            dis = fs.open(split.getPath());
        }
  
        publicbooleannextKeyValue() throws IOException, InterruptedException {
            if(dis.available() != 0) { 
                value.readFields(dis);
                key.set(value.getUrl());                
                returntrue;
            }else {
                more =false; 
                returnfalse;
            }
        }
    }
}

本人也是新学，对hadoop也不是很熟悉，如果有更好的方式，恳请赐教。

This entry was posted in Hadoop and tagged Hadoop byRandy. Bookmark the permalink.

hadoop自定义InputFormat

One thought on “hadoop自定义InputFormat”