Nutch如何读取CrawlDb中的<Text,CrawlDatum>键值对

来源:互联网 发布:淘宝网死飞自行车 编辑:程序博客网 时间:2024/05/24 04:40

想要查看CrawlDb中的键值对<Text,CrawlDatum>,于是今天修改了nutch中的CrawlDbReader类,对源码进行了稍微的修改,提交了一个job,然后在添加的MapperClass类中,将CrawlDb中的键值对打印出来进行查看分析,代码如下:

(1)添加的MapperClass类,代码如下:

  public static class CrawlDbShowMapper implements Mapper<Text, CrawlDatum, Text, LongWritable> {    LongWritable COUNT_1 = new LongWritable(1);    private boolean sort = false;    public void configure(JobConf job) {      sort = job.getBoolean("db.reader.stats.sort", false );    }    public void close() {}    public void map(Text key, CrawlDatum value, OutputCollector<Text, LongWritable> output, Reporter reporter)            throws IOException {    System.out.println("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++");    System.out.println("key="+key.toString());    System.out.println("CrawlDatum="+value.getFetchTime());    System.out.println("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++");     }  }


    代码很简单,就是将map()函数获得的键值对打印出来。

(2)提交job的函数showCrawlDb(),代码如下:

  public void showCrawlDb(String crawlDb,String output,Configuration config) throws IOException{     Path tmpFolder = new Path(output,""+System.currentTimeMillis());     JobConf job = new NutchJob(config);     job.setJobName("show " + crawlDb);     FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));     job.setInputFormat(SequenceFileInputFormat.class);     job.setMapperClass(CrawlDbShowMapper.class);     FileOutputFormat.setOutputPath(job, tmpFolder);//     job.setOutputFormat(SequenceFileOutputFormat.class);//     job.setOutputKeyClass(Text.class);//     job.setOutputValueClass(LongWritable.class);     JobClient.runJob(job);  }

    解释:代码很简单,就是将CrawlDb的路径作为InputPath输入,将CrawlDbShowMapper作为MapperClass类,然后提交job,系统就会自动运行,输出CrawlDb中的兼职对,调用该函数有三个参数,crawlDb为CrawlDb的路径;output为结果输出的路径,因为MapperClass没有OutputCollector收集结果,所以该参数实际没有用到;最后一个配置信息类Configuration实际为NutchConfiguration实例,可以通过代码Configuration conf = NutchConfiguration.create()得到。


 

 

原创粉丝点击