Nutch 二次开发之parse正文内容

来源：互联网发布：linux安装maven仓库编辑：程序博客网时间：2024/06/04 18:58

关于nutch的基础知识可以参考lemo的专栏

nutch支持二次开发，为了满足搜索的准确率的问题，考虑仅仅将网页正文的内容提取出来作为索引的内容，对应的是parse_text的数据。我使用的事nutch1.4 版本，在cygwin下执行crawl命令进行爬取。

bin/nutch crawl urls -dir crawl -depth 3 -topN 30

爬取的流程如下：inject ：将urls下的url文档中的url注入到数据库，generate：从数据库中取得url获取需要爬取的url队列，fetch：从url爬取队列中爬取page，parse：解析page的内容。从这里看到我需要改写的是parse对网页解析部分，parse对网页进行解析后将解析的text放入crawl/segments下对应的parse_text文件夹下，我们可以通过命令

bin/nutch readseg -dump crawl/segments/20120710142020 segdata

查看具体爬取的内容。

从系统的扩展点，通过实现系统中的parser扩展点，即可实现自己的parse应用，而系统中对html页面解析是通过默认的parse-html插件实现的，这里我们为了方便（但升级nutch版本之后就不方便了），直接在parse-html插件处进行修改。

首先我们先找到parse-html实现parser借口的getparse方法，这个方法是具体解析网页内容的。

public ParseResult getParse(Content content) {    HTMLMetaTags metaTags = new HTMLMetaTags();    URL base;    try {      base = new URL(content.getBaseUrl());    } catch (MalformedURLException e) {      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());    }    String text = "";    String title = "";    Outlink[] outlinks = new Outlink[0];    Metadata metadata = new Metadata();    // parse the content    DocumentFragment root;    try {      byte[] contentInOctets = content.getContent();      InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));      EncodingDetector detector = new EncodingDetector(conf);      detector.autoDetectClues(content, true);      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");      String encoding = detector.guessEncoding(content, defaultCharEncoding);      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);      input.setEncoding(encoding);      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }      root = parse(input);    } catch (IOException e) {      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());    } catch (DOMException e) {      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());    } catch (SAXException e) {      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());    } catch (Exception e) {      e.printStackTrace(LogUtil.getWarnStream(LOG));      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());    }          // get meta directives    HTMLMetaProcessor.getMetaTags(metaTags, root, base);    if (LOG.isTraceEnabled()) {      LOG.trace("Meta tags for " + base + ": " + metaTags.toString());    }    // check meta directives    if (!metaTags.getNoIndex()) {               // okay to index      StringBuffer sb = new StringBuffer();      if (LOG.isTraceEnabled()) { LOG.trace("Getting text..."); }           try {      utils.getText(sb, root);// 这里是具体解析text的位置      text = sb.toString();      } catch (SAXException e) {      // TODO Auto-generated catch block      e.printStackTrace();      }      sb.setLength(0);      if (LOG.isTraceEnabled()) { LOG.trace("Getting title..."); }      utils.getTitle(sb, root);         // extract title      title = sb.toString().trim();    }          if (!metaTags.getNoFollow()) {              // okay to follow links      ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract outlinks      URL baseTag = utils.getBase(root);      if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }      utils.getOutlinks(baseTag!=null?baseTag:base, l, root);      outlinks = l.toArray(new Outlink[l.size()]);      if (LOG.isTraceEnabled()) {        LOG.trace("found "+outlinks.length+" outlinks in "+content.getUrl());      }    }        ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);    if (metaTags.getRefresh()) {      status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);      status.setArgs(new String[] {metaTags.getRefreshHref().toString(),        Integer.toString(metaTags.getRefreshTime())});          }    ParseData parseData = new ParseData(status, title, outlinks,                                        content.getMetadata(), metadata);    ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),                                                  new ParseImpl(text, parseData));    // run filters on parse    ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult,                                                              metaTags, root);    if (metaTags.getNoCache()) {             // not okay to cache      for (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)         entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY,                                                       cachingPolicy);    }    return filteredParse;  }

我们从代码中可以看到具体解析text的位置，我们需要修改的就是这个位置的代码了，可以通过查看源代码，nutch是通过Dom tree的方式进行解析text内容的，而我在这里为了拿到page的正文部分的内容，我选用了开源的工具boilerpipe进行正文的提取。插入如上函数的代码段为：

text = BoilerpipeUtils.getMainbodyTextByBoilerpipe(new InputSource(      new ByteArrayInputStream(content.getContent())));      if(text.equals("")){      utils.getText(sb, root);        text = sb.toString();        if (LOG.isTraceEnabled()) {         LOG.trace("Extract text using DOMContentUtils...");         }      }else if (LOG.isTraceEnabled()) {       LOG.trace("Extract text using Boilerpipe...");       }      FileWriter fw = new FileWriter("E://mainbodypage//URLText.txt",true);      fw.write("url::" + content.getUrl() + "\n");      fw.write("text::" + text + "\n");      fw.close();

我将对应的page的url和text内容写入到特定的path下，这样可以方便测试，如上代码段调用的静态方法类如下：

package org.apache.nutch.parse.html;import org.xml.sax.InputSource;import org.xml.sax.SAXException;import de.l3s.boilerpipe.BoilerpipeExtractor;import de.l3s.boilerpipe.BoilerpipeProcessingException;import de.l3s.boilerpipe.document.TextDocument;import de.l3s.boilerpipe.extractors.CommonExtractors;import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;public class BoilerpipeUtils {public static String getMainbodyTextByBoilerpipe(InputSource is) throws BoilerpipeProcessingException, SAXException{final TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;extractor.process(doc);  if(doc.getContent() != null && !doc.getContent().equals(""))return doc.getContent();elsereturn "";  }}

由于用到了开源的工具boilerpipe，因此需要将相关的jar包放入到插件文件夹下的lib目录中，同时对应的plugin.xml配置中runtime段如下：

<runtime>      <library name="parse-html.jar">         <export name="*"/>      </library>      <library name="tagsoup-1.2.1.jar"/>      <library name="boilerpipe-1.2.0.jar">      </library>      <library name="nekohtml-1.9.13.jar">      </library>      <library name="xerces-2.9.1.jar">      </library>   </runtime>

至此就完成了插件的功能，在eclipse下执行build project后运行如上的crawl命令，即可得到自己想要的正文部分的parse_text数据了，如果在cwgwin下运行crawl命令，还会报NoClassDefFound的runtimeException，找不到指定的jar包，将如上的三个jar包放入到runtime/local/lib目录下即可。

然而boilerpipe的正文提取效果还存在提升的空间，不尽理想；另外也可以用针对特定网站的定制功能去提取text信息。