对webmagic总体构架的理解与代码分析——PageProcessor篇

来源：互联网发布：手机淘宝内置安全密码编辑：程序博客网时间：2024/06/07 03:08

前面我谢了关于webmagic里Spider的博文，http://write.blog.csdn.net/postlist

现在写一篇关于webmagic关于PageProcessor的博文。

下面是PageProcessor代码，PageProcessor是一个接口

public interface PageProcessor {

    /**
     * process the page, extract urls to fetch, extract the data and store
     *
     * @param page
     */
    public void process(Page page);

    /**
     * get the site settings
     *
     * @return site
     * @see Site
     */
    public Site getSite();
}

黄老师在webmagic文章中对PageProcessor的描述是

PageProcessor负责解析页面，抽取有用信息，以及发现新的链接。WebMagic使用Jsoup作为HTML解析工具，并基于其开发了解析XPath的工具Xsoup。

在这四个组件中，PageProcessor对于每个站点每个页面都不一样，是需要使用者定制的部分。

怎么理解黄老师的这句话呢？webmagic又具体是怎么通PageProcessor解析页面的呢？

下面是一个具体的爬虫实例

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.FilePipeline;
import us.codecraft.webmagic.pipeline.Pipeline;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.scheduler.PriorityScheduler;

public class MarketPriceDemo2 implements PageProcessor {

   private Site site = Site.me().setRetryTimes(3).setSleepTime(0);

   String date = getcurrdate();

    public final String URL_LIST = "http://www\\.87050\\.com/\\w+/\\w+/.*";

    public final String URL_POST = "/plushtml/pluprice/bjd\\d+"+ date +"\\.htm";

    @Override
    public void process(Page page) {
        if (page.getUrl().regex(URL_LIST).match()) {
            page.putField("goodsName", page.getHtml().xpath("//title/tidyText()"));
            if (page.getResultItems().get("goodsName") == null) {
                page.setSkip(true);
                System.out.println("没有匹配到相应的url");
            }
            page.putField("Data", page.getHtml().xpath("//table[@width='778' and@border='1']"));

            page.addTargetRequests(page.getHtml().links().regex(URL_POST).all());
            //addTargetRequests传入的参数是数组
            //这里面不能继续用.all()了，这里面的.all是深度和宽度爬虫的集合，我要自己写一个单独的链表只完成宽度爬虫。
        }
    }
    public String getcurrdate() {//返回当天日期
       long now = System.currentTimeMillis();
       Date CurrTime = new Date(now);
       SimpleDateFormat dateformat = new SimpleDateFormat("yyyyMMdd");
       String rq = dateformat.format(CurrTime);
       return rq;
       }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new MarketPriceDemo2()).setScheduler(new PriorityScheduler())
                .addUrl("http://www.87050.com/asp/jghq/index_ssjg.asp?scid=101").addPipeline(new FilePipeline("D:\\webmagic1")).thread(5).run();
    }
}

爬虫程序都要实现一个PageProcessor接口的抽象方法public void processor（Page page）；但是具体的实现，确实调用的Page类里面封装的方法，所以，想要研究PageProcessor组件，最重要的是研究Page类。明白PageProcessor是怎么通过Page类来解析网页的。

0 0