用web-harvest爬取yahoo！answers数据

来源：互联网发布：淘宝天天秒杀编辑：程序博客网时间：2024/05/16 05:08

关于web-harvest的使用，上篇转载的文章已经有简单的说明，本文主要以爬取yahoo！answers的数据为例，说明在使用过程中需要注意的问题。当然，最好的使用文档就是官方网站的user manual。

web-harvest有三个版本，这里用的是源码包。要完成数据的爬取，最重要的是配置config文件。源码包中有个Java类，Test.java，源代码如下：

public class Test {

    public static void main(String[] args) throws IOException {

        ScraperConfiguration config = new ScraperConfiguration("e:/temp/yahooanswer/auto racing.xml"); //line a
        Scraper scraper = new Scraper(config, "e:/temp/wikianswer"); //line b

scraper.setDebug(true);

        long startTime = System.currentTimeMillis();
        scraper.execute();
        System.out.println("time elapsed: " + (System.currentTimeMillis() - startTime));
    }

}

line a中的.xml文件即抓取配置数据，line b 为抓取后数据的存放路径。其功能是完成yahoo！answers分类中sports/auto racing的resolved问题中的前5页内容，每页20条，以如下格式写入文件中：

下面主要来分析一下auto racing.xml,xml文件如下：

<?xml version="1.0" encoding="utf-8"?>

<include path="functions.xml"/>

<var-def name="home">http://answers.yahoo.com</var-def>

<var-def name="QALinks">                  //定义变量QALinks，其值为函数download-multipage-list的返回值。
  <call name="download-multipage-list">
   <call-param name="pageUrl">http://answers.yahoo.com/dir/index;_ylt=AnRU11UwwAiICNV69Xv._0HzDH1G;_ylv=3?sid=396545601&link=resolved#yan-questions"</call-param>
   <call-param name="nextXPath">//li[@rel="next"]/@href</call-param>
   <call-param name="itemXPath">//ul[@class="questions"]//h3//a/@href</call-param>
   <call-param name="maxloops">5</call-param>
  </call>
</var-def>


<var-def name="questions">
  <loop item="item" index="i">
   <list><var name="QALinks"/></list>
   <body>
    <html-to-xml>
     <http url="${sys.fullUrl(home, item)}"/>
    </html-to-xml>

            </body>
        </loop>
        <![CDATA[ </questionanswers> ]]>
    </file>

</config>

functions.xml源代码：

<?xml version="1.0" encoding="UTF-8"?>

<config>
    
    <function name="download-multipage-list">
        <return>
            <while condition="${pageUrl.toString().length() != 0}" maxloops="${maxloops}" index="i">
                <empty>                                                   //函数中<empty></empty>中的内容表示不用返回。
                    <var-def name="content">               //定义了变量content，其内容是pageUrl返回的网页内容
                        <html-to-xml>
                            <http url="${pageUrl}"/>
                        </html-to-xml>
                    </var-def>
                    <script><![CDATA[                             //   <script>中是调试用的print，将输入内容显示在Java的控制台。
                        print("pageUrl:"+pageUrl);
                   ]]></script>

                    <var-def name="nextLinkUrl">        //定义了变量nextLinkUrl，其值是根据nextXPath从content中获取的数据
                        <xpath expression="${nextXPath}">
                            <var name="content"/>
                        </xpath>
                    </var-def>

                    <var-def name="pageUrl">         //重新定义pageUrl，其值为原来的pageUrl和nextLinkUrl的连接。
                        <template>${sys.fullUrl(pageUrl.toString(), nextLinkUrl.toString())}</template>
                    </var-def>

                </empty>

                <xpath expression="${itemXPath}">   //要返回的值，根据itemXPath从content中获取的数据

                    <var name="content"/>
                </xpath>
            </while>
        </return>
    </function>
</config>

functions.xml定义了一个函数，4个输入参数，1个输出。pageUrl表示起始的抓取url；nextXPath是从本页抓取的内容中获取下一页url的xpath表达式，也就是如何在本页中获取next所对应的href；function包含一个while循环，maxloops是在其他条件满足是最多循环次数；itemXPath是每次循环时从抓取的内容中获取返回的列表的xpath表达式，本例中是从每页获得answer对应的href。最后返回的是根据itemXPath获取的所有内容的列表。