用web-harvest爬取yahoo!answers数据

来源:互联网 发布:淘宝天天秒杀 编辑:程序博客网 时间:2024/05/16 05:08

         关于web-harvest的使用,上篇转载的文章已经有简单的说明,本文主要以爬取yahoo!answers的数据为例,说明在使用过程中需要注意的问题。当然,最好的使用文档就是官方网站的user manual。

         web-harvest有三个版本,这里用的是源码包。要完成数据的爬取,最重要的是配置config文件。源码包中有个Java类,Test.java,源代码如下:

public class Test {

    public static void main(String[] args) throws IOException {

        ScraperConfiguration config = new ScraperConfiguration("e:/temp/yahooanswer/auto racing.xml"); //line a
        Scraper scraper = new Scraper(config, "e:/temp/wikianswer"); //line b

        scraper.setDebug(true);

        long startTime = System.currentTimeMillis();
        scraper.execute();
        System.out.println("time elapsed: " + (System.currentTimeMillis() - startTime));
    }

}

         line a中的.xml文件即抓取配置数据,line b 为抓取后数据的存放路径。其功能是完成yahoo!answers分类中sports/auto racing的resolved问题中的前5页内容,每页20条,以如下格式写入文件中:

         

         下面主要来分析一下auto racing.xml,xml文件如下:         

<?xml version="1.0" encoding="utf-8"?>

<config charset="utf-8">

 <include path="functions.xml"/>
 
 <var-def name="home">http://answers.yahoo.com</var-def>
 
 <var-def name="QALinks">                  //定义变量QALinks,其值为函数download-multipage-list的返回值。
  <call name="download-multipage-list">
   <call-param name="pageUrl">http://answers.yahoo.com/dir/index;_ylt=AnRU11UwwAiICNV69Xv._0HzDH1G;_ylv=3?sid=396545601&amp;link=resolved#yan-questions"</call-param>
   <call-param name="nextXPath">//li[@rel="next"]/@href</call-param>
   <call-param name="itemXPath">//ul[@class="questions"]//h3//a/@href</call-param>
   <call-param name="maxloops">5</call-param>
  </call>
 </var-def>
 
 <!-- According the link, get all questions -->
 <var-def name="questions">
  <loop item="item" index="i">
   <list><var name="QALinks"/></list>
   <body>
    <html-to-xml>
     <http url="${sys.fullUrl(home, item)}"/>
    </html-to-xml>

    <script><![CDATA[
     print("item"+i+":"+item);
    ]]></script>
   </body>
  </loop>               
 </var-def>
 
 <!-- iterates over all collected products and extract desired data -->
    <file action="write" path="sports/auto racing.xml" charset="utf-8">
        <![CDATA[ <questionanswers> ]]>
        <loop item="item" index="i">
            <list><var name="questions"/></list>
            <body>
    <template>
     <![CDATA[<number>]]><var name="i"/><![CDATA[</number>]]>
    </template>
                <xquery>
                    <xq-param name="item" type="node()"><var name="item"/></xq-param>
                   
                    <xq-expression><![CDATA[
                            declare variable $item as node() external;
                          
 
                            let $subject := data($item//h1[@class='subject'])
                                return
                                    <questionanswer>
          
                                        <subject>{normalize-space($subject)}</subject>
                                       { for $x at $count in data($item//div[@class="content"])
           return if($count eq 1)
               then <questioncontent>{$x}</questioncontent> 
               else <answer>{$x}</answer>   
              
            }                                        
                                    </questionanswer>
                    ]]></xq-expression>
                </xquery>

            </body>
        </loop>
        <![CDATA[ </questionanswers> ]]>
    </file>
 
</config>

functions.xml源代码:

<?xml version="1.0" encoding="UTF-8"?>

<config>
    <!--
        Download multi-page list of items.
       
        @param pageUrl       - URL of starting page
        @param itemXPath     - XPath expression to obtain single item in the list
        @param nextXPath     - XPath expression to URL for the next page
        @param maxloops      - maximum number of pages downloaded
       
        @return list of all downloaded items
     -->
    <function name="download-multipage-list">
        <return>
            <while condition="${pageUrl.toString().length() != 0}" maxloops="${maxloops}" index="i">
                <empty>                                                   //函数中<empty></empty>中的内容表示不用返回。
                    <var-def name="content">               //定义了变量content,其内容是pageUrl返回的网页内容
                        <html-to-xml>
                            <http url="${pageUrl}"/>
                        </html-to-xml>
                    </var-def>
                    <script><![CDATA[                             //   <script>中是调试用的print,将输入内容显示在Java的控制台。     
                        print("pageUrl:"+pageUrl);
                    ]]></script>

                    <var-def name="nextLinkUrl">        //定义了变量nextLinkUrl,其值是根据nextXPath从content中获取的数据
                        <xpath expression="${nextXPath}">
                            <var name="content"/>
                        </xpath>
                    </var-def>

                    <var-def name="pageUrl">         //重新定义pageUrl,其值为原来的pageUrl和nextLinkUrl的连接。
                        <template>${sys.fullUrl(pageUrl.toString(), nextLinkUrl.toString())}</template>
                    </var-def>

                </empty>
   
                <xpath expression="${itemXPath}">   //要返回的值,根据itemXPath从content中获取的数据

                    <var name="content"/>
                </xpath>
            </while>
        </return>
    </function>
</config>

        functions.xml定义了一个函数,4个输入参数,1个输出。pageUrl表示起始的抓取url;nextXPath是从本页抓取的内容中获取下一页url的xpath表达式,也就是如何在本页中获取next所对应的href;function包含一个while循环,maxloops是在其他条件满足是最多循环次数;itemXPath是每次循环时从抓取的内容中获取返回的列表的xpath表达式,本例中是从每页获得answer对应的href。最后返回的是根据itemXPath获取的所有内容的列表。