迭代分析网页内容

来源：互联网发布：js数组中删除某个元素编辑：程序博客网时间：2024/06/01 09:27

最近在抓取豆瓣小组的评论区。我想按照用户名，评论，回应的url作为一条存入数据库。首先想到的是用lxml，但是xpath一抓全部都一起出来。试着用ElementTree，玩了半天，感觉越来越复杂，于是就弃暗投明，回到了梦开始的地方。。。。

BeautifulSoup是处理网页最知名的模块之一，也是我最开始用的，当时感觉太复杂了，就使用了相对简单的lxml。昨天又读了一遍文档，以前觉得很无聊的功能，现在感觉十分好用，比如下一个元素，下一个兄弟，子结点诸如此类。本文主要说明迭代和下一个元素。

一个豆瓣的评论区通常都是这样的：

<li class="clearfix comment-item" id="766574544" data-cid="766574544" >    <div class="user-face">        <a href="http://www.douban.com/group/people/57082998/"><img class="pil" src="http://img3.douban.com/icon/u57082998-5.jpg" alt="小怪兽。"/></a>    </div>    <div class="reply-doc content" style="padding-left:0px;">        <div class="bg-img-green">          <h4>              <a href="http://www.douban.com/group/people/57082998/" class="">***。</a> (任他们多漂亮，未及你矜贵。)              <span class="pubtime">2014-09-25 10:55:11</span>          </h4>        </div>                <div class="reply-quote">            <span class="short">前任说他家压力很大，负担重，一定要成功，而我家条件一般，没法帮他啥，然后被分手，我希望他能</span>            <span class="all">前任说他家压力很大，负担重，一定要成功，而我家条件一般，没法帮他啥，然后被分手，我希望他能如你他所愿，找到能帮到你他的另一半，我现在在大街上看到他，估计都想分分钟砍了他</span>        <a href="#" class="toggle-reply">            <span class="expaned">...</span>        </a><span class="pubdate"><a href="http://www.douban.com/group/people/65152572/">****</a></span></div>        <p class="">帮他。呵呵。小白脸当我节奏吗</p>        <div class="operation_div" id="57082998">            <div class="operation-more">                <a rel="nofollow" href="javascript:void(0);" data-cid="766574544" class="lnk-delete-comment" title="真的要删除***。的发言?">删除</a>            </div>            <a rel="nofollow" href="javascript:void(0);" class="comment-vote lnk-fav">赞</a>            <a href="http://www.douban.com/group/topic/63187199/?cid=766574544#last" class="lnk-reply">回应</a>        </div>    </div></li>

其他的评论区与这个评论区的结构相同，于是我们可以先抓下来每一个这样一个评论区，然后再把每一个评论区的内容分别抓下来，这样就很容易的分门别类了。首先就是抓评论区：（需要导入bs4.BeautifulSoup）

response = self.session.get(each_url)soup = BeautifulSoup(response.text)regions = soup.find_all('li', class_="clearfix comment-item")

regions是个列表，列表中的每一个对象仍然是一个soup对象。这是beautifulsoup比lxml好太多的地方。

对于每一个region，我们可以继续找出我们需要的内容。还是利用属性和特点：

user_name = region.find('h4').find('a').textcomment = region.find('p').textreply_url = region.find('a', class_="lnk-reply").attrs['href']

巧的是这三个字段都很有代表性。user_name是h4标签下的a标签下的内容，beautifulsoup又是可以直接叠加，太爽了。

comment很简单，就是p下的内容。

reply_url则需要用属性取筛选，当属性是关键字时，比如class，就需要用class_。而链接又是在属性中的，.attr可以把对象的属性转换成字典，然后根据key取出值即可。

有没有感觉很爽呢？beautifulsoup的子结点，父结点也可以做到这些事情，只不过我习惯用自己的方法。

另外我还玩了一把下一个对象方法。通常多页的网页，我们会从每一页中找到下一页的链接，一直迭代到没有下一页为止。豆瓣的页数显示通常时这样的：

<span class="thispage" data-total-page="15">1</span>                            <a href="http://www.douban.com/group/topic/63187199/?start=100" >2</a>                                    <a href="http://www.douban.com/group/topic/63187199/?start=200" >3</a>                                    <a href="http://www.douban.com/group/topic/63187199/?start=300" >4</a>                                    <a href="http://www.douban.com/group/topic/63187199/?start=400" >5</a>                                    <a href="http://www.douban.com/group/topic/63187199/?start=500" >6</a>                                    <a href="http://www.douban.com/group/topic/63187199/?start=600" >7</a>                                    <a href="http://www.douban.com/group/topic/63187199/?start=700" >8</a>                                    <a href="http://www.douban.com/group/topic/63187199/?start=800" >9</a>                    <span class="break">...</span>                            <a href="http://www.douban.com/group/topic/63187199/?start=1300" >14</a>                    <a href="http://www.douban.com/group/topic/63187199/?start=1400" >15</a>                <span class="next">            <link rel="next" href="http://www.douban.com/group/topic/63187199/?start=100"/>            <a href="http://www.douban.com/group/topic/63187199/?start=100" >后页></a>        </span>

对我而言最想抓的就是前面几页，为什么不直接把所有的页数都抓下来呢？只要找到“this page”，然后找下一个元素就可以了。

nextpage = soup.find('span', class_="thispage")while True:    nextpage = nextpage.next_sibling.next_sibling    #there is '\n' between two page object                try:        all_pages_urls.append(nextpage.attrs['href'])    except(KeyError):        breakprint("total %s pages" %len(all_pages_urls))

当然其实豆瓣的下一页的链接很有规律，只要自加就可以了。不过用了一把这个下一个元素的功能，真是蛮爽的。

最后上一下成果（前天写的四篇文章的图全部挂了，待会去补＝＝）：

由于最近研究的东西容易被拿去恶意使用，以后不再那么热情的上源代码了。

————————————————

github主页：https://github.com/gt11799

E-mail：gting405@163.com

0 0