Python豆瓣静态网页抓取,lxml解析和显示(实验)

来源:互联网 发布:深圳户口 知乎 编辑:程序博客网 时间:2024/06/07 10:09

抓取的网页https://movie.douban.com/review/best/
Python源码:

import sysimport requestsimport timeurl='https://movie.douban.com/review/best/'data=requests.get(url) #用requests爬取整个页面print(data.encoding)print(data.status_code)from lxml import etreeselector=etree.HTML(data.text) #用lxml.etree对爬取的页面进行解析# 存储解析到的内容title_links=[] #评论主题subject_titles=[] #电影名字ratings=[] #评星times=[] #评论时间comments=selector.xpath('//*[@id="content"]/div/div[1]/div[1]/div') #“*”可以代替所有的节点名,HTML文档里copy XPATH查看定位xml字段,小技巧(查看同等级类别的xpath取定位符)print(len(comments)) #comments是一个列表for comment in comments:           title_link=comment.xpath('.//header/h3/a/text()')[0]##html:<a href="https://movie.douban.com/review/8868602/" class="title-link">拍出了水平的哭戏</a>    subject_title=comment.xpath('.//header/div/a[2]/text()')[0]##html:<a class="subject-title" href="https://movie.douban.com/subject/25870236/">可爱的你</a>                     rating=comment.xpath('.//header/div/span[1]/@title')[0]##html:<span class="allstar40 main-title-rating" title="推荐"></span>    time=comment.xpath('.//header/div/span[3]/text()')[0]##html:<span property="v:dtreviewed" content="2017-10-16" class="main-meta">2017-10-16 10:52:51</span>    title_links.append(title_link)    subject_titles.append(subject_title)    ratings.append(rating)    times.append(time)comment_dict={'title_links':title_links,'subject_titles':subject_titles,'ratings':ratings,'times':times}import pandas as pdcomment_df=pd.DataFrame(comment_dict)#'contents'comment_df

抓取结果:
这里写图片描述