python爬虫知识点三--解析豆瓣top250数据
来源:互联网 发布:十八掌大数据视频 编辑:程序博客网 时间:2024/04/27 13:31
一。利用cookie访问import requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}cookies = {'cookie': 'bid=a3MhK2YEpZw; ll="108296"; ps=y; ue="t.t.panda@hotmail.com"; _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1482650884%2C%22https%3A%2F%2Fwww.so.com%2Fs%3Fie%3Dutf-8%26shb%3D1%26src%3Dhome_so.com%26q%3Dpython%2B%25E8%25B1%2586%25E7%2593%25A3%25E6%25BA%2590%22%5D; _gat_UA-7019765-1=1; ap=1; __utmt=1; _ga=GA1.2.1329310863.1477654711; dbcl2="2625855:/V89oXS4WD4"; ck=EePo; push_noty_num=0; push_doumail_num=0; _pk_id.100001.8cb4=40c3cee75022c8e1.1477654710.8.1482652441.1482639716.; _pk_ses.100001.8cb4=*; __utma=30149280.1329310863.1477654711.1482643456.1482650885.10; __utmb=30149280.19.10.1482650885; __utmc=30149280; __utmz=30149280.1482511651.7.6.utmcsr=blog.csdn.net|utmccn=(referral)|utmcmd=referral|utmcct=/alanzjl/article/details/50681289; __utmv=30149280.262; _vwo_uuid_v2=64E0E442544CB2FE2D322C59F01F1115|026be912d24071903cb0ed891ae9af65'}url = 'http://www.douban.com'r = requests.get(url, cookies = cookies, headers = headers)with open('douban_2.txt', 'wb+') as f: f.write(r.content)
二。利用Xpath搜索
import requestsfrom lxml import etrees = requests.Session()for id in range(0, 251, 25): print (id)
url = 'https://movie.douban.com/top250/?start-' + str(id) r = s.get(url) r.encoding = 'utf-8' root = etree.HTML(r.content) items = root.xpath('//ol/li/div[@class="item"]') //利用xpath的标签选择
# print(len(items)) for item in items: title = item.xpath('./div[@class="info"]//a/span[@class="title"]/text()')//如下找到中文名字
name = title[0].encode('gb2312', 'ignore').decode('gb2312')//title是一个数组 先encoding 再decode确保字符不混在一起 # rank = item.xpath('./div[@class="pic"]/em/text()')[0] rating = item.xpath('.//div[@class="bd"]//span[@class="rating_num"]/text()')[0]
print(name, rating)
结果:成功爬取前250个评分
ps:必须知道网页的结构
阅读全文
0 0
- python爬虫知识点三--解析豆瓣top250数据
- Python爬虫豆瓣电影top250
- python爬虫,爬豆瓣top250电影
- python第一只爬虫:爬豆瓣top250
- [Python爬虫]1.豆瓣电影Top250
- [Python爬虫]2.豆瓣图书Top250
- Python爬虫获取豆瓣电影TOP250
- Python爬虫——豆瓣电影Top250
- Python 爬虫 抓取豆瓣读书TOP250
- Python爬虫小案例:豆瓣电影TOP250
- 爬虫学习--豆瓣top250
- Python爬取豆瓣电影Top250数据
- Python爬虫初学(1)豆瓣电影top250评论数
- Python爬虫初学(2)豆瓣电影top250评论数
- 用Python爬虫爬取豆瓣TOP250电影
- Python 采用Scrapy爬虫框架爬取豆瓣电影top250
- [python爬虫入门]爬取豆瓣电影排行榜top250
- Python爬虫实战——豆瓣电影Top250
- Kotlin高阶函数笔记1
- Snowflake雪花算法
- MyBatis全版教程+源码分析(一)
- 移动计算学习资源
- hdu 1711Number Sequence (KMP~)
- python爬虫知识点三--解析豆瓣top250数据
- c++map使用总结
- Maven学习笔记(二)-Maven中核心概念介绍
- Linux输入子系统相关文章
- 关于Python爬虫之获取海量表情包+存入数据库+搭建网站通过关键字查询表情包
- 练习-Oracle用户和权限
- 配置yarn集群
- makefile将中间文件生成到临时目录
- linux 下杀掉进程的n种方法