scrapy爬取豆瓣TOP250电影
来源:互联网 发布:域名备案查询工具 编辑:程序博客网 时间:2024/04/30 03:19
1. 思路分析
1.1 网页关系分析
上图红框内是第一页网址
第一页网址:https://movie.douban.com/top250?start=0
第二页网址:https://movie.douban.com/top250?start=25
…
第十页网址:https://movie.douban.com/top250?start=225
可以看出存在规律,实际就是每页展示25部电影。
1.2 页面内容定位
由于使用scrapy框架,可用Xpath表达式定位元素。
推荐可以使用Firefox的Firefinder插件结合Xpath,快速的定位到想要提取的元素。
2. 创建项目编写爬虫
创建一个项目目录douban
scrapy startproject douban
进入douban目录创建爬虫film
scrapy genspider -t basic film movie.douban.com
items.py代码如下
import scrapyclass DoubanItem(scrapy.Item): rank = scrapy.Field() title = scrapy.Field() dr = scrapy.Field() act = scrapy.Field() ty = scrapy.Field() yr = scrapy.Field() con = scrapy.Field() des = scrapy.Field() score = scrapy.Field() link = scrapy.Field() peo = scrapy.Field()
film.py代码如下
import scrapyfrom douban.items import DoubanItemfrom scrapy.http import Requestclass MovieSpider(scrapy.Spider): """ 爬取豆瓣排名前250的电影信息,包括: rank排名 title片名 dr导演 act演员 ty类型 score得分 peo评价人数 yr上映时间 con国家 link豆瓣链接 """ name = "movie" allowed_domains = ["movie.douban.com"] start_urls = ['http://movie.douban.com/'] def parse(self, response): for i in range(10): url = "https://movie.douban.com/top250?start=%s" % str(i*25) yield Request(url=url, callback=self.film_detail) def film_detail(self, response): item = DoubanItem() rank = response.xpath('//ol[@class="grid_view"]/li/div/div/em/text()').extract() namelst = response.xpath('//ol[@class="grid_view"]/li//div[@class="info"]//a//span[@class="title"][1]//text()').extract() score = response.xpath('//ol[@class="grid_view"]/li//div[@class="star"]//span[2]//text()').extract() peo = response.xpath('//ol[@class="grid_view"]/li//div[@class="star"]//span[4]//text()').extract() link = response.xpath('//ol[@class="grid_view"]/li/div/div/a/@href').extract() lsts = response.xpath('//ol[@class="grid_view"]/li//div[@class="bd"]//p[1]/text()').extract() lsts = [lst.strip() for lst in lsts] dr_act = lsts[::2] # 提取奇数项 yr_con_ty = lsts[1::2] # 提取偶数项 dr_act = [d.split('\xa0\xa0\xa0') for d in dr_act] dr = [dr[0] for dr in dr_act] act = [act[1] for act in dr_act] yr_con_ty = [d.split('\xa0/\xa0') for d in yr_con_ty] yr = [yr[0] for yr in yr_con_ty] con = [con[1] for con in yr_con_ty] ty = [ty[2] for ty in yr_con_ty] for i in range(len(rank)): item['rank'] = rank[i] item['title'] = namelst[i] item['score'] = score[i] item['link'] = link[i] item['dr'] = dr[i] item['act'] = act[i] item['yr'] = yr[i] item['con'] = con[i] item['ty'] = ty[i] item['peo'] = peo[i] yield item
pipelines.py代码如下
import pymysql.cursorsclass DoubanPipeline(object): """ 将数据写进本地csv文件和mysql数据库 """ def __init__(self): self.conn = pymysql.connect(host='127.0.0.1', user='root', password='123456', charset='utf8', cursorclass=pymysql.cursors.DictCursor) cur = self.conn.cursor() cur.execute("create database douban") cur.execute("use douban") cur.execute("create table film(id INT PRIMARY KEY AUTO_INCREMENT, rank INT, title VARCHAR(200), score FLOAT , link VARCHAR(50), dr VARCHAR(200), act VARCHAR(200), yr VARCHAR(200), con VARCHAR(200), ty VARCHAR(200), peo VARCHAR(50))") def process_item(self, item, spider): rank = item['rank'] title = item['title'] score = item['score'] peo = item['peo'] link = item['link'] dr = item['dr'] act = item['act'] yr = item['yr'] con = item['con'] ty = item['ty'] with open('film.csv', 'a+', encoding='utf-8') as f: f.write(rank+',') f.write(title+',') f.write(score+',') f.write(dr+',') f.write(act+',') f.write(yr+',') f.write(con+',') f.write(ty+',') f.write(peo + ',') f.write(link + '\n') try: cur = self.conn.cursor() sql = "insert into film(rank, title, score, peo, link, dr, act, yr, con, ty)values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)" cur.execute(sql, (rank, title, score, peo, link, dr, act, yr, con, ty)) self.conn.commit() return item except Exception as err: print(err) print(rank, title)
3. 爬取结果
本地csv文件
MySQL数据库
0 0
- scrapy爬取豆瓣TOP250电影
- scrapy ------ 爬取豆瓣电影TOP250
- scrapy爬取豆瓣top250电影
- Python 采用Scrapy爬虫框架爬取豆瓣电影top250
- 爬虫框架scrapy,爬取豆瓣电影top250
- 使用scrapy+mongodb爬取豆瓣电影TOP250
- Scrapy教程(一)爬取豆瓣top250电影
- 使用scrapy框架爬取豆瓣电影top250信息
- 用scrapy框架爬取豆瓣Top250电影
- 用Scrapy对豆瓣top250进行电影详细信息爬取
- scrapy爬取豆瓣电影top250并存储到mysql
- Python爬取豆瓣电影top250
- Python爬取豆瓣电影Top250数据
- python+beautifulsoup爬取豆瓣电影TOP250
- nodejs爬取豆瓣top250电影信息
- Python爬取豆瓣电影top250
- python爬取豆瓣电影Top250
- 使用requests爬取豆瓣电影top250
- Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/
- Java虚拟机垃圾回收机制
- 使用APACHE部署DJANGO程序的时候如何配置静态文件支持?
- 基于Java语言的安卓编程学习之五 Menu的响应
- 第七讲项目一算两个数的正差值
- scrapy爬取豆瓣TOP250电影
- .NET 虚拟框架(Mock Framework)原理剖析
- HttpClient get和post请求的示例代码以及防乱码处理
- 1102. Invert a Binary Tree
- CPU、内存、硬盘、指令之间的关系
- 基于linux centos6.5 hadoop伪分布式搭建
- select函数详细用法解析(转自:zhenhuaqin)
- OpenCL编程基本流程及完整示例
- 舞蹈链详解