python 网络爬虫入门 1
来源:互联网 发布:php表单提交数据库 编辑:程序博客网 时间:2024/05/21 15:40
一、python 自带三个库
基本但强大 urllib, urllib2, cookielib
以下是简单的抓取代码
## 抓静态页面import urllib, urllib2url = "http://www.baidu.com/s"data = { 'wd':'Katherine'}data = urllib.urlencode(data) #编码 由dict->stringfull_url = url+'?'+data #get请求发送response = urllib2.urlopen(full_url)print response.read()
# 需要登录<无验证码> post 豆瓣源代码# data 格式从不同网页的Form Data 查看import urllib, urllib2url = "http://www.douban.com"data = { 'form_email':'xxxx', 'form_password':'xxxx',}data = urllib.urlencode(data)req = urllib2.Request(url = url, data = data)response = urllib2.urlopen(req)print response.read()
# 用cookie 免登录import urllib, urllib2, cookielibcookie_support = urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support,urllib2.HTTPHandler)opener.open("https://www.douban.com")
二、Scrapy 框架
Scrapy (/ˈskreɪpi/ skray-pee)[1] is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.[2] It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.
Scrapy project architecture is built around ‘spiders’, which are self-contained crawlers which are given a set of instructions. Following the spirit of other don’t repeat yourself frameworks, such as Django,[3] it makes it easier to build and scale large crawling projects by allowing developers to re-use their code. Scrapy also provides a web crawling shell which can be used by developers to test their assumptions on a site’s behavior.[4]
Some well-known companies and products using Scrapy are: Lyst,[5] CareerBuilder,[6] Parse.ly,[7] Sciences Po Medialab,[8] Data.gov.uk’s World Government Data site.[9]
一个现成的scrapy 案例:实现从腾讯招聘页面抓取数据
官网document镇文: http://doc.scrapy.org/en/0.20/
案例代码出自:
http://blog.csdn.net/HanTangSongMing/article/details/24454453
此文我讲详细讲解这个案例,给出下载代码和修改代码的方法(原博客中很多人会出现运行错误etc.).
1 下载代码与修改
如果未装scrapy,命令行中运行
pip install scrapy
在命令行中, 在自己想建立工程的文件夹下,运行以下来下载代码
git clone https://github.com/maxliaops/scrapy-itzhaopin.git
文件夹中会新建出scrapy-itzhaopin 这个文件夹(下文会详述里面的文件都是干什么用的,是怎么创建出来的)
在此路径下找到tencent_spider.py 文件
scrapy-itzhaopin->itzhaopin->itzhaopin->spiders->tencent_spider.py
打开并用以下代码片进行替换
import reimport jsonfrom scrapy.selector import Selectortry: from scrapy.spiders import Spiderexcept: from scrapy.spiders import BaseSpider as Spiderfrom scrapy.utils.response import get_base_urlfrom scrapy.utils.url import urljoin_rfcfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as slefrom itzhaopin.items import *from itzhaopin.misc.log import *class TencentSpider(CrawlSpider): name = "tencent" allowed_domains = ["tencent.com"] start_urls = [ "http://hr.tencent.com/position.php" ] rules = [ Rule(sle(allow=("/position.php\?&start=\d{,4}#a")), follow=True, callback='parse_item') ] def parse_item(self, response): items = [] sel = Selector(response) base_url = get_base_url(response) sites_even = sel.css('table.tablelist tr.even') for site in sites_even: item = TencentItem() item['name'] = site.css('.l.square a').xpath('text()').extract()[0] relative_url = site.css('.l.square a').xpath('@href').extract()[0] item['detailLink'] = urljoin_rfc(base_url, relative_url) item['catalog'] = site.css('tr > td:nth-child(2)::text').extract()[0] item['workLocation'] = site.css('tr > td:nth-child(4)::text').extract()[0] item['recruitNumber'] = site.css('tr > td:nth-child(3)::text').extract()[0] item['publishTime'] = site.css('tr > td:nth-child(5)::text').extract()[0] items.append(item) #print repr(item).decode("unicode-escape") + '\n' sites_odd = sel.css('table.tablelist tr.odd') for site in sites_odd: item = TencentItem() item['name'] = site.css('.l.square a').xpath('text()').extract()[0] relative_url = site.css('.l.square a').xpath('@href').extract()[0] item['detailLink'] = urljoin_rfc(base_url, relative_url) item['catalog'] = site.css('tr > td:nth-child(2)::text').extract()[0] item['workLocation'] = site.css('tr > td:nth-child(4)::text').extract()[0] item['recruitNumber'] = site.css('tr > td:nth-child(3)::text').extract()[0] item['publishTime'] = site.css('tr > td:nth-child(5)::text').extract()[0] items.append(item) #print repr(item).decode("unicode-escape") + '\n' info('parsed ' + str(response)) return items def _process_request(self, request): info('process ' + str(request)) return request
在命令行中运行
scrapy crawl tencent
即运行了这个scrapy框架的spider,将抓取的数据存放在spiders文件夹下的tencent.json文件中。
2 案例讲解
1 目的:抓取腾讯招聘官网上的职位信息并保存为json格式
http://hr.tencent.com/position.php
2 步骤
1) create a project
新建一个工程文件夹,执行
scrapy startproject itzhaopin
这将会在当前目录下建立一个新目录itzhaopin,既定结构如下:
├── itzhaopin│ ├── itzhaopin│ │ ├── __init__.py│ │ ├── items.py│ │ ├── pipelines.py│ │ ├── settings.py│ │ └── spiders│ │ └── __init__.py│ └── scrapy.cfg
scrapy.cfg 为项目的配置文件(不用管它)
settings.py 为爬虫配置文件(需要加定义pipeline的内容,打开pipelines.py里面有响应的注释和官网链接)
items.py 为需要提取的数据结构定义文件(需要我们自己定义)
pipeline.py 为管道定义,用来对items里面提取的数据进一步处理,比如保存(需要我们自己定义)
spiders: 爬虫核心文件
2) declare items
- python 网络爬虫入门 1
- [Python]网络爬虫1
- Python网络爬虫1
- Python入门网络爬虫之精华版
- 基于Python的网络爬虫入门
- Python入门网络爬虫之精华版
- Python:入门到实现网络爬虫 Day1
- Python:入门到实现网络爬虫 Day2
- Python:入门到实现网络爬虫 Day3
- Python入门网络爬虫之精华版
- 一小时入门 Python 3 网络爬虫-转载
- 【Python爬虫1】网络爬虫简介
- python爬虫入门1--爬虫基本结构
- Python 爬虫入门 1 了解爬虫Scrapy
- Python网络爬虫(1)
- python 3.0 网络爬虫 1
- Python网络爬虫演示-1
- python网络爬虫day'1
- 《少年班》采访,有些问题结合自己还没想清楚
- 5句话让你了解苹果Handoff功能
- shell队列实现线程并发控制
- EXCEL排序(结构体排序)
- 由eclispe转as的小技巧
- python 网络爬虫入门 1
- JavaScript 开发的45个经典技巧
- css强制div中的字不换号
- C++ 单目运算 and 双目运算符重载complex
- 自己动手编译、运行Java程序
- css 样式
- mysql主从复制
- javascript 闭包
- 关于xcode—— Utility Application