网络爬虫概述

来源：互联网发布：qq飞车数据编辑：程序博客网时间：2024/06/03 13:46

1、概述

网络爬虫事一种按照一定的规则，自动抓取万维网信息的程序或者脚本。

2、分类

网络爬虫按照系统结构和实现技术，大致可以分为以下几种：

1）通用型爬虫

2）聚焦型爬虫

3）增量式爬虫

4）深层网络爬虫

3、基本结构

1）URL管理器

2）HTML下载器

3）HTML解析器

4）数据存储器

5）爬虫调度器

4、HTTP请求Python实现

1) urllib2/urllib实现

GET:

import urllib2

response=urllib2.urlopen('http://www.zhihu.com')

html=response.read()

print(html)

POST:

import urllib

import urllib2

url='http://www.zhihu.com'

postdata={'username' : 'u',

'password' : 'p'}

data=urllib.urlencode(postdata)

req=urllib2.Request(url,data)

response=urllib2.urlopen(req)

html=response.read()

2) 第三方库requests实现

GET:

import requests

r=requests.get('http://www.zhihu.com')

print(r.content)

POST:

import requests

r=requests.get('http://zhihu.com')

print(r.content)

阅读全文

0 0