Python爬虫

来源：互联网发布：淘宝拍照单反相机编辑：程序博客网时间：2024/05/02 02:02

综合使用urllib和urllib2

1、post发送表单数据

import urllib
import urllib2
url = 'http://localhost/CI-github/index.php/form'

values = { 'username' : 'why222',
'password' : 'test',
'passconf' : 'test',
'email' : 'test@test.com',
}

data = urllib.urlencode(values)

req = urllib2.Request(url, data)

response = urllib2.urlopen(req)

the_page = response.read()

print the_page

2、get方式连接，先生成

lib2的urlopen

3、伪造header，注意部分Header有默认值，需要覆盖

import urllib
import urllib2

url = 'http://localhost/CI-github/index.php/form'

user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0 FirePHP/0.7.4'
values = {'username' : 'WHYaaa',
'password' : 'SDU',
'passconf' : 'SDU',
'email' : 'Python@qq.com' }

headers = { 'User-Agent' : user_agent,
'Host:' : 'localhost'
}
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)

response = urllib2.urlopen(req)
the_page = response.read()

print the_page

当找不到主机时会报

getaddrinfo failed错误

简单的列表形式获取：

import string, urllib2

def baidu_tieba(url,begin_page,end_page):
for i in range(begin_page, end_page+1):
sName = string.zfill(i,5) + '.html'
print 'class' + str(i) + ' fetch' + sName + '......'
f = open(sName,'w+')
m = urllib2.urlopen(url + str(i)).read()
f.write(m)
f.close()

bdurl = 'http://tieba.baidu.com/p/2296017831?pn='
iPostBegin = 1
iPostEnd = 10

baidu_tieba(bdurl,iPostBegin,iPostEnd)

opener示例:

import urllib2opener = urllib2.build_opener()opener.addheaders = [('User-agent', 'Mozilla/5.0')]opener.open('http://www.example.com/')

参考：

http://docs.python.org/2/library/urllib2.html

参考：

http://blog.csdn.net/column/details/why-bug.html

使用scrapy做爬虫框架，加入webkit可以实现动态爬虫

rhino内核

0 0