Python爬虫

来源:互联网 发布:淘宝拍照单反相机 编辑:程序博客网 时间:2024/05/02 02:02

综合使用urllib和urllib2

1、post发送表单数据

import urllib
import urllib2
url = 'http://localhost/CI-github/index.php/form'


values = { 'username' : 'why222',
  'password' : 'test',
  'passconf' : 'test',
  'email' : 'test@test.com',
}




data = urllib.urlencode(values)
 
req = urllib2.Request(url, data)


response = urllib2.urlopen(req) 


the_page = response.read() 


print the_page



2、get方式连接,先生成

lib2的urlopen

3、伪造header,注意部分Header有默认值,需要覆盖

import urllib    
import urllib2    
  
url = 'http://localhost/CI-github/index.php/form'  
  
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0 FirePHP/0.7.4'    
values = {'username' : 'WHYaaa',    
          'password' : 'SDU',    
          'passconf' : 'SDU',    
          'email' : 'Python@qq.com' }    
  
headers = { 'User-Agent' : user_agent, 
'Host:' : 'localhost'
 }    
data = urllib.urlencode(values)    
req = urllib2.Request(url, data, headers)    


response = urllib2.urlopen(req)    
the_page = response.read()  


print the_page

当找不到主机时会报

getaddrinfo failed错误


简单的列表形式获取:

import string, urllib2  
   


def baidu_tieba(url,begin_page,end_page):     
    for i in range(begin_page, end_page+1):  
        sName = string.zfill(i,5) + '.html'
        print 'class' + str(i) + ' fetch' + sName + '......'  
        f = open(sName,'w+')  
        m = urllib2.urlopen(url + str(i)).read()  
        f.write(m)  
        f.close()  
   
   


bdurl = 'http://tieba.baidu.com/p/2296017831?pn='  
iPostBegin = 1  
iPostEnd = 10  


baidu_tieba(bdurl,iPostBegin,iPostEnd)  


opener示例:


import urllib2opener = urllib2.build_opener()opener.addheaders = [('User-agent', 'Mozilla/5.0')]opener.open('http://www.example.com/')
参考:

http://docs.python.org/2/library/urllib2.html


参考:

http://blog.csdn.net/column/details/why-bug.html



使用scrapy做爬虫框架,加入webkit可以实现动态爬虫



rhino内核

0 0
原创粉丝点击