python爬虫常见问题

来源：互联网发布：数据库查找重复数据编辑：程序博客网时间：2024/05/22 09:51
1.使用链接地址下载文件     
#####urlretrieve(url, filename=None, reporthook=None, data=None)
from urllib import urlretrieve
urlretrieve(url,filename)
eg:
01#!/usr/bin/python
02#encoding:utf-8
03import urllib
04import os
05def Schedule(a,b,c):
06    '''''
07    a:已经下载的数据块
08    b:数据块的大小
09    c:远程文件的大小
10   '''
11    per = 100.0 * a * b / c
12    if per > 100 :
13        per = 100
14    print '%.2f%%' % per
15url = 'http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.bz2'
16#local = url.split('/')[-1]
17local = os.path.join('/data/software','Python-2.7.5.tar.bz2')
18urllib.urlretrieve(url,local,Schedule)
19######output######
2.上传文件
import requestsfiles = {'uploadFile': open('log.jpg', 'rb')}r = requests.post("url", files=files)print r.text3.保存登录cookie

import requests
params = {'username': 'anything', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
# r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", cookies=r.cookies)
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies=r.cookies)
print r.text

4.使用session保存登录信息
相比于cookie的优点是可以跟踪会话信息，可以随变化而变化

import requests
session = requests.Session()
params = {'username': 'anything', 'password': 'password'}
r = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
# r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", cookies=r.cookies)
r = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print r.text

5.使用HTPP基本接入验证


Request有一个专门的模块用来进行HTTP验证

import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth


auth = HTTPBasicAuth('anything', 'password')
r = requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=auth)
print r.text

6.使用Selenium作为选择器

对于需要使用Ajax加载的数据，有的数据是在进入页面后几秒钟通过Ajax来更新数据或者显示数据，如果单纯的使用初始的网页信息，将不会得到这部分的结果，因此需要一种方法来获取几秒钟后的信息。

安装PhantomJS无图像浏览器

获取Ajax后的消息

from selenium import webdriver
import time
driver = webdriver.PhantomJS(executable_path='D:\Program Files\phantomjs-2.1.1-windows\\bin\phantomjs')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print driver.find_element_by_id('content').text
driver.close()

Selenium作为选择器的使用

①driver.find_element_by_css_selector("#content")

②driver.find_element_by_tag_name("div")

③driver.find_element_by_id('content')

若要返回页面上具有相同特征的元素，将element变为elements即可

与beautifulsoup共用解析网页

pagesource = driver.paeg_source

bsobj = BeautifulSoup(pageSource)

print bsobj.find(id="content").get_text()
        0        0           
python爬虫常见问题
[读书笔记]python爬虫-scrapy安装过程常见问题及解决方法
python常见问题
python 常见问题
Python---常见问题
python常见问题
python常见问题
Python常见问题
Python常见问题
python爬虫初学（一）——基本代码和常见问题
python爬虫-->爬虫基础
[爬虫] Python爬虫技巧
Python爬虫
python 爬虫
python 爬虫
python 爬虫
python爬虫
Python爬虫
git 创建项目和使用
JSP —— XML 与dom4j 基础使用
HDOJ 2023 求平均成绩
使用HttpURLConnection上传文件，进度条显示不正确
代码执行过程
python爬虫常见问题
Js复制剪切-兼容所有浏览器
Android 6.0以上 需要运行时申请的权限(二)
敌兵布阵（线段树）hdu  1166
NAND FLASH 驱动
性能优化之数据存储&DOM编程
走进数据结构之排序（四）---快速排序
设置textview字体不一样的显示效果
C#中System.DateTime.Now.ToString()用法