python+selenium+phantomjs 踩坑

来源:互联网 发布:聪明坏处 知乎 编辑:程序博客网 时间:2024/05/17 08:53

在写爬虫时遇到有些网页加载超时的情况,以下对比一下他们的优缺点:

WebDriverWait():selenium设置元素发现超时等待时间
WebDriverWait()函数是在在设置时间内,默认每隔一段时间检测一次当前页面所指定元素是否存在,如果超过设置时间检测不到则抛出异常。

用法: WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located((By.ID, "kw")))

until(method,message):

  • method: 在等待期间,每隔poll_frequency时间之后调用这个传入的方法,直到返回值不是False或者超出timeout时间范围才不再执行
  • message :如果超时,抛出TimeoutException,将message传入异常

注: until_not 与until方法刚好相反,until是当某元素出现或什么条件成立则继续执行,until_not是当某元素消失或什么条件不成立则继续执行,参数也相同,

WebDriverWait():

In [12]: WebDriverWait() driver=                               %%! ignored_exceptions=                   "Application Data" poll_frequency=                       "Local Settings"                      > self=                                 "My Documents" timeout=                              "Saved Games"
  • driver: WebDriver 的驱动程序
  • timeout:最长超时时间,默认以秒为单位
  • poll_frequency:调用until或until_not中的方法的休眠时间的间隔(步长)时间,默认为 0.5 秒
  • ignored_exceptions:这里设置忽略的异常如果在调用until或until_not的过程中抛出中的异常元组中, 则不中断代码,继续等待,如果抛出的是这个元组外的异常,则中断代码,抛出异常。默认只有NoSuchElementException。

在使用 presence_of_element_located()函数检查元素是否存在或加载完成时,这个函数传入的是一个元组参数,而非两个单独的参数,错误代码如下:

In [12]: WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located(By.ID, "kw"))---------------------------------------------------------------------------TypeError  Traceback (most recent call last)<ipython-input-12-84622169566b> in <module>()----> 1 WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located(By.ID, "kw"))TypeError: __init__() takes exactly 2 arguments (3 given)

只需要一个参数,并且只能是一个元组,正确写法如下:

In [13]: WebDriverWait(driver, 20, 0.5).until(EC.presence_of_element_located((By.ID, "kw")))Out[13]: <selenium.webdriver.remote.webelement.WebElement (session="b0f52d40-582d-11e7-9ffc-7d9ee5fd2752", element=":wdc:1498233922094")>

implicitly_wait(timeout):隐式等待

如果某些元素没有找到, 不是立即可用的,隐式等待是告诉WebDriver去等待一定的时间后去查找元素。 默认等待时间是0秒,一旦设置该值,隐式等待是设置该WebDriver的实例的生命周期。

sleep:进程等待

有些时候我们喜欢将进程睡眠几秒钟而使网页加载完成。

三者的比较

测试用例如下(本例为了测试时间找了一个网页中没有的标签进行测试):

WebDriverWait()用例

import datetimefrom selenium import webdriverfrom selenium.webdriver import DesiredCapabilitiesfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.wait import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECdcap = dict(DesiredCapabilities.PHANTOMJS)user_agent = "Mozilla/5.0WindowsNT6.1WOW64AppleWebKit/535.8KHTML,likeGeckoBeamrise/17.2.0.9Chrome/17.0.939.0Safari/535.8"dcap["phantomjs.page.settings.userAgent"] = user_agentdriver = webdriver.PhantomJS(desired_capabilities=dcap)# driver.implicitly_wait(10)start_time = datetime.datetime.now()print 'start_time: ', start_timedriver.get('https://www.baidu.com')t = datetime.datetime.now()try:    element = WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located((By.CLASS_NAME, "gettell")))    element.click()except Exception, e:    print eend_time = datetime.datetime.now()print "Sds", (t - start_time).secondsprint "time", (end_time - start_time).secondsdriver.quit()

测试结果:

E:\usr\Anaconda2\python.exe C:/Users/Administrator/Desktop/ershouche/wait.pystart_time:  2017-06-24 03:29:41.140000Message: Screenshot: available via screenSds 0time 11Process finished with exit code 0

可以看到总用时10秒, 打开网页小于1秒,等待10秒。查找元素小于1 秒。下例查找网页中存在的元素:

import datetimefrom selenium import webdriverfrom selenium.webdriver import DesiredCapabilitiesfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.wait import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECdcap = dict(DesiredCapabilities.PHANTOMJS)user_agent = "Mozilla/5.0WindowsNT6.1WOW64AppleWebKit/535.8KHTML,likeGeckoBeamrise/17.2.0.9Chrome/17.0.939.0Safari/535.8"dcap["phantomjs.page.settings.userAgent"] = user_agentdriver = webdriver.PhantomJS(desired_capabilities=dcap)# driver.implicitly_wait(10)start_time = datetime.datetime.now()print 'start_time: ', start_timedriver.get('https://www.baidu.com')t = datetime.datetime.now()try:    element = WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located((By.ID, "su")))    element.click()except Exception, e:    print eend_time = datetime.datetime.now()print "Sds", (t - start_time).secondsprint "time", (end_time - start_time).secondsdriver.quit()

测试结果:

E:\usr\Anaconda2\python.exe C:/Users/Administrator/Desktop/ershouche/wait.pystart_time:  2017-06-24 03:42:04.557000Sds 0time 0Process finished with exit code 0

从此结果看出,打开网页速度小于1秒,查找元素时间少于1秒,程序执行只需要不到一秒就完成了。所以我们得到的结果是:

  • WebDriverWait()只要在最大时间内找到元素就会继续向下执行程序,没有找到就继续按照时间间隔去查找,直到超过最大时间限制则抛出超时异常

implicitly_wait(timeout)用例

import datetimefrom selenium import webdriverfrom selenium.webdriver import DesiredCapabilitiesdcap = dict(DesiredCapabilities.PHANTOMJS)user_agent = "Mozilla/5.0WindowsNT6.1WOW64AppleWebKit/535.8KHTML,likeGeckoBeamrise/17.2.0.9Chrome/17.0.939.0Safari/535.8"dcap["phantomjs.page.settings.userAgent"] = user_agentdriver = webdriver.PhantomJS(desired_capabilities=dcap)start_time = datetime.datetime.now()print 'start_time: ', start_timedriver.implicitly_wait(10)driver.get('https://www.baidu.com')t = datetime.datetime.now()driver.save_screenshot("ssd.png")ts = datetime.datetime.now()try:    driver.find_element_by_id("su").click()except Exception, e:    print eend_time = datetime.datetime.now()print "Sds", (t - start_time).secondsprint "time", (end_time - start_time).secondsprint "s", (ts - start_time).secondsdriver.quit()

测试结果如下:

start_time:  2017-06-24 04:14:36.779000Sds 0time 0s 0

打开网页时间小于1秒, 总用时1秒,查找元素时间也小于1秒,说明webdriver在能找到元素时无需等待。再来查找一个页面中不存在的元素

import datetimefrom selenium import webdriverfrom selenium.webdriver import DesiredCapabilitiesdcap = dict(DesiredCapabilities.PHANTOMJS)user_agent = "Mozilla/5.0WindowsNT6.1WOW64AppleWebKit/535.8KHTML,likeGeckoBeamrise/17.2.0.9Chrome/17.0.939.0Safari/535.8"dcap["phantomjs.page.settings.userAgent"] = user_agentdriver = webdriver.PhantomJS(desired_capabilities=dcap)start_time = datetime.datetime.now()print 'start_time: ', start_timedriver.implicitly_wait(10)driver.get('https://www.baidu.com')t = datetime.datetime.now()try:    driver.find_element_by_id("su").click()except Exception, e:    print ets = datetime.datetime.now()try:    driver.find_element_by_id("sudf").click()except Exception, e:    print eend_time = datetime.datetime.now()print "Sds", (t - start_time).secondsprint "time", (end_time - start_time).secondsprint "s", (ts - start_time).secondsdriver.quit()

测试结果:

E:\usr\Anaconda2\python.exe C:/Users/Administrator/Desktop/ershouche/wait.pystart_time:  2017-06-24 04:18:43.785000Message: {"errorMessage":"Unable to find element with id 'sudf'","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"85","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:57585","User-Agent":"Python http auth"},"httpVersion":"1.1","method":"POST","post":"{\"using\": \"id\", \"sessionId\": \"27957130-5851-11e7-a4d7-9740be247d0a\", \"value\": \"sudf\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/27957130-5851-11e7-a4d7-9740be247d0a/element"}}Screenshot: available via screenSds 0time 10s 0Process finished with exit code 0

这个结果是打开网页用时小于1秒, 第一次查找元素耗时小于1秒, 第二次查找没找到,则webdriver等待了10秒中,从这四个例子中可以看到,WebDriverWait()是设置间隔不断去找,找到就继续执行,找不到就抛出超时异常,implicitly_wait(timeout)先找一次,找不到了等待timeout的时间之后继续找,找到了就继续向下执行,找不到就抛出异常;在下面的程序里如果出现查找元素的情况规则同上步骤,因此他是对整个模块起作用的,不需要重写。而sleep就比较死板了,我设置睡眠多长时间,它就睡多长时间。

综合上面的例子可得,当页面加载不完全时适合使用implicitly_wait(timeout),当局部JS加载缓慢时我们可使用WebDriverWait(),我不建议用sleep来等待页面或JS的加载。

如有疑问请加qq群:526855734

原创粉丝点击