python第二次采集数据小记

来源：互联网发布：杭州淘宝学院编辑：程序博客网时间：2024/05/16 09:25

有些网页右键查看网页源代码，里面没有要查找的数据，这是为什么呢?答案是：页面是由JS动态生成出来的。

但是在审查元素中Elements中是有的。

解决方案是 python 有一个第三方库 Selenium 可以模拟浏览器

第一步安装 Selenium

在 cmd 中打开python的Scripts目录。输入python 回车

输入： pip install selenium

安装最新版Selenium

第二步安装浏览器驱动(本人使用的Chrome)

网上找到与浏览器对应版本的驱动后(谷歌为 chromedriver )，下载解压放到浏览器的安装目录下（chrome://version 命令查看路径）

第三步代码部分

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"#驱动路径driver = webdriver.Chrome(chromedriver)#启动浏览器driver.get("url")#打开网址

driver.page_source #加载完成后返回网页源代码，但是JS动态生成的页面。webdriver并不知道何时加载完毕所以需要用到 time.sleep() 函数延时

使用time函数需要import time 导入

第四部使用PhantomJS替代浏览器

Phantom JS是一个服务器端的 JavaScript API 的 WebKit。其支持各种Web标准： DOM 处理, CSS 选择器, JSON, Canvas, 和 SVG。

下载 Phantomjs后解压把phantomjs.exe拷贝到python的scripts目录下

使用示例：

    #调用无窗口 phantomJS    driver = webdriver.PhantomJS()    driver.get(url)

下面是完整实例代码：

#-*-coding:utf-8-*-##导入需要的类库import requestsimport urlparseimport reimport sysimport timeimport MySQLdbfrom bs4 import BeautifulSoupfrom selenium import webdriver#构造请求头headers = { "Accept":"text/html,application/xhtml+xml,application/xml;",            "Accept-Encoding":"gzip",            "Accept-Language":"zh-CN,zh;q=0.8",            "Referer":"http://www.example.com/",            "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"            }conn= MySQLdb.connect(        host='localhost',        port = 3306,        user='root',        passwd='123',        db ='caiji',        )cur = conn.cursor()sql=cur.execute("select * from xyyk")info = cur.fetchmany(sql)for i in info:    url = i[2]    numid = i[0]    #print url    #调用谷歌浏览器    #chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"    #driver = webdriver.Chrome(chromedriver)    #调用无窗口 phantomJS    driver = webdriver.PhantomJS()    driver.get(url)    time.sleep(25)    soup = BeautifulSoup(driver.page_source)    try:        for j in soup.find('div', {'class': 'youkexx-box youkexx-box2'}).find_all('input'):            print j['value']            value = j['value']                except Exception,e:                print '写入失败 numid'    else:        #print value        #print numid        cur.execute("update xyyk set newUrl = '%s'  where id = '%s'" %(value,numid))    driver.quit()    time.sleep(5)cur.close()conn.commit()conn.close()

阅读全文

0 0