selenium结合lxml爬取豆瓣电影相关信息

来源：互联网发布：电脑淘宝怎么延长收货编辑：程序博客网时间：2024/05/22 11:46

环境说明
重要代码解释
完整代码

环境说明

python3.5
centos7.2

重要代码解释

使用selenium加载网页：

driver=webdriver.PhantomJS()driver.get("https://movie.douban.com/")

使用selenium和web进行互动将网页加在完全：

end = Truewhile (end):    try:        end = driver.find_element_by_class_name("more")        end.click()    except Exception as e:        print("没有这样的text.")        end = False

获得电影信息的web的源代码：

movis = driver.page_sourcedriver.close()

使用xpath解析web代码：

html = etree.HTML(movis)titles = html.xpath("//a[@class='item']")

提取需要的内容：

i =0while(i<len(titles)):    url_img = titles[i].xpath("./div/img/@src")    title_moive = titles[i].xpath("./p/text()")    rank_movie = titles[i].xpath("./p/strong/text()")    title_moive=re.sub("\s+","",title_moive[0])    i= i+1

完整代码

from selenium import webdriverfrom selenium.common.exceptions import NoSuchElementExceptionfrom scrapy.selector import Selectorfrom lxml import etreeimport redriver=webdriver.PhantomJS()driver.get("https://movie.douban.com/")end = Truewhile (end):    try:        end = driver.find_element_by_class_name("more")        end.click()    except Exception as e:        print("没有这样的text.")        end = Falsemovis = driver.page_sourcedriver.close()print(type(movis))html = etree.HTML(movis)titles = html.xpath("//a[@class='item']")i =0while(i<len(titles)):    url_img = titles[i].xpath("./div/img/@src")    title_moive = titles[i].xpath("./p/text()")    rank_movie = titles[i].xpath("./p/strong/text()")    title_moive=re.sub("\s+","",title_moive[0])    i= i+1    print(url_img,"===",title_moive,"===",rank_movie)    print("****************************************************************************")

0 0