selenium结合lxml爬取豆瓣电影相关信息

来源:互联网 发布:电脑淘宝怎么延长收货 编辑:程序博客网 时间:2024/05/22 11:46

  • 环境说明
  • 重要代码解释
  • 完整代码

环境说明

python3.5
centos7.2

重要代码解释

使用selenium加载网页:

driver=webdriver.PhantomJS()driver.get("https://movie.douban.com/")

使用selenium和web进行互动将网页加在完全:

end = Truewhile (end):    try:        end = driver.find_element_by_class_name("more")        end.click()    except Exception as e:        print("没有这样的text.")        end = False

获得电影信息的web的源代码:

movis = driver.page_sourcedriver.close()

使用xpath解析web代码:

html = etree.HTML(movis)titles = html.xpath("//a[@class='item']")

提取需要的内容:

i =0while(i<len(titles)):    url_img = titles[i].xpath("./div/img/@src")    title_moive = titles[i].xpath("./p/text()")    rank_movie = titles[i].xpath("./p/strong/text()")    title_moive=re.sub("\s+","",title_moive[0])    i= i+1

完整代码

from selenium import webdriverfrom selenium.common.exceptions import NoSuchElementExceptionfrom scrapy.selector import Selectorfrom lxml import etreeimport redriver=webdriver.PhantomJS()driver.get("https://movie.douban.com/")end = Truewhile (end):    try:        end = driver.find_element_by_class_name("more")        end.click()    except Exception as e:        print("没有这样的text.")        end = Falsemovis = driver.page_sourcedriver.close()print(type(movis))html = etree.HTML(movis)titles = html.xpath("//a[@class='item']")i =0while(i<len(titles)):    url_img = titles[i].xpath("./div/img/@src")    title_moive = titles[i].xpath("./p/text()")    rank_movie = titles[i].xpath("./p/strong/text()")    title_moive=re.sub("\s+","",title_moive[0])    i= i+1    print(url_img,"===",title_moive,"===",rank_movie)    print("****************************************************************************")
0 0
原创粉丝点击