一个简单的爬虫代码爬取糗事百科段子（selenium+ChromeDriver）

来源：互联网发布：windows 98se 安装编辑：程序博客网时间：2024/06/14 09:54

一个简单的爬虫入门代码，爬取糗事百科主页的段子（不包括图片，仅文字）。

前期准备：

需要安装selenium和ChromeDriver。
将chromedriver.exe放在Chrome的安装目录下。
配置环境变量。点击我的电脑->属性->高级系统设置->PATH->新建（Chrome的安装位置，比如我的是：C:\Program Files (x86)\Google\Chrome\Application）

一切就绪以后就可以开始最重要的工作——分析需要爬取的目标网页。
首先打开糗事百科主页和开发者工具（F12），然后我们会发现，这个页面的左侧都是笑话，右侧都是广告。
通过开发者工具可以发现左侧笑话区域的id是content-left，content-left 中还有很多用户的头像、姓名等信息，这些是我们不需要的，我们只需要看笑话、段子就好了。继续用之前的方法，我们可以发现，包裹着笑话文字div的class为 content。所以我们实则需要的内容是id为content-left中的class为content的内容。

代码如下：

#/usr/bin/env python#coding:utf-8#导入seleniumfrom selenium import webdriverclass Qiubai:    def __init__(self):        #打开Chrome浏览器        self.dr = webdriver.Chrome()        #访问糗事百科主页        self.dr.get('https://www.qiushibaike.com/')    def print_content(self):        #获取id为“content-left”的元素        main_content = self.dr.find_element_by_id('content-left')        #获取class为“content”的元素        contents = main_content.find_elements_by_class_name('content')        #通过for循环输出获取到的内容        i = 1        for content in contents:            print(str(i) + "." + content.text +'\n')            i += 1        self.quit()    def quit(self):        #关闭浏览器        self.dr.quit()Qiubai().print_content()

参考资料：陈斌 Python爬虫课

阅读全文

0 0