使用BeautifulSoup的简单小爬虫

来源：互联网发布：淘宝仓库宝贝删除编辑：程序博客网时间：2024/06/05 07:54

最近稍微看了点python的入门， runoob上面的入门过了一遍 python的菜鸟教程。网上看爬虫用BeautifulSoup就能简单的尝试下，就学着写了个百度贴吧的，算是小爬虫吧。。。

安装BeautifulSoup

先从官网上down下来然后解压再用python安装
官网地址 https://www.crummy.com/software/BeautifulSoup/#Download
具体还是网上搜吧超级多

爬取模块

其实贴吧的网址还是比较容易拼接的所以有挺多人拿贴吧练手来着

def start(self):    for i in range(self.topic_limit/50):        self.spide_listpage(i * 50)

因为计划着要翻页嘛拼接的页码就是这么个格式做个循环调用方法

def spide_listpage(self, num):    url = self.baseUrl + "&pn=" + str(num)    html = urllib2.urlopen(url).read()    soup = BeautifulSoup(html, 'html.parser')    topic_list = soup.findAll('a', attrs={'class': 'j_th_tit '})    for topic in topic_list:        if self.keyword in topic['title']:           print topic['title'], (self.domain +  topic['href']).strip()           self.theUrl = (self.domain + topic['href']).strip()           break

html就是拼接出来的地址，然后利用beautifulsoup来进行读取，在找到所有class里面带 j_th_tit样式的然后再把对应的标题和超链接打印出来
这个思路嘛就是找html里面对应的css样式，毕竟同类的格式肯定是一样的这个估计大家都懂就不赘述了

然后循环把含有keyword的提取打印出来

文件写入模块

爬取出来索性就写入txt文档好啦

class writeInFile:    def __init__(self, url):        self.url = url    def getTheWeb(self):        html = urllib2.urlopen(self.url).read()        soup = BeautifulSoup(html, 'html.parser')        context_list = soup.findAll('div', 'd_post_content j_d_post_content ')        for context in context_list:            # print context.text            self.wirteFile(context.text)    def wirteFile(self, text):        with open( 'spider.txt', 'a') as f:            f.write(text)            f.write('\n')

把刚刚找到的url传入这个方法，然后调用Beautifulsoup吧帖子里面的文字信息找出来，最后调用python自带的write方法写入到txt里面去
基本还是重复了上一个模块的操作吧
这里写图片描述
呃。。。这个帖子貌似有点重口味。。下次换个keyword再说吧。

阅读全文

0 0