python BeautifulSoup 抓取python中文开发者社区中的所有高级教程
来源:互联网 发布:中国电信四个重构 网络 编辑:程序博客网 时间:2024/06/05 22:31
话不多说直接上代码:
#coding=utf-8from bs4 import BeautifulSoupimport urllib2url = 'http://www.pythontab.com/html/pythonhexinbiancheng/index.html'url_list = [url]for i in range(2,19): url_list.append('http://www.pythontab.com/html/pythonhexinbiancheng/%s.html'%i)source_list = []for j in url_list: request = urllib2.urlopen(j) html = request.read() suop = BeautifulSoup(html,'lxml') titles = suop.select('#catlist > li > a') links = suop.select('#catlist > li > a') for title, link in zip(titles, links): data = { "title" : title.get_text(), "link" : link.get('href') } source_list.append(data) for l in source_list: request = urllib2.urlopen(l['link']) html = request.read() suop = BeautifulSoup(html,'lxml') text_p = suop.select('#Article > div.content > p') text = [] print(text_p) for t in text_p: text.append(t.get_text().encode('utf-8')) title_text = l['title'] title_text = title_text.replace('*','').replace('/','or').replace('"',' ').replace('?','wenhao').replace(':',' ') with open('%s.txt'%title_text, 'wb') as f: for a in text: f.write(a)
0 0
- python BeautifulSoup 抓取python中文开发者社区中的所有高级教程
- python用BeautifulSoup用抓取a标签内所有数据
- python 抓取网页--用BeautifulSoup
- python魔术方法详解--转自Python中文开发者社区
- python BeautifulSoup 安装教程
- python中的BeautifulSoup模块
- python BeautifulSoup中文乱码问题
- python 网页抓取中的中文乱码问题解决
- python beautifulsoup多线程分析抓取网页
- BeautifulSoup+正则+Python 抓取网页数据
- python beautifulsoup 抓取网页正文内容
- python beautifulsoup多线程分析抓取网页
- Python结合BeautifulSoup抓取知乎数据
- python用BeautifulSoup抓取知乎小药丸
- [python]利用BeautifulSoup进行简单图片抓取
- python爬虫——BeautifulSoup 抓取图片
- python : BeautifulSoup 网页 table 抓取实例
- Python抓取中文网页
- 页面滑动加载------尽快更新实例
- io
- C++编写约瑟夫死亡游戏
- C/C++中extern关键字的用法
- Android MediaPlayer seekTo不准确问题
- python BeautifulSoup 抓取python中文开发者社区中的所有高级教程
- opencv的ml库学习之pca demo
- (POJ3096)Surprising Strings <STL-map 水题>
- 龟兔赛跑预测
- 查看cmake的模块,了解其工作流程
- Android开发心得
- centos7 经常断网
- 查看Oracle性能差的SQL
- Android利用canvas画各种图形(点、直线、弧、圆、椭圆、文字、矩形、多边形、曲线、圆角矩形)