简书
来源:互联网 发布:大数据培训学什么课程 编辑:程序博客网 时间:2024/05/16 15:52
from selenium import webdriver
from bs4 import BeautifulSoup
import requests,re,os,time
driver = webdriver.PhantomJS()
urls=["http://www.jianshu.com/search?q=Python+selenium+PhantomJS&page=%d&type=notes"%x for x in range(1,100)]
for url in urls:
print(url)
try:
driver.get(url)
pt = driver.title
print(pt)
data=driver.page_source
bs1=BeautifulSoup(data,'lxml')
site=bs1.find("ul","unstyled list")
ws=site.find_all("li")
for w in ws:
time.sleep(4)
l=w("a")
title=w.a.text.strip()
link='http://www.jianshu.com/'+l[0].get('href')
print(link)
name=l[1].text
read=l[2].text
comment=l[3].text
info=title+link+name+read+comment
print(info)
title=re.sub(r'\\|\*|\>|\<|\?|\:|\"','',title)
headers={"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}
content=requests.get(link,headers=headers).text
bs2=BeautifulSoup(content,'lxml')
content=bs2.find("div","show-content")
content=content.text
a=re.sub(r'。',r'。\n',content)
t='f://简书//%s.text'%title
with open(t,'w',errors='replace') as f:
f.write(info+'\n')
print('下载中')
f.write(a)
except:
pass
from bs4 import BeautifulSoup
import requests,re,os,time
driver = webdriver.PhantomJS()
urls=["http://www.jianshu.com/search?q=Python+selenium+PhantomJS&page=%d&type=notes"%x for x in range(1,100)]
for url in urls:
print(url)
try:
driver.get(url)
pt = driver.title
print(pt)
data=driver.page_source
bs1=BeautifulSoup(data,'lxml')
site=bs1.find("ul","unstyled list")
ws=site.find_all("li")
for w in ws:
time.sleep(4)
l=w("a")
title=w.a.text.strip()
link='http://www.jianshu.com/'+l[0].get('href')
print(link)
name=l[1].text
read=l[2].text
comment=l[3].text
info=title+link+name+read+comment
print(info)
title=re.sub(r'\\|\*|\>|\<|\?|\:|\"','',title)
headers={"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}
content=requests.get(link,headers=headers).text
bs2=BeautifulSoup(content,'lxml')
content=bs2.find("div","show-content")
content=content.text
a=re.sub(r'。',r'。\n',content)
t='f://简书//%s.text'%title
with open(t,'w',errors='replace') as f:
f.write(info+'\n')
print('下载中')
f.write(a)
except:
pass
0 0
- 简书
- EXOplayer简书
- Es简书
- Hello,简书
- 简书 - 凡墙专题
- Normalize.css简书
- 这波能反杀 简书 读书笔记
- 简书 专注写作,专注阅读
- 本博客迁移至 简书
- 简书---从简开始,书写点滴
- 简书App适配iOS 11
- 简书:电子书时代的“出版经纪人”
- iOS开发-数据持久化-简书
- 简书APP、网页版产品分析
- 【iOS开发】从 UIWebView 到 WKWebView--简书
- 简书---不错的博文网站
- 简书30日排行爬虫代码
- font简写规则(作者 JoinFisher from 简书)
- Servlet监听器详解及举例
- 豆瓣分类排行电影信息
- 牛逼了我的Charles - 可以抓浏览器 不能抓取App接口了
- 编辑距离--动态规划
- (3)javascript实现模块化
- 简书
- Linux下g++编译与使用静态库和动态库
- PHP中isset函数的用法
- 妹子图
- Android开发——内部存储数据(FileIntputStream和FileOutputStream)
- c语言现代设计方法复习(1)
- 小e开发板重新编译刷写测试AT例程全过程
- 美女图片
- 缓冲输入流、缓冲输出流、对象序列化转为byte[]、byte[]转化反序列化为对象