python从零开始写爬虫（3）

来源：互联网发布：卡盟销售官网源码编辑：程序博客网时间：2024/05/21 06:49

接下来我们要的数据还有每篇新闻的具体标题，内容，发布时间，编辑人，以及来源

具体操作如下：

1.任意点开一篇新闻，进入页面

2.获取新闻标题，开发中模式分析标题：

1）定位到对应标题：（标题在id为：artibodyTitle里面）

2）实现代码：

import requestsfrom bs4 import BeautifulSoupres = requests.get('http://news.sina.com.cn/o/2017-01-12/doc-ifxzqnva3333635.shtml')res.encoding='utf-8'soup=BeautifulSoup(res.text,'html.parser')soup.select('#artibodyTitle')[0].text

输出结果：

3.获取时间以及来源：

1）定位对应时间：（时间在#navtimeSource里面）

2）实现代码：

soup.select('#navtimeSource')[0]

3）输出结果：

4）继续分析，获取#navtimeSource便签内容

soup.select('#navtimeSource')[0].contents

5）得到内容数组，[0]为时间内容，[1]为来源

6）得到时间，去除\t

soup.select('#navtimeSource')[0].contents[0].strip()

7)输出结果：

8）接着上面得到的内容数组，[1]为来源

soup.select('#navtimeSource')[0].contents[1].text.strip()

9）输出结果：

4.获取新闻内容：

1）定位新闻内容，分析得到内容在id为artibody的div里面，每一段落都在p标签里

2）代码实现：

artcle = []for p in soup.select('#artibody p'):    artcle.append(p.text.strip())#把数据追加到数组里'\n'.join(artcle)#用换行符对数组进行连接

3）输出结果：

5.获取编辑人：

1）编辑人在class为article-editor的p标签里面

2）代码实现：

soup.select('.article-editor')[0].text

3）输出结果：

4）去除左边‘责任编辑’得到名字：

soup.select('.article-editor')[0].text.lstrip('责任编辑：')

5）输出结果：

ok!我们大概就要这么些数据

未完待续....

0 0