网络爬虫学习（二）

来源：互联网发布：linux防火墙禁止ip 编辑：程序博客网时间：2024/04/29 07:39

网络爬虫学习（二）

1、BeautifulSoup 基础操作

上次代码为：

from bs4 import BeautifulSouphtml_sample=' \<html> \<body> \<h1 id="title"> Hello World</h1> \<a href="#" class="link">This is link1</a> \<a href="# link2" class="link"> This is link2</a> \</body> \</html>'soup=BeautifulSoup(html_sample,'html.parser')print (soup.text)

（1）找出所有含特定标签的HTML元素

使用select找出含有h1标记的元素

soup=BeautifulSoup(html_sample,'html.parser')header=soup.select('h1')print (header)

print (header[0])print (header[0].text)

使用select找出含有a标签的元素

soup=BeautifulSoup(html_sample,'html.parser')alink=soup.select('a')print (alink)for link in alink:#将标签分2行打印    #print (link)    print (link.text)#取出标签里面的内容

（2）取得含有特定CSS属性的元素

使用select找出所有id为title的元素（id前面需加#）

alink=soup.select('#title')print (alink)

使用select找出所有class为link的元素（class前面需加.）

soup=BeautifulSoup(html_sample)for link in soup.select('.link'):print (link)

（3）取得所有a标签内的链接

使用select找出所有a tag的href链接

alinks=soup.select('a')for link in alinks:print (link['href'])

示例

a='<a href="#" qoo=123 abc=456> I am a linker</a>'soup2=BeautifulSoup(a,'html.parser')print (soup2.select('a'))print (soup2.select('a')[0]['abc'])#输出abc后面的值print (soup2.select('a')[0]['qoo'])#输出qoo后面的值

2、观察元素抓取位置

使用检查工具，可以看到新闻内容上面显示的“news_item.first-news-item”

其代码为：

3、根据不同HTML标签取得对应的内容

4、完成所有的爬虫

代码如下：

import requestsfrom bs4 import BeautifulSoupres=requests.get('http://news.sina.com.cn/china/')res.encoding='utf-8'soup= BeautifulSoup(res.text,'html.parser')for news in soup.select('.news-item'):    if len(news.select('h2'))>0:        h2=(news.select('h2')[0].text)        time=(news.select('.time')[0].text)        a=news.select('a')[0]['href']        print (time,h2,a)

结果：

0 0