Python爬虫学习1

来源：互联网发布：天拓网络编辑：程序博客网时间：2024/04/30 17:00

一.概述

最近在学习Python，对于爬虫这一块有较大兴趣，于是开博客记录学习历程。

在使用中，我没有使用一些网络教程中的urllib2模块，直接使用requests模块，感觉确实很简单。

我使用爬虫抓取sina新闻的有关内容

二.使用requests获取html代码

在这里直接上代码

import requestsnewsurl = "http"//news.sina.com.cn/china"res = requests.get(newsurl)print (res.text)

发现是乱码，于是查看编码方式

print (res.encoding)#查看编码方式

要能解析中文，需要使用“utf-8”的编码格式
最终，使用Python请求链接的代码为：

import requests  newurl = 'http://news.sina.com.cn/china/'  res = requests.get(newurl)  res.encoding = 'utf-8'  print(res.text)

三.使用BeautifulSoup4，对网页进行解析

在这里，我有这样一段html代码：

<html>    <body>        <h1 id="title"> Hello World </h1>        <a href="!"  class = "link"> This is link1 </a>        <a href="! link2" class = "link"> This is\ link2 </a>    </body></html>

接下来，引入BeautifulSoup4库

soup = BeautifulSoup(html_sample, 'html.parser') #剖析器为parserprint (soup.text)   #得到需要的文字#找出所有含有特定标签的html元素soup = BeautifulSoup(html_sample,'html.parser')header = soup.select("h1")print (header)      #回传Pythonlistprint (header[0])   #去掉括号print (header[0].text)  #取出文字

得到结果如下

四.其他类似功能的实现

#找出所有含有特定标签的html元素soup = BeautifulSoup(html_sample)header = soup.select("h1")print (header)#取得含有特定css属性的元素使用select找出所有id为title的元素（id前面要加#）alink = soup.select('#title')print (alink)#使用select找出所有class为link的元素（class前面要加.）soup = BeautifulSoup(html_sample)for link in soup.select('.link'):    print (link)#使用SELECT找出所有a tag的href连结alinks = soup.select("a")for link in alinks:    print (link["href"])#根据不同的html标签取得对应内容for news in soup.select('.news-item'):    if(len(news.select('h2'))>0):        h2 = news.select('h2')[0].text        time = news.select('.time')[0].text        a = news.select('a')[0]['href']        print (time, h2, a)

以上代码有以下注意事项：
a）id前面要加上句号（.）; class前面要加上井号（#）
b）在最后一段代码处，需要判断字符串的长度是否为0，只需要长度不为0的字符串进行解析，其他一律省略

五.网页内容的抓取

##取得内文页面import requestsfrom bs4 import BeautifulSoupurl = "http://news.sina.com.cn/c/nd/2017-02-27/doc-ifyavvsh6939815.shtml"res = requests.get(url)res.encoding = "utf-8"print (res.text)soup = BeautifulSoup(res.text, 'html.parser')#抓取标题soup.select("#artibodyTitle")[0].text#来源与时间soup.select('.time-source')[0]###使用contents:将资料列成不同listsoup.select('.time-sourse')[0].contents[0].strip() #strip()进行相关字符串的删除##取得文章内容article = []for p in soup.select('#artibody p')[:-1]:    article.append(p.text.strip())" ".join(article)#段落部分用空格隔##相当于使用列表解析[p.text.strip() for p in soup.select("#antibody p")[:-1]]###取得编辑的名字editor = soup.select('.article-editor')[0].text.strip("zerenbianji")###取得评论数量soup.select("#commentCount1")## 找出评论出处

六.总结

这个爬虫是最基本的一个小型爬虫，对于入门有很好的帮助

继续加油，学好Python！！

0 0