Pyhon 网络爬虫--简单的爬取功能

来源：互联网发布：labview编程详解书编辑：程序博客网时间：2024/06/08 06:29

从网页上爬取内容大致分为三步：

1、获取整个网页信息（源代码）

2、通过正则匹配，获取指定标签中的内容

3、将获取到的内容写到本地中

一、获取整个网页信息（源代码）

# coding utf-8import urllib.requestdef getHtml(url):    html = urllib.request.urlopen(url).read()    return htmlhtml = getHtml("http://www.weather.com.cn/weather/101190401.shtml")print(html)

urllib.request.urlopen()方法用于打开一个URL地址。

read()用于读取URL中的数据

二、通过正则匹配，获取指定标签中的内容

# coding utf-8import urllib.requestimport redef getHtml(url):    html = urllib.request.urlopen(url).read()    return htmldef getImg(html):    reg = 'src="(.+?\.png)"'    imgre = re.compile(reg)    html = html.decode('utf-8')#不加这句话，否则会报TypeError: cannot use a string pattern on a bytes-like object错误    imglist = imgre.findall(html)    return imglisthtml = getHtml("http://www.weather.com.cn/weather/101190401.shtml")print(getImg(html))

（按F12打开开发者工具，在里面可以查看源代码，看你所需要筛选内容的格式）

通过正则表达式对html中进行筛选，获得图片链接

re.compile() 可以把正则表达式编译成一个正则表达式对象.

正则表达式对象.findall() 方法读取html 中包含 imgre（正则表达式）的数据。

三、将页面筛选的数据保存到本地

# coding utf-8import urllib.requestimport redef getHtml(url):    html = urllib.request.urlopen(url).read()    return htmldef getImg(html):    reg = 'src="(.+?\.png)"'    imgre = re.compile(reg)    html = html.decode('utf-8')    imglist = imgre.findall(html)    x = 0    for imgurl in imglist:        urllib.request.urlretrieve(imgurl, '%s.png' % x)        x += 1    return imglisthtml = getHtml("http://www.weather.com.cn/weather/101190401.shtml")print(getImg(html))

urllib.request.urlretrieve()方法，直接将远程数据下载到本地。

0 0