爬虫MOOC 第三周实战

来源：互联网发布：淘宝上碧欧泉是正品吗编辑：程序博客网时间：2024/06/05 07:22

首先我们要理解什么是正则表达式

Regular Expression 简称 RE，就是所谓的正则表达式。

正则表达式很好用，关键就是简洁！

比如，例子1：

'PY'

'PYY'

'PYYY'

'PYYYY'

....

'PYYYYYYY.....'（无穷个Y）

这些东西用正则表达式就是：

PY+

例子2：

'PY'开头，后续存在不多于10个字符，后续字符不能是'P'或者'Y'

符合——'PYABC'

不符合——'PYKXYZ'

那么用正则表达式就是：

PY[^PY]{0,10}

============================================

总而言之，正则表达式就是简洁表达一组字符串的表达式。

正则表达式在文本处理中十分常用：

1.表达文本类型的特征

2.同时查找或替换一组字符串

3.匹配字符串的全部或者部分（主要应用）

为了使用正则，首先我们要编译：将符合正则表达式语法的字符串转换为正则表达式特征。

正则表达式的语法：

下面是一些例子：

如何根据字符串构建正则？

例如：匹配 IP 地址

\d+.\d+.d\+.\d+ 当然这里没有考虑取值范围

实际上取值只能是0-255

首先看

0-99，用 [0-9]?\d 表示

100-199，用 1\d{2} 表示

那么，

200-249，用 2[0-4]\d 表示

250-255，用 25[0-5] 表示

用或操作符和括号链接起来：

(([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5]).){3}([1-9]?\d|1\d{2}|2[0-4]\d|25[0-5])

这就是 IP 地址的正则表达式。

==============================================================

Re 库是 Python 的标准库，用于字符串匹配。

import re

即可。

当然也可以用 string 类型来用，但是更繁琐。

建议：当正则包含转义符的时候，用 raw string最好。

测试：

import rematch = re.search(r'[1-9]\d{5}', 'BIT 100081') # 匹配邮政编码 100081if match: # 如果匹配    print(match.group(0)) # 打印出来

结果正常打印出来了。

====================================================================

下面看 match 函数

先输入：

import rematch = re.match(r'[1-9]\d{5}', 'BIT 100081') # 匹配邮政编码 100081if match: # 如果匹配    print(match.group(0)) # 打印出来，实际这里开头是 BIT，不匹配

改为：

import rematch = re.match(r'[1-9]\d{5}', '100081') # 匹配邮政编码 100081if match: # 如果匹配    print(match.group(0)) # 打印出来

====================================================================

现在是 findall 函数：

import rels = re.findall(r'[1-9]\d{5}', 'BIT100081 TSU100084')print (ls) # 结果应该打印出两个邮编

====================================================================

下面是 split 函数

import reprint (re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')) # ['BIT', 'TSU', '']print (re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit = 1)) # ['BIT','TSU100084']

=====================================================================================

import refor m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):    if m:        print (m.group(0))# 10081# 10084

group是针对（）来说的，group（0）就是指的整个串，group（1）指的是第一个括号里的东西，group（2）指的第二个括号里的东西。

======================================================================================

import reprint (re.sub(r'[1-9]\d{5}',':zipcode', 'BIT100081 TSU100084'))# BIT:zipcode# TSU:zipcode

发现被替换了。

==================================================================================

后面会介绍 match 对象的使用。

这种好处就是经过一次编译，需要多次用正则的时候就会更方便，更快。

经过 compile 以后可以和正则的6个使用方法一样。

=============================================================

match 对象的用法

只有经过 compile 之后的正则表达式才是真的正则表达式。

============================================================================

RE 库的贪婪匹配和最小匹配

import rematch = re.search(r'PY.*N', 'PYANBNCNDN') # 贪婪匹配，输出匹配最长的子串print (match.goup(0)) match = re.search(r'PY.*?N', 'PYANBNCNDN') # 最小匹配，输出匹配最短的子串print (match.goup(0))

import rematch = re.search(r'PY.*N', 'PYANBNCNDN') # 贪婪匹配，输出匹配最长的子串print (match.goup(0)) match = re.search(r'PY.*?N', 'PYANBNCNDN') # 最小匹配，输出匹配最短的子串print (match.goup(0))

===========================================================================

实例1：淘宝商品定向比价爬虫

目标：获取淘宝搜索页面的信息，提取其中的名称和价格。

首先搜索书包，看到在淘宝网址里面的链接是：

https://s.taobao.com/search?q=书包&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170

q 是关键词， “书包” 就是变量。

然后翻页：第二页：

https://s.taobao.com/search?q=%E4%B9%A6%E5%8C%85&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170317&bcoffset=1&ntoffset=1&p4ppushleft=1%2C48&s=44

第三页：

变量 s 表示某一页起始商品的编号。

==============================================================================================

看看 robots 协议，发现：

User-Agent:  *Disallow:  /

说明不许爬取。

不过这是为了练习目的且不对淘宝造成骚扰。

==============================================================================================

步骤1：提交商品搜索请求，循环获取页面。

步骤2：对于给个页面，提取商品名称和价格信息

步骤3：将信息输出

# 名称：淘宝商品定向比价爬虫# 任务：获取淘宝搜索页面的信息，提取其中的名称和价格。# 理解：淘宝的搜索接口、翻页处理# 技术路线： requests 库、re 库import reimport requests def getHTMLText(url): # 提取页面    try:        r = requests.get(url, timeout = 30) # url 和 限制时间        r.raise_for_status() # 返回异常        r.encoding = r.apparent_encoding        return r.text     except:        return "" def parsePage(ilt, html): # 获取名称和价格    # 这段要查看网页源代码    try:        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)        # 先检索 view_price 然后获得后面的数字，同理可以得到名称        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html) # *? 是最小匹配        for i in range(len(plt)):            price = eval(plt[i].split(':')[1]) # eval 可以去引号            title = eval(tlt[i].split(':')[1])            ilt.append([price, title])    except:        print("")def printGoodsList(ilt):    tplt = "{:4}\t{:8}\t{:16}" # 定义格式，第一个长度为4，然后是 8、16    print(tplt.format("序号","价格","商品名称"))    count = 0 # 计数器    for g in ilt:        count = count + 1        print(tplt.format(count, g[0], g[1])) # count 表示序号，后面分别是价格 price 、名称 titledef main():    goods = "书包" # 搜索的商品    depth = 2 # 向下一页爬取的深度，比如这里到第二页    start_url = "https://s.taobao.com/search?q=" + goods # url    infoList = [] # 结果信息储存的表    for i in range(depth): # 对每个页面单独访问处理        try:            url = start_url + '$s=' + str(44*i)            html = getHTMLText(url)            parsePage(infoList, html)        except:            continue # 如果出错，继续解析下一个页面    printGoodsList(infoList)main()

输出结果：

=========================================================================================

实例2：股票数据定向爬虫

目标：获取上交所和深交所所有股票的名称和交易信息

选择百度股票，点击个股信息。

步骤1：从东方财富网获取股票列表

步骤2：根据股票列表逐个到百度股票获取个股信息

步骤3：将结果存到文件

# 名称：股票数据定向爬虫# 任务：获取上交所和深交所所有股票的名称和交易信息# 输出：保存到文件中# 技术路线： bs4库、requests 库、re 库# 网站：#         百度股票 http://gupiao.baidu.com/stock/#       东方财富网 http://quote.eastmoney.com/stocklist.htmlimport re import requestsfrom bs4 import BeautifulSoupimport traceback # 这个库是追踪异常用的def getHTMLText(url): # 和往常一样    try:        r = requests.get(url, timeout = 30) # url 和 限制时间        r.raise_for_status() # 返回异常        r.encoding = apparent_encoding        return r.text     except:        return "" def getStockList(lst, stockURL): # 获得股票列表    html = getHTMLText(stockURL) # 获得页面    soup = BeautifulSoup(html, 'html.parser') # 解析页面    a = soup.find_all('a') # 因为股票名字是在东方财富网源代码的 a 标签中    for i in a:        try:            href = i.attrs['href']            lst.append(re.findall(r"[s][hz]\d{6}", href)[0]) # s 开头，中间有 h或z 字符，后面6个数字        except:            continuedef getStockInfo(lst, stockURL, fpath):    for stock in lst:        url = stockURL + stock + ".html"        html = getHTMLText(url)        try:            if html == "": # 如果为空                continue            infoDict = {} # 建立一个字典            soup = BeautifulSoup(html, 'html.parser') # 解析页面            stockInfo = soup.find('div', attrs = {'class':'stock-bets'}) # 获得当前股票的第一个标签                        name = stockInfo.find_all(attrs={'class':'bets-name'})[0] # 获得名字            infoDict.update({'股票名称':name.text.split()[0]}) # 用 split 获得股票名称的完整部分            # 获得和股票相关的键值对列表            keyList = stockInfo.find_all('dt') # 键的区域            valueList = stockInfo.find_all('dd') # 值的区域            for i in range(len(keyList)): # 还原为键值对存到字典中                key = keyList[i].text                val = valueList[i].text                infoDict[key] = val # key = val 向字典中新增信息            with open(fpath, 'a', encoding = 'utf-8') as f: # 把相关信息保存在文件里面                f.write(str(infoDict) + '\n')        except:             traceback.print_exc() # 获得错误信息            continuedef main():    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'    stock_info_url = 'https://gupiao.baidu.com/stock'    output_file = 'D://BaiduStockInfo.txt'    slist = [] # 股票信息放在这    getStockList(slist, stock_list_url) # 获得股票列表    getStockInfo(slist, stock_info_url, output_file) # 获得股票信息main()

然后我们要进行优化：

# 名称：股票数据定向爬虫# 任务：获取上交所和深交所所有股票的名称和交易信息# 输出：保存到文件中# 技术路线： bs4库、requests 库、re 库# 网站：#         百度股票 http://gupiao.baidu.com/stock/#       东方财富网 http://quote.eastmoney.com/stocklist.htmlimport re import requestsfrom bs4 import BeautifulSoupimport traceback # 这个库是追踪异常用的def getHTMLText(url, code ='utf-8' ): # 和往常一样    try:        r = requests.get(url, timeout = 30) # url 和 限制时间        r.raise_for_status() # 返回异常        # r.encoding = r.apparent_encoding # 这里可以优化        r.encoding = code # 百度默认 utf-8        return r.text     except:        return "" def getStockList(lst, stockURL): # 获得股票列表    html = getHTMLText(stockURL, 'GB2312') # 获得页面，东方财富是 GB2312 编码    soup = BeautifulSoup(html, 'html.parser') # 解析页面    a = soup.find_all('a') # 因为股票名字是在东方财富网源代码的 a 标签中    for i in a:        try:            href = i.attrs['href']            lst.append(re.findall(r"[s][hz]\d{6}", href)[0]) # s 开头，中间有 h或z 字符，后面6个数字        except:            continuedef getStockInfo(lst, stockURL, fpath):    count = 0 # 为了后面的进度条    for stock in lst:        url = stockURL + stock + ".html"        html = getHTMLText(url)        try:            if html == "": # 如果为空                continue            infoDict = {} # 建立一个字典            soup = BeautifulSoup(html, 'html.parser') # 解析页面            stockInfo = soup.find('div', attrs = {'class':'stock-bets'}) # 获得当前股票的第一个标签                        name = stockInfo.find_all(attrs={'class':'bets-name'})[0] # 获得名字            infoDict.update({'股票名称':name.text.split()[0]}) # 用 split 获得股票名称的完整部分            # 获得和股票相关的键值对列表            keyList = stockInfo.find_all('dt') # 键的区域            valueList = stockInfo.find_all('dd') # 值的区域            for i in range(len(keyList)): # 还原为键值对存到字典中                key = keyList[i].text                val = valueList[i].text                infoDict[key] = val # key = val 向字典中新增信息            with open(fpath, 'a', encoding = 'utf-8') as f: # 把相关信息保存在文件里面                f.write(str(infoDict) + '\n')                count = count + 1                print ("\r当前进度:{:.2f}%".format(count*100/len(lst)), end = "") # 动态进度显示                # 不换行可以用 /r，下一次打印覆盖上一次                #实际上在 shell 里面可能 /r 被禁止了        except:             count = count + 1            print ("\r当前进度:{:.2f}%".format(count*100/len(lst)), end = "")            continuedef main():    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'    stock_info_url = 'https://gupiao.baidu.com/stock'    output_file = 'D://BaiduStockInfo.txt'    slist = [] # 股票信息放在这    getStockList(slist, stock_list_url) # 获得股票列表    getStockInfo(slist, stock_info_url, output_file) # 获得股票信息main()

0 0

爬虫MOOC 第三周 实战

爬虫MOOC 第三周实战