python爬虫学习第二十天

来源：互联网发布：人民日报图文数据库编辑：程序博客网时间：2024/06/05 16:38

今天的练习是如何把 API 和网络数据采集结合起来：看看维基百科的贡献者们大都在哪里。

练习1 获取维基百科的匿名贡献者IP(test15.py)

from urllib.request import urlopenfrom bs4 import BeautifulSoupimport randomimport datetimeimport re# 获取内链接def getlinks(articleUrl):    html = urlopen("http://en.wikipedia.org"+articleUrl)    bsObj = BeautifulSoup(html)    links = bsObj.find('div',{'id':'bodyContent'}).findAll('a',href = re.compile("^(/wiki/)((?!:).)*$"))    return links    pass# 获取匿名贡献者的ipdef getHistoryIps(pageUrl):    pageUrl = pageUrl.replace("/wiki/", "")    historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+'&action=history'     html = urlopen(historyUrl)    bsObj = BeautifulSoup(html)    ipAddresses = bsObj.findAll('a',{'class':'mw-anonuserlink'})    adresseslist=set()    for ipAdress in ipAddresses:        adresseslist.add(ipAdress.get_text())    return adresseslist    passrandom.seed(datetime.datetime.now())links = getlinks("/wiki/Python_(programming_language)")while len(links)>0:    for link in links:        print('______________')        addresses = getHistoryIps(link.attrs['href'])        for address in addresses:            print(address)    newlink = links[random.randint(0,len(links)-1)]    links = getlinks(newlink.attrs['href'])

在上面代码的第21行，有这样一条语句：adresseslist=set()。使用set()关键字建立的变量adresseslist是一个集合变量，关于集合变量的知识这里特别介绍一下。

Python 的集合类型简介

到现在为止，我用已经用过两个 Python 的数据结构来储存不同类型的数据：列表和词典。已经有了两种数据类型，为什么还要用集合（set）？ Python 的集合是无序的，就是说你不能用位置来获得集合元素对应的值。数据加入集合的顺序，和你重新获取它们的顺序，很可能是不一样的。在上面的示例代码中，使用集合的一个好处就是它不会储存重复值。如果你要存储一个已有的值到集合中，集合会自动忽略它。因此，我们可以快速地获取历史编辑页面中独立的 IP 地址，不需要考虑同一个编辑者多次编辑历史的情况。

对于未来可能需要扩展的代码，在决定使用集合还是列表时，有两件事情需要考虑：虽然列表迭代速度比集合稍微快一点儿，但集合查找速度更快（确定一个对象是否在集合中），因为 Python 集合就是值为 None 的词典，用的是哈希表结构，查询速度为 O(1)。

练习2 小改动：将找到的IP地址用API定位地理位置

from urllib.request import urlopenfrom urllib.error import HTTPErrorfrom bs4 import BeautifulSoupimport randomimport datetimeimport reimport json# 获取内链接def getlinks(articleUrl):    html = urlopen("http://en.wikipedia.org"+articleUrl)    bsObj = BeautifulSoup(html)    links = bsObj.find('div',{'id':'bodyContent'}).findAll('a',href = re.compile("^(/wiki/)((?!:).)*$"))    return links    pass# 获取匿名贡献者的ipdef getHistoryIps(pageUrl):    pageUrl = pageUrl.replace("/wiki/", "")    historyUrl = "http://en.wikipedia.org/w/index.php?title="+pageUrl+'&action=history'     html = urlopen(historyUrl)    bsObj = BeautifulSoup(html)    ipAddresses = bsObj.findAll('a',{'class':'mw-anonuserlink'})    adresseslist=set()    for ipAdress in ipAddresses:        adresseslist.add(ipAdress.get_text())    return adresseslist    passdef getAddresses(historyIP):    try:        html = urlopen('http://freegeoip.net/json/'+historyIP)        response = html.read().decode('utf-8')    except HTTPError:        return None    jsonObj = json.loads(response)    return jsonObj.get('country_code')    passrandom.seed(datetime.datetime.now())links = getlinks("/wiki/Python_(programming_language)")while len(links)>0:    for link in links:        print('______________')        addresses = getHistoryIps(link.attrs['href'])        for address in addresses:            print('IP: '+address+' is in: '+getAddresses(address))    newlink = links[random.randint(0,len(links)-1)]    links = getlinks(newlink.attrs['href'])

在原来的基础上增加了一个getAddresses函数，用来调用API获取IP地址对应的物理位置。

阅读全文

0 0