Python 爬虫笔记（对维基百科页面的深度爬取）

来源：互联网发布：电视机的网络接口编辑：程序博客网时间：2024/06/08 07:35

*#! /usr/bin/env python#coding=utf-8import urllib2from    bs4 import  BeautifulSoupimport  reimport datetimeimport randomrandom.seed(datetime.datetime.now())def getLinks(articleUrl):        html=urllib2.urlopen("http://en.wikipedia.org"+articleUrl)        bsObj=BeautifulSoup(html)        return bsObj.find("div",{"id":"bodyContent"}).findAll("a",                                            href=re.compile("^(/wiki/)((?!:).)*$"))links=getLinks("/wiki/Kevin_Bacon")while  len(links)>0:        newArticle=links[random.randint(0,len(links)-1)].attrs["href"]        print(newArticle)        links=getLinks(newArticle)*

PSEUDORANDOM NUMBERS AND RANDOM SEEDS
In the previous example, I used Python’s random number generator to select an article at random on each page in
order to continue a random traversal of Wikipedia. However, random numbers should be used with caution.
While computers are great at calculating correct answers, they’re terrible at just making things up. For this reason,
random numbers can be a challenge. Most random number algorithms strive to produce an evenly distributed and
hard-to-predict sequence of numbers, but a “seed” number is needed to give these algorithms something to work
with initially. The exact same seed will produce the exact same sequence of “random” numbers every time, so for
this reason I’ve used the system clock as a starter for producing new sequences of random numbers, and, thus, new
sequences of random articles. This makes the program a little more exciting to run.
For the curious, the Python pseudorandom number generator is powered by the Mersenne Twister algorithm. While
it produces random numbers that are difficult to predict and uniformly distributed, it is slightly processor intensive.
Random numbers this good don’t come cheap!

0 0