Python 爬虫笔记(对维基百科页面的深度爬取)
来源:互联网 发布:电视机的网络接口 编辑:程序博客网 时间:2024/06/08 07:35
*#! /usr/bin/env python#coding=utf-8import urllib2from bs4 import BeautifulSoupimport reimport datetimeimport randomrandom.seed(datetime.datetime.now())def getLinks(articleUrl): html=urllib2.urlopen("http://en.wikipedia.org"+articleUrl) bsObj=BeautifulSoup(html) return bsObj.find("div",{"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))links=getLinks("/wiki/Kevin_Bacon")while len(links)>0: newArticle=links[random.randint(0,len(links)-1)].attrs["href"] print(newArticle) links=getLinks(newArticle)*
PSEUDORANDOM NUMBERS AND RANDOM SEEDS
In the previous example, I used Python’s random number generator to select an article at random on each page in
order to continue a random traversal of Wikipedia. However, random numbers should be used with caution.
While computers are great at calculating correct answers, they’re terrible at just making things up. For this reason,
random numbers can be a challenge. Most random number algorithms strive to produce an evenly distributed and
hard-to-predict sequence of numbers, but a “seed” number is needed to give these algorithms something to work
with initially. The exact same seed will produce the exact same sequence of “random” numbers every time, so for
this reason I’ve used the system clock as a starter for producing new sequences of random numbers, and, thus, new
sequences of random articles. This makes the program a little more exciting to run.
For the curious, the Python pseudorandom number generator is powered by the Mersenne Twister algorithm. While
it produces random numbers that are difficult to predict and uniformly distributed, it is slightly processor intensive.
Random numbers this good don’t come cheap!
- Python 爬虫笔记(对维基百科页面的深度爬取)
- python爬虫学习笔记(1)-爬取糗事百科
- Python简单爬虫开发的学习笔记整理(爬取百度百科词条)
- 简单的python爬虫(爬取百度百科词条)
- 【Python爬虫】爬取百度百科python相关的1000个页面
- python爬虫--爬取维基百科(六步理论深度爬取)
- 一个简单的爬虫程序(爬取百度百科关于python的一千个页面)
- 一个单线程爬取英文维基百科正文与链接关系的Python爬虫
- 第一个python爬虫(python3爬取百度百科1000个页面)
- python3爬虫(1)--百度百科的页面爬取
- python爬虫爬取糗事百科的段子
- Python爬虫实战(1):爬取糗事百科段子
- Python爬虫实战(1):爬取糗事百科段子
- Python爬虫(一)——爬取糗事百科
- Python 简单爬虫实现(爬取百度百科信息)
- python爬虫(6)爬取糗事百科
- python爬虫(一)爬取糗事百科
- python爬虫爬取糗事百科
- 一个 Java 的 Socket 服务器和客户端通信的例子
- C++基础笔记之八:二分查找
- Wins【7/10】环境下安装基于Eclipse的STM32交叉编译开发调试环境
- ANDROID STUDIO详细教程汇总
- 《天下少年英雄》隐私政策
- Python 爬虫笔记(对维基百科页面的深度爬取)
- leetcode 67.Minimum Window Substring
- 利用nodejs express mysql +boostrap构建一个博客
- C语言:逗号运算符和逗号表达式
- Unity3D NGUI图文混排聊天表情
- 进程和线程的区别
- 企业如何运用PRINCE2,避免项目失败——上海信息化培训中心
- 修改TextView中字体的颜色【字符串拼接之后显示在TextView中】
- Activity生命周期