python 爬虫分析部分词条影响力分布

来源:互联网 发布:淘宝宝贝分类 编辑:程序博客网 时间:2024/06/06 13:07
本文的内容:1.简要描述用python写网络爬虫2.从一个初始链接搜索多个,收集编辑者(贡献者)IP3.根据收集过来的IP列表定位IP来源并统计

一.python写网络爬虫

用python写网络爬虫可以说是所有编程语言中最简单的一种,网络爬虫工作原理大致可以分为三步:    1)从服务器端获取页面    2)解析页面并提取需要的内容    3)保存提取的内容会了这三步你就会了python爬虫,其中每一步都可以大做文章,大做文章之后的爬虫便是高级爬虫了,比如:从服务器端获取页面,用代理吗?本地需要缓存吗?cookie?解析页面并提取需要的内容,用什么解析,正则?BS?lxml?等等。这里呢,我们只需要掌握每一步的简单执行就可以完成本项目。1.从服务器端获取页面    from urllib.request import urlopen    html=urlopen(url)2.解析页面并提取需要的内容    #find the link on a webpage    from bs4 import BeautifulSoup    bsObj=BeautifulSoup(html.read())    bsObj.findAll(a)3.保存提取的内容(可以选择保存到本地还是数据库中)    #save the data on local file    passTips:这里并不是想讲怎么写爬虫,只是一个启发式,具体的写爬虫可以参考各种书籍,博主推荐《python网络数据采集》

二.收集贡献者IP

说明:我们可以在维基百科(中文版)的编辑链接中可以查到一个词条的历史编辑者,也就是贡献者。当然,如果要把维基中文全部词条收集,这 可不是一个小工作量。这里我只是从一个链接出发,收集了总共1000个内 链,然后寻找定位IP。     要采集需要的数据就需要确定数据在哪?    发现:    1.维基中文词条的链接形如:https://zh.wikipedia.org/wiki/%E6%8A%AB%E9%A0%AD%E5%9B%9B%E6%A8%82%E9%9A%8A    2.词条的历史版本链接形如:https://zh.wikipedia.org/w/index.php?title=%E6%8A%AB%E9%A0%AD%E5%9B%9B%E6%A8%82%E9%9A%8A&action=history    所以要得到历史贡献者的IP,就需要将词条链接转换成历史版本链接,在从历史版本链接中爬取贡献者IP

三、将IP地址解析成实际地址

这里,我们通过一个网址进行解析,这个网址支持IPV4和IPV6,响应速度快。https://freegeoip.net/json/IP address,它的响应是json格式的数据。然后python有内置的json模块,所以这一切就很容易了。例如:    def getCountryFromIP(address):        url='https://freegeoip.net/json/'+address        resp=urlopen(url).read().decode('utf-8')                getCountryFromIP('114.255.40.41')    #响应内容是标准的json格式数据:      {"ip":"114.255.40.41","country_code":"CN","country_name":"China","region_code":"11","region_name":"Beijing","city":"Beijing","zip_code":"","time_zone":"Asia/Shanghai","latitude":39.9289,"longitude":116.3883,"metro_code":0}    这里,可以直接使用Python自带的json模块解析:        import json        json.loads(resp.read().decode('utf-8')).get(country_name)        ps:json解析的模块是str类型,需要将响应解码。

写到这里,主要内容就已经写完了,展示一些大概的结果,这个小项目也是一时兴起,加上我的水平有限,也没有花很多时间去捯饬,就这样吧,展示一下结果吧,顺遍贴上代码,记录这一次的随心。如果能够帮助到任何人,那就是至上荣誉了。

截取的部分IP地址和解析的结果,包含Ipv6,只收集了从一个源点出发的一部分词条。

['59.66.81.218', 'China', 'Beijing\n']['134.148.10.13', 'Australia', 'Newcastle\n']['2001:da8:201:1235:add0:3ac0:f9db:c272', 'China', 'Beijing\n']['211.72.118.98', 'Taiwan\n']['46.140.124.122', 'Switzerland', 'Winterthur\n']['123.193.83.141', 'Taiwan', 'Taipei\n']['111.70.16.213', 'Taiwan', 'Taipei\n']['111.70.16.213', 'Taiwan', 'Taipei\n']['117.136.0.252', 'China', 'Beijing\n']['118.165.99.108', 'Taiwan', '\n']['14.111.20.141', 'China', 'Chongqing\n']['114.46.203.245', 'Taiwan', 'Taichung\n']['116.208.193.21', 'China', 'Wuhan\n']['116.208.193.21', 'China', 'Wuhan\n']['114.246.180.114', 'China', 'Beijing\n']['113.232.229.174', 'China', 'Shenyang\n']['221.200.238.87', 'China', 'Shenyang\n']['67.198.207.26', 'United States', 'Orange\n']['119.236.176.232', 'Hong Kong', 'Central District\n']['116.235.110.168', 'China', 'Shanghai\n']['2601:647:4e01:56d4:7d6b:977:b2bb:3b49', 'United States', 'Sunnyvale\n']['182.155.188.153', 'Taiwan', 'Taichung\n']['46.140.124.122', 'Switzerland', 'Winterthur\n']['46.140.124.122', 'Switzerland', 'Winterthur\n']['46.140.124.122', 'Switzerland', 'Winterthur\n']['46.140.124.122', 'Switzerland', 'Winterthur\n']['100.8.204.188', 'United States', 'Jersey City\n']['149.159.1.41', 'United States', 'Bloomington\n']['119.82.250.83', 'Cambodia', '\n']['119.82.250.83', 'Cambodia\n']['203.210.0.59', 'Hong Kong', 'Central District\n']['220.133.254.206', 'Taiwan', 'Kaohsiung City\n']['122.14.140.11', 'China', 'Shenzhen\n']['119.82.250.83', 'Cambodia\n']['119.82.250.83', 'Cambodia\n']['119.82.250.83', 'Cambodia\n']['129.100.76.127', 'Canada', 'London\n']['194.83.163.206', 'United Kingdom', 'Buckingham\n']['194.83.163.206', 'United Kingdom', 'Buckingham\n']['123.202.99.188', 'Hong Kong', 'Central District\n']['180.99.66.217', 'China', 'Xuzhou\n']['180.99.66.217', 'China', 'Xuzhou\n']['98.242.65.162', 'United States', 'Decatur\n']['112.120.189.28', 'Hong Kong', 'Central District\n']['112.120.189.28', 'Hong Kong', 'Central District\n']['180.157.128.252', 'China', 'Shanghai\n']['60.175.100.232', 'China', 'Hefei\n']['59.46.82.10', 'China', 'Shenyang\n']['209.107.204.13', 'United States', 'Costa Mesa\n']['222.33.38.82', 'China', 'Beijing\n']['216.129.107.106', 'United States', 'Sunnyvale\n']['216.129.107.106', 'United States', 'Sunnyvale\n']['216.129.107.106', 'United States', 'Sunnyvale\n']['70.48.229.89', 'Canada\n']['134.148.10.13', 'Australia', 'Newcastle\n']['60.162.176.50', 'China', 'Taizhou\n']['219.77.187.75', 'Hong Kong', 'Central District\n']['130.226.87.174', 'Denmark', 'Odense\n']['130.226.87.137', 'Denmark', 'Odense\n']['180.222.6.7', 'Australia', 'Ballarat\n']['180.222.6.7', 'Australia', 'Ballarat\n']['117.139.205.157', 'China', 'Chengdu\n']['111.251.180.87', 'Taiwan', '\n']['61.92.66.139', 'Hong Kong', 'Central District\n']['58.165.156.2', 'Australia', 'Arana Hills\n']['58.165.156.2', 'Australia', 'Arana Hills\n']['58.165.156.2', 'Australia', 'Arana Hills\n']['58.165.156.2', 'Australia', 'Arana Hills\n']['58.165.156.2', 'Australia', 'Arana Hills\n']['2001:b011:800f:1f72:ada3:8354:a780:93ba', 'Taiwan', 'Taichung\n']['223.141.40.117', 'Taiwan', 'Taipei\n']['49.219.13.131', 'Taiwan', 'Taipei\n']['46.166.138.164', 'Netherlands\n']['2601:647:4e01:56d4:558:5287:83b7:ef1c', 'United States', 'Sunnyvale\n']['182.239.105.109', 'Hong Kong', 'Kwun Hang\n']['182.239.105.109', 'Hong Kong', 'Kwun Hang\n']['194.135.232.11', 'Russia', 'Moscow\n']['14.53.67.111', 'Republic of Korea', 'Paju\n']['118.122.85.171', 'China', 'Chengdu\n']['125.34.158.62', 'China', 'Beijing\n']['175.25.240.4', 'China', 'Beijing\n']['118.186.207.4', 'China', 'Beijing\n']['114.250.131.234', 'China', 'Beijing\n']['123.120.66.217', 'China', 'Beijing\n']['123.120.61.45', 'China', 'Beijing\n']['41.210.129.62', 'Uganda', '\n']['114.42.130.142', 'Taiwan', '\n']['223.244.16.100', 'China', 'Hefei\n']['113.52.97.198', 'Macao', '\n']['113.52.99.88', 'Macao', '\n']['2001:b000:1c9:0:61:219:36:93', 'Taiwan', 'Taipei', '\n']['122.254.5.178', 'Taiwan', 'Taipei\n']['80.71.135.23', 'Denmark', 'Copenhagen\n']['80.71.135.23', 'Denmark', 'Copenhagen\n']['220.191.181.164', 'China', 'Hangzhou\n']['113.52.99.156', 'Macao\n']['223.73.94.30', 'China', 'Changsha\n']['27.109.176.227', 'Macao', '\n']['106.37.14.1', '', 'China', 'Beijing\n']['2001:da8:e000:1100:d1cb:6c6a:e290:462e', 'China', 'Hangzhou\n']['75.82.28.208', 'United States', 'Los Angeles\n']['197.214.64.218', 'Equatorial Guinea\n']['59.41.182.183', 'China', 'Guangzhou\n']['153.101.102.11', 'China', 'Nanjing\n']['202.111.51.204', 'China', 'Nanjing\n']['125.77.120.42', 'China', 'Fuzhou\n']['125.77.120.42', 'China', 'Fuzhou']

如果有兴趣,后期可以对这些IP地址进行进一步的分类,整合,地理位置可视化等等,这是后话。
代码贴上:

'''    to find which city has more contributions for wiki Chinese    to do this, I would:        1.get a lot of links from BUPT        2.from this link I get the ip address (IP4 and IP6)        3.get the ip address list,then to find the location and count        4.rank'''from urllib.request import urlopenfrom urllib.parse import urlparsefrom bs4 import BeautifulSoupimport reimport jsonimport randomstartURL=r'https://zh.wikipedia.org/wiki/%E5%8C%97%E4%BA%AC%E9%82%AE%E7%94%B5%E5%A4%A7%E5%AD%A6'#function to get Link def getLinks(url):    html=urlopen(url)    bsObj=BeautifulSoup(html,'lxml')    return bsObj.find('div',{'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))#function to get onethousandslinks  onethousandslinks=[]def getOneThousandLinks(startUrl):    global onethousandslinks    links=getLinks(startUrl)        onethousandslinks.extend(links)    while len(onethousandslinks)<1000:        link=random.choice(onethousandslinks)        getOneThousandLinks(r'https://zh.wikipedia.org'+link.attrs['href'])    return onethousandslinks#function want's to get 1000 pages' users' ipdef getHistoryIP(allLinks):    for link in allLinks:        pageUrl=link.attrs['href'].replace('/wiki/','')        historyUrl=r'https://zh.wikipedia.org/w/index.php?title='+pageUrl+'&action=history'        html=urlopen(historyUrl)        bsObjHistory=BeautifulSoup(html.read(),'lxml')        historyLinks=bsObjHistory.findAll('a',{'class':'mw-userlink mw-anonuserlink'})        for link in historyLinks:            print(link.get_text())            url='https://freegeoip.net/json/'+link.get_text()            resp=urlopen(url).read().decode('utf-8')            print(json.loads(resp).get('country_name'))            print(json.loads(resp).get('city'))#get the 1000 linksallLinks=getOneThousandLinks(startURL)allLinks=set(allLinks)getHistoryIP(allLinks)

如果有任何问题,欢迎留言或者私信~,对于博客内容的不足之处,欢迎来大战三百回合,哈哈(^__^) ……

0 0
原创粉丝点击