根据chorme的html记录history提取访问过的主域名

来源：互联网发布：北通手柄知乎编辑：程序博客网时间：2024/04/30 12:01

ctrl+h可以看到历史记录，如果把历史纪录的网页保存下来，可以用python的bs4库提取出带有href与id属性的节点与其href值

根据第三个斜杠（如https://www.baidu.com/，第三个斜杠前即主域名），计下第三个斜杠的索引并提取子字符串

然后用dict保存数据并合并，排序，我近半个月的记录如下：

[(u'https://www.baidu.com', 746), (u'http://www.acfun.tv', 213), (u'http://image.baidu.com', 178), (u'http://tieba.baidu.com', 74), (u'https://pan.baidu.com', 71), (u'http://pan.baidu.com', 63), (u'http://www.gamersky.com', 61), (u'https://detail.tmall.com', 58), (u'http://product.yesky.com', 39), (u'http://zhidao.baidu.com', 38), (u'http://fanyi.baidu.com', 36), (u'http://dl.3dmgame.com', 36) 可以看到baidu和acfun这两个站点浏览较多

一共有两个函数

produce_frequ_sove_list_by_filename（filename）    以根据文件名产生顺序域名序列，在后一个函数中被调用

gener_result_by_html_index(index_start,index_end)   根据保存下来的历史纪录html的文件名产生结果，主函数。（我的文件名为8.html, 9.html,...18.html）我调用时直接使用参数8，18.

使用第二个参数时请注意把html文件放在同目录下。否则请自行指定路径。

代码：

import refrom bs4 import BeautifulSoupimport operatorfrom collections import Counterdef produce_frequ_sove_list_by_filename(filename):    html = open(filename,"r").read()    soup = BeautifulSoup(html,'html.parser')    long_list = soup.find_all(href=re.compile(""),id=re.compile(""))    labellist = {}    for child in long_list:        href =  child['href']        count = 0        index = -1        for str in href:            index += 1            if str == "/":                count += 1            if count == 3:                main_sove = href[0:index]                if main_sove not in labellist:                    labellist[main_sove] = 0                else:                    labellist[main_sove] += 1                break    labellist = sorted(labellist.iteritems(), key=operator.itemgetter(1), reverse=True)    print labellist    return labellistdef gener_result_by_html_index(index_start,index_end):    labeldict = {}    for i in range(index_start,index_end):        filename = str(i) + ".html"        print " operating..." + filename        i_dict = dict(produce_frequ_sove_list_by_filename(filename))        labeldict = dict(Counter(labeldict)+Counter( i_dict))    labeldict = sorted(labeldict.iteritems(), key=operator.itemgetter(1), reverse=True)    print " "    print "final result with sorted desc: "    print labeldictif __name__=="__main__":    gener_result_by_html_index(8,18)

[(u'https://www.baidu.com', 746), (u'http://www.acfun.tv', 213), (u'http://image.baidu.com', 178), (u'http://www.galgames.cc', 113), (u'http://tieba.baidu.com', 74), (u'https://pan.baidu.com', 71), (u'http://pan.baidu.com', 63), (u'http://www.gamersky.com', 61), (u'https://detail.tmall.com', 58), (u'http://product.yesky.com', 39), (u'http://zhidao.baidu.com', 38), (u'http://fanyi.baidu.com', 36), (u'http://dl.3dmgame.com', 36),

0 0