bit.ly思路

来源：互联网发布：asp.net入门编程实例编辑：程序博客网时间：2024/04/30 20:47

open(path).readline()读取某个文件的一行，open(path).readlines()读取文件的所有行
若文件为json格式（前提），则可以通过json.loads函数逐行加载数据，将json字符串转换成python字典对象

import jsonrecords=[json.loads(line) for line in open(path)]#[]表示records为序列，只不过序列里面的元素为字典

对时区进行计数：
1. 用纯Python代码对时区进行计数：

time_zones=[rec['tz'] for rec in records if 'tz' in rec]#时区字段序列from collections import Countercounts=Counter(time_zones)#对时区字段进行计数counts.most_common(10)#时区字段降序排列

    2. 用pandas对时区进行计数：

import pandas as pd;import numpy as npfrom pandas import DataFrame,Seriesframe=DataFrame(records)#将数据表示为一个表格，字典中的所有键作为列tz_counts=frame['tz'].value_counts#frame['tz']为series对象，但这种计数方法太粗糙，里面包括缺失值和未知值，所以需要先处理一下clean_tz=frame['tz'].fillna('Missing')clean_tz[clean_tz=='']='Unknown'#是不是很奇怪，值直接跟''相比，然后返回键，最后将对应该键的值改为unknown![这里写图片描述](http://img.blog.csdn.net/20170806160709857?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcXFfMzk0NjY2MTY=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)tz_counts=clean_tz.value_counts()tz_counts[:10].plot(kind='barh',rot=0)#画成表格import matplotlib.pyplot as pltplt.show()#agent中包括代理商以及windows或者not windows用户，例："a": "Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/535.11 (KHTML, like Gecko) Chrome\/17.0.963.78 Safari\/535.11"

（1）看看代理商：
results=Series([x.split()[0] for x in frame.a.dropna()]#列表推导式)
result.value_counts()
(2)按Windows和 not windows用户对时区统计信息进行分解：

cframe=frame[frame.a.notnull()]#返回frame中a不为空的frame表格，原理见上图operating_system=bp.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')#np.where(conditon,x,y)当condition为真时，输出x，否则输出Y,而且是按照a的顺序输出的，所以才能与后面的'tz'进行groupby()by_tz_os=cframe.groupby(['tz'],operating_system)#'tz'与operating_system一一对应，从而能进行分组agg_counts=by_tz_os.size().unstack().fillna(0)#当某个字段只有windows用户时，not windows的值肯定缺失，所以补为0，unstack()使Series对象具有一个层次化索引（即唯一的键值对）(间接索引数组)indexer=agg_counts.sum(1).argsort()#sum(1)使每一行相加，argsort():x=np.array([1,4,3,-1,6,9]),x.argsort()输出为：[3,0,2,1,4,5],所以indexer后十行的值代表的agg_counts中的'tz'就是最频繁的字段count_subset=agg_counts.take(indexer)[-10:]count_subset.plot(kind='barh',stacked=True)#True表示堆积normed_subset=count_subset.div(count_subset.sum(1),axis=0)#使各行均为1

阅读全文

0 0