python for data analysis 操作usagov_bitly_data示例

来源:互联网 发布:电脑淘宝扫一扫在哪 编辑:程序博客网 时间:2024/06/06 20:16

python for data analysis 操作usagov_bitly_data示例

import jsonpath = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'records = [json.loads(line) for line in open(path)]In [18]: records[0]Out[18]:{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, likeGecko) Chrome/17.0.963.78 Safari/535.11',u'al': u'en-US,en;q=0.8',u'c': u'US',u'cy': u'Danvers',u'g': u'A6qOVH',u'gr': u'MA',u'h': u'wfLQtf',u'hc': 1331822918,u'hh': u'1.usa.gov',u'l': u'orofrog',u'll': [42.576698, -70.954903],u'nk': 1,u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',u't': 1331923247,u'tz': u'America/New_York',u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}In [19]: records[0]['tz']Out[19]: u'America/New_York'

Counting Time Zones with pandas

In [289]: from pandas import DataFrame, SeriesIn [290]: import pandas as pdIn [291]: frame = DataFrame(records)In [292]: frameOut[292]:<class 'pandas.core.frame.DataFrame'>Int64Index: 3560 entries, 0 to 3559Data columns:_heartbeat_ 120 non-null valuesa 3440 non-null valuesal 3094 non-null valuesc 2919 non-null valuescy 2919 non-null valuesg 3440 non-null valuesgr 2919 non-null valuesh 3440 non-null valueshc 3440 non-null valueshh 3440 non-null valueskw 93 non-null valuesl 3440 non-null valuesll 2919 non-null valuesnk 3440 non-null valuesr 3440 non-null valuest 3440 non-null valuestz 3440 non-null valuesu 3440 non-null valuesdtypes: float64(4), object(14)In [293]: frame['tz'][:10]Out[293]:0 America/New_York1 America/Denver2 America/New_York3 America/Sao_Paulo4 America/New_York5 America/New_York6 Europe/Warsaw789Name: tz

The Series object returned by frame[‘tz’] has a method value_counts that gives us what we’re looking for:

In [294]: tz_counts = frame['tz'].value_counts()In [295]: tz_counts[:10]Out[295]:America/New_York 1251521America/Chicago 400America/Los_Angeles 382America/Denver 191Europe/London 74Asia/Tokyo 37Pacific/Honolulu 36Europe/Madrid 35America/Sao_Paulo 33

You can do a bit of munging to fill in a substitute value for unknown and missing time zone data in the records. The fillna function can replace missing (NA) values and unknown (empty strings) values can be replaced by boolean array indexing:

In [296]: clean_tz = frame['tz'].fillna('Missing')In [297]: clean_tz[clean_tz == ''] = 'Unknown'In [298]: tz_counts = clean_tz.value_counts()In [299]: tz_counts[:10]Out[299]:America/New_York 1251Unknown 521America/Chicago 400America/Los_Angeles 382America/Denver 191Missing 120Europe/London 74Asia/Tokyo 37Pacific/Honolulu 36Europe/Madrid 35

Making a horizontal bar plot can be accomplished using the plot method on the counts objects:

In [301]: tz_counts[:10].plot(kind='barh', rot=0)

We’ll explore more tools for working with this kind of data. For example, the a field contains information about the browser, device, or application used to perform the URL shortening:

In [302]: frame['a'][1]Out[302]: u'GoogleMaps/RochesterNY'In [303]: frame['a'][50]Out[303]: u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'In [304]: frame['a'][51]Out[304]: u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1In [305]: results = Series([x.split()[0] for x in frame.a.dropna()])In [306]: results[:5]Out[306]:0 Mozilla/5.01 GoogleMaps/RochesterNY2 Mozilla/4.03 Mozilla/5.04 Mozilla/5.0In [307]: results.value_counts()[:8]Out[307]:Mozilla/5.0 2594Mozilla/4.0 601GoogleMaps/RochesterNY 121Opera/9.80 34TEST_INTERNET_AGENT 24GoogleProducer 21Mozilla/6.0 5BlackBerry8520/5.0.0.681 4

suppose you wanted to decompose the top time zones into Windows and non-Windows users. As a simplification, let’s say that a user is on Windows if the string ‘Windows’ is in the agent string. Since some of the agents are missing, I’ll exclude these from the data:

In [308]: cframe = frame[frame.a.notnull()]In [309]: operating_system = np.where(cframe['a'].str.contains('Windows'),.....: 'Windows', 'Not Windows')In [310]: operating_system[:5]Out[310]:0 Windows1 Not Windows2 Windows3 Not Windows4 WindowsName: aIn [311]: by_tz_os = cframe.groupby(['tz', operating_system])

The group counts, analogous to the value_counts function above, can be computed using size. This result is then reshaped into a table with unstack:

In [312]: agg_counts = by_tz_os.size().unstack().fillna(0)In [313]: agg_counts[:10]Out[313]:a Not Windows Windowstz245 276Africa/Cairo 0 3Africa/Casablanca 0 1Africa/Ceuta 0 2Africa/Johannesburg 0 1Africa/Lusaka 0 1America/Anchorage 4 1America/Argentina/Buenos_Aires 1 0America/Argentina/Cordoba 0 1America/Argentina/Mendoza 0 1

Finally, let’s select the top overall time zones. To do so, I construct an indirect index array from the row counts in agg_counts:

# Use to sort in ascending orderIn [314]: indexer = agg_counts.sum(1).argsort()In [315]: indexer[:10]Out[315]:tz24Africa/Cairo 20Africa/Casablanca 21Africa/Ceuta 92Africa/Johannesburg 87Africa/Lusaka 53America/Anchorage 54America/Argentina/Buenos_Aires 57America/Argentina/Cordoba 26America/Argentina/Mendoza 55

I then use take to select the rows in that order, then slice off the last 10 rows:

In [316]: count_subset = agg_counts.take(indexer)[-10:]In [317]: count_subsetOut[317]:a Not Windows WindowstzAmerica/Sao_Paulo 13 20Europe/Madrid 16 19Pacific/Honolulu 0 36Asia/Tokyo 2 35Europe/London 43 31America/Denver 132 59America/Los_Angeles 130 252America/Chicago 115 285245 276America/New_York 339 912In [319]: count_subset.plot(kind='barh', stacked=True)


The plot doesn’t make it easy to see the relative percentage of Windows users in the smaller groups, but the rows can easily be normalized to sum to 1 then plotted again

normed_subset = count_subset.div(count_subset.sum(1), axis=0)normed_subset.plot(kind='barh', stacked=True)

0 0
原创粉丝点击