数据可视化漫游(五)

来源:互联网 发布:如何用丝巾缠包带 知乎 编辑:程序博客网 时间:2024/04/28 10:38

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents


博主简介:风雪夜归子(Allen),机器学习算法攻城狮,喜爱钻研Meachine Learning的黑科技,对Deep Learning和Artificial Intelligence充满兴趣,经常关注Kaggle数据挖掘竞赛平台,对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦,个人CSDN博客:http://blog.csdn.net/u013719780?viewmode=contents



数据可视化有助于理解数据,在机器学习项目特征工程阶段也会起到很重要的作用,因此,数据可视化是一个很有必要掌握的武器。本系列博文就对数据可视化进行一些简单的探讨。本文使用Python的Altair对数据进行可视化。



In [2]:
%matplotlib inlineimport matplotlib.pyplot as pltimport pandas as pdimport numpy as npimport seaborn as snssns.set()sns.set_context('notebook', font_scale=1.5)cp = sns.color_palette()
In [3]:
from altair import *


Thing 1: Line Chart (with many lines)


In [55]:
ts = pd.read_csv('data/ts.csv')ts = ts.assign(dt = pd.to_datetime(ts.dt))ts.head()
Out[55]:
 dtkindvalue02000-01-01A1.44252112000-01-02A1.98129022000-01-03A1.58649432000-01-04A1.37896942000-01-05A-0.277937
In [56]:
dfp = ts.pivot(index='dt', columns='kind', values='value')dfp.head()
Out[56]:
kindABCDdt    2000-01-011.4425211.8087410.4374150.0969802000-01-021.9812902.2770200.706127-1.5231082000-01-031.5864943.4743921.358063-3.1007352000-01-041.3789692.9061320.262223-2.6605992000-01-05-0.2779373.4895530.796743-3.417402
In [6]:
c = Chart(ts).mark_line().encode(    x='dt',    y='value',    color='kind')c

In [57]:
c = Chart(ts).mark_line().encode(    x='dt',    y='value',    color=Color('kind', scale=Scale(range=cp.as_hex())))c


Thing 2: Scatter


In [7]:
df = pd.read_csv('data/iris.csv')df.head()
Out[7]:
 petalLengthpetalWidthsepalLengthsepalWidthspecies01.40.25.13.5setosa11.40.24.93.0setosa21.30.24.73.2setosa31.50.24.63.1setosa41.40.25.03.6setosa
In [8]:
c = Chart(df).mark_point(filled=True).encode(    x='petalLength',    y='petalWidth',    color='species')c


Thing 3: Trellising the Above


In [9]:
c = Chart(ts).mark_line().encode(    x='dt',    y='value',    color='kind',    column='kind')c.configure_cell(height=200, width=200)

In [10]:
c = Chart(df).mark_point().encode(    x='petalLength',    y='petalWidth',    color='species',    column=Column('species',                  title='Petal Width v. Length by Species'))c.configure_cell(height=300, width=300)

In [11]:
tmp_n = df.shape[0] - df.shape[0]/2df['random_factor'] = (np.\                         random.\                         permutation(['A'] * tmp_n +                                     ['B'] * (df.shape[0] - tmp_n)))df.head()
Out[11]:
 petalLengthpetalWidthsepalLengthsepalWidthspeciesrandom_factor01.40.25.13.5setosaB11.40.24.93.0setosaA21.30.24.73.2setosaA31.50.24.63.1setosaB41.40.25.03.6setosaB
In [12]:
c = Chart(df).mark_point().encode(    x='petalLength',    y='petalWidth',    color='species',    column=Column('species',                  title='Petal Width v. Length by Species'),    row='random_factor')c.configure_cell(height=200, width=200)


Thing 4: Visualizing Distributions (Boxplot and Histogram)


In [49]:
# please note: this code is super speculative -- I'm# assuming there's a better way to do this and I just# don't know itc = Chart(df).mark_point(opacity=.5).encode(    x='species',    y='petalWidth')c25 = Chart(df).mark_tick(tickThickness=3.0,                          tickSize=20.0,                          color='r').encode(    x='species',    y='q1(petalWidth)')c50 = Chart(df).mark_tick(tickThickness=3.0,                          tickSize=20.0,                          color='r').encode(    x='species',    y='median(petalWidth)')c75 = Chart(df).mark_tick(tickThickness=3.0,                          tickSize=20.0,                          color='r').encode(    x='species',    y='q3(petalWidth)')LayeredChart(data=df, layers=[c, c25, c50, c75])

In [50]:
c = Chart(df).mark_bar(opacity=.75).encode(    x=X('petalWidth', bin=Bin(maxbins=30)),    y='count(*)',    color=Color('species', scale=Scale(range=cp.as_hex())))c


Thing 5: Bar Chart


In [51]:
df = pd.read_csv('data/titanic.csv')df.head()
Out[51]:
 survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
In [52]:
dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})dfg
Out[52]:
  faresurvivedpclass 0164.684008219.412328313.6693641195.608029222.055700313.694887
In [53]:
died = dfg.loc[0, :]survived = dfg.loc[1, :]
In [54]:
c = Chart(df).mark_bar().encode(    x='survived:N',    y='mean(fare)',    color='survived:N',    column='class')c.configure(    facet=FacetConfig(cell=CellConfig(strokeWidth=0, height=250)))


1 0
原创粉丝点击