数据可视化漫游（五）

来源：互联网发布：如何用丝巾缠包带知乎编辑：程序博客网时间：2024/04/28 10:38

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人CSDN博客：http://blog.csdn.net/u013719780?viewmode=contents

数据可视化有助于理解数据，在机器学习项目特征工程阶段也会起到很重要的作用，因此，数据可视化是一个很有必要掌握的武器。本系列博文就对数据可视化进行一些简单的探讨。本文使用Python的Altair对数据进行可视化。

In [2]:

%matplotlib inlineimport matplotlib.pyplot as pltimport pandas as pdimport numpy as npimport seaborn as snssns.set()sns.set_context('notebook', font_scale=1.5)cp = sns.color_palette()

In [3]:

from altair import *

Thing 1: Line Chart (with many lines)

In [55]:

ts = pd.read_csv('data/ts.csv')ts = ts.assign(dt = pd.to_datetime(ts.dt))ts.head()

Out[55]:

dtkindvalue02000-01-01A1.44252112000-01-02A1.98129022000-01-03A1.58649432000-01-04A1.37896942000-01-05A-0.277937

In [56]:

dfp = ts.pivot(index='dt', columns='kind', values='value')dfp.head()

Out[56]:

kindABCDdt 2000-01-011.4425211.8087410.4374150.0969802000-01-021.9812902.2770200.706127-1.5231082000-01-031.5864943.4743921.358063-3.1007352000-01-041.3789692.9061320.262223-2.6605992000-01-05-0.2779373.4895530.796743-3.417402

In [6]:

c = Chart(ts).mark_line().encode(    x='dt',    y='value',    color='kind')c

In [57]:

c = Chart(ts).mark_line().encode(    x='dt',    y='value',    color=Color('kind', scale=Scale(range=cp.as_hex())))c

Thing 2: Scatter

In [7]:

df = pd.read_csv('data/iris.csv')df.head()

Out[7]:

petalLengthpetalWidthsepalLengthsepalWidthspecies01.40.25.13.5setosa11.40.24.93.0setosa21.30.24.73.2setosa31.50.24.63.1setosa41.40.25.03.6setosa

In [8]:

c = Chart(df).mark_point(filled=True).encode(    x='petalLength',    y='petalWidth',    color='species')c

Thing 3: Trellising the Above

In [9]:

c = Chart(ts).mark_line().encode(    x='dt',    y='value',    color='kind',    column='kind')c.configure_cell(height=200, width=200)

In [10]:

c = Chart(df).mark_point().encode(    x='petalLength',    y='petalWidth',    color='species',    column=Column('species',                  title='Petal Width v. Length by Species'))c.configure_cell(height=300, width=300)

In [11]:

tmp_n = df.shape[0] - df.shape[0]/2df['random_factor'] = (np.\                         random.\                         permutation(['A'] * tmp_n +                                     ['B'] * (df.shape[0] - tmp_n)))df.head()

Out[11]:

petalLengthpetalWidthsepalLengthsepalWidthspeciesrandom_factor01.40.25.13.5setosaB11.40.24.93.0setosaA21.30.24.73.2setosaA31.50.24.63.1setosaB41.40.25.03.6setosaB

In [12]:

c = Chart(df).mark_point().encode(    x='petalLength',    y='petalWidth',    color='species',    column=Column('species',                  title='Petal Width v. Length by Species'),    row='random_factor')c.configure_cell(height=200, width=200)

Thing 4: Visualizing Distributions (Boxplot and Histogram)

In [49]:

# please note: this code is super speculative -- I'm# assuming there's a better way to do this and I just# don't know itc = Chart(df).mark_point(opacity=.5).encode(    x='species',    y='petalWidth')c25 = Chart(df).mark_tick(tickThickness=3.0,                          tickSize=20.0,                          color='r').encode(    x='species',    y='q1(petalWidth)')c50 = Chart(df).mark_tick(tickThickness=3.0,                          tickSize=20.0,                          color='r').encode(    x='species',    y='median(petalWidth)')c75 = Chart(df).mark_tick(tickThickness=3.0,                          tickSize=20.0,                          color='r').encode(    x='species',    y='q3(petalWidth)')LayeredChart(data=df, layers=[c, c25, c50, c75])

In [50]:

c = Chart(df).mark_bar(opacity=.75).encode(    x=X('petalWidth', bin=Bin(maxbins=30)),    y='count(*)',    color=Color('species', scale=Scale(range=cp.as_hex())))c

Thing 5: Bar Chart

In [51]:

df = pd.read_csv('data/titanic.csv')df.head()

Out[51]:

survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue

In [52]:

dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})dfg

Out[52]:

faresurvivedpclass 0164.684008219.412328313.6693641195.608029222.055700313.694887

In [53]:

died = dfg.loc[0, :]survived = dfg.loc[1, :]

In [54]:

c = Chart(df).mark_bar().encode(    x='survived:N',    y='mean(fare)',    color='survived:N',    column='class')c.configure(    facet=FacetConfig(cell=CellConfig(strokeWidth=0, height=250)))

1 0