python数据处理与可视化初探

来源：互联网发布：淘宝网亨吉利可信吗编辑：程序博客网时间：2024/05/22 00:06

引言

欢迎各路大能前来指正错误，以便于共同进步！

在利用python对数据进行处理时，本文用到数学绘图库matplotlib、pygal包，基于python3.x。（python3.x的安装不解释，可以访问http://python.org/downloads/下载安装）

本文会用到python的几个模块csv、json和扩展包xlrd，在下文具体的位置将会介绍安装过程。

利用以上所说的工具，对csv文件、json文件和excel文件进行数据处理及可视化。

本文介绍作者本人在学习python数据处理时的一些心得。

1 工具安装及简单介绍

1.1 matplotlib安装（os x系统）

$pip3 install --user matplotlib

（pip或pip3的安装不再赘述，读者可自行按照个人需求安装）

$pip3 install matplotlib (两个方法都行，在终端界面操作)

1.2 安装pygal

linux和os x系统，执行命令：

pip3 install pygal==1.7

windows系统，执行命令：

python -m pip3 install --user pygal==1.7

pygal包可以生成适合在数字设备上显示的图表，可以轻松调整图表大小。

2 一个简单的代码段

快速学会使用matplotlib和pygal工具。

2.1 使用matplotlib绘制简单的散点图

import matplotlib.pyplot as plt

  x_values = list(range(1,1000))   y_values = [x**2 for x in x_values]

  plt.scatter(x_values,y_values,c=y_values,cmap=cm.Blues,     edgecolor='none',s=40)

2.2 使用pygal绘制直方图

import pygalfrom die import Die# Create a D6.die = Die()# Make some rolls, and store results in a list.results = []for roll_num in range(1000):    result = die.roll()    results.append(result)    # Analyze the results.frequencies = []for value in range(1, die.num_sides+1):    frequency = results.count(value)    frequencies.append(frequency)    # Visualize the results.hist = pygal.Bar()hist.title = "Results of rolling one D6 1000 times."hist.x_labels = ['1', '2', '3', '4', '5', '6']hist.x_title = "Result"hist.y_title = "Frequency of Result"hist.add('D6', frequencies)hist.render_to_file('die_visual.svg')

注意：要查看svg文件，直接将svg文件拉到web浏览器的地址搜索框中就行。（个人经验）

3 从csv文件中读取数据并进行可视化

在处理过程中会用到python模块csv。python库中包含模块csv，所以没必要再次安装，直接使用就行。

现在对一段程序进行分析：

import csv  #导入csv模块

filename = 'sit_ka_weather.csv'with open(filename) as f:   #把文件名储存在filename中，并将结果储存在f     reader = csv.reader(f)  #创建csv文件阅读器     header_row = next(reader) #读取第一行     print(header_row)

现在已文件sit_ka_weather.csv为数据集（可本人csdn资源中下载使用），通过以下代码来绘制图表

import csvfrom datetime import datetime  #导入模块datetimefrom matplotlib import pyplot as plt    #使用绘图工具matplotlib# Get dates, high, and low temperatures from file.filename = 'sit_ka_weather.csv'with open(filename) as f:    reader = csv.reader(f)        header_row = next(reader)    dates, highs, lows, wen_chas= [], [], [] ,[] #建立空列表    for row in reader:   #行遍历        try:            current_date = datetime.strptime(row[0], "%Y-%m-%d")            high = int(row[1])            low = int(row[3])

            wen_cha = int(row[1])-int(row[3])          except ValueError:            print(current_date, 'missing data')        else:            dates.append(current_date)            highs.append(high)            lows.append(low)

            wen_chas.append(wen_cha)# Plot data.fig = plt.figure(dpi=128, figsize=(10, 6))plt.plot(dates, highs, c='red', alpha=0.5)  #最高温度plt.plot(dates, lows, c='blue', alpha=0.5)  #最低温度

plt.plot(dates,wen_chas,c='black',alpha=0.5)  #温差plt.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1)# Format plot.title = "Daily high and low temperatures"   #设计图参数plt.title(title, fontsize=20)plt.xlabel('', fontsize=16)fig.autofmt_xdate()plt.ylabel("Temperature (F)", fontsize=16)plt.tick_params(axis='both', which='major', labelsize=16)plt.show()

运行上述代码，得到结果：

如图片（daily.png)所显示

4 从json文件中提取数据，并进行可视化

4.1 数据来源：population_data.json。

4.2 一个简单的代码段：

import json  #导入json模版filename = 'population_data.png'with open(filename) as f:     pop_data = json.load(f)  #加载json文件数据

通过小的代码段了解最基本的原理，具体详情还要去查看手册。

4.3制作简单的世界地图（代码如下）

import pygal  #导入pygalwm = pygal.maps.world.World()  #正确导入世界地图模块wm.title = 'populations of Countries in North America'wm.add('North America',{'ca':34126000,'us':309349000,'mx':113423000})wm.render_to_file('na_populations.svg')  #生成svg文件

结果：

4.4 制作世界地图

代码段：

import jsonimport pygalfrom pygal.style import LightColorizedStyle as LCS, RotateStyle as RSfrom country_codes import get_country_code# Load the data into a list.filename = 'population_data.json'with open(filename) as f:    pop_data = json.load(f)# Build a dictionary of population data.cc_populations = {}for pop_dict in pop_data:    if pop_dict['Year'] == '2010':        country_name = pop_dict['Country Name']        population = int(float(pop_dict['Value']))        code = get_country_code(country_name)        if code:            cc_populations[code] = population# Group the countries into 3 population levels.cc_pops_1, cc_pops_2, cc_pops_3 = {}, {}, {}for cc, pop in cc_populations.items():    if pop < 10000000:                #分组        cc_pops_1[cc] = pop    elif pop < 1000000000:        cc_pops_2[cc] = pop    else:        cc_pops_3[cc] = pop# See how many countries are in each level.        print(len(cc_pops_1), len(cc_pops_2), len(cc_pops_3))wm_style = RS('#336699', base_style=LCS)wm = pygal.maps.world.World(style=wm_style)   #已修改，原代码有错误！wm.title = 'World Population in 2010, by Country'wm.add('0-10m', cc_pops_1)wm.add('10m-1bn', cc_pops_2)wm.add('>1bn', cc_pops_3)    wm.render_to_file('world_population.svg')

辅助代码段country_code.py如下：

from pygal.maps.world import COUNTRIESfrom pygal_maps_world import i18n      #原代码也有错误，现已订正def get_country_code(country_name):    """Return the Pygal 2-digit country code for the given country."""    for code, name in COUNTRIES.items():        if name == country_name:            return code    # If the country wasn't found, return None.    return None

结果读者可以自行尝试，本人将会在个人资源中给出。

5 从excel中读取数据，并进行可视化

5.1 安装模块（os x系统）

#pip3 install xlrd (在终端进行）#从excel中读取数据

#pip3 install numpy(在终端进行）

5.2数据集：2017年深圳杯建模比赛a题数据；代码来源：实验楼

5.3示例分析

代码段：

import numpy as npimport matplotlib.pyplot as pltimport xlrdfrom pylab import *from xlrd import open_workbookx_data=[]y_data=[]x_volte=[]temp=[]wb = open_workbook('SpeedVideoDataforModeling.xlsx')#导入数据集for s in wb.sheets():        for row in range(s.nrows):                values = []        for col in range(s.ncols):            values.append(s.cell(row,col).value)               x_data.append(values[1])        y_data.append(values[9])    plt.scatter(x_data, y_data, s=15)plt.title(u"2017-sz-A",size=20)plt.legend(loc=0)ax = gca()ax.spines['right'].set_color('none')ax.spines['top'].set_color('none')ax.xaxis.set_ticks_position('bottom')ax.spines['bottom'].set_position(('data',0))ax.yaxis.set_ticks_position('left')ax.spines['left'].set_position(('data',0))plt.xlabel(u"chuo shi huan chong feng zhi",size=20)plt.ylabel(u"bo fang ping ju su du",size=20)plt.show()print ('over!')

结果留给读者，作者会在csdn的资源区贴出（作者本人的资源区）

6 心得

6.1 python代码书写一定要安装pee8标准，不然可能会出现错误导致编译不通过。

6.2 参加社区对学习编程很有帮助，推荐社区：csdn、pythontab、开源中国、码云等。

6.3 不要盲从书本，例如本人学习参考的《python编程从入门到实践》在第16章就在代码中出现错误，但这本教材还是不错的。

6.4 有目的的选择python研究方向，本人要参加建模比赛才接触了python，主攻数据处理。

参考书目：《python编程从入门到实践》【美】Eric Matthes著

实验楼—python实现从excel读取数据并绘制精美图像。

阅读全文

2 0