Python数据分析入门
来源:互联网 发布:pdf批量转换jpg mac 编辑:程序博客网 时间:2024/05/23 19:59
最近接受Python数据分析的培训,准备接下来深入研究一下,正处在初涉阶段,先上一个小练习热热身。
开发工具:PyCharm 2016.2
完整练习的GitHub地址:https://github.com/xinluqishi/pythonTrainingPro
项目分析数据:https://www.kaggle.com/osmi/mental-health-in-tech-survey,这是有关科技工作者心理健康数据的分析项目,数据是CSV格式的。这是一个很好的网站,里面的数据可以拿来做Python数据分析,大家可以下载,片段如下:
"Timestamp","Age","Gender","Country","state","self_employed","family_history","treatment","work_interfere","no_employees","remote_work","tech_company","benefits","care_options","wellness_program","seek_help","anonymity","leave","mental_health_consequence","phys_health_consequence","coworkers","supervisor","mental_health_interview","phys_health_interview","mental_vs_physical","obs_consequence","comments"2014-08-27 11:29:31,37,"Female","United States","IL",NA,"No","Yes","Often","6-25","No","Yes","Yes","Not sure","No","Yes","Yes","Somewhat easy","No","No","Some of them","Yes","No","Maybe","Yes","No",NA2014-08-27 11:29:37,44,"M","United States","IN",NA,"No","No","Rarely","More than 1000","No","No","Don't know","No","Don't know","Don't know","Don't know","Don't know","Maybe","No","No","No","No","No","Don't know","No",NA2014-08-27 11:29:44,32,"Male","Canada",NA,NA,"No","No","Rarely","6-25","No","Yes","No","No","No","No","Don't know","Somewhat difficult","No","No","Yes","Yes","Yes","Yes","No","No",NA
需求:统计各个国家存在的心理健康问题的平均年龄。
这个需求很简单也很清晰,其目的就是通过Python的数据分析功能,进行数据的清洗、分类和计算。
我分两步实现:
1. 先找到有心理问题的记录中的年龄数据,然后根据国家列出所有符合条件的年龄集合;
2. 将年龄相加除以有心理健康问题的人数;
另外,观察到有脏数据,发现Zimbabwe国家的年龄数据是999999,直接过滤掉了;并且要将最后的统计结果保留两位小数,因为求平均数时会有很多小数位数的结果出现。这些应该属于数据的前期清洗和最后的数据整理工作。
因为是初涉,所以用Python最简单直接的实现方式,比如会做循环过滤数据,等后面我会写更简单的处理案例,用上一些Python现有的数据处理包。
附上我的处理代码,里面的注释有我的思考过程:
# -*- coding: utf-8 -*-""" 作者: kevin shi 版本: 1.0 日期: 2017/02/18 项目名称:科技工作者心理健康数据分析 (Mental Health in Tech Survey)"""import csv# 数据集路径data_path = './survey.csv'def run_main(): mental_health_set = {'Yes'} # 心理健康问题要找到的值 result_dict = {} # 最终结果存放列表 with open(data_path, 'r', newline='') as csvfile: # 加载数据 rows = csv.reader(csvfile) for i, row in enumerate(rows): if i == 0: # 跳过第一行表头数据 continue if i % 50 == 0: print('正在处理第{}行数据...'.format(i)) age_val = row[1] # 性别数据 country_val = row[3] # 国家 mental_health_val = row[18] # 是否有心理问题 # sum([1, 2, 3]) 可以使用sum函数相加生成的列表 这里简单用累加了 # 去掉可能存在的空格 age_val = age_val.replace(' ', '') mental_health_val = mental_health_val.replace(' ', '') # 判断“国家”是否已经存在 if country_val not in result_dict: # 如果不存在,初始化数据 # result_dict[country_val] = [] # 存放所有符合条件的年龄 result_dict[country_val] = [0, 0, 0] # 第一个参数存储符合条件的年龄总和, 第二个参数存储有多少条记录 # 有心理问题, 要过滤不合常理的数据,如Zimbabwe 年龄999999 392行 if mental_health_val in mental_health_set and (len(age_val) <= 3): # 列出所有符合条件的年龄列表 # result_dict[country_val].append(age_val) # 第一个参数存储符合条件的年龄总和, 第二个参数存储有多少条记录 result_dict[country_val][0] += int(age_val) result_dict[country_val][1] += 1 else: # 噪声数据,不做处理 pass # 将结果写入文件 with open('mental_country1.csv', 'w', newline='', encoding='utf-16') as csvfile: csvwriter = csv.writer(csvfile, delimiter=',') # 写入表头 # csvwriter.writerow(['国家', '存在心理问题的年龄列表']) csvwriter.writerow(['国家', '存在心理问题的平均年龄']) # 写入统计结果 for k, v in list(result_dict.items()): # if len(result_dict[k]) == 0: # csvwriter.writerow([k, 0]) # else: # csvwriter.writerow([k, v]) # csvwriter.writerow([k, v]) # 处理年龄为0的所属国家记录 if int(v[0]) == 0: v[2] = 0 else: v[2] = round(int(v[0]) / int(v[1]), 2) # 保证结果不出现多个小数位数 csvwriter.writerow([k, v[2]])if __name__ == '__main__': run_main()
这是处理后的数据结果:
国家,存在心理问题的平均年龄Norway,0Moldova,0Finland,27.0Australia,31.5Romania,0Hungary,27.0Austria,0Costa Rica,0Canada,29.88Brazil,0India,24.0Philippines,31.0Slovenia,19.0Belgium,30.0Croatia,43.0South Africa,61.0Poland,0Colombia,26.0Ireland,35.27Russia,28.0Spain,30.0Latvia,0Uruguay,0Netherlands,33.0Israel,27.0Czech Republic,0Italy,37.0Zimbabwe,0Denmark,0Greece,36.5Singapore,39.0France,26.0Sweden,0United States,33.38Mexico,0United Kingdom,31.57Bulgaria,26.0Georgia,20.0Germany,32.0Thailand,0New Zealand,36.75Nigeria,0Switzerland,30.0Bosnia and Herzegovina,0China,0Portugal,27.0"Bahamas, The",8.0Japan,49.0
结束,如果有可以优化的地方,希望大家赐教。
0 0
- Python数据分析入门
- Python数据分析入门
- Python数据分析入门
- PYTHON数据分析入门
- Python数据分析入门
- python数据分析入门
- Python数据分析入门
- hive+python数据分析入门
- Python 数据分析包:pandas 入门
- 利用python进行数据分析-pandas入门
- python数据分析入门学习笔记儿
- python数据分析入门学习笔记
- python数据分析入门学习笔记儿
- python数据分析入门学习笔记儿
- 利用Python数据分析:pandas入门(四)
- 利用Python数据分析:pandas入门(五)
- 利用Python数据分析:pandas入门(六)
- python数据分析入门学习笔记
- cf 682C
- JNI c调用Java 返回值为String
- AJAX——核心XMLHttpRequest对象
- 第二章_2.3volatle关键字
- Android中支付宝,微信植入的详细流程
- Python数据分析入门
- python入门学习
- 点击表格获取表格行或列索引
- 推荐vim 一键配置
- Linux 工作队列之工作者线程创建
- unity碰撞检测和触发信息
- 程序设计B(2)实验一 :共用体练习
- 常见排序算法小结
- Mac下AS快捷键