Python数据分析入门

来源:互联网 发布:pdf批量转换jpg mac 编辑:程序博客网 时间:2024/05/23 19:59

最近接受Python数据分析的培训,准备接下来深入研究一下,正处在初涉阶段,先上一个小练习热热身。
开发工具:PyCharm 2016.2
完整练习的GitHub地址:https://github.com/xinluqishi/pythonTrainingPro

项目分析数据:https://www.kaggle.com/osmi/mental-health-in-tech-survey,这是有关科技工作者心理健康数据的分析项目,数据是CSV格式的。这是一个很好的网站,里面的数据可以拿来做Python数据分析,大家可以下载,片段如下:

"Timestamp","Age","Gender","Country","state","self_employed","family_history","treatment","work_interfere","no_employees","remote_work","tech_company","benefits","care_options","wellness_program","seek_help","anonymity","leave","mental_health_consequence","phys_health_consequence","coworkers","supervisor","mental_health_interview","phys_health_interview","mental_vs_physical","obs_consequence","comments"2014-08-27 11:29:31,37,"Female","United States","IL",NA,"No","Yes","Often","6-25","No","Yes","Yes","Not sure","No","Yes","Yes","Somewhat easy","No","No","Some of them","Yes","No","Maybe","Yes","No",NA2014-08-27 11:29:37,44,"M","United States","IN",NA,"No","No","Rarely","More than 1000","No","No","Don't know","No","Don't know","Don't know","Don't know","Don't know","Maybe","No","No","No","No","No","Don't know","No",NA2014-08-27 11:29:44,32,"Male","Canada",NA,NA,"No","No","Rarely","6-25","No","Yes","No","No","No","No","Don't know","Somewhat difficult","No","No","Yes","Yes","Yes","Yes","No","No",NA

需求:统计各个国家存在的心理健康问题的平均年龄。
这个需求很简单也很清晰,其目的就是通过Python的数据分析功能,进行数据的清洗、分类和计算。
我分两步实现:
1. 先找到有心理问题的记录中的年龄数据,然后根据国家列出所有符合条件的年龄集合;
2. 将年龄相加除以有心理健康问题的人数;
另外,观察到有脏数据,发现Zimbabwe国家的年龄数据是999999,直接过滤掉了;并且要将最后的统计结果保留两位小数,因为求平均数时会有很多小数位数的结果出现。这些应该属于数据的前期清洗和最后的数据整理工作。
因为是初涉,所以用Python最简单直接的实现方式,比如会做循环过滤数据,等后面我会写更简单的处理案例,用上一些Python现有的数据处理包。

附上我的处理代码,里面的注释有我的思考过程:

# -*- coding: utf-8 -*-"""    作者:     kevin shi    版本:     1.0    日期:     2017/02/18    项目名称:科技工作者心理健康数据分析 (Mental Health in Tech Survey)"""import csv# 数据集路径data_path = './survey.csv'def run_main():    mental_health_set = {'Yes'}  # 心理健康问题要找到的值    result_dict = {}  # 最终结果存放列表    with open(data_path, 'r', newline='') as csvfile:        # 加载数据        rows = csv.reader(csvfile)        for i, row in enumerate(rows):            if i == 0:                # 跳过第一行表头数据                continue            if i % 50 == 0:                print('正在处理第{}行数据...'.format(i))            age_val = row[1]  # 性别数据            country_val = row[3]  # 国家            mental_health_val = row[18]  # 是否有心理问题            # sum([1, 2, 3]) 可以使用sum函数相加生成的列表 这里简单用累加了            # 去掉可能存在的空格            age_val = age_val.replace(' ', '')            mental_health_val = mental_health_val.replace(' ', '')            # 判断“国家”是否已经存在            if country_val not in result_dict:                # 如果不存在,初始化数据                # result_dict[country_val] = []  # 存放所有符合条件的年龄                result_dict[country_val] = [0, 0, 0]  # 第一个参数存储符合条件的年龄总和, 第二个参数存储有多少条记录            # 有心理问题, 要过滤不合常理的数据,如Zimbabwe 年龄999999 392行            if mental_health_val in mental_health_set and (len(age_val) <= 3):                # 列出所有符合条件的年龄列表                # result_dict[country_val].append(age_val)                # 第一个参数存储符合条件的年龄总和, 第二个参数存储有多少条记录                result_dict[country_val][0] += int(age_val)                result_dict[country_val][1] += 1            else:                # 噪声数据,不做处理                pass    # 将结果写入文件    with open('mental_country1.csv', 'w', newline='', encoding='utf-16') as csvfile:        csvwriter = csv.writer(csvfile, delimiter=',')        # 写入表头        # csvwriter.writerow(['国家', '存在心理问题的年龄列表'])        csvwriter.writerow(['国家', '存在心理问题的平均年龄'])        # 写入统计结果        for k, v in list(result_dict.items()):            # if len(result_dict[k]) == 0:            #     csvwriter.writerow([k, 0])            # else:            #     csvwriter.writerow([k, v])            # csvwriter.writerow([k, v])            # 处理年龄为0的所属国家记录            if int(v[0]) == 0:                v[2] = 0            else:                v[2] = round(int(v[0]) / int(v[1]), 2)  # 保证结果不出现多个小数位数            csvwriter.writerow([k, v[2]])if __name__ == '__main__':    run_main()

这是处理后的数据结果:

国家,存在心理问题的平均年龄Norway,0Moldova,0Finland,27.0Australia,31.5Romania,0Hungary,27.0Austria,0Costa Rica,0Canada,29.88Brazil,0India,24.0Philippines,31.0Slovenia,19.0Belgium,30.0Croatia,43.0South Africa,61.0Poland,0Colombia,26.0Ireland,35.27Russia,28.0Spain,30.0Latvia,0Uruguay,0Netherlands,33.0Israel,27.0Czech Republic,0Italy,37.0Zimbabwe,0Denmark,0Greece,36.5Singapore,39.0France,26.0Sweden,0United States,33.38Mexico,0United Kingdom,31.57Bulgaria,26.0Georgia,20.0Germany,32.0Thailand,0New Zealand,36.75Nigeria,0Switzerland,30.0Bosnia and Herzegovina,0China,0Portugal,27.0"Bahamas, The",8.0Japan,49.0

结束,如果有可以优化的地方,希望大家赐教。

0 0
原创粉丝点击