Pandas——value_counts&index&to_dict

来源:互联网 发布:淘宝卖stussy的正品店 编辑:程序博客网 时间:2024/04/26 19:47

本文数据是大学专业和就业的信息。有两个csv文件all-ages.csv和recent-grads.csv

  • 主要的属性如下:

Rank - The numerical rank of the major by post-graduation median earnings.
Major_code - The numerical code of the major.
Major - The description of the major.
Major_category - The category of the major.
Total - The total number of people who studied the major.
Men - The number of men who studied the major.
Women - The number of women who studied the major.
ShareWomen - The share of women (from 0 to 1) who studied the major.
Employed - The number of people who studied the major and were employed post-graduation.

  • recent-grads.csv
    这里写图片描述
  • all-ages.csv和这个类似,只是某些列的值不同

Summarizing Major Categories

计算两个数据集中每个Major Categories(每个Major Categories包含多个Major
)的就读的人数。

  • Series.value_counts返回的是该Series对象中独一无二的元素的个数(Returns object containing counts of unique values.)是个Series对象。
print(all_ages['Major_category'].value_counts())'''Engineering                            29Education                              16Humanities & Liberal Arts              15Biology & Life Science                 14Business                               13Health                                 12Computers & Mathematics                11Physical Sciences                      10Agriculture & Natural Resources        10Psychology & Social Work                9Social Science                          9Arts                                    8Industrial Arts & Consumer Services     7Law & Public Policy                     5Communications & Journalism             4Interdisciplinary                       1Name: Major_category, dtype: int64'''
  • 再转换为index对象
print(all_ages['Major_category'].value_counts().index)'''Index([u'Engineering', u'Education', u'Humanities & Liberal Arts',       u'Biology & Life Science', u'Business', u'Health',       u'Computers & Mathematics', u'Physical Sciences',       u'Agriculture & Natural Resources', u'Psychology & Social Work',       u'Social Science', u'Arts', u'Industrial Arts & Consumer Services',       u'Law & Public Policy', u'Communications & Journalism',       u'Interdisciplinary'],      dtype='object')'''
  • 因此计算每个Major Categories下就读的学生人数的代码如下,人数存在Total中。
all_ages_major_categories = dict()recent_grads_major_categories = dict()def calculate_major_cat_totals(df):    # cats存储了Major_category的类别category    cats = df['Major_category'].value_counts().index     counts_dictionary = dict()    for c in cats:        major_df = df[df["Major_category"] == c] #category为c的行        total = major_df["Total"].sum(axis=0) #计算Total和        counts_dictionary[c] = total    return counts_dictionaryall_ages_major_categories = calculate_major_cat_totals(all_ages)recent_grads_major_categories = calculate_major_cat_totals(recent_grads)
  • 根据前面的学习,我想到了一个更简单的方法,与上面得到的结果一模一样,并且用to_dict()将Series转换为dict
# -*- coding: utf-8 -*-import pandas as pdimport numpy as npall_ages = pd.read_csv("all-ages.csv")recent_grads = pd.read_csv("recent-grads.csv")all_ages_major_categories  = all_ages.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict()recent_grads_major_categories  = recent_grads.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict()

Low Wage Jobs Rates

  • 接下来就该分析有多少大学生毕业后不能找到高薪的工作?或者不好的工作?低薪的工作?
  • “Low_wage_jobs”:从事低薪工作的人数
  • “Total”:每个Major的人数
  • 因此从事低薪学生的占比为:
low_wage_percent = 0.0low_wage_percent = (recent_grads['Low_wage_jobs'].sum(axis=0))/(recent_grads['Total'].sum(axis=0))

Comparing Datasets

现在有两个数据集,all_ages(总的历史数据)和recent_grads (最近几年的)数据集都有173行。因此可以进行比较。

  • 每个major未就业率的比较,得到的是43:128也就是最几年就业率变好了。
# All majors, common to both DataFramesmajors = recent_grads['Major'].value_counts().indexrecent_grads_lower_emp_count = 0all_ages_lower_emp_count = 0for m in majors:    recent_grads_row =  recent_grads[recent_grads['Major'] == m]    all_ages_row = all_ages[all_ages['Major'] == m]    recent_grads_unemp_rate = recent_grads_row['Unemployment_rate'].values[0]    all_ages_unemp_rate = all_ages_row['Unemployment_rate'].values[0]    if recent_grads_unemp_rate < all_ages_unemp_rate:        recent_grads_lower_emp_count += 1    elif all_ages_unemp_rate < recent_grads_unemp_rate:        all_ages_lower_emp_count += 1print(recent_grads_lower_emp_count)print(all_ages_lower_emp_count)'''43128'''
0 0
原创粉丝点击