Pandas——value_counts&index&to_dict

来源：互联网发布：淘宝卖stussy的正品店编辑：程序博客网时间：2024/04/26 19:47

本文数据是大学专业和就业的信息。有两个csv文件all-ages.csv和recent-grads.csv

主要的属性如下：

Rank - The numerical rank of the major by post-graduation median earnings.
Major_code - The numerical code of the major.
Major - The description of the major.
Major_category - The category of the major.
Total - The total number of people who studied the major.
Men - The number of men who studied the major.
Women - The number of women who studied the major.
ShareWomen - The share of women (from 0 to 1) who studied the major.
Employed - The number of people who studied the major and were employed post-graduation.

recent-grads.csv
all-ages.csv和这个类似，只是某些列的值不同

Summarizing Major Categories

计算两个数据集中每个Major Categories（每个Major Categories包含多个Major
）的就读的人数。

Series.value_counts返回的是该Series对象中独一无二的元素的个数（Returns object containing counts of unique values.）是个Series对象。

print(all_ages['Major_category'].value_counts())'''Engineering                            29Education                              16Humanities & Liberal Arts              15Biology & Life Science                 14Business                               13Health                                 12Computers & Mathematics                11Physical Sciences                      10Agriculture & Natural Resources        10Psychology & Social Work                9Social Science                          9Arts                                    8Industrial Arts & Consumer Services     7Law & Public Policy                     5Communications & Journalism             4Interdisciplinary                       1Name: Major_category, dtype: int64'''

再转换为index对象

print(all_ages['Major_category'].value_counts().index)'''Index([u'Engineering', u'Education', u'Humanities & Liberal Arts',       u'Biology & Life Science', u'Business', u'Health',       u'Computers & Mathematics', u'Physical Sciences',       u'Agriculture & Natural Resources', u'Psychology & Social Work',       u'Social Science', u'Arts', u'Industrial Arts & Consumer Services',       u'Law & Public Policy', u'Communications & Journalism',       u'Interdisciplinary'],      dtype='object')'''

因此计算每个Major Categories下就读的学生人数的代码如下，人数存在Total中。

all_ages_major_categories = dict()recent_grads_major_categories = dict()def calculate_major_cat_totals(df):    # cats存储了Major_category的类别category    cats = df['Major_category'].value_counts().index     counts_dictionary = dict()    for c in cats:        major_df = df[df["Major_category"] == c] #category为c的行        total = major_df["Total"].sum(axis=0) #计算Total和        counts_dictionary[c] = total    return counts_dictionaryall_ages_major_categories = calculate_major_cat_totals(all_ages)recent_grads_major_categories = calculate_major_cat_totals(recent_grads)

根据前面的学习，我想到了一个更简单的方法，与上面得到的结果一模一样，并且用to_dict()将Series转换为dict

# -*- coding: utf-8 -*-import pandas as pdimport numpy as npall_ages = pd.read_csv("all-ages.csv")recent_grads = pd.read_csv("recent-grads.csv")all_ages_major_categories  = all_ages.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict()recent_grads_major_categories  = recent_grads.pivot_table(index="Major_category", values="Total", aggfunc=np.sum).to_dict()

Low Wage Jobs Rates

接下来就该分析有多少大学生毕业后不能找到高薪的工作？或者不好的工作？低薪的工作？
“Low_wage_jobs”：从事低薪工作的人数
“Total”：每个Major的人数
因此从事低薪学生的占比为：

low_wage_percent = 0.0low_wage_percent = (recent_grads['Low_wage_jobs'].sum(axis=0))/(recent_grads['Total'].sum(axis=0))

Comparing Datasets

现在有两个数据集，all_ages（总的历史数据）和recent_grads （最近几年的）数据集都有173行。因此可以进行比较。

每个major未就业率的比较,得到的是43:128也就是最几年就业率变好了。

# All majors, common to both DataFramesmajors = recent_grads['Major'].value_counts().indexrecent_grads_lower_emp_count = 0all_ages_lower_emp_count = 0for m in majors:    recent_grads_row =  recent_grads[recent_grads['Major'] == m]    all_ages_row = all_ages[all_ages['Major'] == m]    recent_grads_unemp_rate = recent_grads_row['Unemployment_rate'].values[0]    all_ages_unemp_rate = all_ages_row['Unemployment_rate'].values[0]    if recent_grads_unemp_rate < all_ages_unemp_rate:        recent_grads_lower_emp_count += 1    elif all_ages_unemp_rate < recent_grads_unemp_rate:        all_ages_lower_emp_count += 1print(recent_grads_lower_emp_count)print(all_ages_lower_emp_count)'''43128'''

0 0