Normalizing&Sorting DataFrame Column

来源:互联网 发布:查看进程占用的端口 编辑:程序博客网 时间:2024/05/29 18:59

Dataset

本实验的目的将高蛋白低脂肪的食物打分,公式如下:
Score=2×(Protein_(g))−0.75×(Lipid_Tot_(g))

  • 食品营养表

数据

food_info是个DataFrame对象,food_info.columns得到的是DataFrame的列标签对象(

# 读入数据import pandas as pdfood_info = pd.read_csv("food_info.csv")cols = food_info.columns.tolist()

Transforming A Column

  • 为pandas可以对数值型数据做任何算术运算
div_100 = food_info["Iron_(mg)"] / 1000add_100 = food_info["Iron_(mg)"] + 100sub_100 = food_info["Iron_(mg)"] - 100mult_2 = food_info["Iron_(mg)"]*2sodium_grams = food_info["Sodium_(mg)"] / 1000sugar_milligrams = food_info["Sugar_Tot_(g)"] * 1000
  • 不仅可以做算术运算来修改列值,还可以进行列之间的运算
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]grams_of_protein_per_gram_of_water = food_info["Protein_(g)"] / food_info["Water_(g)"]milligrams_of_calcium_and_iron = food_info["Calcium_(mg)"] + food_info["Iron_(mg)"]

Nutritional Index

现在开始利用上面那个公式来计算每个food的score:Score=2×(Protein_(g))−0.75×(Lipid_Tot_(g))

weighted_protein = food_info["Protein_(g)"] * 2weighted_fat = -0.75 * food_info["Lipid_Tot_(g)"]initial_rating = weighted_protein + weighted_fat

Normalizing Columns

由于每列的属性不同,单位不同,取值范围也有很大差异,在进行某些运算时,如果直接使用原始值将带来一些偏差,比如”Vit_A_IU”这个属性的取值范围较大(0~100000),因此该列的值在计算时影响力要远远大于”Fiber_TD_(g)”(取值范围:0~79),因此需要对数据进行规范化。

  • 此处我们进行规范化的方法是,将某列所有的值除以该列最大值
max_protein = food_info["Protein_(g)"].max()normalized_protein = food_info["Protein_(g)"] / max_proteinnormalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()

Creating A New Column

  • 前面将修改的列数据(Series )都赋给了一个变量,实际上也可以直接添加到DataFrame对象中,添加的方式如下(此时该数据多了两列,原来的两列依旧存在):
ormalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()food_info["Normalized_Protein"] = normalized_proteinfood_info["Normalized_Fat"] = normalized_fat

Normalized Nutritional Index

因此现在在公式中用于计算的就不是原始数据,而是规范化的数据:

food_info["Normalized_Protein"] = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()food_info["Normalized_Fat"] = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()food_info["Norm_Nutr_Index"] = 2*food_info["Normalized_Protein"] + (-0.75*food_info["Normalized_Fat"])

Sorting A DataFrame By A Column

原始数据是由NDB_No行号进行索引的,这个是唯一标示的index.DataFrame有一个sort()函数可以对它的列数据进行排序(默认是升序),返回一个新的DataFrame变量。

food_info["Normalized_Protein"] = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()food_info["Normalized_Fat"] = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()food_info["Norm_Nutr_Index"] = 2*food_info["Normalized_Protein"] + (-0.75*food_info["Lipid_Tot_(g)"])food_info.sort("Norm_Nutr_Index", inplace=True, ascending=False)
0 0
原创粉丝点击