Series——series_custom&reindex&sort_index

来源:互联网 发布:一橙网络 编辑:程序博客网 时间:2024/05/17 07:14

Pandas 中主要的数据类型有三个:

  • Series (collection of values)
  • DataFrame (collection of Series objects)
  • Panel (collection of DataFrame objects)

本篇重点讲Series

Series 对象使用numpy的数组进行快速计算,它是基于numpy的但是又扩展了numpy,ndarray的索引只能是整型数据,而Series 的索引可以是字符类型,并且Series 的数据是混合类型还可以是NaN(表示缺失值)。

  • Series 对象存储的数据的类型有以下几种:

float - for representing float values
int - for representing integer values
bool - for representing Boolean values
datetime64[ns] - for representing date & time, without time-zone
datetime64[ns, tz] - for representing date & time, with time-zone
timedelta[ns] - for representing differences in dates & times (seconds, minutes, etc.)
category - for representing categorical values
object - for representing String values (记住object是代表string哟)

Dataset

本文的数据集是fandango_score_comparison.csv集合了不同网站的评论家和用户对电影的评分

  • 数据属性如下:

FILM - film name
RottenTomatoes - Rotten Tomatoes critics average score
RottenTomatoes_User - Rotten Tomatoes user average score
RT_norm - Rotten Tomatoes critics average score (normalized to a 0 to 5 point system)
RT_user-norm - Rotten Tomatoes user average score (normalized to a 0 to 5 point system)
Metacritic - Metacritic critics average score
Metacritic_User - Metacritic user average score

Integer Index

  • 像numpy一样进行索引(用:切片,用标签索引一列数据)
fandango = pd.read_csv('fandango_score_comparison.csv')series_film = fandango['FILM']series_rt = fandango['RottenTomatoes']print(series_film[0:5])print(series_rt[0:5])'''0    Avengers: Age of Ultron (2015)1                 Cinderella (2015)2                    Ant-Man (2015)3            Do You Believe? (2015)4     Hot Tub Time Machine 2 (2015)Name: FILM, dtype: object0    741    852    803    184    14Name: RottenTomatoes, dtype: int64'''

Custom Index

自定义索引:前面一个例子给出的是一个电影名称的列数据和电影评分的列数据,当我们想要查找一个名称的电影的评分时我们得先找到这个名称电影的index然后再找到评分,这样这个过程就显得很复杂,因此我们希望找到一个能根据电影名称直接索引出电影评分的对象

  • Series(rt_scores , index=film_names)函数通过重新定义index和value创建新的Series对象(series_custom 是
# Import the Series object from pandasfrom pandas import Seriesfilm_names = series_film.valuesrt_scores = series_rt.valuesseries_custom = Series(rt_scores , index=film_names)print(series_custom[['Minions (2015)', 'Leviathan (2014)']])'''Minions (2015)      54Leviathan (2014)    99dtype: int64'''

Reindexing

  • 将series_custom的index排序,此时注意他的index是电影名称。
original_index = series_custom.index.tolist()sorted_index = sorted(original_index)sorted_by_index = series_custom.reindex(sorted_index)

Sorting

pandas有两个函数进行排序:
- sort_index():根据series的index进行排序,返回一个series
- sort_values():根据值进行排序(默认从小到大)

sc2 = series_custom.sort_index()sc3 = series_custom.sort_values()

Vectorized Operations

Series 对象支持向量操作,因为pandas是基于numpy的,numpy的想量化操作被优化得难以置信(用低级语言C实现的),而用循环来进行计算的话要慢的多,因此要好好利用向量操作,毕竟它已经被优化得很好了。

series_normalized = (series_custom/100)*5

Comparing And Filtering

series_greater_than_50 = series_custom[series_custom > 50]

Alignment

  • 对齐指的就是两个series的长度相同,只有两个对象的长度对齐了,才能利用python标准的加减乘除运算。
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])rt_mean = (rt_critics + rt_users)/2print(rt_mean)
0 0