
来源:互联网 发布:一橙网络 编辑:程序博客网 时间:2024/05/17 07:14

Pandas 中主要的数据类型有三个:

  • Series (collection of values)
  • DataFrame (collection of Series objects)
  • Panel (collection of DataFrame objects)


Series 对象使用numpy的数组进行快速计算,它是基于numpy的但是又扩展了numpy,ndarray的索引只能是整型数据,而Series 的索引可以是字符类型,并且Series 的数据是混合类型还可以是NaN(表示缺失值)。

  • Series 对象存储的数据的类型有以下几种:

float - for representing float values
int - for representing integer values
bool - for representing Boolean values
datetime64[ns] - for representing date & time, without time-zone
datetime64[ns, tz] - for representing date & time, with time-zone
timedelta[ns] - for representing differences in dates & times (seconds, minutes, etc.)
category - for representing categorical values
object - for representing String values (记住object是代表string哟)



  • 数据属性如下:

FILM - film name
RottenTomatoes - Rotten Tomatoes critics average score
RottenTomatoes_User - Rotten Tomatoes user average score
RT_norm - Rotten Tomatoes critics average score (normalized to a 0 to 5 point system)
RT_user-norm - Rotten Tomatoes user average score (normalized to a 0 to 5 point system)
Metacritic - Metacritic critics average score
Metacritic_User - Metacritic user average score

Integer Index

  • 像numpy一样进行索引(用:切片,用标签索引一列数据)
fandango = pd.read_csv('fandango_score_comparison.csv')series_film = fandango['FILM']series_rt = fandango['RottenTomatoes']print(series_film[0:5])print(series_rt[0:5])'''0    Avengers: Age of Ultron (2015)1                 Cinderella (2015)2                    Ant-Man (2015)3            Do You Believe? (2015)4     Hot Tub Time Machine 2 (2015)Name: FILM, dtype: object0    741    852    803    184    14Name: RottenTomatoes, dtype: int64'''

Custom Index


  • Series(rt_scores , index=film_names)函数通过重新定义index和value创建新的Series对象(series_custom 是
# Import the Series object from pandasfrom pandas import Seriesfilm_names = series_film.valuesrt_scores = series_rt.valuesseries_custom = Series(rt_scores , index=film_names)print(series_custom[['Minions (2015)', 'Leviathan (2014)']])'''Minions (2015)      54Leviathan (2014)    99dtype: int64'''


  • 将series_custom的index排序,此时注意他的index是电影名称。
original_index = series_custom.index.tolist()sorted_index = sorted(original_index)sorted_by_index = series_custom.reindex(sorted_index)


- sort_index():根据series的index进行排序,返回一个series
- sort_values():根据值进行排序(默认从小到大)

sc2 = series_custom.sort_index()sc3 = series_custom.sort_values()

Vectorized Operations

Series 对象支持向量操作,因为pandas是基于numpy的,numpy的想量化操作被优化得难以置信(用低级语言C实现的),而用循环来进行计算的话要慢的多,因此要好好利用向量操作,毕竟它已经被优化得很好了。

series_normalized = (series_custom/100)*5

Comparing And Filtering

series_greater_than_50 = series_custom[series_custom > 50]


  • 对齐指的就是两个series的长度相同,只有两个对象的长度对齐了,才能利用python标准的加减乘除运算。
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])rt_mean = (rt_critics + rt_users)/2print(rt_mean)
0 0