Dataquest Data Scientist Path 整理笔记（2）

来源：互联网发布：积分返利商城源码编辑：程序博客网时间：2024/06/05 21:05

在Dataquest中学习Data Scientist方向的知识要点整理笔记。

Step 2: Data Analysis And Visualization

ndarray : The core data structure in NumPy, stands for N-dimensional array.
将csv文件读取为numpy array格式

nfl = numpy.genfromtxt("nfl.csv", delimiter = ",", dtype = "U75", skip_header = 1)#以逗号为分隔符，格式统一为U75，跳过标题行

通过numpy.array()建立数组，ndarray.shape读取数组规模，ndarray.dtype读取元素类型。numpy中的元素为统一的一种格式，一些基本类型有bool，int，float，string。

vector = numpy.array([5, 10, 15, 20])matrix = numpy.array([[5, 10, 15],                       [20, 25, 30],                       [35, 40, 45]])print(vector.shape)print(matrix.shape)vector.dtype

Numpy中使用nan表示Not a Number，使用na表示Not Available
分别读取list of list和numpy matrix创建矩阵，并读取某个元素

>>list_of_lists = [[5, 10, 15],                    [20, 25, 30]]>>list_of_list[1][2]30>>matrix = np.array([[5, 10, 15],                      [20, 25, 30]])>>matrix[1, 2]30

Slicing array

print(vector[1:3])#返回一个vector，即第2到4个元素print(matrix[:, 1])#返回一个vector，即第2列print(matrix[:, 1:3])#返回一个子matrix，即第2到4列

Array Comparisons

>>vector = numpy.array([5, 10, 15, 20])>>vector == 10array([False, False, True, False], dtype=bool)>>(vector == 10)&(vector == 15)array([False, False, False, False], dtype=bool)>>(vector == 10)|(vector == 15)array([False, True, True, False], dtype=bool)>>matrix = numpy.array([[5, 10, 15],                         [20, 25, 30],                        [35, 40, 45]])>>matrix == 25array([[False, False, False],       [False,  True, False],       [False, False, False]], dtype=bool)

Selecting Elements

matrix = numpy.array([[5, 10, 15],                       [20, 25, 30],                      [35, 40, 45]])second_column_25 = (matrix[:,1] == 25)print(matrix[second_column_25, :])

Replacing Values

>>vector = numpy.array([5, 10, 15, 20])>>equal_to_ten_or_five = (vector == 10) | (vector == 5)>>vector[equal_to_ten_or_five] = 50>>print(vector)[50 50 15 20]>>matrix = numpy.array([[5, 10, 15],                        [20, 25, 30],                        [35, 40, 45]])>>second_column_25 = matrix[:,1] == 25>>matrix[second_column_25, 1] = 10>>matrixarray([[ 5, 10, 15],       [20, 10, 30],       [35, 40, 45]])

转换元素格式 astype()

vector = numpy.array(["1", "2", "3"])vector = vector.astype(float)

Computing With NumPy

>>vector = numpy.array([5, 10, 15, 20])>>vector.sum()50>>matrix = numpy.array([[5, 10, 15],                         [20, 25, 30],                        [35, 40, 45]])>>matrix.sum(axis=1)#axis=1为对每行运算，0为对每列运算array([30, 75, 120])

例
Year WHO region Country Beverage Types Display Value
1986 Western Viet Nam Wine 0
1986 Americas Uruguay Other 0.5
1985 AfricaCte Cte d’Ivoire Wine 1.62
1986 Americas Colombia Beer 4.27
1987 Americas Saint Kitts Beer 1.98

totals = {}#新建字典is_year = world_alcohol[:,0] == "1985"#year = world_alcohol[is_year,:]for country in countries:    is_country = year[:,2] == country    country_consumption = year[is_country,:]    alcohol_column = country_consumption[:,4]    is_empty = alcohol_column == ''    alcohol_column[is_empty] = "0"    alcohol_column = alcohol_column.astype(float)    totals[country] = alcohol_column.sum()

使用pandas读取csv文件

crime_rates = pandas.read_csv("crime_rates.csv")#读取为pandas的通用数据格式 dataframe

对数据进行分析

first_rows = food_info.head()#前5行column_names = food_info.columns#读取列名称dimensions = food_info.shape#读取矩阵大小num_rows = dimensions[0]#读取行数num_cols = dimensions[1]#读取列数

选择行和列
pandas使用第一行为列标，行数为行标

food_info.loc[0]#选择第1行food_info.loc[2:5]#选择第3至6行food_info.loc[[2,4,6]]#选择第3、5、7行food_info.iloc[0]#重新排序后选择第1行food_info.iloc[0:4]#重新排序后选择1至5行food_info["NDB_No"]#选择列标为"NDB_No"的1列food_info[["Zinc", "Copper"]]#选择列标为"Zinc"和"Copper"的2列，返回的列顺序与输入的列标顺序一致

endswith() 选择以某个字符串结束的元素

for c in col_names:    if c.endswith("(g)"):        gram_columns.append(c)

对DataFrame进行数学运算

max_Iron = food_info["Iron_(mg)"].max()#取最大值mean_Iron = food_info["Iron_(mg)"].mean()#取平均值iron_grams = food_info["Iron_(mg)"] / 1000food_info["Iron_(g)"] = iron_grams#新增1列"Iron_(g)"

排序

food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)#降序排序，对food_info直接更新

Python中使用None表示“no value”，Pandas中使用NaN表示“not a number”，即缺失值，None和NaN均称为null 值，可使用pandas.isnull()筛选

sex = titanic_survival["sex"]sex_is_null = pandas.isnull(sex)#返回Ture/False数组，矩阵也适用

passenger_classes = [1, 2, 3]fares_by_class = {}for this_class in passenger_classes:    pclass_rows = titanic_survival[titanic_survival["pclass"] == this_class]    pclass_fares = pclass_rows["fare"]    fare_for_class = pclass_fares.mean()    fares_by_class[this_class] = fare_for_class

Pivot tables

passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)#起到与上例中相同的作用，aggfunc默认为mean,可省略port_stats = titanic_survival.pivot_table(index="embarked", values=["fare","survived"], aggfunc=np.sum)#可对values中的值进行运算

Drop Missing Values

drop_na_rows = titanic_survival.dropna(axis=0)#axis=0或axis='index'会删除所有含null值的行，axis=1或axis='columns'会删除所有含null值的列

iloc[]

first_row_first_column = new_titanic_survival.iloc[0,0]all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3]row_index_83_age = new_titanic_survival.loc[83,"age"]row_index_766_pclass = new_titanic_survival.loc[766,"pclass"]

对行标重新排序

titanic_reindexed = new_titanic_survival.reset_index(drop=True)#将旧的行标删除

DataFrame.apply()：定义一个函数，对每一列/行执行这一函数

#计算每列的null值的个数def not_null_count(column):    column_null = pd.isnull(column)    null = column[column_null]    return len(null)column_null_count = titanic_survival.apply(not_null_count)#根据年龄进行分类def generate_age_label(row):    age = row["age"]    if pd.isnull(age):        return "unknown"    elif age < 18:        return "minor"    else:        return "adult"age_labels = titanic_survival.apply(generate_age_label, axis=1)#axis=1表示对每行执行该函数

Series：core data structure that pandas uses to represent rows and columns.
Series与NumPy vector相似，不同的是Series可以使用非整数的标签

from pandas import Seriesseries_custom = Series(rt_scores, index = film_names)fiveten = series_custom[5:11]#Series同样可以用integer index

Series中的排序

sorted_by_index = series_custom.reindex(sorted_index)#根"sorted_index"的顺序对series_custom进行重新排列sc2 = series_custom.sort_index()#根据index重新排序sc3 = series_custom.sort_values()#根据values重新排序

Series运算

np.add(series_custom, series_custom)#values翻倍np.sin(series_custom)#取sinnp.max(series_custom)#取最大值

Comparing And Filtering

series_custom > 50#返回True/False数组series_greater_than_50 = series_custom[series_custom > 50]#可用"|"、"or"、"&"、"and"组合

选择行
iloc[]可填入以下内容
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.

fandango[0:5]#选择前5行fandango[140:]#选择140行之后fandango.iloc[50]#选择第50行fandango.iloc[[45,90]]#选择第45行和90行

set_index()
可输入一列作为行标
参数：inplace = True，表示直接对dataframe进行替换；
　　　drop = False，表示不会删除新增的作为行标的那一列。

fandango_films = fandango.set_index('FILM', drop = False)

pandas.Series.value_counts()

data["Do you celebrate Thanksgiving?"].value_counts()

排序

food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)#降序排序，对food_info直接更新

阅读全文

0 0