Dataquest Data Scientist Path 整理笔记(2)

来源:互联网 发布:积分返利商城源码 编辑:程序博客网 时间:2024/06/05 21:05

在Dataquest中学习Data Scientist方向的知识要点整理笔记。

Step 2: Data Analysis And Visualization

  • ndarray : The core data structure in NumPy, stands for N-dimensional array.

  • 将csv文件读取为numpy array格式

nfl = numpy.genfromtxt("nfl.csv", delimiter = ",", dtype = "U75", skip_header = 1)#以逗号为分隔符,格式统一为U75,跳过标题行
  • 通过numpy.array()建立数组,ndarray.shape读取数组规模,ndarray.dtype读取元素类型。numpy中的元素为统一的一种格式,一些基本类型有bool,int,float,string。
vector = numpy.array([5, 10, 15, 20])matrix = numpy.array([[5, 10, 15],                       [20, 25, 30],                       [35, 40, 45]])print(vector.shape)print(matrix.shape)vector.dtype
  • Numpy中使用nan表示Not a Number,使用na表示Not Available

  • 分别读取list of list和numpy matrix创建矩阵,并读取某个元素

>>list_of_lists = [[5, 10, 15],                    [20, 25, 30]]>>list_of_list[1][2]30>>matrix = np.array([[5, 10, 15],                      [20, 25, 30]])>>matrix[1, 2]30
  • Slicing array
print(vector[1:3])#返回一个vector,即第2到4个元素print(matrix[:, 1])#返回一个vector,即第2列print(matrix[:, 1:3])#返回一个子matrix,即第2到4列
  • Array Comparisons
>>vector = numpy.array([5, 10, 15, 20])>>vector == 10array([False, False, True, False], dtype=bool)>>(vector == 10)&(vector == 15)array([False, False, False, False], dtype=bool)>>(vector == 10)|(vector == 15)array([False, True, True, False], dtype=bool)>>matrix = numpy.array([[5, 10, 15],                         [20, 25, 30],                        [35, 40, 45]])>>matrix == 25array([[False, False, False],       [False,  True, False],       [False, False, False]], dtype=bool)
  • Selecting Elements
matrix = numpy.array([[5, 10, 15],                       [20, 25, 30],                      [35, 40, 45]])second_column_25 = (matrix[:,1] == 25)print(matrix[second_column_25, :])
  • Replacing Values
>>vector = numpy.array([5, 10, 15, 20])>>equal_to_ten_or_five = (vector == 10) | (vector == 5)>>vector[equal_to_ten_or_five] = 50>>print(vector)[50 50 15 20]>>matrix = numpy.array([[5, 10, 15],                        [20, 25, 30],                        [35, 40, 45]])>>second_column_25 = matrix[:,1] == 25>>matrix[second_column_25, 1] = 10>>matrixarray([[ 5, 10, 15],       [20, 10, 30],       [35, 40, 45]])
  • 转换元素格式 astype()
vector = numpy.array(["1", "2", "3"])vector = vector.astype(float)
  • Computing With NumPy
>>vector = numpy.array([5, 10, 15, 20])>>vector.sum()50>>matrix = numpy.array([[5, 10, 15],                         [20, 25, 30],                        [35, 40, 45]])>>matrix.sum(axis=1)#axis=1为对每行运算,0为对每列运算array([30, 75, 120])

  • Year WHO region Country Beverage Types Display Value
    1986 Western Viet Nam Wine 0
    1986 Americas Uruguay Other 0.5
    1985 AfricaCte Cte d’Ivoire Wine 1.62
    1986 Americas Colombia Beer 4.27
    1987 Americas Saint Kitts Beer 1.98
totals = {}#新建字典is_year = world_alcohol[:,0] == "1985"#year = world_alcohol[is_year,:]for country in countries:    is_country = year[:,2] == country    country_consumption = year[is_country,:]    alcohol_column = country_consumption[:,4]    is_empty = alcohol_column == ''    alcohol_column[is_empty] = "0"    alcohol_column = alcohol_column.astype(float)    totals[country] = alcohol_column.sum()
  • 使用pandas读取csv文件
crime_rates = pandas.read_csv("crime_rates.csv")#读取为pandas的通用数据格式 dataframe
  • 对数据进行分析
first_rows = food_info.head()#前5行column_names = food_info.columns#读取列名称dimensions = food_info.shape#读取矩阵大小num_rows = dimensions[0]#读取行数num_cols = dimensions[1]#读取列数
  • 选择行和列
    pandas使用第一行为列标,行数为行标
food_info.loc[0]#选择第1行food_info.loc[2:5]#选择第3至6行food_info.loc[[2,4,6]]#选择第3、5、7行food_info.iloc[0]#重新排序后选择第1行food_info.iloc[0:4]#重新排序后选择1至5行food_info["NDB_No"]#选择列标为"NDB_No"的1列food_info[["Zinc", "Copper"]]#选择列标为"Zinc"和"Copper"的2列,返回的列顺序与输入的列标顺序一致
  • endswith() 选择以某个字符串结束的元素
for c in col_names:    if c.endswith("(g)"):        gram_columns.append(c)
  • 对DataFrame进行数学运算
max_Iron = food_info["Iron_(mg)"].max()#取最大值mean_Iron = food_info["Iron_(mg)"].mean()#取平均值iron_grams = food_info["Iron_(mg)"] / 1000food_info["Iron_(g)"] = iron_grams#新增1列"Iron_(g)"
  • 排序
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)#降序排序,对food_info直接更新
  • Python中使用None表示“no value”,Pandas中使用NaN表示“not a number”,即缺失值,NoneNaN均称为null 值,可使用pandas.isnull()筛选
sex = titanic_survival["sex"]sex_is_null = pandas.isnull(sex)#返回Ture/False数组,矩阵也适用
passenger_classes = [1, 2, 3]fares_by_class = {}for this_class in passenger_classes:    pclass_rows = titanic_survival[titanic_survival["pclass"] == this_class]    pclass_fares = pclass_rows["fare"]    fare_for_class = pclass_fares.mean()    fares_by_class[this_class] = fare_for_class
  • Pivot tables
passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)#起到与上例中相同的作用,aggfunc默认为mean,可省略port_stats = titanic_survival.pivot_table(index="embarked", values=["fare","survived"], aggfunc=np.sum)#可对values中的值进行运算
  • Drop Missing Values
drop_na_rows = titanic_survival.dropna(axis=0)#axis=0或axis='index'会删除所有含null值的行,axis=1或axis='columns'会删除所有含null值的列
  • iloc[]
first_row_first_column = new_titanic_survival.iloc[0,0]all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3]row_index_83_age = new_titanic_survival.loc[83,"age"]row_index_766_pclass = new_titanic_survival.loc[766,"pclass"]
  • 对行标重新排序
titanic_reindexed = new_titanic_survival.reset_index(drop=True)#将旧的行标删除
  • DataFrame.apply():定义一个函数,对每一列/行执行这一函数
#计算每列的null值的个数def not_null_count(column):    column_null = pd.isnull(column)    null = column[column_null]    return len(null)column_null_count = titanic_survival.apply(not_null_count)#根据年龄进行分类def generate_age_label(row):    age = row["age"]    if pd.isnull(age):        return "unknown"    elif age < 18:        return "minor"    else:        return "adult"age_labels = titanic_survival.apply(generate_age_label, axis=1)#axis=1表示对每行执行该函数
  • Series:core data structure that pandas uses to represent rows and columns.
    Series与NumPy vector相似,不同的是Series可以使用非整数的标签
from pandas import Seriesseries_custom = Series(rt_scores, index = film_names)fiveten = series_custom[5:11]#Series同样可以用integer index
  • Series中的排序
sorted_by_index = series_custom.reindex(sorted_index)#根"sorted_index"的顺序对series_custom进行重新排列sc2 = series_custom.sort_index()#根据index重新排序sc3 = series_custom.sort_values()#根据values重新排序
  • Series运算
np.add(series_custom, series_custom)#values翻倍np.sin(series_custom)#取sinnp.max(series_custom)#取最大值
  • Comparing And Filtering
series_custom > 50#返回True/False数组series_greater_than_50 = series_custom[series_custom > 50]#可用"|"、"or"、"&"、"and"组合
  • 选择行
    iloc[]可填入以下内容
    An integer, e.g. 5.
    A list or array of integers, e.g. [4, 3, 0].
    A slice object with ints, e.g. 1:7.
    A boolean array.
fandango[0:5]#选择前5行fandango[140:]#选择140行之后fandango.iloc[50]#选择第50行fandango.iloc[[45,90]]#选择第45行和90行
  • set_index()
    可输入一列作为行标
    参数:inplace = True,表示直接对dataframe进行替换;
       drop = False,表示不会删除新增的作为行标的那一列。
fandango_films = fandango.set_index('FILM', drop = False)
  • pandas.Series.value_counts()
data["Do you celebrate Thanksgiving?"].value_counts()
  • 排序
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)#降序排序,对food_info直接更新
原创粉丝点击