优达机器学习:异常值

来源:互联网 发布:js 时间格式转换 编辑:程序博客网 时间:2024/06/07 23:18

异常值的处理步骤

  • 训练
  • 去除10%的误差大的点
  • 再次训练,重复第二步

这里写图片描述

练习:带有异常值的回归斜率和回归分数

  • outlier_removal_regression.py
#!/usr/bin/pythonimport randomimport numpyimport matplotlib.pyplot as pltimport picklefrom outlier_cleaner import outlierCleaner### load up some practice data with outliers in itages = pickle.load( open("practice_outliers_ages.pkl", "r") )net_worths = pickle.load( open("practice_outliers_net_worths.pkl", "r") )### ages and net_worths need to be reshaped into 2D numpy arrays### second argument of reshape command is a tuple of integers: (n_rows, n_columns)### by convention, n_rows is the number of data points### and n_columns is the number of featuresages       = numpy.reshape( numpy.array(ages), (len(ages), 1))net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))from sklearn.cross_validation import train_test_splitages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)### fill in a regression here!  Name the regression object reg so that### the plotting code below works, and you can see what your regression looks like# answerfrom sklearn.linear_model import LinearRegressionreg = LinearRegression()reg.fit(ages_train,net_worths_train)print reg.coef_print reg.intercept_print reg.score(ages_test,net_worths_test)try:    plt.plot(ages, reg.predict(ages), color="blue")except NameError:    passplt.scatter(ages, net_worths)plt.show()### identify and remove the most outlier-y pointscleaned_data = []try:    predictions = reg.predict(ages_train)    cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )except NameError:    print "your regression object doesn't exist, or isn't name reg"    print "can't make predictions to use in identifying outliers"### only run this code if cleaned_data is returning dataif len(cleaned_data) > 0:    ages, net_worths, errors = zip(*cleaned_data)    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))    ### refit your cleaned data!    try:        reg.fit(ages, net_worths)        print reg.coef_        print reg.intercept_        print reg.score(ages_test,net_worths_test)        plt.plot(ages, reg.predict(ages), color="blue")    except NameError:        print "you don't seem to have regression imported/created,"        print "   or else your regression object isn't named reg"        print "   either way, only draw the scatter plot of the cleaned data"    plt.scatter(ages, net_worths)    plt.xlabel("ages")    plt.ylabel("net worths")    plt.show()else:    print "outlierCleaner() is returning an empty list, no refitting to be done"

练习:清理后的回归斜率和回归分数

  • outlier_cleaner.py
#!/usr/bin/pythonimport numpy as npimport mathdef outlierCleaner(predictions, ages, net_worths):    """        Clean away the 10% of points that have the largest        residual errors (difference between the prediction        and the actual net worth).        Return a list of tuples named cleaned_data where         each tuple is of the form (age, net_worth, error).    """    ### your code goes here    # answer    ages = ages.reshape((1,len(ages)))[0]    net_worths = net_worths.reshape((1,len(ages)))[0]    predictions = predictions.reshape((1,len(ages)))[0]    cleaned_data = zip(ages,net_worths,abs(net_worths-predictions))    cleaned_data = sorted(cleaned_data , key=lambda x: (x[2]))    cleaned_num = int(-1 * math.ceil(len(cleaned_data)* 0.1))    cleaned_data = cleaned_data[:cleaned_num]    # print cleaned_data    # print len(cleaned_data)    return cleaned_data

练习:识别最大的安然异常值

这里写图片描述

  • enron_outliers.py
#!/usr/bin/pythonimport pickleimport sysimport matplotlib.pyplotsys.path.append("../tools/")from feature_format import featureFormat, targetFeatureSplit### read in data dictionary, convert to numpy arraydata_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )features = ["salary", "bonus"]data = featureFormat(data_dict, features)### your code below# answersolve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]max_value = sorted(solve,reverse=True)[0]print max_valueimport pprintpp = pprint.PrettyPrinter(indent=4)for item in data_dict:    if data_dict[item]['bonus'] == max_value:        print item # the answer is crazyfor point in data:    salary = point[0]    bonus = point[1]    matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary")matplotlib.pyplot.ylabel("bonus")matplotlib.pyplot.show()

练习:移除安然异常值

  • 清除掉,它是一个电子表格 bug

练习:还有更多异常值吗?

这里写图片描述

  • 可能还有四个

  • enron_outliers.py

#!/usr/bin/pythonimport pickleimport sysimport matplotlib.pyplotsys.path.append("../tools/")from feature_format import featureFormat, targetFeatureSplit### read in data dictionary, convert to numpy arraydata_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )# answerdata_dict.pop( 'TOTAL', 0 )features = ["salary", "bonus"]data = featureFormat(data_dict, features)### your code below# answer# solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]# max_value = sorted(solve,reverse=True)[0]# print max_value# import pprint# pp = pprint.PrettyPrinter(indent=4)# for item in data_dict:#     if data_dict[item]['bonus'] == max_value:#         print item # the answer is crazyfor point in data:    salary = point[0]    bonus = point[1]    matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary")matplotlib.pyplot.ylabel("bonus")matplotlib.pyplot.show()

练习:再识别两个异常值

这里写图片描述

答案

1、 LAY KENNETH L
2、SKILLING JEFFREY K

  • enron_outliers.py
#!/usr/bin/pythonimport pickleimport sysimport matplotlib.pyplotsys.path.append("../tools/")from feature_format import featureFormat, targetFeatureSplit### read in data dictionary, convert to numpy arraydata_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )# answerdata_dict.pop( 'TOTAL', 0 )features = ["salary", "bonus"]data = featureFormat(data_dict, features)### your code below# answer# solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]# max_value = sorted(solve,reverse=True)[0]# print max_value# import pprint# pp = pprint.PrettyPrinter(indent=4)# for item in data_dict:#     if data_dict[item]['bonus'] == max_value:#         print item # the answer is crazy# answerfor item in data_dict:    if data_dict[item]['bonus'] != 'NaN' and data_dict[item]['salary'] != 'NaN':        if data_dict[item]['bonus'] > 5e6 and data_dict[item]['salary'] > 1e6:            print itemfor point in data:    salary = point[0]    bonus = point[1]    matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary")matplotlib.pyplot.ylabel("bonus")matplotlib.pyplot.show()

练习:是否移除这些异常值:

  • 留下来
原创粉丝点击