优达机器学习:异常值
来源:互联网 发布:js 时间格式转换 编辑:程序博客网 时间:2024/06/07 23:18
异常值的处理步骤
- 训练
- 去除10%的误差大的点
- 再次训练,重复第二步
练习:带有异常值的回归斜率和回归分数
- outlier_removal_regression.py
#!/usr/bin/pythonimport randomimport numpyimport matplotlib.pyplot as pltimport picklefrom outlier_cleaner import outlierCleaner### load up some practice data with outliers in itages = pickle.load( open("practice_outliers_ages.pkl", "r") )net_worths = pickle.load( open("practice_outliers_net_worths.pkl", "r") )### ages and net_worths need to be reshaped into 2D numpy arrays### second argument of reshape command is a tuple of integers: (n_rows, n_columns)### by convention, n_rows is the number of data points### and n_columns is the number of featuresages = numpy.reshape( numpy.array(ages), (len(ages), 1))net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))from sklearn.cross_validation import train_test_splitages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)### fill in a regression here! Name the regression object reg so that### the plotting code below works, and you can see what your regression looks like# answerfrom sklearn.linear_model import LinearRegressionreg = LinearRegression()reg.fit(ages_train,net_worths_train)print reg.coef_print reg.intercept_print reg.score(ages_test,net_worths_test)try: plt.plot(ages, reg.predict(ages), color="blue")except NameError: passplt.scatter(ages, net_worths)plt.show()### identify and remove the most outlier-y pointscleaned_data = []try: predictions = reg.predict(ages_train) cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )except NameError: print "your regression object doesn't exist, or isn't name reg" print "can't make predictions to use in identifying outliers"### only run this code if cleaned_data is returning dataif len(cleaned_data) > 0: ages, net_worths, errors = zip(*cleaned_data) ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) ### refit your cleaned data! try: reg.fit(ages, net_worths) print reg.coef_ print reg.intercept_ print reg.score(ages_test,net_worths_test) plt.plot(ages, reg.predict(ages), color="blue") except NameError: print "you don't seem to have regression imported/created," print " or else your regression object isn't named reg" print " either way, only draw the scatter plot of the cleaned data" plt.scatter(ages, net_worths) plt.xlabel("ages") plt.ylabel("net worths") plt.show()else: print "outlierCleaner() is returning an empty list, no refitting to be done"
练习:清理后的回归斜率和回归分数
- outlier_cleaner.py
#!/usr/bin/pythonimport numpy as npimport mathdef outlierCleaner(predictions, ages, net_worths): """ Clean away the 10% of points that have the largest residual errors (difference between the prediction and the actual net worth). Return a list of tuples named cleaned_data where each tuple is of the form (age, net_worth, error). """ ### your code goes here # answer ages = ages.reshape((1,len(ages)))[0] net_worths = net_worths.reshape((1,len(ages)))[0] predictions = predictions.reshape((1,len(ages)))[0] cleaned_data = zip(ages,net_worths,abs(net_worths-predictions)) cleaned_data = sorted(cleaned_data , key=lambda x: (x[2])) cleaned_num = int(-1 * math.ceil(len(cleaned_data)* 0.1)) cleaned_data = cleaned_data[:cleaned_num] # print cleaned_data # print len(cleaned_data) return cleaned_data
练习:识别最大的安然异常值
- enron_outliers.py
#!/usr/bin/pythonimport pickleimport sysimport matplotlib.pyplotsys.path.append("../tools/")from feature_format import featureFormat, targetFeatureSplit### read in data dictionary, convert to numpy arraydata_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )features = ["salary", "bonus"]data = featureFormat(data_dict, features)### your code below# answersolve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]max_value = sorted(solve,reverse=True)[0]print max_valueimport pprintpp = pprint.PrettyPrinter(indent=4)for item in data_dict: if data_dict[item]['bonus'] == max_value: print item # the answer is crazyfor point in data: salary = point[0] bonus = point[1] matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary")matplotlib.pyplot.ylabel("bonus")matplotlib.pyplot.show()
练习:移除安然异常值
- 清除掉,它是一个电子表格 bug
练习:还有更多异常值吗?
可能还有四个
enron_outliers.py
#!/usr/bin/pythonimport pickleimport sysimport matplotlib.pyplotsys.path.append("../tools/")from feature_format import featureFormat, targetFeatureSplit### read in data dictionary, convert to numpy arraydata_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )# answerdata_dict.pop( 'TOTAL', 0 )features = ["salary", "bonus"]data = featureFormat(data_dict, features)### your code below# answer# solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]# max_value = sorted(solve,reverse=True)[0]# print max_value# import pprint# pp = pprint.PrettyPrinter(indent=4)# for item in data_dict:# if data_dict[item]['bonus'] == max_value:# print item # the answer is crazyfor point in data: salary = point[0] bonus = point[1] matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary")matplotlib.pyplot.ylabel("bonus")matplotlib.pyplot.show()
练习:再识别两个异常值
答案
1、 LAY KENNETH L
2、SKILLING JEFFREY K
- enron_outliers.py
#!/usr/bin/pythonimport pickleimport sysimport matplotlib.pyplotsys.path.append("../tools/")from feature_format import featureFormat, targetFeatureSplit### read in data dictionary, convert to numpy arraydata_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )# answerdata_dict.pop( 'TOTAL', 0 )features = ["salary", "bonus"]data = featureFormat(data_dict, features)### your code below# answer# solve = data.reshape( ( 1, len(data) * len(data[0]) ) )[0]# max_value = sorted(solve,reverse=True)[0]# print max_value# import pprint# pp = pprint.PrettyPrinter(indent=4)# for item in data_dict:# if data_dict[item]['bonus'] == max_value:# print item # the answer is crazy# answerfor item in data_dict: if data_dict[item]['bonus'] != 'NaN' and data_dict[item]['salary'] != 'NaN': if data_dict[item]['bonus'] > 5e6 and data_dict[item]['salary'] > 1e6: print itemfor point in data: salary = point[0] bonus = point[1] matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary")matplotlib.pyplot.ylabel("bonus")matplotlib.pyplot.show()
练习:是否移除这些异常值:
- 留下来
阅读全文
0 0
- 优达机器学习:异常值
- 机器学习八 异常值
- 机器学习笔记:线性回归,异常值
- 机器学习之异常检测
- 机器学习算法~异常检测
- 机器学习实战-数据探索(异常值处理)
- 优达(Udacity)-机器学习基础-异常值
- Stanford机器学习---第十一讲.异常检测
- 机器学习复习——异常检测
- 机器学习(十八)异常检测
- [机器学习笔记]Note13--异常检测
- 斯坦福机器学习——异常检测
- 基于机器学习的web异常检测
- 基于机器学习的web异常检测
- 基于机器学习的web异常检测
- 机器学习-->检测异常样本方法总结
- 机器学习(10)-异常分析
- python机器学习-异常数据分析
- Xcode工程直接拖
- 171114—程序学习:猜数游戏。
- Beaglebone Black 开发笔记
- #!/bin/bash和#!/bin/sh是什么意思以及区别
- OpenCV.Resize详解
- 优达机器学习:异常值
- 如何对比两个Jar包
- 爬虫实战爬取数据
- 课外作业之CountDownLatch应用详解
- 深入C++的new
- C#将字符串转化成二进制
- const /*/& 在一起要干嘛 ?!
- 用java导出word并下载文件
- Bennyhou的kotlin视频的学习笔记(一)