Python R 线性回归 高斯回归 比较

来源:互联网 发布:python中统计字符串 编辑:程序博客网 时间:2024/05/21 12:27
使用的数据是公路一氧化碳数据,相应细节可参见下面链接:
数据下载链接:http://www.statsci.org/data/general/cofreewy.html

R设定工作目录指令setwd
下面先使用R 的逐步回归选取AIC最小的普通线性模型实行最小二乘估计:

w = read.table("COfreewy.txt", header = T)
a = lm(CO~.,w)
summary(a)
b = step(a, direction = 'backward')
summary(b)
shapiro.test(b$res)

并后面使用斯皮尔曼正态检验,不能拒绝正态性假设。(0.05601)

下面给出上述过程的python简化版本:

from __future__ import divisionimport numpy as np import scipy as sciimport pandas as pdfrom math import logfrom sklearn import linear_model from sklearn.gaussian_process import GaussianProcessfrom sklearn.cross_validation import LeaveOneOutimport itertoolsco_file = open("cofreewy.txt","r")lines_list = co_file.readlines()column_length = len(lines_list[0].split("\t"))column_names = [name.strip() for name in lines_list[0].split("\t")]data_array = Nonefor element in lines_list[1:]: if type(data_array) == type(None):  data_array = np.array([[name.strip() for name in element.split("\t")]]) else:  data_array = np.append(data_array, [[name.strip() for name in element.split("\t")]], axis = 0)co_file.close()data_frame = pd.DataFrame(data_array, columns = column_names, dtype = np.float64)y = data_frame.loc[:, ["CO"]].values.ravel()def generate_AIC(RSquare, y, k): SST = np.linalg.norm(y - y.mean()) ** 2 SSR = (1 - RSquare) * SST return 2 * k + y.shape[0] * log(SSR / y.shape[0])def generate_AIC_combinations(): column_names.remove("CO") conclusion_dict = dict() for i in range(len(column_names)):  for names in list(itertools.combinations(column_names, i + 1)):   array_data = data_frame.loc[:, names].values   clf = linear_model.LinearRegression()   clf.fit(array_data, y)   RSquare = clf.score(array_data, y)   AIC_value = generate_AIC(RSquare, y, len(names))   conclusion_dict[names] = (AIC_value, clf) return conclusion_dictsmallest_aic = np.infsmallest_aic_names = Nonesmallest_aic_clf = Nonefor k, v in generate_AIC_combinations().items(): if v[0] < smallest_aic:  smallest_aic = v[0]  smallest_aic_names = k  smallest_aic_clf = v[1]print "The samllest aic is :" + str(smallest_aic)print "The correspond paras are :" + str(smallest_aic_names)y_hat = smallest_aic_clf.predict(data_frame.loc[:, smallest_aic_names].values)res = y - y_hatprint "The shapiro Test value :" + str(sci.stats.shapiro(res))print "The score of The smallest_aic model :" + str(smallest_aic_clf.score(data_frame.loc[:, smallest_aic_names].values, y))# Now fit gaussian process of the whole data , and Test the conclusion by leave One out method# mix_lm is the model build by the book.new_data_dict = copy.deepcopy(dict(data_frame)) new_data_dict["Wind_sq"] = (data_frame.loc[:,["Wind"]].values ** 2).ravel()new_data_dict["Hour_cos_0"] = (np.cos(data_frame.loc[:, ["Hour"]].values * 2 * np.pi / 24)).ravel()new_data_dict["Hour_cos_1"] = (np.cos(data_frame.loc[:, ["Hour"]].values * 4 * np.pi / 24)).ravel()new_data_frame = pd.DataFrame(new_data_dict)new_names = ["Traffic", "Wind", "Wind_sq", "Hour_cos_0", "Hour_cos_1"]gp_res_square_sum = 0lm_res_square_sum = 0mix_lm_res_square_sum = 0loo = LeaveOneOut(len(y))for train, test in loo: gp = GaussianProcess() gp_clf = gp.fit(data_frame.loc[train ,column_names].values,  y[train]) res = y[test] - gp_clf.predict(data_frame.loc[test ,column_names].values) gp_res_square_sum += res[0] ** 2 lm = linear_model.LinearRegression() lm_clf = lm.fit(data_frame.loc[train, smallest_aic_names].values, y[train]) res = y[test] - lm_clf.predict(data_frame.loc[test, smallest_aic_names]) lm_res_square_sum += res[0] ** 2 mix_lm = linear_model.LinearRegression() mix_lm_clf = lm.fit(new_data_frame.loc[train, new_names].values, y[train]) res = y[test] - mix_lm_clf.predict(new_data_frame.loc[test, new_names]) mix_lm_res_square_sum += res[0] ** 2print "The residual of gaussian_process :" + str(gp_res_square_sum)print "The residual of linear_model :" + str(lm_res_square_sum)print "The residual of mix_linear_model :" + str(mix_lm_res_square_sum)


这里在进行了相同的模型建立后,还探究性地使用了gaussian过程对所有系数进行
“概率插值”以期望得到更好的模型,并与复杂数据统计方法中利用AIC得到的较优模型
进行比较

由于其为“插值”,故对于原数据得到R方进行比较是没有意义的,
故采取one leave out的Cross_validation方法,比较二者的残差平方和,

The residual of gaussian_process :0.890643120296
The residual of linear_model :6.60246279043
The residual of mix_linear_model :1.66428991979

可见gaussian过程较优。
这里使用gaussian过程进行回归主要考虑到问题的小样本性。(“概率插值”花费时间
较长)

采用高斯回归的方法从统计的意义上看,较(混合)线性回归的一大优势在于假设的极大放宽,
高斯回归假设自变量服从正态分布,而线性回归假设残差服从正态分布,后者仅仅在自变量
提取“正确”后才生效,而前者是很一般性的假设,
统计学的观点往往在于通过与现实意义的结合(假设)得到较大的助力,
但这在一定程度上限制了其普适性及容错性。
这也是机器学习“优于”统计学的地方。












0 0
原创粉丝点击