【Python数据挖掘课程】九.回归模型LinearRegression简单分析氧化物数据

来源:互联网 发布:知乎日本都市传说 编辑:程序博客网 时间:2024/04/30 02:20
        这篇文章主要介绍三个知识点,也是我《数据挖掘与分析》课程讲课的内容。同时主要参考学生的课程提交作业内容进行讲述,包括:
        1.回归模型及基础知识;
        2.UCI数据集;
        3.回归模型简单数据分析。

        前文推荐:
       【Python数据挖掘课程】一.安装Python及爬虫入门介绍
       【Python数据挖掘课程】二.Kmeans聚类数据分析及Anaconda介绍
       【Python数据挖掘课程】三.Kmeans聚类代码实现、作业及优化
       【Python数据挖掘课程】四.决策树DTC数据分析及鸢尾数据集分析
       【Python数据挖掘课程】五.线性回归知识及预测糖尿病实例
       【Python数据挖掘课程】六.Numpy、Pandas和Matplotlib包基础知识
       【Python数据挖掘课程】七.PCA降维操作及subplot子图绘制
       【Python数据挖掘课程】八.关联规则挖掘及Apriori实现购物推荐

      

        希望这篇文章对你有所帮助,尤其是刚刚接触数据挖掘以及大数据的同学,这些基础知识真的非常重要。如果文章中存在不足或错误的地方,还请海涵~
        感谢ZJ学生提供的数据集与作品相关报告,学生确实还是学到了些东西。

        授课知识强推第五节课内容:五.线性回归知识及预测糖尿病实例


一. 算法简介-回归模型

1.初识回归

      “子女的身高趋向于高于父母的身高的平均值,但一般不会超过父母的身高。”
                                                                        -- 《遗传的身高向平均数方向的回归》

        回归(Regression)这一概念最早由英国生物统计学家高尔顿和他的学生皮尔逊在研究父母亲和子女的身高遗传特性时提出。如今,我们做回归分析时所讨论的“回归”和这种趋势中效应已经没有任何瓜葛了,它只是指源于高尔顿工作的那样——用一个或多个自变量来预测因变量的数学方法。

        在一个回归模型中,我们需要关注或预测的变量叫做因变量(响应变量或结果变量),我们选取的用来解释因变量变化的变量叫做自变量(解释变量或预测变量)。



        回归是统计学中最有力的工具之一。机器学习监督学习算法分为分类算法和回归算法两种,其实就是根据类别标签分布类型为离散型、连续性而定义的。
        分类算法用于离散型分布预测,如KNN、决策树、朴素贝叶斯、adaboost、SVM、Logistic回归都是分类算法;回归算法用于连续型分布预测,针对的是数值型的样本,使用回归,可以在给定输入的时候预测出一个数值,这是对分类方法的提升,因为这样可以预测连续型数据而不仅仅是离散的类别标签。

        回归的目的就是建立一个回归方程用来预测目标值,回归的求解就是求这个回归方程的回归系数。预测的方法即回归系数乘以输入值再全部相加就得到了预测值。


        回归最简单的定义:给出一个点集D,用一个函数去拟合这个点集,并且使得点集与拟合函数间的误差最小,如果这个函数曲线是一条直线,那就被称为线性回归,如果曲线是一条二次曲线,就被称为二次回归。

2.线性回归

         假定预测值与样本特征间的函数关系是线性的,回归分析的任务,就在于根据样本X和Y的观察值,去估计函数h,寻求变量之间近似的函数关系。定义:

        其中,n = 特征数目;xj = 每个训练样本第j个特征的值,可以认为是特征向量中的第j个值。
为了方便,记x0= 1,则多变量线性回归可以记为:
        其中,θ、x都表示(n+1,1)维列向量。

        注意:多元和多次是两个不同的概念,“多元”指方程有多个参数,“多次”指的是方程中参数的最高次幂。多元线性方程是假设预测值y与样本所有特征值符合一个多元一次线性方程。


3.求解线性回归

        回归常常指线性回归,回归的求解就是多元线性回归方程的求解。假设有连续型值标签(标签值分布为Y)的样本,有X={x1,x2,...,xn}个特征,回归就是求解回归系数θ=θ0, θ1,…,θn。那么,手里有一些X和对应的Y,怎样才能找到θ呢?

         在回归方程里,求得特征对应的最佳回归系数的方法是最小化误差的平方和。这里的误差是指预测y值和真实y值之间的差值,使用该误差的简单累加将使得正差值和负差值相互抵消,所以采用平方误差(最小二乘法)。平方误差可以写做:


        在数学上,求解过程就转化为求一组θ值使求上式取到最小值,那么求解方法有梯度下降法、Normal Equation等等。

        梯度下降有如下特点:需要预先选定步长a、需要多次迭代、特征值需要Scaling(统一到同一个尺度范围)。因此比较复杂,还有一种不需要迭代的求解方式——Normal Equation,简单、方便、不需要Feature Scaling。Normal Equation方法中需要计算X的转置与逆矩阵,计算量很大,因此特征个数多时计算会很慢,只适用于特征个数小于100000时使用;当特征数量大于100000时使用梯度法。

        另外,当X不可逆时就有岭回归算法的用武之地了。


        3.1 梯度下降法(Gradient Descent)

        根据平方误差,定义该线性回归模型的损耗函数(Cost Function)为:   

        线性回归的损耗函数的值与回归系数θ的关系是碗状的,只有一个最小点。线性回归的求解过程如同Logistic回归,区别在于学习模型函数hθ(x)不同。

        3.2 普通最小二乘法Normal Equation
        Normal Equation算法也叫做普通最小二乘法(ordinary least squares),其特点是:给定输人矩阵X,如果X T X的逆存在并可以求得的话,就可以直接采用该方法求解。其求解理论也十分简单:既然是是求最小误差平方和,另其导数为0即可得出回归系数。

        矩阵X为(m,n+1)矩阵(m表示样本数、n表示一个样本的特征数),y为(m,1)列向量。 上述公式中包含XTX, 也就是需要对矩阵求逆,因此这个方程只在逆矩阵存在的时候适用。

4.回归模型性能度量

         数据集上计算出的回归方程并不一定意味着它是最佳的,可以便用预测值yHat和原始值y的相关性来度量回归方程的好坏。相关性取值范围0~1,值越高说明回归模型性能越好。
         线性回归是假设值标签与特征值之间的关系是线性的,但有些时候数据间的关系可能会更加复杂,使用线性的模型就难以拟合,就需要引入多项式曲线回归(多元多次拟合)或者其他回归模型,如回归树。

        注意:
        多元回归存在多重共线性,自相关性和异方差性。线性回归对异常值非常敏感。它会严重影响回归线,最终影响预测值。多重共线性会增加系数估计值的方差,使得在模型轻微变化下,估计非常敏感,结果就是系数估计值不稳定。


二. 数据集介绍

        在数据分析中数据集是最重要的信息,推荐数据集UCI:
        http://archive.ics.uci.edu/ml/machine-learning-databases/
        该数据集包括6种类型的玻璃,各个特征是定义它们的氧化物含量(即钠,铁,钾等)。Mg:镁 Ai:铝 Si:硅 K:钾 Ca:钙 Ba:钡  Fe:铁  Type of glass:级属性。
        数据集位glass.csv文件,如下图所示:



        详细内容如下:

idrinamgalsikcabafeglass_type11.5210113.644.491.171.780.068.7500121.5176113.893.61.3672.730.487.8300131.5161813.533.551.5472.990.397.7800141.5176613.213.691.2972.610.578.2200151.5174213.273.621.2473.080.558.0700161.5159612.793.611.6272.970.648.0700.26171.5174313.33.61.1473.090.588.1700181.5175613.153.611.0573.240.578.2400191.5191814.043.581.3772.080.568.3001101.51755133.61.3672.990.578.400.111111.5157112.723.461.5673.20.678.0900.241121.5176312.83.661.2773.010.68.56001131.5158912.883.431.473.280.698.0500.241141.5174812.863.561.2773.210.548.3800.171151.5176312.613.591.3173.290.588.5001161.5176112.813.541.2373.240.588.39001171.5178412.683.671.1673.110.618.7001181.5219614.363.850.8971.360.159.15001191.5191113.93.731.1872.120.068.89001201.5173513.023.541.6972.730.548.4400.071211.517512.823.551.4972.750.548.5200.191221.5196614.773.750.2972.020.039001231.5173612.783.621.2972.790.598.7001241.5175112.813.571.3573.020.628.59001251.517213.383.51.1572.850.58.43001261.5176412.983.541.21730.658.53001271.5179313.213.481.4172.640.598.43001281.5172112.873.481.3373.040.568.43001291.5176812.563.521.4373.150.578.54001301.5178413.083.491.2872.860.68.49001311.5176812.653.561.373.080.618.6900.141321.5174712.843.51.1473.270.568.55001331.5177512.853.481.2372.970.618.560.090.221341.5175312.573.471.3873.390.68.5500.061351.5178312.693.541.3472.950.578.75001361.5156713.293.451.2172.740.568.57001371.5190913.893.531.3271.810.518.780.1101381.5179712.743.481.3572.960.648.68001391.5221314.213.820.4771.770.119.57001401.5221314.213.820.4771.770.119.57001411.5179312.793.51.1273.030.648.77001421.5175512.713.421.273.20.598.64001431.5177913.213.391.3372.760.598.59001441.522113.733.840.7271.760.179.74001451.5178612.733.431.1972.950.628.7600.31461.51913.493.481.3571.950.559001471.5186913.193.371.1872.720.578.8300.161481.5266713.993.70.7171.570.029.8200.11491.5222313.213.770.7971.990.1310.02001501.5189813.583.351.2372.080.598.91001511.523213.723.720.5171.750.0910.0600.161521.5192613.23.331.2872.360.69.1400.111531.5180813.432.871.1972.840.559.03001541.5183713.142.841.2872.850.559.07001551.5177813.212.811.2972.980.519.0200.091561.5176912.452.711.2973.70.569.0600.241571.5121512.993.471.1272.980.628.3500.311581.5182412.873.481.2972.950.68.43001591.5175413.483.741.1772.990.598.03001601.5175413.393.661.1972.790.578.2700.111611.5190513.63.621.1172.640.148.76001621.5197713.813.581.3271.720.128.670.6901631.5217213.513.860.8871.790.239.5400.111641.5222714.173.810.7871.3509.69001651.5217213.483.740.972.010.189.6100.071661.5209913.693.591.1271.960.099.4001671.5215213.053.650.8772.220.199.8500.171681.5215213.053.650.8772.320.199.8500.171691.5215213.123.580.972.20.239.8200.161701.52313.313.580.8271.990.1210.1700.031711.5157414.863.671.7471.870.167.3600.122721.5184813.643.871.2771.960.548.3200.322731.5159313.093.591.5273.10.677.83002741.5163113.343.571.5772.870.617.89002751.5159613.023.561.5473.110.727.9002761.515913.023.581.5173.120.697.96002771.5164513.443.611.5472.390.668.03002781.51627133.581.5472.830.618.04002791.5161313.923.521.2572.880.377.9400.142801.515912.823.521.972.860.697.97002811.5159212.863.522.1272.660.697.97002821.5159313.253.451.4373.170.617.86002831.5164613.413.551.2572.810.688.1002841.5159413.093.521.5572.870.688.0500.092851.5140914.253.092.0872.281.17.08002861.5162513.363.581.4972.720.458.21002871.5156913.243.491.4773.250.388.03002881.5164513.43.491.5272.650.678.0800.12891.5161813.013.51.4872.890.68.12002901.516412.553.481.8773.230.638.0800.092911.5184112.933.741.1172.280.648.9600.222921.5160512.93.441.4573.060.448.27002931.5158813.123.411.5873.260.078.3900.192941.515913.243.341.4773.10.398.22002951.5162912.713.331.4973.280.678.24002961.518613.363.431.4372.260.518.6002971.5184113.023.621.0672.340.649.1300.152981.5174312.23.251.1673.550.628.900.242991.5168912.672.881.7173.210.738.540021001.5181112.962.961.4372.920.68.790.14021011.5165512.752.851.4473.270.578.790.110.2221021.517312.352.721.6372.870.79.230021031.518212.622.760.8373.810.359.4200.221041.5272513.83.150.6670.570.0811.640021051.524113.832.91.1771.150.0810.790021061.5247511.4501.8872.190.8113.2400.3421071.5312510.7302.169.810.5813.33.150.2821081.5339312.30170.160.1216.1900.2421091.5222214.430172.670.111.5200.0821101.5181813.7200.5674.45010.990021111.5266411.2300.7773.21014.680021121.5273911.0200.7573.08014.960021131.5277712.6400.6772.020.0614.40021141.5189213.463.831.2672.550.578.2100.1421151.5184713.13.971.1972.440.68.430021161.5184613.413.891.3372.380.518.280021171.5182913.243.91.4172.330.558.3100.121181.5170813.723.681.8172.060.647.880021191.5167313.33.641.5372.530.658.0300.2921201.5165213.563.571.4772.450.647.960021211.5184413.253.761.3272.40.588.420021221.5166312.933.541.6272.960.648.0300.2121231.5168713.233.541.4872.840.568.10021241.5170713.483.481.7172.520.627.990021251.5217713.23.681.1572.750.548.520021261.5187212.933.661.5672.510.588.5500.1221271.5166712.943.611.2672.750.568.60021281.5208113.782.281.4371.990.499.8500.1721291.5206813.552.091.6772.180.539.570.270.1721301.520213.981.351.6371.760.3910.5600.1821311.5217713.751.011.3672.190.3311.140021321.5261413.701.3671.240.1913.4400.121331.5181313.433.981.1872.490.588.150021341.51813.713.931.5471.810.548.2100.1521351.5181113.333.851.2572.780.528.120021361.5178913.193.91.372.330.558.4400.2821371.51806133.81.0873.070.568.3800.1221381.5171112.893.621.5772.960.618.110021391.5167412.793.521.5473.360.667.90021401.5167412.873.561.6473.140.657.990021411.516913.333.541.6172.540.688.110021421.5185113.23.631.0772.830.578.410.090.1721431.5166212.853.511.4473.010.688.230.060.2521441.51709133.471.7972.720.668.180021451.516612.993.181.2372.970.588.8100.2421461.5183912.853.671.2472.570.628.6800.3521471.5176913.653.661.1172.770.118.60031481.516113.333.531.3472.670.568.330031491.516713.243.571.3872.70.568.4400.131501.5164312.163.521.3572.890.578.530031511.5166513.143.451.7672.480.68.3800.1731521.5212714.323.90.8371.509.490031531.5177913.643.650.65730.068.930031541.516113.423.41.2272.690.598.320031551.5169412.863.581.3172.610.618.790031561.5164613.043.41.2673.010.528.580031571.5165513.413.391.2872.640.528.650031581.5212114.033.760.5871.790.119.650031591.5177613.533.411.5272.040.588.790031601.5179613.53.361.6371.940.578.8100.0931611.5183213.333.341.5472.140.568.990031621.5193413.643.540.7572.650.168.890.150.2431631.5221114.193.780.9171.360.239.1400.3731641.5151414.012.683.569.891.685.872.2051651.5191512.731.851.8672.690.610.090051661.5217111.561.881.5672.860.4711.410051671.5215111.031.711.5673.440.5811.620051681.5196912.6401.6573.750.3811.530051691.5166612.8601.8373.880.9710.170051701.5199413.2701.7673.030.4711.320051711.5236913.4401.5872.220.3212.240051721.5131613.0203.0470.486.216.960051731.513211303.0270.76.216.930051741.5204313.3801.472.250.3312.50051751.5205812.851.612.1772.180.769.70.240.5151761.5211912.970.331.5173.390.1311.2700.2851771.51905142.391.5672.3709.570061781.5193713.792.411.1972.7609.770061791.5182914.462.241.6272.3809.260061801.5185214.092.191.6672.6709.320061811.5129914.41.741.5474.5507.590061821.5188814.990.781.7472.509.950061831.5191614.1502.0972.74010.880061841.5196914.5600.5673.48011.220061851.5111517.3800.3475.4106.650061861.5113113.693.21.8172.811.765.431.19071871.5183814.323.262.2271.251.465.791.63071881.5231513.443.341.2372.380.68.830071891.5224714.862.22.0670.260.769.760071901.5236515.791.831.3170.430.318.611.68071911.5161313.881.781.7973.108.670.76071921.5160214.8502.3873.2808.760.640.0971931.5162314.202.7973.460.049.040.40.0971941.5171914.750273.0208.531.590.0871951.5168314.5601.9873.2908.521.570.0771961.5154514.1402.6873.390.089.070.610.0571971.5155613.8702.5473.230.149.410.810.0171981.5172714.702.3473.2808.950.66071991.5153114.3802.6673.10.049.080.64072001.5160915.0102.5173.050.058.830.53072011.5150815.1502.2573.508.340.63072021.5165311.9501.1975.182.78.930072031.5151414.8502.4273.7208.390.56072041.5165814.801.9973.1108.281.71072051.5161714.9502.2773.308.710.67072061.5173214.9501.872.9908.611.55072071.5164514.9401.8773.1108.671.38072081.5183114.3901.8272.861.416.472.88072091.516414.3702.7472.8509.450.54072101.5162314.1402.8872.610.089.181.06072111.5168514.9201.9973.0608.41.59072121.5206514.3602.0273.4208.441.64072131.5165114.3801.9473.6108.481.57072141.5171114.2302.0873.3608.621.6707

       PS:现在正在步入第四科学范式,第一范式是实验(哥白尼),第二范式是理论(牛顿),第三范式是计算(四色填充地图),第四范式是数据。


三. 回归模型分析

        回归模型分析代码如下:
        注意:1) pandas、Matplotlib、seaboard三种不同方法绘制图形,基本类似。
                   2) 代码对应结果不进行详细分析,只提供方法,为提升学生阅读代码能力。

# -*- coding: utf-8 -*-"""Created on Sun Mar 05 18:10:07 2017@author: eastmount & zj"""#导入玻璃识别数据集import pandas as pdglass=pd.read_csv("glass.csv")#显示前6行数据print(glass.shape)print(glass.head(6))import seaborn as snsimport matplotlib.pyplot as pltsns.set(font_scale=1.5)sns.lmplot(x='al', y='ri', data=glass, ci=None)#利用Pandas画散点图glass.plot(kind='scatter', x='al', y='ri')plt.show()#利用matplotlib做等效的散点图plt.scatter(glass.al, glass.ri)plt.xlabel('al')plt.ylabel('ri')#拟合线性回归模型from sklearn.linear_model import LinearRegressionlinreg = LinearRegression()feature_cols = ['al']X = glass[feature_cols]y = glass.rilinreg.fit(X, y)plt.show()#对于所有的x值做出预测       glass['ri_pred'] = linreg.predict(X)print("预测的前六行:")print(glass.head(6))#用直线表示预测结果plt.plot(glass.al, glass.ri_pred, color='red')plt.xlabel('al')plt.ylabel('Predicted ri')plt.show()#将直线结果和散点图同时显示出来plt.scatter(glass.al, glass.ri)plt.plot(glass.al, glass.ri_pred, color='red')plt.xlabel('al')plt.ylabel('ri')plt.show()#利用相关方法线性预测linreg.intercept_ + linreg.coef_ * 2#使用预测方法计算Al = 2的预测linreg.predict(2)#铝检验系数ai=zip(feature_cols, linreg.coef_)print(ai)#使用预测方法计算Al = 3的预测pre=linreg.predict(3)print(pre)#检查glass_typesort=glass.glass_type.value_counts().sort_index()print(sort)#类型1、2、3的窗户玻璃#类型5,6,7是家用玻璃glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})print(glass.head())plt.scatter(glass.al, glass.household)plt.xlabel('al')plt.ylabel('household')plt.show()#拟合线性回归模型并存储预测feature_cols = ['al']X = glass[feature_cols]y = glass.householdlinreg.fit(X, y)glass['household_pred'] = linreg.predict(X)plt.show()#包括回归线的散点图plt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred, color='red')plt.xlabel('al')plt.ylabel('household')plt.show()
        输出结果如下:
预测的前六行:   id       ri     na    mg    al     si     k    ca   ba    fe  glass_type  \0   1  1.52101  13.64  4.49  1.10  71.78  0.06  8.75  0.0  0.00           1   1   2  1.51761  13.89  3.60  1.36  72.73  0.48  7.83  0.0  0.00           1   2   3  1.51618  13.53  3.55  1.54  72.99  0.39  7.78  0.0  0.00           1   3   4  1.51766  13.21  3.69  1.29  72.61  0.57  8.22  0.0  0.00           1   4   5  1.51742  13.27  3.62  1.24  73.08  0.55  8.07  0.0  0.00           1   5   6  1.51596  12.79  3.61  1.62  72.97  0.64  8.07  0.0  0.26           1       ri_pred  0  1.519220  1  1.518576  2  1.518130  3  1.518749  4  1.518873  5  1.517932  
        部分输出如下图所示,绘制图形al和ri基本点状图:



        将预测的线性回归直线结果和散点图同时显示出来:




        拟合逻辑回归代码:

# -*- coding: utf-8 -*-"""Created on Sun Mar 05 18:28:56 2017@author: eastmount & zj"""#-------------逻辑回归-----------------#拟合Logistic回归模型,存储类预测import numpy as npnums = np.array([5, 15, 8])np.where(nums > 10, 'big', 'small')  #将household_pred转换为 1或0   glass['household_pred_class'] = np.where(glass.household_pred >= 0.5, 1, 0)print(glass.head(6))from sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression(C=1e9)feature_cols = ['al']X = glass[feature_cols]y = glass.householdlogreg.fit(X, y)glass['household_pred_class'] = logreg.predict(X)#绘图-显示预测结果plt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred_class, color='red')plt.xlabel('al')plt.ylabel('household')plt.show()glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]#绘图 绘制预测概率plt.scatter(glass.al, glass.household)plt.plot(glass.al, glass.household_pred_prob, color='red')plt.xlabel('al')plt.ylabel('household')plt.show()#检查一些例子的预测print (logreg.predict_proba (1))print (logreg.predict_proba(2))print (logreg. predict_proba (3))
输出如下图所示:





        最后希望这篇文章对你有所帮助,尤其是我的学生和接触数据挖掘、机器学习的博友。新学期开始,各种事情,专注教学、科研及项目,加油~
        爱你的一切,伴着尤克里里的琴声写了一下午,谢谢我的女神。
       (By:Eastmount 2017-03-05 下午6点半  http://blog.csdn.net/eastmount/ )


3 0
原创粉丝点击