卡方函数和皮尔逊函数选取最佳特征

来源:互联网 发布:看程序员直播 编辑:程序博客网 时间:2024/06/02 19:28

良好的数据挖掘结果依赖与选择好的特征作为判断依据,下面将介绍如何找到好的单个特征,通过卡方函数或者皮尔逊函数进行打分,从而选择最佳特征,数据集可以在http://archive.ics.uci.edu/ml/ datasets/Adult上进行下载,代码如下:

import osimport pandas as pdimport numpy as npdata_folder = "Adult"adult_filename = os.path.join(data_folder, "adult.data") adult = pd.read_csv(adult_filename, header=None,                     names=["Age", "Work-Class", "fnlwgt", "Education",                            "Education-Num", "Marital-Status", "Occupation",                           "Relationship", "Race", "Sex", "Capital-gain",                            "Capital-loss", "Hours-per-week", "Native-Country",                           "Earnings-Raws"]) #查看每周工作时间的描述,#print(adult["Hours-per-week"].describe())#查看某一项的的所有取值#print(adult["Work-Class"].unique())#定义每周工作时长大于40为长时间工作.adult["longHours"] = adult["Hours-per-week"] > 40#特征的选取是难点。#下面将删除方差达不到要求的属性列,即属性列特征相差不够明显,不能作为判断依据。from sklearn.feature_selection import VarianceThresholdvt = VarianceThreshold()#vt只能处理数值类型数据X = adult[["Age","Education-Num","Capital-gain","Capital-loss","Hours-per-week"]].valuesadult_solve = vt.fit_transform(X)#print(adult_solve)#print(vt.variances_)#选取最佳特征#1.抽取一部分特征,比如上面的X#2.抽取目标类别列表adult["Earning_High"] = adult["Earnings-Raws"] == " >50K"#3.利用卡方函数进行打分from sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2#k表示选出3个属性列transformer = SelectKBest(score_func=chi2, k=3)best_feature_chi2 = transformer.fit_transform(X,adult["Earning_High"])#通过打分就可以判断哪些属性列比较好(这些都是针对单变量特征选取而言)print("卡方函数打分情况:",transformer.scores_)#上面使用卡方函数进行打分,下面将使用皮尔逊函数进行打分,但是皮尔森函数输入为2个一维数组,所以要对输入进行处理from scipy.stats import pearsonrdef multivariate_pearsonr(X,Y):    scores, pvalues = [],[]    for column in range(X.shape[1]):        cur_score,cur_p = pearsonr(X[:,column], Y)        scores.append(cur_score)        pvalues.append(cur_p)    return np.array(scores),np.array(pvalues)transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)best_feature_pearson = transformer.fit_transform(X,adult["Earning_High"])print("皮尔逊函数打分情况:",transformer.scores_)#通过采用不同的打分标准,得到的最佳特征也是不相同的,下面将使用决策树来判断哪个特征集好一些,当然判断特征集的好坏也得依据具体的算法from sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import cross_val_scoreclf = DecisionTreeClassifier()scores_chi2 = cross_val_score(clf, best_feature_chi2, adult["Earning_High"], scoring="accuracy")print("卡方函数平均正确率:",np.mean(scores_chi2))scores_pearsonr = cross_val_score(clf, best_feature_pearson, adult["Earning_High"], scoring="accuracy")print("皮尔逊函数正确率:",np.mean(scores_pearsonr))

运行结果如下:
这里写图片描述
结果分析:通过卡方(皮尔逊)函数打分情况,我们可以清楚知道不同打分函数得到的最佳特征是不一样的,打分越高说明特征越好。最后的2行说明的是平均正确率,这里卡方函数平均正确率要高一些。