文本分类的python实现-基于Xgboost算法

来源：互联网发布：js混淆怎么读编辑：程序博客网时间：2024/05/17 09:34

转自：http://blog.csdn.net/orlandowww/article/details/52967187

描述

训练集为评论文本，标签为 pos,neu,neg三种分类，train.csv的第一列为文本content，第二列为label。
python的xgboost包安装方法，网上有很多详细的介绍

参数

XGBoost的作者把所有的参数分成了三类：1、通用参数：宏观函数控制。2、Booster参数：控制每一步的booster。3、学习目标参数：控制训练目标的表现。

1。通用参数：

booster[默认gbtree]：gbtree：基于树的模型、gbliner：线性模型
silent[默认0]：值为1时，静默模式开启，不会输出任何信息
nthread[默认值为最大可能的线程数]：这个参数用来进行多线程控制，应当输入系统的核数。如果你希望使用CPU全部的核，那就不要输入这个参数，算法会自动检测它

2。Booster参数：

这里只介绍tree booster，因为它的表现远远胜过linear booster，所以linear booster很少用到

eta[默认0.3]：和GBM中的 learning rate 参数类似。通过减少每一步的权重，可以提高模型的鲁棒性。常用的值为0.2, 0.3
max_depth[默认6]：这个值为树的最大深度。max_depth越大，模型会学到更具体更局部的样本。常用的值为6
gamma[默认0]：Gamma指定了节点分裂所需的最小损失函数下降值。这个参数的值越大，算法越保守。这个参数的值和损失函数息息相关。
subsample[默认1]：这个参数控制对于每棵树，随机采样的比例。减小这个参数的值，算法会更加保守，避免过拟合。但是，如果这个值设置得过小，它可能会导致欠拟合。常用的值：0.7-1
colsample_bytree[默认1]：用来控制每棵随机采样的列数的占比(每一列是一个特征)。常用的值：0.7-1

3。学习目标参数

objective[默认reg:linear]：这个参数定义需要被最小化的损失函数。binary:logistic二分类的逻辑回归，返回预测的概率。multi:softmax 使用softmax的多分类器，返回预测的类别。这种情况下，还需要多设一个参数：num_class(类别数目)。 multi:softprob 和multi:softmax参数一样，但是返回的是每个数据属于各个类别的概率。
eval_metric[默认值取决于objective参数的取值]：对于有效数据的度量方法。对于回归问题，默认值是rmse，对于分类问题，默认值是error。其他的值：rmse 均方根误差； mae 平均绝对误差；logloss 负对数似然函数值；error 二分类错误率(阈值为0.5)； merror 多分类错误率；mlogloss 多分类logloss损失函数；auc 曲线下面积。
seed[默认0]：随机数的种子设置它可以复现随机数据的结果。

实验

代码

# -*- coding: utf-8 -*-import xgboost as xgbimport csvimport jiebajieba.load_userdict('wordDict.txt')import numpy as npfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformer# 读取训练集def readtrain():    with open('Train.csv', 'rb') as csvfile:        reader = csv.reader(csvfile)        column1 = [row for row in reader]    content_train = [i[1] for i in column1[1:]] # 第一列为文本内容，并去除列名    opinion_train = [i[2] for i in column1[1:]] # 第二列为类别，并去除列名    print '训练集有 %s 条句子' % len(content_train)    train = [content_train, opinion_train]    return train# 将utf8的列表转换成unicodedef changeListCode(b):    a = []    for i in b:        a.append(i.decode('utf8'))    return a# 对列表进行分词并用空格连接def segmentWord(cont):    c = []    for i in cont:        a = list(jieba.cut(i))        b = " ".join(a)        c.append(b)    return c# 类别用数字表示：pos:2,neu:1,neg:0def transLabel(labels):    for i in range(len(labels)):        if labels[i] == 'pos':            labels[i] = 2        elif labels[i] == 'neu':            labels[i] = 1        elif labels[i] == 'neg':            labels[i] = 0        else: print "label无效：",labels[i]    return labelstrain = readtrain()content = segmentWord(train[0])opinion = transLabel(train[1])  # 需要用数字表示类别opinion = np.array(opinion)     # 需要numpy格式train_content = content[:7000]train_opinion = opinion[:7000]test_content = content[7000:]test_opinion = opinion[7000:]vectorizer = CountVectorizer()tfidftransformer = TfidfTransformer()tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(train_content))weight = tfidf.toarray()print tfidf.shapetest_tfidf = tfidftransformer.transform(vectorizer.transform(test_content))test_weight = test_tfidf.toarray()print test_weight.shapedtrain = xgb.DMatrix(weight, label=train_opinion)dtest = xgb.DMatrix(test_weight, label=test_opinion)  # label可以不要，此处需要是为了测试效果param = {'max_depth':6, 'eta':0.5, 'eval_metric':'merror', 'silent':1, 'objective':'multi:softmax', 'num_class':3}  # 参数evallist  = [(dtrain,'train'), (dtest,'test')]  # 这步可以不要，用于测试效果num_round = 50  # 循环次数bst = xgb.train(param, dtrain, num_round, evallist)preds = bst.predict(dtest)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

输出

Building prefix dict from the default dictionary ...Loading model from cache c:\users\www\appdata\local\temp\jieba.cacheLoading model cost 0.366 seconds.Prefix dict has been built succesfully.训练集有 10981 条句子(7000, 14758)(3981L, 14758L)[0] train-merror:0.337857   test-merror:0.409194[1] train-merror:0.322000   test-merror:0.401658[2] train-merror:0.312429   test-merror:0.401909[3] train-merror:0.300857   test-merror:0.387340[4] train-merror:0.293143   test-merror:0.389601[5] train-merror:0.286286   test-merror:0.390857[6] train-merror:0.279000   test-merror:0.388847[7] train-merror:0.270571   test-merror:0.387340[8] train-merror:0.263857   test-merror:0.379804[9] train-merror:0.257286   test-merror:0.376036[10]    train-merror:0.248000   test-merror:0.374278[11]    train-merror:0.241857   test-merror:0.371012[12]    train-merror:0.237000   test-merror:0.369254[13]    train-merror:0.231571   test-merror:0.366491[14]    train-merror:0.225857   test-merror:0.365737[15]    train-merror:0.220286   test-merror:0.365988[16]    train-merror:0.216286   test-merror:0.364732[17]    train-merror:0.212286   test-merror:0.360462[18]    train-merror:0.210143   test-merror:0.357699[19]    train-merror:0.205143   test-merror:0.356694[20]    train-merror:0.202286   test-merror:0.357699[21]    train-merror:0.198571   test-merror:0.358201[22]    train-merror:0.195429   test-merror:0.356443[23]    train-merror:0.192143   test-merror:0.358955[24]    train-merror:0.189286   test-merror:0.358955[25]    train-merror:0.186571   test-merror:0.354936[26]    train-merror:0.183429   test-merror:0.353680[27]    train-merror:0.181714   test-merror:0.353429[28]    train-merror:0.178286   test-merror:0.353680[29]    train-merror:0.174143   test-merror:0.352675[30]    train-merror:0.172286   test-merror:0.352675[31]    train-merror:0.171286   test-merror:0.353680[32]    train-merror:0.168857   test-merror:0.354434[33]    train-merror:0.167429   test-merror:0.352675[34]    train-merror:0.164286   test-merror:0.350917[35]    train-merror:0.160714   test-merror:0.348907[36]    train-merror:0.159000   test-merror:0.346898[37]    train-merror:0.157571   test-merror:0.346395[38]    train-merror:0.156286   test-merror:0.347400[39]    train-merror:0.154571   test-merror:0.346647[40]    train-merror:0.153714   test-merror:0.345642[41]    train-merror:0.152857   test-merror:0.346647[42]    train-merror:0.150000   test-merror:0.345391[43]    train-merror:0.148143   test-merror:0.345893[44]    train-merror:0.145857   test-merror:0.344135[45]    train-merror:0.144000   test-merror:0.341874[46]    train-merror:0.143000   test-merror:0.342879[47]    train-merror:0.142714   test-merror:0.341874[48]    train-merror:0.141714   test-merror:0.341372[49]    train-merror:0.138286   test-merror:0.339362

转自：http://blog.csdn.net/orlandowww/article/details/52966608

描述

训练集为评论文本，标签为 pos,neu,neg三种分类，train.csv的第一列为文本content，第二列为label。可以单独使用SVC训练然后预测，也可以使用管道pipeline把训练和预测放在一块。
SVC的惩罚参数C：默认值是1.0。C越大，对误分类的惩罚增大，趋向于对训练集全分对的情况，这样对训练集测试时准确率很高，但泛化能力弱。C值小，对误分类的惩罚减小，允许容错，泛化能力较强。
尽管TF-IDF权重有着非常广泛的应用，但并不是所有的文本权重采用TF-IDF都会有较好的性能。在有些问题上，采用BOOL型的权重（单词在某个文档中出现记为1，不出现记为0）可以得到更好的性能。通过增加CountVectorizer的参数(binary = True)实现。

实验

代码

# -*- coding: utf-8 -*-import csvimport jiebajieba.load_userdict('wordDict.txt')import numpy as npfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.svm import SVCfrom sklearn.pipeline import Pipelinefrom sklearn import metricsfrom sklearn.grid_search import GridSearchCV# 读取训练集def readtrain():    with open('Train.csv', 'rb') as csvfile:        reader = csv.reader(csvfile)        column1 = [row for row in reader]    content_train = [i[1] for i in column1[1:]] #第一列为文本内容，并去除列名    opinion_train = [i[2] for i in column1[1:]] #第二列为类别，并去除列名    print '训练集有 %s 条句子' % len(content_train)    train = [content_train, opinion_train]    return train# 将utf8的列表转换成unicodedef changeListCode(b):    a = []    for i in b:        a.append(i.decode('utf8'))    return a# 对列表进行分词并用空格连接def segmentWord(cont):    c = []    for i in cont:        a = list(jieba.cut(i))        b = " ".join(a)        c.append(b)    return c# corpus = ["我 来到 北京 清华大学", "他 来到 了 网易 杭研 大厦", "小明 硕士 毕业 与 中国 科学院"]train = readtrain()content = segmentWord(train[0])opinion = train[1]# 划分train_content = content[:7000]test_content = content[7000:]train_opinion = opinion[:7000]test_opinion = opinion[7000:]# 计算权重vectorizer = CountVectorizer()tfidftransformer = TfidfTransformer()tfidf = tfidftransformer.fit_transform(vectorizer.fit_transform(train_content))  # 先转换成词频矩阵，再计算TFIDF值print tfidf.shape# 单独预测'''word = vectorizer.get_feature_names()weight = tfidf.toarray()# 分类器clf = MultinomialNB().fit(tfidf, opinion)docs = ["在 标准 状态 下 途观 的 行李厢 容积 仅 为 400 L", "新 买 的 锋驭 怎么 没有 随 车 灭火器"]new_tfidf = tfidftransformer.transform(vectorizer.transform(docs))predicted = clf.predict(new_tfidf)print predicted'''# 训练和预测一体text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC(C=0.99, kernel = 'linear'))])text_clf = text_clf.fit(train_content, train_opinion)predicted = text_clf.predict(test_content)print 'SVC',np.mean(predicted == test_opinion)print set(predicted)#print metrics.confusion_matrix(test_opinion,predicted) # 混淆矩阵# 循环调参'''parameters = {'vect__max_df': (0.4, 0.5, 0.6, 0.7),'vect__max_features': (None, 5000, 10000, 15000),              'tfidf__use_idf': (True, False)}grid_search = GridSearchCV(text_clf, parameters, n_jobs=1, verbose=1)grid_search.fit(content, opinion)best_parameters = dict()best_parameters = grid_search.best_estimator_.get_params()for param_name in sorted(parameters.keys()):    print("\t%s: %r" % (param_name, best_parameters[param_name]))'''1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

输出

Building prefix dict from the default dictionary ...Loading model from cache c:\users\www\appdata\local\temp\jieba.cacheLoading model cost 0.383 seconds.Prefix dict has been built succesfully.训练集有 10981 条句子(7000, 14688)SVC 0.701582516956set(['neg', 'neu', 'pos'])1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9

顶

阅读全文

0 0