转自:http://blog.csdn.net/orlandowww/article/details/52744355
平均感知机算法(Averaged Perceptron)
感知机算法是非常好的二分类算法,该算法求取一个分离超平面,超平面由w参数化并用来预测,对于一个样本x,感知机算法通过计算y = [w,x]预测样本的标签,最终的预测标签通过计算sign(y)来实现。算法仅在预测错误时修正权值w。
平均感知机和感知机算法的训练方法一样,不同的是每次训练样本xi后,保留先前训练的权值,训练结束后平均所有权值。最终用平均权值作为最终判别准则的权值。参数平均化可以克服由于学习速率过大所引起的训练过程中出现的震荡现象。
词性标注
词性标注是一个监督学习。先读入训练预料,利用平均感知机算法训练得到tagging模型,并存储在硬盘上。当需要进行词性预测时,首先从硬盘上加载tagging模型,再读入测试语料,进行自动标注。
过程说明
1、为了存储权值weights,建立了一个双层的字典,保存了特征->词性类别->权重,结构如图:
2、从语料库中读取单词,当读取到”.”时,代表是一个句子的结尾,将前面的若干单词组成为一句话,形成一个sentence,存储在二元组中:例如([‘good’,’man’],[‘adj’,’n’])第一个列表是句子中的单词,第二个列表是单词对应的词性。语料库中的所有句子存储在列表training_data中(sentences),形如[ ([ ],[ ]), ([ ],[ ]), ([ ],[ ]) 。。。]
3、模型训练过程中,权值的更新是通过将(特征,正确词性)对应的特征权值+1,并且将(特征,错误词性)对应的特征权值-1。不仅增加正确词性对应的权值,还要惩罚错误词性对应的权值。
4、为了训练一个更通用的模型,在特征提取之前对数据进行预处理:
- 所有英文词语都转小写
- 四位数字并且在1800-2100之间的数字被转义为!YEAR
- 其他数字被转义为!DIGITS
- 当然还可以写一个专门识别日期、电话号码、邮箱等的模块,但目前先不拓展这部分
5、对第i个单词进行特征提取:
- 单词的首字母
- 单词的后缀
- 第i-1个单词的词性
- 第i-1个单词的后缀
- 第i-2个单词的词性
- 第i-2个单词的后缀
- 第i+1个单词的词性,等等
实验
代码1:
AP_algorithm.py
from collections import defaultdictimport pickleclass AveragedPerceptron(object): def __init__(self): self.weights = {} self.classes = set() self._totals = defaultdict(int) self._tstamps = defaultdict(int) self.i = 0 def predict(self, features): scores = defaultdict(float) for feat, value in features.items(): if feat not in self.weights or value == 0: continue weights = self.weights[feat] for label, weight in weights.items(): scores[label] += value * weight return max(self.classes, key=lambda label: (scores[label], label)) def update(self, truth, guess, features): def upd_feat(c, f, w, v): param = (f, c) self._totals[param] += (self.i - self._tstamps[param]) * w self._tstamps[param] = self.i self.weights[f][c] = w + v self.i += 1 if truth == guess: return None for f in features: weights = self.weights.setdefault(f, {}) upd_feat(truth, f, weights.get(truth, 0.0), 1.0) upd_feat(guess, f, weights.get(guess, 0.0), -1.0) return None def average_weights(self): for feat, weights in self.weights.items(): new_feat_weights = {} for clas, weight in weights.items(): param = (feat, clas) total = self._totals[param] total += (self.i - self._tstamps[param]) * weight averaged = round(total / float(self.i), 3) if averaged: new_feat_weights[clas] = averaged self.weights[feat] = new_feat_weights return None def save(self, path): return pickle.dump(dict(self.weights), open(path, 'w')) def load(self, path): self.weights = pickle.load(open(path)) return None
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
代码2:
AP_PosTagging.py
import osimport randomfrom collections import defaultdictimport pickleimport loggingfrom AP_algorithm import AveragedPerceptronPICKLE = "data/tagger-0.1.0.pickle" class PerceptronTagger(): START = ['-START-', '-START2-'] END = ['-END-', '-END2-'] AP_MODEL_LOC = os.path.join(os.path.dirname(__file__), PICKLE) def __init__(self, load=True): self.model = AveragedPerceptron() self.tagdict = {} self.classes = set() if load: self.load(self.AP_MODEL_LOC) def tag(self, corpus): s_split = lambda t: t.split('\n') w_split = lambda s: s.split() def split_sents(corpus): for s in s_split(corpus): yield w_split(s) prev, prev2 = self.START tokens = [] for words in split_sents(corpus): context = self.START + [self._normalize(w) for w in words] + self.END for i, word in enumerate(words): tag = self.tagdict.get(word) if not tag: features = self._get_features(i, word, context, prev, prev2) tag = self.model.predict(features) tokens.append((word, tag)) prev2 = prev prev = tag return tokens def train(self, sentences, save_loc=None, nr_iter=5): ''' 参数sentences: 一个包含(words, tags)的列表 参数save_loc: 存放模型的地方 参数nr_iter: 训练的迭代次数 ''' self._make_tagdict(sentences) self.model.classes = self.classes for iter_ in range(nr_iter): c = 0 n = 0 for words, tags in sentences: prev, prev2 = self.START context = self.START + [self._normalize(w) for w in words] \ + self.END for i, word in enumerate(words): guess = self.tagdict.get(word) if not guess: feats = self._get_features(i, word, context, prev, prev2) guess = self.model.predict(feats) self.model.update(tags[i], guess, feats) prev2 = prev prev = guess c += guess == tags[i] n += 1 random.shuffle(sentences) logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n))) self.model.average_weights() if save_loc is not None: pickle.dump((self.model.weights, self.tagdict, self.classes), open(save_loc, 'wb'), -1) return None def load(self, loc): try: w_td_c = pickle.load(open(loc, 'rb')) except IOError: msg = ("Missing trontagger.pickle file.") raise IOError(msg) self.model.weights, self.tagdict, self.classes = w_td_c self.model.classes = self.classes return None def _normalize(self, word): if '-' in word and word[0] != '-': return '!HYPHEN' elif word.isdigit() and len(word) == 4: return '!YEAR' elif word[0].isdigit(): return '!DIGITS' else: return word.lower() def _get_features(self, i, word, context, prev, prev2): def add(name, *args): features[' '.join((name,) + tuple(args))] += 1 i += len(self.START) features = defaultdict(int) add('bias') add('i suffix', word[-3:]) add('i pref1', word[0]) add('i-1 tag', prev) add('i-2 tag', prev2) add('i tag+i-2 tag', prev, prev2) add('i word', context[i]) add('i-1 tag+i word', prev, context[i]) add('i-1 word', context[i - 1]) add('i-1 suffix', context[i - 1][-3:]) add('i-2 word', context[i - 2]) add('i+1 word', context[i + 1]) add('i+1 suffix', context[i + 1][-3:]) add('i+2 word', context[i + 2]) return features def _make_tagdict(self, sentences): counts = defaultdict(lambda: defaultdict(int)) for words, tags in sentences: for word, tag in zip(words, tags): counts[word][tag] += 1 self.classes.add(tag) freq_thresh = 20 ambiguity_thresh = 0.97 for word, tag_freqs in counts.items(): tag, mode = max(tag_freqs.items(), key=lambda item: item[1]) n = sum(tag_freqs.values()) if n >= freq_thresh and (float(mode) / n) >= ambiguity_thresh: self.tagdict[word] = tagdef _pc(n, d): return (float(n) / d) * 100if __name__ == '__main__': logging.basicConfig(level=logging.INFO) tagger = PerceptronTagger(False) try: tagger.load(PICKLE) print(tagger.tag('how are you ?')) logging.info('Start testing...') right = 0.0 total = 0.0 sentence = ([], []) for line in open('data/test.txt'): params = line.split() if len(params) != 2: continue sentence[0].append(params[0]) sentence[1].append(params[1]) if params[0] == '.': text = '' words = sentence[0] tags = sentence[1] for i, word in enumerate(words): text += word if i < len(words): text += ' ' outputs = tagger.tag(text) assert len(tags) == len(outputs) total += len(tags) for o, t in zip(outputs, tags): if o[1].strip() == t: right += 1 sentence = ([], []) logging.info("Precision : %f", right / total) except IOError: logging.info('Reading corpus...') training_data = [] sentence = ([], []) for line in open('data/train.txt'): params = line.split('\t') sentence[0].append(params[0]) sentence[1].append(params[1]) if params[0] == '.': training_data.append(sentence) sentence = ([], []) logging.info('training corpus size : %d', len(training_data)) logging.info('Start training...') tagger.train(training_data, save_loc=PICKLE)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
结果:
在不同的语料上进行测试,并将准确率与NLTK的词性标注器作比较。
TAGGER WSJ ABC WEBNLTK 94.0 91.5 88.4AP_PosTagging 96.8 94.8 91.8
附录:
词性标注的训练语料和测试语料可以自定义或者从此处下载。将data文件夹与代码放在一起。
注意,此训练语料是为了演示词性标注的流程,选了一个非常小的训练集,所以该训练集训练出来的模型的准确率不是真实的水平,特此说明。