NLTK 学习笔记(4)

来源:互联网 发布:域名转出 费用 编辑:程序博客网 时间:2024/05/16 10:57

文本分类

1. 有监督分类

先来个经典的图

(1) 性别判定

我们使用特征提取器处理名称数据,并划分特征集的结果链表为一个训练集和一个测试集。训练集用于训练一个新的“朴素贝叶斯”分类器。之后,我们在上面测试一些没有出现在训练数据中的名字(Neo and Trinity from 黑客帝国):

>>> def gender_features(word):...       return {'last_letter':word[-1]}... >>> from nltk.corpus import names>>> import random>>> names=([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])>>> random.shuffle(names)>>> >>> >>> f = [(gender_features(n),g) for (n,g) in names]>>> trainset,testset = f[500:],f[:500]>>> c = nltk.NaiveBayesClassifier.train(trainset)>>> >>> c.classify(gender_features('Neo'))'male'>>> c.classify(gender_features('Trinity'))'female'

>>> print nltk.classify.accuracy(c,testset)0.76>>> c.show_most_informative_features(5)Most Informative Features             last_letter = u'a'           female : male   =     34.4 : 1.0             last_letter = u'k'             male : female =     29.9 : 1.0             last_letter = u'f'             male : female =     16.7 : 1.0             last_letter = u'p'             male : female =     11.9 : 1.0             last_letter = u'v'             male : female =     10.5 : 1.0

>>> print nltk.classify.accuracy(c,trainset)0.763030628694>>> print names(0)Traceback (most recent call last):  File "<stdin>", line 1, in <module>TypeError: 'list' object is not callable>>> print names[0](u'Kourtney', 'female')>>> print names[1](u'Mariellen', 'female')>>> print names[100](u'Effie', 'female')>>> print names[500](u'Kalindi', 'female')>>> print names[300](u'Loraine', 'female')>>> print names[30](u'Munroe', 'male')

(2) 选择正确的特征

从你直觉能想到的所有特征开始,然后用反复试验和错误纠分析检查哪些特征是实际有用的。

你要用于一个给定的学习算法的特征的数目是有限的——如果你提供太多的特征,那么该算法将高度依赖你的训练数据的特,性而一般化到新的例子的效果不会很好。这

个问题被称为过拟合,当运作在小训练集上时尤其会有问题。书中给出的过拟合的例子如下。(这里需要注意的是原文举的例子accuracy是0.748,本意是说Feature多了反而过拟合,但我的计算中精度确实提高了一点儿,但是对于由于feature增加的计算复杂度来说,或许(不一定,数据集过小,不好验证)得不偿失)

>>> def gender_features2(name):...     features = {}...     features["firstletter"] = name[0].lower()...     features["lastletter"] = name[-1].lower()...     for letter in 'abcdefghijklmnopqrstuvwxyz':...             features["count(%s)" % letter] =name.lower().count(letter)...             features["has(%s)" % letter] = (letter in name.lower())...     return features... 

>>> gender_features2('John'){'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}>>> featuresets2 = [(gender_features2(n), g) for (n,g) in names]>>> train_set2, test_set2 = featuresets2[500:], featuresets2[:500]>>> classifier2 = nltk.NaiveBayesClassifier.train(train_set2)>>> print nltk.classify.accuracy(classifier2, test_set2)0.782

(3) 错误分析(error analysis)方法

一旦初始特征集被选定,完善特征集的一个非常有成效的方法是错误分析。首先,我们选择一个开发集,包含用于创建模型的语料数据。然后将这种开发集分为训练集开发测试集

>>> train_names = names[1500:]>>> devtest_names = names[500:1500]>>> test_names = names[:500]>>> train_set = [(gender_features(n), g) for (n,g) in train_names]>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]>>> test_set = [(gender_features(n), g) for (n,g) in test_names]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> print nltk.classify.accuracy(classifier, devtest_set)0.77

然后使用开发测试集,我们可以生成一个分类器预测名字性别时的错误列表

>>> errors = []>>> for (name, tag) in devtest_names:...     guess = classifier.classify(gender_features(name))...     if guess != tag:...             errors.append( (tag, guess, name) )... >>> for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE...     print 'correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name)... correct=female   guess=male     name=Aileen                        correct=female   guess=male     name=Alexis                        correct=female   guess=male     name=Allsun                        correct=female   guess=male     name=Alyss                         correct=female   guess=male     name=Amber                         correct=female   guess=male     name=Anabel                        correct=female   guess=male     name=Anett                         correct=female   guess=male     name=Arden                         correct=female   guess=male     name=Ariel                         correct=female   guess=male     name=Barb                          correct=female   guess=male     name=Blondell                      correct=female   guess=male     name=Brear                         correct=female   guess=male     name=Brett                         correct=female   guess=male     name=Bridget                       correct=female   guess=male     name=Brier                         correct=female   guess=male     name=Brook                         correct=female   guess=male     name=Carmon                        correct=female   guess=male     name=Caro                          correct=female   guess=male     name=Carolan                       correct=female   guess=male     name=Carolyn                       correct=female   guess=male     name=Carolynn                      correct=female   guess=male     name=Cathrin                       correct=female   guess=male     name=Cherlyn                       correct=female   guess=male     name=Clio                          correct=female   guess=male     name=Daniel                        correct=female   guess=male     name=Deb                           correct=female   guess=male     name=Demeter                       correct=female   guess=male     name=Devon                         correct=female   guess=male     name=Dido                          correct=female   guess=male     name=Doralynn                      correct=female   guess=male     name=Doreen                        correct=female   guess=male     name=Dyann                         correct=female   guess=male     name=Eilis                         correct=female   guess=male     name=Emlynn                        correct=female   guess=male     name=Eran                          correct=female   guess=male     name=Ester                         correct=female   guess=male     name=Ethel                         correct=female   guess=male     name=Faun                          correct=female   guess=male     name=Felicdad                      correct=female   guess=male     name=Flor                          correct=female   guess=male     name=Gabriel                       correct=female   guess=male     name=Garland                       correct=female   guess=male     name=Gates                         correct=female   guess=male     name=Gill                          correct=female   guess=male     name=Glyn                          correct=female   guess=male     name=Glynnis                       correct=female   guess=male     name=Gredel                        correct=female   guess=male     name=Harriot                       correct=female   guess=male     name=Hildegaard                    correct=female   guess=male     name=Ingaberg                      correct=female   guess=male     name=Isabel                        correct=female   guess=male     name=Izabel                        correct=female   guess=male     name=Jacquelin                     correct=female   guess=male     name=Jannel                        correct=female   guess=male     name=Jazmin                        correct=female   guess=male     name=Jo-Ann                        correct=female   guess=male     name=Jonell                        correct=female   guess=male     name=Karyl                         correct=female   guess=male     name=Katheryn                      correct=female   guess=male     name=Katleen                       correct=female   guess=male     name=Kellyann                      correct=female   guess=male     name=Keriann                       correct=female   guess=male     name=Kial                          correct=female   guess=male     name=Koo                           correct=female   guess=male     name=Kristal                       correct=female   guess=male     name=Kylynn                        correct=female   guess=male     name=Leanor                        correct=female   guess=male     name=Lilas                         correct=female   guess=male     name=Lilias                        correct=female   guess=male     name=Lind                          correct=female   guess=male     name=Linnell                       correct=female   guess=male     name=Lorain                        correct=female   guess=male     name=Mab                           correct=female   guess=male     name=Mag                           correct=female   guess=male     name=Magdalen                      correct=female   guess=male     name=Mair                          correct=female   guess=male     name=Marilyn                       correct=female   guess=male     name=Marion                        correct=female   guess=male     name=Maryann                       correct=female   guess=male     name=Meaghan                       correct=female   guess=male     name=Merilyn                       correct=female   guess=male     name=Merl                          correct=female   guess=male     name=Michal                        correct=female   guess=male     name=Millisent                     correct=female   guess=male     name=Moll                          correct=female   guess=male     name=Nert                          correct=female   guess=male     name=Nichol                        correct=female   guess=male     name=Peg                           correct=female   guess=male     name=Phil                          correct=female   guess=male     name=Philis                        correct=female   guess=male     name=Pier                          correct=female   guess=male     name=Rahal                         correct=female   guess=male     name=Raquel                        correct=female   guess=male     name=Rayshell                      correct=female   guess=male     name=Rhianon                       correct=female   guess=male     name=Roselin                       correct=female   guess=male     name=Shannon                       correct=female   guess=male     name=Sharl                         correct=female   guess=male     name=Shaun                         correct=female   guess=male     name=Sheilakathryn                 correct=female   guess=male     name=Sheril                        correct=female   guess=male     name=Sherill                       correct=female   guess=male     name=Sioux                         correct=female   guess=male     name=Star                          correct=female   guess=male     name=Stoddard                      correct=female   guess=male     name=Theo                          correct=female   guess=male     name=Wallis                        correct=female   guess=male     name=Wileen                        correct=female   guess=male     name=Yoko                          correct=male     guess=female   name=Abbey                         correct=male     guess=female   name=Aguste                        correct=male     guess=female   name=Ali                           correct=male     guess=female   name=Anatole                       correct=male     guess=female   name=Andri                         correct=male     guess=female   name=Arie                          correct=male     guess=female   name=Ash                           correct=male     guess=female   name=Ashby                         correct=male     guess=female   name=Ashley                        correct=male     guess=female   name=Avery                         correct=male     guess=female   name=Baillie                       correct=male     guess=female   name=Barde                         correct=male     guess=female   name=Barney                        correct=male     guess=female   name=Barnie                        correct=male     guess=female   name=Benny                         correct=male     guess=female   name=Bertie                        correct=male     guess=female   name=Billy                         correct=male     guess=female   name=Bjorne                        correct=male     guess=female   name=Carlie                        correct=male     guess=female   name=Chance                        correct=male     guess=female   name=Chaunce                       correct=male     guess=female   name=Christoph                     correct=male     guess=female   name=Claire                        correct=male     guess=female   name=Clare                         correct=male     guess=female   name=Claude                        correct=male     guess=female   name=Conway                        correct=male     guess=female   name=Curtice                       correct=male     guess=female   name=Davide                        correct=male     guess=female   name=Davie                         correct=male     guess=female   name=Dewey                         correct=male     guess=female   name=Dickie                        correct=male     guess=female   name=Dominique                     correct=male     guess=female   name=Donnie                        correct=male     guess=female   name=Dougie                        correct=male     guess=female   name=Doyle                         correct=male     guess=female   name=Dudley                        correct=male     guess=female   name=Duffie                        correct=male     guess=female   name=Dwayne                        correct=male     guess=female   name=Emmy                          correct=male     guess=female   name=Eugene                        correct=male     guess=female   name=Ezra                          correct=male     guess=female   name=Felipe                        correct=male     guess=female   name=Garth                         correct=male     guess=female   name=Gerome                        correct=male     guess=female   name=Gerry                         correct=male     guess=female   name=Graeme                        correct=male     guess=female   name=Grove                         correct=male     guess=female   name=Guillaume                     correct=male     guess=female   name=Hadley                        correct=male     guess=female   name=Harry                         correct=male     guess=female   name=Hartley                       correct=male     guess=female   name=Hercule                       correct=male     guess=female   name=Jay                           correct=male     guess=female   name=Jedediah                      correct=male     guess=female   name=Jeramie                       correct=male     guess=female   name=Jeremiah                      correct=male     guess=female   name=Jody                          correct=male     guess=female   name=Keefe                         correct=male     guess=female   name=Kennedy                       correct=male     guess=female   name=Lance                         correct=male     guess=female   name=Lawrence                      correct=male     guess=female   name=Locke                         correct=male     guess=female   name=Lorrie                        correct=male     guess=female   name=Luce                          correct=male     guess=female   name=Marlowe                       correct=male     guess=female   name=Matty                         correct=male     guess=female   name=Maurise                       correct=male     guess=female   name=Meredeth                      correct=male     guess=female   name=Mitch                         correct=male     guess=female   name=Mordecai                      correct=male     guess=female   name=Morty                         correct=male     guess=female   name=Noah                          correct=male     guess=female   name=Noe                           correct=male     guess=female   name=Paddie                        correct=male     guess=female   name=Pearce                        correct=male     guess=female   name=Pierce                        correct=male     guess=female   name=Quincy                        correct=male     guess=female   name=Radcliffe                     correct=male     guess=female   name=Rafe                          correct=male     guess=female   name=Ravi                          correct=male     guess=female   name=Ray                           correct=male     guess=female   name=Rene                          correct=male     guess=female   name=Rodolphe                      correct=male     guess=female   name=Rolfe                         correct=male     guess=female   name=Rourke                        correct=male     guess=female   name=Ruddie                        correct=male     guess=female   name=Rusty                         correct=male     guess=female   name=Sawyere                       correct=male     guess=female   name=Sergei                        correct=male     guess=female   name=Seth                          correct=male     guess=female   name=Sheffie                       correct=male     guess=female   name=Sherlocke                     correct=male     guess=female   name=Shorty                        correct=male     guess=female   name=Slade                         correct=male     guess=female   name=Smith                         correct=male     guess=female   name=Stearne                       correct=male     guess=female   name=Steve                         correct=male     guess=female   name=Stevy                         correct=male     guess=female   name=Tanny                         correct=male     guess=female   name=Temple                        correct=male     guess=female   name=Terrence                      correct=male     guess=female   name=Thorny                        correct=male     guess=female   name=Trace                         correct=male     guess=female   name=Troy                          correct=male     guess=female   name=Tulley                        correct=male     guess=female   name=Ty                            correct=male     guess=female   name=Ulrich                        correct=male     guess=female   name=Valentine                     correct=male     guess=female   name=Vance                         correct=male     guess=female   name=Vassily                       correct=male     guess=female   name=Verge                         correct=male     guess=female   name=Vinny                         correct=male     guess=female   name=Vite                          correct=male     guess=female   name=Wallace                       correct=male     guess=female   name=Wayne                         correct=male     guess=female   name=Willie                        correct=male     guess=female   name=Yance                         correct=male     guess=female   name=Yule                          correct=male     guess=female   name=Zachary                       correct=male     guess=female   name=Zary                          correct=male     guess=female   name=Zollie 

根据观察找到规律如下

例如:yn 结尾的名字显示以女性为主,尽管事实上,n 结尾的名字往往是男性;以ch 结尾的名字通常是男性,尽管以h 结尾的名字倾向于是女性。因此,调整我们的特征提取器包括两个字母后缀的特征:(值得注意的是效果虽然不错,从0.76升到了0.77,但是书中举例时其实是升了2%到达了0.78.回头看目前的计算效果还不如我们追加了feature的效果,也许是nltk3.0中bayes方法得到了改善,feature越多效果越好?

>>> def gender_features(word):...     return {'suffix1': word[-1:],...             'suffix2': word[-2:]}... >>> train_set = [(gender_features(n), g) for (n,g) in train_names]>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> print nltk.classify.accuracy(classifier, devtest_set)0.771

这个错误分析过程可以不断重复,检查存在于由新改进的分类器产生的错误中的模式,每一次错误分析过程被重复,我们应该选择一个不同的开发测试/训练分割,以确保该分类器不会开始反映开发测试集的特质。

但是,一旦我们已经使用了开发测试集帮助我们开发模型,关于这个模型在新数据会表现多好,我们将不能再相信它会给我们一个准确地结果!因此,保持测试集分离、未使用过,直到我们的模型开发完毕是很重要的。在这一点上,我们可以使用测试集评估模型在新的输入值上执行的有多好。(很可惜的是我们在算了一下测试集的accuracy0.62,反而远远逊于一开始的0.76。虽然方向是对的,增加的这个feature效果却不好)

>>> print nltk.classify.accuracy(classifier, test_set)0.62





0 0
原创粉丝点击