NLTK 学习笔记(4)
来源:互联网 发布:域名转出 费用 编辑:程序博客网 时间:2024/05/16 10:57
文本分类
1. 有监督分类
先来个经典的图
(1) 性别判定
我们使用特征提取器处理名称数据,并划分特征集的结果链表为一个训练集和一个测试集。训练集用于训练一个新的“朴素贝叶斯”分类器。之后,我们在上面测试一些没有出现在训练数据中的名字(Neo and Trinity from 黑客帝国):
>>> def gender_features(word):... return {'last_letter':word[-1]}... >>> from nltk.corpus import names>>> import random>>> names=([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])>>> random.shuffle(names)>>> >>> >>> f = [(gender_features(n),g) for (n,g) in names]>>> trainset,testset = f[500:],f[:500]>>> c = nltk.NaiveBayesClassifier.train(trainset)>>> >>> c.classify(gender_features('Neo'))'male'>>> c.classify(gender_features('Trinity'))'female'
>>> print nltk.classify.accuracy(c,testset)0.76>>> c.show_most_informative_features(5)Most Informative Features last_letter = u'a' female : male = 34.4 : 1.0 last_letter = u'k' male : female = 29.9 : 1.0 last_letter = u'f' male : female = 16.7 : 1.0 last_letter = u'p' male : female = 11.9 : 1.0 last_letter = u'v' male : female = 10.5 : 1.0
>>> print nltk.classify.accuracy(c,trainset)0.763030628694>>> print names(0)Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: 'list' object is not callable>>> print names[0](u'Kourtney', 'female')>>> print names[1](u'Mariellen', 'female')>>> print names[100](u'Effie', 'female')>>> print names[500](u'Kalindi', 'female')>>> print names[300](u'Loraine', 'female')>>> print names[30](u'Munroe', 'male')
(2) 选择正确的特征
从你直觉能想到的所有特征开始,然后用反复试验和错误纠分析检查哪些特征是实际有用的。
你要用于一个给定的学习算法的特征的数目是有限的——如果你提供太多的特征,那么该算法将高度依赖你的训练数据的特,性而一般化到新的例子的效果不会很好。这
个问题被称为过拟合,当运作在小训练集上时尤其会有问题。书中给出的过拟合的例子如下。(这里需要注意的是原文举的例子accuracy是0.748,本意是说Feature多了反而过拟合,但我的计算中精度确实提高了一点儿,但是对于由于feature增加的计算复杂度来说,或许(不一定,数据集过小,不好验证)得不偿失)
>>> def gender_features2(name):... features = {}... features["firstletter"] = name[0].lower()... features["lastletter"] = name[-1].lower()... for letter in 'abcdefghijklmnopqrstuvwxyz':... features["count(%s)" % letter] =name.lower().count(letter)... features["has(%s)" % letter] = (letter in name.lower())... return features...
>>> gender_features2('John'){'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}>>> featuresets2 = [(gender_features2(n), g) for (n,g) in names]>>> train_set2, test_set2 = featuresets2[500:], featuresets2[:500]>>> classifier2 = nltk.NaiveBayesClassifier.train(train_set2)>>> print nltk.classify.accuracy(classifier2, test_set2)0.782
(3) 错误分析(error analysis)方法
一旦初始特征集被选定,完善特征集的一个非常有成效的方法是错误分析。首先,我们选择一个开发集,包含用于创建模型的语料数据。然后将这种开发集分为训练集和开发测试集。
>>> train_names = names[1500:]>>> devtest_names = names[500:1500]>>> test_names = names[:500]>>> train_set = [(gender_features(n), g) for (n,g) in train_names]>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]>>> test_set = [(gender_features(n), g) for (n,g) in test_names]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> print nltk.classify.accuracy(classifier, devtest_set)0.77
然后使用开发测试集,我们可以生成一个分类器预测名字性别时的错误列表。
>>> errors = []>>> for (name, tag) in devtest_names:... guess = classifier.classify(gender_features(name))... if guess != tag:... errors.append( (tag, guess, name) )... >>> for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE... print 'correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name)... correct=female guess=male name=Aileen correct=female guess=male name=Alexis correct=female guess=male name=Allsun correct=female guess=male name=Alyss correct=female guess=male name=Amber correct=female guess=male name=Anabel correct=female guess=male name=Anett correct=female guess=male name=Arden correct=female guess=male name=Ariel correct=female guess=male name=Barb correct=female guess=male name=Blondell correct=female guess=male name=Brear correct=female guess=male name=Brett correct=female guess=male name=Bridget correct=female guess=male name=Brier correct=female guess=male name=Brook correct=female guess=male name=Carmon correct=female guess=male name=Caro correct=female guess=male name=Carolan correct=female guess=male name=Carolyn correct=female guess=male name=Carolynn correct=female guess=male name=Cathrin correct=female guess=male name=Cherlyn correct=female guess=male name=Clio correct=female guess=male name=Daniel correct=female guess=male name=Deb correct=female guess=male name=Demeter correct=female guess=male name=Devon correct=female guess=male name=Dido correct=female guess=male name=Doralynn correct=female guess=male name=Doreen correct=female guess=male name=Dyann correct=female guess=male name=Eilis correct=female guess=male name=Emlynn correct=female guess=male name=Eran correct=female guess=male name=Ester correct=female guess=male name=Ethel correct=female guess=male name=Faun correct=female guess=male name=Felicdad correct=female guess=male name=Flor correct=female guess=male name=Gabriel correct=female guess=male name=Garland correct=female guess=male name=Gates correct=female guess=male name=Gill correct=female guess=male name=Glyn correct=female guess=male name=Glynnis correct=female guess=male name=Gredel correct=female guess=male name=Harriot correct=female guess=male name=Hildegaard correct=female guess=male name=Ingaberg correct=female guess=male name=Isabel correct=female guess=male name=Izabel correct=female guess=male name=Jacquelin correct=female guess=male name=Jannel correct=female guess=male name=Jazmin correct=female guess=male name=Jo-Ann correct=female guess=male name=Jonell correct=female guess=male name=Karyl correct=female guess=male name=Katheryn correct=female guess=male name=Katleen correct=female guess=male name=Kellyann correct=female guess=male name=Keriann correct=female guess=male name=Kial correct=female guess=male name=Koo correct=female guess=male name=Kristal correct=female guess=male name=Kylynn correct=female guess=male name=Leanor correct=female guess=male name=Lilas correct=female guess=male name=Lilias correct=female guess=male name=Lind correct=female guess=male name=Linnell correct=female guess=male name=Lorain correct=female guess=male name=Mab correct=female guess=male name=Mag correct=female guess=male name=Magdalen correct=female guess=male name=Mair correct=female guess=male name=Marilyn correct=female guess=male name=Marion correct=female guess=male name=Maryann correct=female guess=male name=Meaghan correct=female guess=male name=Merilyn correct=female guess=male name=Merl correct=female guess=male name=Michal correct=female guess=male name=Millisent correct=female guess=male name=Moll correct=female guess=male name=Nert correct=female guess=male name=Nichol correct=female guess=male name=Peg correct=female guess=male name=Phil correct=female guess=male name=Philis correct=female guess=male name=Pier correct=female guess=male name=Rahal correct=female guess=male name=Raquel correct=female guess=male name=Rayshell correct=female guess=male name=Rhianon correct=female guess=male name=Roselin correct=female guess=male name=Shannon correct=female guess=male name=Sharl correct=female guess=male name=Shaun correct=female guess=male name=Sheilakathryn correct=female guess=male name=Sheril correct=female guess=male name=Sherill correct=female guess=male name=Sioux correct=female guess=male name=Star correct=female guess=male name=Stoddard correct=female guess=male name=Theo correct=female guess=male name=Wallis correct=female guess=male name=Wileen correct=female guess=male name=Yoko correct=male guess=female name=Abbey correct=male guess=female name=Aguste correct=male guess=female name=Ali correct=male guess=female name=Anatole correct=male guess=female name=Andri correct=male guess=female name=Arie correct=male guess=female name=Ash correct=male guess=female name=Ashby correct=male guess=female name=Ashley correct=male guess=female name=Avery correct=male guess=female name=Baillie correct=male guess=female name=Barde correct=male guess=female name=Barney correct=male guess=female name=Barnie correct=male guess=female name=Benny correct=male guess=female name=Bertie correct=male guess=female name=Billy correct=male guess=female name=Bjorne correct=male guess=female name=Carlie correct=male guess=female name=Chance correct=male guess=female name=Chaunce correct=male guess=female name=Christoph correct=male guess=female name=Claire correct=male guess=female name=Clare correct=male guess=female name=Claude correct=male guess=female name=Conway correct=male guess=female name=Curtice correct=male guess=female name=Davide correct=male guess=female name=Davie correct=male guess=female name=Dewey correct=male guess=female name=Dickie correct=male guess=female name=Dominique correct=male guess=female name=Donnie correct=male guess=female name=Dougie correct=male guess=female name=Doyle correct=male guess=female name=Dudley correct=male guess=female name=Duffie correct=male guess=female name=Dwayne correct=male guess=female name=Emmy correct=male guess=female name=Eugene correct=male guess=female name=Ezra correct=male guess=female name=Felipe correct=male guess=female name=Garth correct=male guess=female name=Gerome correct=male guess=female name=Gerry correct=male guess=female name=Graeme correct=male guess=female name=Grove correct=male guess=female name=Guillaume correct=male guess=female name=Hadley correct=male guess=female name=Harry correct=male guess=female name=Hartley correct=male guess=female name=Hercule correct=male guess=female name=Jay correct=male guess=female name=Jedediah correct=male guess=female name=Jeramie correct=male guess=female name=Jeremiah correct=male guess=female name=Jody correct=male guess=female name=Keefe correct=male guess=female name=Kennedy correct=male guess=female name=Lance correct=male guess=female name=Lawrence correct=male guess=female name=Locke correct=male guess=female name=Lorrie correct=male guess=female name=Luce correct=male guess=female name=Marlowe correct=male guess=female name=Matty correct=male guess=female name=Maurise correct=male guess=female name=Meredeth correct=male guess=female name=Mitch correct=male guess=female name=Mordecai correct=male guess=female name=Morty correct=male guess=female name=Noah correct=male guess=female name=Noe correct=male guess=female name=Paddie correct=male guess=female name=Pearce correct=male guess=female name=Pierce correct=male guess=female name=Quincy correct=male guess=female name=Radcliffe correct=male guess=female name=Rafe correct=male guess=female name=Ravi correct=male guess=female name=Ray correct=male guess=female name=Rene correct=male guess=female name=Rodolphe correct=male guess=female name=Rolfe correct=male guess=female name=Rourke correct=male guess=female name=Ruddie correct=male guess=female name=Rusty correct=male guess=female name=Sawyere correct=male guess=female name=Sergei correct=male guess=female name=Seth correct=male guess=female name=Sheffie correct=male guess=female name=Sherlocke correct=male guess=female name=Shorty correct=male guess=female name=Slade correct=male guess=female name=Smith correct=male guess=female name=Stearne correct=male guess=female name=Steve correct=male guess=female name=Stevy correct=male guess=female name=Tanny correct=male guess=female name=Temple correct=male guess=female name=Terrence correct=male guess=female name=Thorny correct=male guess=female name=Trace correct=male guess=female name=Troy correct=male guess=female name=Tulley correct=male guess=female name=Ty correct=male guess=female name=Ulrich correct=male guess=female name=Valentine correct=male guess=female name=Vance correct=male guess=female name=Vassily correct=male guess=female name=Verge correct=male guess=female name=Vinny correct=male guess=female name=Vite correct=male guess=female name=Wallace correct=male guess=female name=Wayne correct=male guess=female name=Willie correct=male guess=female name=Yance correct=male guess=female name=Yule correct=male guess=female name=Zachary correct=male guess=female name=Zary correct=male guess=female name=Zollie
根据观察找到规律如下
例如:yn 结尾的名字显示以女性为主,尽管事实上,n 结尾的名字往往是男性;以ch 结尾的名字通常是男性,尽管以h 结尾的名字倾向于是女性。因此,调整我们的特征提取器包括两个字母后缀的特征:(值得注意的是效果虽然不错,从0.76升到了0.77,但是书中举例时其实是升了2%到达了0.78.回头看目前的计算效果还不如我们追加了feature的效果,也许是nltk3.0中bayes方法得到了改善,feature越多效果越好?)
>>> def gender_features(word):... return {'suffix1': word[-1:],... 'suffix2': word[-2:]}... >>> train_set = [(gender_features(n), g) for (n,g) in train_names]>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> print nltk.classify.accuracy(classifier, devtest_set)0.771
这个错误分析过程可以不断重复,检查存在于由新改进的分类器产生的错误中的模式,每一次错误分析过程被重复,我们应该选择一个不同的开发测试/训练分割,以确保该分类器不会开始反映开发测试集的特质。
但是,一旦我们已经使用了开发测试集帮助我们开发模型,关于这个模型在新数据会表现多好,我们将不能再相信它会给我们一个准确地结果!因此,保持测试集分离、未使用过,直到我们的模型开发完毕是很重要的。在这一点上,我们可以使用测试集评估模型在新的输入值上执行的有多好。(很可惜的是我们在算了一下测试集的accuracy0.62,反而远远逊于一开始的0.76。虽然方向是对的,增加的这个feature效果却不好)
>>> print nltk.classify.accuracy(classifier, test_set)0.62
- NLTK 学习笔记(4)
- NLTK 学习笔记(1)
- NLTK 学习笔记(2)
- NLTK 学习笔记(5)
- NLTK学习笔记(6)
- NLTK学习笔记
- NLTK 学习笔记(3)
- NLTK学习笔记
- NLTK入门学习笔记
- NLTK学习笔记
- NLTK学习笔记
- Python NLTK 学习笔记0
- Python NLTK 学习笔记1
- NLTK学习(一)
- NLTK学习笔记(7)- Extracting information from text
- NLTK学习笔记——Classify模块(1)
- NLTK学习笔记——Classify模块(2)
- NLTK学习笔记——Classify模块(3)
- ldconfig 命令
- Win32下使用公共控件库
- HDU 2133 What day is it
- 陈天桥或未完全退出盛大游戏
- iOS开发 - post / get 详解
- NLTK 学习笔记(4)
- ASCII与unicode
- Java当中的数组
- poj 3207 Ikki's Story IV - Panda's Trick
- XCode 快捷键, MAC 快捷键
- 14周项目6 读程序
- 有关快速排序的心情
- linux(CentOS)忘记root密码解决办法
- 《老友记》vs《新旧走遍美国》