第六章 文本分类

来源:互联网 发布:黑马java 2016 编辑:程序博客网 时间:2024/05/24 05:29
def gender_features(word):

return {'last_letter':word[-1]}

定义一个性别特征,通过尾字母进行判断


names = ([(name, 'male') for name in names.words('male.txt')] +
    [(name, 'female') for name in names.words('female.txt')])

用男名和女名生成names列表


featuresets = [(gender_features(n), g) for (n,g) in names]

将names根据尾字母生成特征集

train_set, test_set = featuresets[500:], featuresets[:500]

生成训练集和测试集

classifier = nltk.NaiveBayesClassifier.train(train_set)

使用朴素贝叶斯分类器

classifier.classify(gender_features('huangcongying‘))

用分类器进行测试

nltk.classify.accuracy(classifier, test_set)

用测试集生成准确率

classifier.show_most_informative_features(5)

检查分类器,展示最有用的5个特征


train_names = names[1500:]

devtest_names = names[500:1500]

test_names = names[:500]

训练集用于训练模型,开发测试集用于进行错误分析,测试集用于系统的最终评估。

train_set = [(gender_features(n), g) for (n,g) in train_names]

devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]

test_set = [(gender_features(n), g) for (n,g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, devtest_set)

设立训练集、开发测试集和测试集的特征集合,并计算准准确率


0 0
原创粉丝点击