BOW使用指南
来源:互联网 发布:改图软件 编辑:程序博客网 时间:2024/06/03 08:24
先规范一下发间:bow的韵音同low而不是cow。
bow包含三个项目:rainbow用于文本分类;arrow用于文本检索;crossbow用于文本聚类。这三个程序是独立的。
Rainbow
使用rainbow前首先要建立原始文档的一个model----包含了原始文档的一些统计信息,使用rainbow命令时通过-d选项来指定model的路径。
rainbow -d ~/model --index ~/20_newsgroups/*
以上命令是为 20_newsgroups所有分类创建model,生成~/model文件。
--index目录可以分别写:rainbow -d ~/model --index ~/20_newsgroups/talk.politics.guns ~/20_newsgroups/talk.politics.mideast ~/20_newsgroups/talk.politics.misc
--index可简写为-i
rainbow不支持一个文档拥有多个类标签。
各个文档属于哪个类都已经包含在了model中。
rainbow -d model --print-doc-names 打印出model中包含的所有文件的文件名(包括完整路径)。
默认情况下rainbow在建立model把字母都转换成了小写,并去除了停用词。
当然用rainbow建立model时还有很多选项可以指定,比如--skip-html可以路过“<"和“>"之间的所有字符;--skip-headers (or -h)选项跳过新闻组或邮件的headers before beginning tokenization.
为原始文档建立好索引后就可以来进行分类了。
rainbow -d ~/model --test-set=0.4 --test=3
表示输出3次试验的结果,60%的文档作为训练集,剩下40%作为测试集。
输出类似于:
/home/mccallum/20_newsgroups/talk.politics.misc/178939 talk.politics.misc talk.politics.misc:0.98 talk.politics.mideast:0.015 talk.politics.guns:0.005
指出了一个文档属于各个类的概率。
bow路径下还有一个perl脚本文件--rainbow-stats,它的输入是以上分类命令的输出,它的输出是平均精度、标准差和混淆矩阵。
rainbow -d ~/model --test-set=0.4 --test=2 | rainbow-stats
进行2次trail,输出形如:
Trial 1
Correct: 1077 out of 1200 (89.75 percent accuracy)
- Confusion details, row is actual, column is predicted
classname 0 1 2 :total
0 talk.politics.guns 378 2 20 :400 94.50%
1 talk.politics.mideast 7 374 19 :400 93.50%
2 talk.politics.misc 57 18 325 :400 81.25%
Percent_Accuracy average 90.38 stderr 0.44
--test-set选项也可以为整数,如--test-set=30,将试图从集合中随机选取30个文档作为测试集,并尽可能保证我30个文档平均分布于各个类中。
--test-set=200pc表示每个分类中随机选取200个文档放入测试集。
你甚至可以具体指定哪个文件放在测试集中--test-set=file1 fil2 各个文件用空格分开。
同理可以指定train-set:rainbow -d ~/model --test-set=file1 --train-set=file2 --test=1
一个文件不能同时出现在--teat-set和--train-set中。
默认情况下不在--test-set中的文档都在--train-set中,我们也可以反过来指定:
rainbow -d ~/model --train-set=1pc --test-set=remaining --test=1
如果测试集不在model中,也可以另外指定:rainbow -d ~/model --test-files ~/more-talk.politics/*
特征项选择
--prune-vocab-by-infogain=N (-T)计算每个词的average mutual information,选取互信息最高的N个词作为特征词。默认情况下N=0,即选择所有的词作为特征词。
--prune-vocab-by-doc-count=N (-D)一个词的文档频率(即在多少个文档中出现)大于N时才把它作为特征词。
--prune-vocab-by-occur-count=N (-O)一个词出现不少于N次时选为特征词。
比如rainbow -d ~/model --prune-vocab-by-infogain=50 --test=1
你还可以查处哪50个词选为了特征词,通过:rainbow -d ~/model -I 50
选择采用的分类方法:rainbow -d ~/model --method=tfidf --test=1
rainbow支持的分类方法有:naivebayes, knn, tfidf, prind(probabilistic indexing),默认情况下采用的是naivebayes。
采用naivebayes分类时可以指定的选项有:
--smoothing-method=METHODSet the method for smoothing word probabilities to avoid zeros; METHOD may be one of: goodturing, laplace, mestimate, wittenbell. The default is laplace, which is a uniform Dirichlet prior with alpha=2.
--event-model=EVENTNAMESet what objects will be considered the `events' of the probabilistic model. EVENTNAME can be one of: word (i.e. multinomial, unigram), document (i.e. multi-variate Bernoulli, bit vector), or document-then-word (i.e. document-length-normalized multinomial). For more details on these methods, see A Comparison of Event Models for Naive Bayes Text Classification. The default is word.
--uniform-class-priorsWhen classifying and calculating mutual information, use equal prior probabilities on classes, instead of using the distribution determined from the training data.
机器诊断
你可以查看model中的各种信息。
rainbow -d ~/model -I 10 查看前10个特征词的互信息(或文档频率、出现的次数)
rainbow -d ~/model --train-set=~/docs1 -I 10 查看文档1中前10个特征词的互信息(或文档频率、出现的次数)
rainbow -d ~/model -T 10 --print-word-probabilities=talk.politics.mideast 查看类别talk.politics.mideast中
互信息最高的前10个特征词的probability。输出形如:
god 0.05026782
people 0.64977338
government 0.24062629
car 0.03502266
game 0.00412031
team 0.01030078
bike 0.00041203
dod 0.00041203
hockey 0.00123609
windows 0.00782859
概率之和为1.
rainbow -d ~/model --print-word-counts=team 查看单词team在各个类中出现的次数及所占比重。输出形如:
2 / 125039 ( 0.00002) alt.atheism
6 / 119511 ( 0.00005) comp.graphics
5 / 91147 ( 0.00005) comp.os.ms-windows.misc
1 / 71002 ( 0.00001) comp.sys.mac.hardware
12 / 131120 ( 0.00009) comp.windows.x
15 / 62130 ( 0.00024) misc.forsale
2 / 83942 ( 0.00002) rec.autos
10 / 78685 ( 0.00013) rec.motorcycles
543 / 88623 ( 0.00613) rec.sport.baseball
970 / 115109 ( 0.00843) rec.sport.hockey
9 / 136655 ( 0.00007) sci.crypt
1 / 81206 ( 0.00001) sci.electronics
8 / 125235 ( 0.00006) sci.med
71 / 128754 ( 0.00055) sci.space
2 / 141389 ( 0.00001) soc.religion.christian
13 / 135054 ( 0.00010) talk.politics.guns
24 / 208367 ( 0.00012) talk.politics.mideast
14 / 164266 ( 0.00009) talk.politics.misc
9 / 130013 ( 0.00007) talk.religion.misc
Note: the probability of the word team is not equal to the probability of team from the --print-word-probabilities command above, because we did not reduce vocabulary size to 10 in this example.
rainbow -d ~/model --train-set=3pc --print-doc-names=train
通过--print-doc-names=train你可以查看测试集中有哪些文档
通达--print-matrix还可以输出word-document矩阵。
Print entries for all words in the vocabulary, or just print the words that actually occur in the document.
aall
ssparse, (default)
Print word counts as integers or as binary presence/absence indicators.
bbinary
iinteger, (default)
How to indicate the word itself.
ninteger word index
wword string
ccombination of integer word index and word string, (default)
eempty, don't print anything to indicate the identity of the word
比如rainbow -d ~/model -T 100 --print-matrix=siw | head -n 10的输出是下面的形式:
~/20_newsgroups/alt.atheism/53366 alt.atheism god 2 jesus 1 nasa 2 people 2
~/20_newsgroups/alt.atheism/53367 alt.atheism jesus 2 jewish 1 christian 1
~/20_newsgroups/alt.atheism/51247 alt.atheism god 4 evidence 2
~/20_newsgroups/alt.atheism/51248 alt.atheism
~/20_newsgroups/alt.atheism/51249 alt.atheism nasa 1 country 2 files 1 law 3 system 1 government 1
~/20_newsgroups/alt.atheism/51250 alt.atheism god 3 people 2 evidence 1 law 1 system 1 public 5 rights 1 fact 1 religious 1
~/20_newsgroups/alt.atheism/51251 alt.atheism
~/20_newsgroups/alt.atheism/51252 alt.atheism people 4 evidence 2 system 2 religion 1
~/20_newsgroups/alt.atheism/51253 alt.atheism god 19 christian 1 evidence 1 faith 5 car 2 space 1 game 1
~/20_newsgroups/alt.atheism/51254 alt.atheism people 1 jewish 3 game 1 bible 7
而rainbow -d ~/model -T 10 --print-matrix=abe | head -n 10的输出是下面的形式:
~/20_newsgroups/alt.atheism/53366 alt.atheism 1 1 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/53367 alt.atheism 0 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51247 alt.atheism 1 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51248 alt.atheism 0 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51249 alt.atheism 0 0 1 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51250 alt.atheism 1 1 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51251 alt.atheism 0 0 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51252 alt.atheism 0 1 0 0 0 0 0 0 0 0
~/20_newsgroups/alt.atheism/51253 alt.atheism 1 0 0 1 1 0 0 0 0 0
~/20_newsgroups/alt.atheism/51254 alt.atheism 0 1 0 0 1 0 0 0 0 0
通用选项
--verbosity=LEVEL(LEVEL的取值为,0,1,2,3,4,5)控制打印进度信息,数字越小打印的信息越少,默认为2。
例如rainbow -v 0 -d ~/model -I 10
凡事牵涉到随机选取的命令都可以指定种子:
rainbow -d ~/model -t 1 --test-set=0.3 --random-seed=2
rainbow -d ~/model --random-seed=1 --train-set=4pc --print-doc-names=train
Arrow
Arrow是一个单独的程序,它采用TFIDF来对文档进行检索。
为文档建立索引:arrow --index 20_newsgroups/talk.politics.*
建立索引后生成的文件放在哪儿了我还不知道。
检索:orisun@zcypc:~/master$ arrow --query
Loading data files...
Type your query text now. End with a Control-D.
america,HITCOUNT 176
20_newsgroups/talk.politics.mideast/76225 0.104700 america
20_newsgroups/talk.politics.guns/53295 0.094274 america
20_newsgroups/talk.politics.misc/178545 0.089230 america
20_newsgroups/talk.politics.misc/179112 0.088966 america
20_newsgroups/talk.politics.guns/55265 0.087667 america
20_newsgroups/talk.politics.misc/178465 0.086633 america
20_newsgroups/talk.politics.mideast/77184 0.084780 america
20_newsgroups/talk.politics.guns/54318 0.084553 america
20_newsgroups/talk.politics.misc/178790 0.081991 america
20_newsgroups/talk.politics.mideast/76286 0.081669 america
.
orisun@zcypc:~/master$ arrow --query
Loading data files...
Type your query text now. End with a Control-D.
china,HITCOUNT 30
20_newsgroups/talk.politics.mideast/76476 0.176953 china
20_newsgroups/talk.politics.mideast/76536 0.152673 china
20_newsgroups/talk.politics.guns/53341 0.151225 china
20_newsgroups/talk.politics.mideast/75405 0.151225 china
20_newsgroups/talk.politics.misc/178554 0.127059 china
20_newsgroups/talk.politics.misc/176864 0.116911 china
20_newsgroups/talk.politics.misc/178488 0.105641 china
20_newsgroups/talk.politics.misc/178326 0.104150 china
20_newsgroups/talk.politics.misc/176852 0.092056 china
20_newsgroups/talk.politics.mideast/76264 0.090448 china
.
orisun@zcypc:~/master$ arrow --query
Loading data files...
Type your query text now. End with a Control-D.
AMERICA
CHINA
,HITCOUNT 194
20_newsgroups/talk.politics.mideast/76476 0.150675 china
20_newsgroups/talk.politics.mideast/76536 0.130001 china
20_newsgroups/talk.politics.guns/53341 0.128768 china
20_newsgroups/talk.politics.mideast/75405 0.128768 china
20_newsgroups/talk.politics.misc/178554 0.108191 china
20_newsgroups/talk.politics.misc/176864 0.099550 china
20_newsgroups/talk.politics.guns/54857 0.091291 china america
20_newsgroups/talk.politics.misc/178488 0.089953 china
20_newsgroups/talk.politics.misc/178326 0.088683 china
20_newsgroups/talk.politics.misc/176852 0.078386 china
.
- BOW使用指南
- BOW
- BoW
- BoW
- BOW
- Bow
- 文本BoW
- bow实现
- Bow模型
- Bow词袋
- Take a bow~
- Take A Bow
- OpenCV应用----BOW篇
- OpenCV应用----BOW篇
- OpenCV应用----BOW篇
- BOW(bag of words)
- BoW用于图像分类
- SIFT,BOW,SPM
- 聚类算法之KMeans(Java实现)
- 聚类算法之DBScan(Java实现)
- 聚类算法之CHAMELEON(Java实现)
- LibSVM使用指南
- java读取pdf和MS Office文档
- BOW使用指南
- 归纳决策树ID3(Java实现)
- FP-Tree算法的实现
- 线程局部存储,Part 6:Windows Server 2003中隐式TLS支持方法设计中的问题
- 聚类算法之BIRCH(Java实现)
- 用遗传算法走出迷宫(Java版)
- 用神经网络指挥坦克扫雷(Java实现)
- 隐马尔可夫模型及的评估和解码问题
- 向前-向后算法(forward-backward algorithm)