JgibbLDA输出结果说明与示例

来源:互联网 发布:淘宝售前客服流程视频 编辑:程序博客网 时间:2024/06/13 02:55


JgibbLDA输出以下几个文件:

.others文件存储LDA模型参数,如alphabeta等。

.phi文件存储topic-word分布,每一个元素是p(word|topic),每一行是一个主题,列内容为词语(应该是设定的top多少的词)

.theta文件存储document-topic分布,每一个元素是p(topic|document),每一行是一个文档,列内容是主题概率。

.tassign文件是训练预料中单词的主题指定(归属),每一行是一个语料文档。

.twords文件是存放每个topic下面选出的top words以及对应的权重

wordmap.txt是整个corpus中出现的distinctive的所有词,词的id是按照出现的顺序来编的,但是在wordmap.txt里词是按照字母顺序来排的。

下面举例说明结果:

test_input.txt中有4篇文档,前两个文档是关于sport的(足球),后两个文档是关于travel。test_input.txt内容如下:

4sport Spanish football association competition club tickets scored win winners keeper shots best goal campaign season's Champions League France team France Football Federation president national team training session Champions record European competition without recording a single victory quit my job to travel passport world travel is a luxury for the privileged the rich or the retired travel stories Have a long-term plan visa-free destinations Central Station City of London dry gin drinking building older foundations River Fleet flavour gin and tonic be served with cubed ice fruit floral spicy earthy savoury citrus
LDA 模型参数如下:

alpha0.5 beta0.1topicNum2niters1000savestep1000twords10

设置的是2个topic,每个topic下面有10个词。先看wordmap.txt中的内容,由于test_input.txt中不重复的词有81个,所以里面第一行是总词数,第一个词从编码0开始,具体如下:

81competition 4Central 54ice 74earthy 78without 28building 62passport 38Federation 21record 26club 5Spanish 1plan 51floral 76League 17goal 13drinking 61Fleet 66keeper 10destinations 53foundations 64is 40Have 49dry 59City 56spicy 77European 27my 34privileged 44Station 55savoury 79served 71London 58campaign 14tonic 69shots 11job 35tickets 6be 70season's 15session 25fruit 75for 42association 3recording 29best 12training 24gin 60world 39and 68of 57national 23River 65retired 47older 63France 18win 8winners 9a 30or 46stories 48flavour 67cubed 73victory 32rich 45football 2team 19Football 20citrus 80single 31the 43Champions 16with 72scored 7luxury 41quit 33to 36visa-free 52travel 37sport 0president 22long-term 50
.tassign文件每行对应一个document,其中的元素是 word_id : topic_id,意思是第word_id个词是属于第topic_id的,具体如下:

0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:1 18:0 19:0 18:0 20:0 21:0 22:0 23:0 19:0 24:0 25:0 16:0 26:0 27:0 4:0 28:0 29:0 30:0 31:0 32:0 33:1 34:1 35:1 36:1 37:1 38:0 39:1 37:1 40:0 30:1 41:1 42:1 43:1 44:1 43:1 45:1 46:1 43:1 47:1 37:1 48:1 49:1 30:1 50:1 51:1 52:1 53:1 54:1 55:0 56:1 57:1 58:0 59:1 60:1 61:1 62:1 63:1 64:1 65:1 66:1 67:1 60:1 68:1 69:0 70:1 71:1 72:1 73:1 74:0 75:1 76:1 77:1 78:1 79:0 80:0 
.twords文件直接就是每个topic下的出现频率最高的词以及权重:

Topic 0th:competition 0.04030710172744722Champions 0.04030710172744722France 0.04030710172744722team 0.04030710172744722sport 0.02111324376199616Spanish 0.02111324376199616football 0.02111324376199616association 0.02111324376199616club 0.02111324376199616tickets 0.02111324376199616Topic 1th:travel 0.05525846702317291the 0.05525846702317291a 0.03743315508021391gin 0.03743315508021391League 0.0196078431372549quit 0.0196078431372549my 0.0196078431372549job 0.0196078431372549to 0.0196078431372549world 0.0196078431372549
下面是最重要的两个输出文件 .phi以及.theta

.phi是topic-word矩阵,本测试中topic只有2个,因而行数是2,列中的word并不是在参数中设置的topic word个数,这个topic word个数只是控制显示多少个word的,实际上计算中用的是所有的word,因而这里topic word矩阵的列是所有的word,即wordmap.txt中的所有word,所以列的维度是81. .phi文件如下:

0.0211130.0211130.0211130.0211130.0403070.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0403070.0019190.0403070.0403070.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0019190.0019190.0019190.0019190.0019190.0211130.0019190.0211130.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0211130.0019190.0019190.0211130.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0211130.0019190.0019190.0019190.0019190.0211130.0019190.0019190.0019190.0019190.0211130.0211130.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0196080.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0374330.0017830.0017830.0196080.0196080.0196080.0196080.0552580.0017830.0196080.0017830.0196080.0196080.0552580.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0017830.0196080.0196080.0017830.0196080.0374330.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0017830.0196080.0196080.0196080.0196080.0017830.0196080.0196080.0196080.0196080.0017830.001783

.theta矩阵是document-topic矩阵,那么本测试中有4个document、2个topic,则该矩阵是4行2列的,具体如下:

0.9210530.0789470.9750.0250.1166670.8833330.2037040.796296从.theta矩阵的数据,可以直观看出,前两个文档在topic0上的权重大,而在topic1上的权重小,后两个文档正好相反。那么分类的结果就是前两个文档属于一类,后两个文档属于另一类。


---- end -----



1 0
原创粉丝点击