LDA(二)
来源:互联网 发布:全景通系统源码 编辑:程序博客网 时间:2024/06/08 13:21
原代码可以在这里下载。论文原作者DavidBlei用C写的。blei@cs.princeton.edu,可以看一下原论文(http://www.cs.berkeley.edu/~blei/papers/blei03a.pdf).
LDA是一个分层的概率文档模型,\alpha是一个标量,\beta_{1:K}是单词的K分布(称为主题)
lda-c.tgz载下来解压,用make编译,生成lda可执行程序。
ldaest [initial alpha] [k] [settings] [data] [random/seeded/*][directory]
lda inf [settings] [model] [data][name]
第一个命令是模型生成的部分。est就是estimate
第二个命令是推理部分。inf就是inference
主函数在lda-estimate.c里。
Asimplemented here, a K topic LDA model assumes the followinggenerative process of an N word document:
1. \theta | \alpha ~Dirichlet(\alpha, ..., \alpha)
2. for each word n = {1, ...,N}:
a. Z_n | \theta ~Mult(\theta)
b. W_n | z_n, \beta ~Mult(\beta_{z_n})
This code implements variational inference of\theta and z_{1:N} for a document, and estimation of the topics\beta_{1:K} and Dirichlet parameter\alpha.
从前面一篇文章可知K是一个选取的参数表示topic的个数(在代码里用变量NTOPICS表示),\alpha是Dirichlet分布的参数,本来是一个k维向量,但是这里把它定义为标量,用一个值表示。在ldaest命令里会给出一个初始的\alpha值,程序会根据训练数据求出最终的\alpha和\beta值,\beta值在上一篇中已经解释:β是一个k*V的矩阵。k就是主题个数,V是词语个数。计算模型实际上就是估算α和β的值。
ldaest另外几个参数的解释。
setting:配置文件,格式如下:
var max iter [integer e.g., 10 or -1]
varconvergence [float e.g., 1e-8]
em maxiter [integer e.g., 100]
em convergence[float e.g., 1e-5]
alpha[fit/estimate]
var max iter:对于每一个文档的最大迭代次数,-1就是不限制,用收敛准则去判断是否停止迭代。
var convergence: 参数估计的收敛准则,当(score_old- score) / abs(score_old)小于设定值(或者迭代次数达到最大值)时,停止迭代
em max iter: EM最大迭代次数
em convergence: EM收敛准则
alpha: fit表示迭代过程中\alpha值保持不变,estimate表示\alpha值也会计算
data:数据文件(数据格式在后面讲到)
random/seeded/*:模型初始化参数,就是\beta矩阵的初始化。random就是用随机变量初始化,seeded就是随机抽取一个文档用平滑方法得到。*表示从已有的模型里载入。代码里涉及到一个概念:sufficientstatistics(充分统计量),用结构体lda_suffstats存储,里面有一个k*V的二维数组class_word,不知道跟lda_model结构里的log_prob_w(也是k*V的二维数组)是什么关系。此处存疑。
directory:输出目录
这里有一个简单的语料库。载下来解压到lda程序所在目录。
1.提取语料库主题
运行如下命令,选取了一百个主题,
./lda est 1 100 settings.txt ../ap/ap.dat random log
大约跑了2h11min,第一次选取了10个主题跑了不到十分钟。
可以用python topics.py ./log/final.beta ../ap/vocab.txt 5察看每一个主题的前面5个的单词
可以看一下前5个主题的前面5个单词
topic 000
hospital
doctors
heart
hospitals
surgery
topic 001
drug
panama
noriega
states
united
topic 002
presley
patients
ruby
years
record
topic 003
computer
program
security
drug
service
topic 004
government
elections
election
party
president
可以看到每一个都和具体某一个主题相关。
2.用lda推断新的文档的主题
./lda inf inf-settings.txt log/final ../ap/test.data test/test
final是刚才训练得到的模型的名字(final.gamma,final.beta)的前缀,../ap/test.data是我们的测试数据。
Two files will be created : [name].gamma arethe variational Dirichlet parameters for each document;
[name].likelihood is the bound on the likelihood for each document.
我是使用了语料库的第一篇文章
A 16-year-old student at a private Baptist school who allegedly killed one teacher and wounded another before firing into a filled classroom apparently ``just snapped,'' the school's pastor said. ``I don't know how it could have happened,'' said George Sweet, pastor of Atlantic Shores Baptist Church. ``This is a good, Christian school. We pride ourselves on discipline. Our kids are good kids.'' The Atlantic Shores Christian School sophomore was arrested and charged with first-degree murder, attempted murder, malicious assault and related felony charges for the Friday morning shooting. Police would not release the boy's name because he is a juvenile, but neighbors and relatives identified him as Nicholas Elliott. Police said the student was tackled by a teacher and other students when his semiautomatic pistol jammed as he fired on the classroom as the students cowered on the floor crying ``Jesus save us! God save us!'' Friends and family said the boy apparently was troubled by his grandmother's death and the divorce of his parents and had been tormented by classmates. Nicholas' grandfather, Clarence Elliott Sr., said Saturday that the boy's parents separated about four years ago and his maternal grandmother, Channey Williams, died last year after a long illness. The grandfather also said his grandson was fascinated with guns. ``The boy was always talking about guns,'' he said. ``He knew a lot about them. He knew all the names of them _ none of those little guns like a .32 or a .22 or nothing like that. He liked the big ones.'' The slain teacher was identified as Karen H. Farley, 40. The wounded teacher, 37-year-old Sam Marino, was in serious condition Saturday with gunshot wounds in the shoulder. Police said the boy also shot at a third teacher, Susan Allen, 31, as she fled from the room where Marino was shot. He then shot Marino again before running to a third classroom where a Bible class was meeting. The youngster shot the glass out of a locked door before opening fire, police spokesman Lewis Thurston said. When the youth's pistol jammed, he was tackled by teacher Maurice Matteson, 24, and other students, Thurston said. ``Once you see what went on in there, it's a miracle that we didn't have more people killed,'' Police Chief Charles R. Wall said. Police didn't have a motive, Detective Tom Zucaro said, but believe the boy's primary target was not a teacher but a classmate. Officers found what appeared to be three Molotov cocktails in the boy's locker and confiscated the gun and several spent shell casings. Fourteen rounds were fired before the gun jammed, Thurston said. The gun, which the boy carried to school in his knapsack, was purchased by an adult at the youngster's request, Thurston said, adding that authorities have interviewed the adult, whose name is being withheld pending an investigation by the federal Bureau of Alcohol, Tobacco and Firearms. The shootings occurred in a complex of four portable classrooms for junior and senior high school students outside the main building of the 4-year-old school. The school has 500 students in kindergarten through 12th grade. Police said they were trying to reconstruct the sequence of events and had not resolved who was shot first. The body of Ms. Farley was found about an hour after the shootings behind a classroom door.
通过察看test-gamma.dat文件,可以得到,主题概率最大的是44和80
topic 044
school
students
student
schools
education
board
teachers
university
college
high
topic 080
police
mrs
man
two
yearold
arrested
shot
night
found
city
参考:
http://hi.baidu.com/lewutian/item/62da5818b716cc797a5f258d
http://www.cs.princeton.edu/~blei/lda-c/
- LDA(二)
- LDA 线性判别分析(二)
- LDA理解以及源码分析(二)
- 主题模型(LDA)(二)-公式推导
- LDA 开篇二
- 复习二:LDA
- sift+bag_of_words+LDA实现图片搜索(二)
- 【机器学习】主题模型(二):pLSA和LDA
- 线性判别分析LDA的数学原理(二)
- 线性判别分析(LDA)实现二分类的思路
- opencv下LDA二分类
- 线性判别模型LDA<二>
- LDA(FISHER)与核LDA
- LDA(LDA文档主题生成模型)
- 线性判别分析(LDA)
- LDA(一)
- 线性判别分析 (LDA)
- 线性判别分析(LDA)
- 阮一峰:字符串匹配的KMP算法
- 学习笔记
- 提高PHP编程效率的53种方法
- Ubuntu 加入Mozilla Prism应用支持!!!!
- 使用javaScript控制文本框输入值
- LDA(二)
- JUnit的框架设计及其使用的设计模式 .
- 限制textarea的长度
- 添加控制文件
- Spring任务调度Quartz CronTrigger设置说明
- sysfs 文件系统
- 前/中/后缀表达式的转换
- JUnitReport的使用 .
- 日志的操作