java代码运行LDA JGibbLDA运行配置方法(MyEclipse)

来源:互联网 发布:固态硬盘恢复数据公司 编辑:程序博客网 时间:2024/04/29 02:59

JGibbLDA是一个Java版本的LDA实现,使用Gibbs采样进行快速参数估计和推断。本文是我自己实现的把JGibbLDA在myeclipse中跑起来的方法。

1.下载JGibbLDA的jar包,并解压;(网址:http://jgibblda.sourceforge.NET/#Griffiths04)

2.将1中解压的文件夹放在MyEclipse工作空间中;(如果不清楚自己的工作空间在哪里,File - Switch Workspace进行查看)

3.在MyEclipse中Import进2中放在工作空间中的文件夹;

4.成功导进去之后,在项目名上右击 - Properties - Java Build Path - Libraries 中Add JARs ,添加args4j-2.0.6.jar包

5.找到LDACmdOption.java文件, 修改部分代码

[java] view plain copy
  1. public class LDACmdOption {  
  2.       
  3.     @Option(name="-est", usage="Specify whether we want to estimate model from scratch")  
  4.     public boolean est = false;  
  5.       
  6.     @Option(name="-estc", usage="Specify whether we want to continue the last estimation")  
  7.     public boolean estc = false;  
  8.       
  9.     @Option(name="-inf", usage="Specify whether we want to do inference")  
  10.     public boolean inf = true;  
  11.       
  12.     @Option(name="-dir", usage="Specify directory")  
  13.     public String dir = "models/casestudy-en";  
  14.       
  15.     @Option(name="-dfile", usage="Specify data file")  
  16.     public String dfile = "models/casestudy-en/newdocs.dat";  
  17.       
  18.     @Option(name="-model", usage="Specify the model name")  
  19.     public String modelName = "model-01000";  
  20.       
  21.     @Option(name="-alpha", usage="Specify alpha")  
  22.     public double alpha = 0.2;  
  23.       
  24.     @Option(name="-beta", usage="Specify beta")  
  25.     public double beta = 0.1;  
  26.       
  27.     @Option(name="-ntopics", usage="Specify the number of topics")  
  28.     public int K = 100;  
  29.       
  30.     @Option(name="-niters", usage="Specify the number of iterations")  
  31.     public int niters = 1000;  
  32.       
  33.     @Option(name="-savestep", usage="Specify the number of steps to save the model since the last save")  
  34.     public int savestep = 100;  
  35.       
  36.     @Option(name="-twords", usage="Specify the number of most likely words to be printed for each topic")  
  37.     public int twords = 100;  
  38.       
  39.     @Option(name="-withrawdata", usage="Specify whether we include raw data in the input")  
  40.     public boolean withrawdata = false;  
  41.       
  42.     @Option(name="-wordmap", usage="Specify the wordmap file")  
  43.     public String wordMapFileName = "wordmap.txt";  
  44. }  

6.修改该项目的Run Configurations,在Java Application中选择LDA,点击(x)=Arguments,输入-est -alpha 0.2 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 100 -dir  models\casestudy-en -dfile "newdocs.dat"

(其中"newdocs.dat"是JGibbLDA自带的测试训练文本集,不用做修改,以后我们自己的训练文本集也是要生成跟newdocs.dat一样格式的文本)

7.Run。当命令行出现如图所示时,就说明运行成功了!

根据LDA漫游指南来对gibbsLDA++结果进行分析

在使用gibbsLDA++之前,对于文档进行预处理,第一行为文档的数目,第2 - n+1行为每一篇文章分词后的结果。 
最终会产生许多文件,gibbsLDA++软件是通过迭代的方式来进行运算的,最后会生成:

model-final.othersmodel-final.phimodel-final.tassignmodel-final.thetamodel-final.twordswordmap.txt
这样六个文件。

Parameter estimation from scratch

Command line:

$ lda -est [-alpha <double>] [-beta <double>] [-ntopics <int>] [-niters <int>] [-savestep <int>] [-twords <int>] -dfile <string>

in which (parameters in [] are optional):

  • -est: Estimate the LDA model from scratch
  • -alpha <double>: The value of alpha, hyper-parameter of LDA. The default value of alpha is 50 / K (K is the the number of topics). See [Griffiths04] for a detailed discussion of choosing alpha and beta values.
  • -beta <double>: The value of beta, also the hyper-parameter of LDA. Its default value is 0.1
  • -ntopics <int>: The number of topics. Its default value is 100. This depends on the input dataset. See [Griffiths04] and [Blei03] for a more careful discussion of selecting the number of topics.
  • -niters <int>: The number of Gibbs sampling iterations. The default value is 2000.
  • -savestep <int>: The step (counted by the number of Gibbs sampling iterations) at which the LDA model is saved to hard disk. The default value is 200.
  • -twords <int>: The number of most likely words for each topic. The default value is zero. If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++ will print out the list of top 20 most likely words per each topic each time it save the model to hard disk according to the parameter savestep above.
  • -dfile <string>: The input training data file. See section "Input data format" for a description of input data format.

Parameter estimation from a previously estimated model

Command line:

$ lda -estc -dir <string> -model <string> [-niters <int>] [-savestep <int>] [-twords <int>]

in which (parameters in [] are optional):

  • -estc: Continue to estimate the model from a previously estimated model.
  • -dir <string>: The directory contain the previously estimated model
  • -model <string>: The name of the previously estimated model. See section "Outputs" to know how GibbsLDA++ saves outputs on hard disk.
  • -niters <int>: The number of Gibbs sampling iterations to continue estimating. The default value is 2000.
  • -savestep <int>: The step (counted by the number of Gibbs sampling iterations) at which the LDA model is saved to hard disk. The default value is 200.
  • -twords <int>: The number of most likely words for each topic. The default value is zero. If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++ will print out the list of top 20 most likely words per each topic each time it save the model to hard disk according to the parameter savestep above.

Inference for previously unseen (new) data

Command line:

$ lda -inf -dir <string> -model <string> [-niters <int>] [-twords <int>] -dfile <string>

in which (parameters in [] are optional):

  • -inf: Do inference for previously unseen (new) data using a previously estimated LDA model.
  • -dir <string>: The directory contain the previously estimated model
  • -model <string>: The name of the previously estimated model. See section "Outputs" to know how GibbsLDA++ saves outputs on hard disk.
  • -niters <int>: The number of Gibbs sampling iterations for inference. The default value is 20.
  • -twords <int>: The number of most likely words for each topic of the new data. The default value is zero. If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++ will print out the list of top 20 most likely words per each topic after inference.
  • -dfile <string>:The file containing new data. See section "Input data format" for a description of input data format.

Input data format

Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format as follows:

[M]

[document1]

[document2]

...

[documentM]

in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of Niwords/terms.

[documenti] = [wordi1] [wordi2] ... [wordiNi]

in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.

Note that the terms document and word here are abstract and should not only be understood as normal text documents. This is because LDA can be used to discover the underlying topic structures of any kind of discrete data. Therefore, GibbsLDA++ is not limited to text and natural language processing but can also be applied to other kinds of data like images and biological sequences. Also, keep in mind that for text/Web data collections, we should first preprocess the data (e.g., removing stop words and rare words, stemming, etc.) before estimating with GibbsLDA++.

Outputs

Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files:

<model_name>.others

<model_name>.phi

<model_name>.theta

<model_name>.tassign

<model_name>.twords

in which:

<model_name>: is the name of a LDA model corresponding to the time step it was saved on the hard disk. For example, the name of the model was saved at the Gibbs sampling iteration 400th will be model-00400. Similarly, the model was saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.

<model_name>.others: This file contains some parameters of LDA model, such as:

alpha=?

beta=?

ntopics=? # i.e., number of topics

ndocs=? # i.e., number of documents

nwords=? # i.e., the vocabulary size

liter=? # i.e., the Gibbs sampling iteration at which the model was saved

<model_name>.phi: This file contains the word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the vocabulary.

<model_name>.theta: This file contains the topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic.

<model_name>.tassign: This file contains the topic assignments for words in training data. Each line is a document that consists of a list of <wordij>:<topic of wordij>

<model_file>.twords: This file contains twords most likely words of each topic. twords is specified in the command line.

GibbsLDA++ also saves a file called wordmap.txt that contains the maps between words and word's IDs (integer). This is because GibbsLDA++ works directly with integer IDs of words/terms inside instead of text strings.

Outputs of Gibbs sampling inference for previously unseen data

The outputs of GibbsLDA++ inference are almost the same as those of the estimation process except that the contents of those files are of the new data. The <model_name> is exactly the same as the filename of the input (new) data.


原创粉丝点击