weka文本聚类（3）--文本转换成arff

来源：互联网发布：rs232转网络编辑：程序博客网时间：2024/06/05 08:43

要使用weka进行聚类分析，必须先将文本数据转换成weka可识别的arff格式。Instances类是weka可识别的数据类，其toString方法即可转换为arff格式的数据。在文本聚类中，arff格式的示例如下：

@relation patent

@attribute text string

@data
'第一篇文章的内容'

'第二篇文章的内容'

......

经过摸索，主要有三种方式将文本转换成Instances类。

（1）连接数据库。weka对数据库连接的支持很差，需要将weka的jar解压，再修改里面的参数重新打包才可以正常使用。修改参数的示例百度上有许多，现在送上一个链接，是修改完参数后的java调用教程。这种方式特别麻烦，不实用。

（2）调用TextDirectoryLoader。此类是weka自带的Loader，能够读取一个文件夹下的文本，并转换成arff格式。其调用非常简单，但是有几个需要注意的点。首先是文本的摆放格式，一篇文本用一个文件保存，但是主文件夹下不能直接放置文本文件，需要建立不同的文件夹放置不同种类的文本文件。举个例子，如在“d:\\text"目录下，应该建立多个子文件夹，如“class1”,"class2"，在两个子文件夹下再放置文本文件。本次使用主题是用weka进行文本聚类，因此，文本只需要放置在一个文件夹下就可以了。以刚才的例子为示例，下面是TextDirectoryLoader的使用代码。

TextDirectoryLoader loader = new TextDirectoryLoader();

loader.setDirectory(new File(“d:\\text”));

Instances dataRaw = loader.getDataSet();

dataRaw.setClassIndex(-1);

这个Instances即是我们需要的weka格式的文件。通过 TextDirectoryLoader导入的Instances是带有分类这个属性的，而k-means聚类算法不允许Instances带有分类，因此需要该分类设置为-1,才能被k-means算法处理。

（3）直接构造Instances。这种方法来源于对TextDirectoryLoader源码的分析，它既然能读取文件夹转换成arff，其内部必然有直接构造Instances的方法，通过查看，其使用如下：

public Instances getStruct(List<String> list) {
       FastVector atts = new FastVector();
       atts.addElement(new Attribute("text", (FastVector) null));
       Instances data = new Instances("patent", atts, 0);

       for (String str: list) {
           double[] newInst = new double[1];
           //这里为了更加清晰，省略了对文本进行分词的代码，
           newInst[0] = (double) data.attribute(0).addStringValue(str);
           data.add(new Instance(1.0, newInst));
       }

       return data;
   }

上面的代码是将需要分类的文本放在list中，每个String对象代表一篇文本，为了使结构清晰，省略了对文本进行分词的步骤，在实际中，文本分词应该在这里进行。下面也提供一个对文本分词后，再转换成arff的代码。

public Instances getStruct(List<Agriculture> agList) {
       FastVector atts = new FastVector();
       atts.addElement(new Attribute("text", (FastVector) null));
       Instances data = new Instances("patent", atts, 0);

FilterRecognition stopFilter = null;
try {

//初始化filterRecognition，用于过滤掉停用词，是ansj的工具类

            stopFilter = InitStopWords();
       } catch (Exception e) {
           // TODO Auto-generated catch block
           e.printStackTrace();
       }
       for (Agriculture ag : agList) {
           System.out.println(ag.getContent());
           double[] newInst = new double[1]; // 不算分类属性

           String content = ToAnalysis.parse(ag.getContent())
                   .recognition(stopFilter).toStringWithOutNature(" ");
           System.out.println("分词结果：" + content);
           newInst[0] = (double) data.attribute(0).addStringValue(content);
           data.add(new Instance(1.0, newInst));
       }
       return data;
   }

//构造停用词工具

public FilterRecognition InitStopWords() throws Exception {
       ArrayList<String> stopList = new ArrayList<String>();
       String stopWordTable = "src/stopwords.txt";
       // 读入停用词文件
       BufferedReader StopWordFileBr = new BufferedReader(
               new InputStreamReader(new FileInputStream(stopWordTable),
                       "UTF-8"));

       String stopWord = null;
       for (; (stopWord = StopWordFileBr.readLine()) != null;) {
           stopList.add(stopWord);
       }
       StopWordFileBr.close();

       FilterRecognition filterRecognition = new FilterRecognition();
       filterRecognition.insertStopWords(stopList);

       return filterRecognition;
   }

以上就是将文本转换成arff格式的方法，能够完成到这里，即已经进入了使用weka的入口，迈向成功的一大步。

0 0