Java Lucene(7):编写索引器之二

来源:互联网 发布:链轮计算软件 编辑:程序博客网 时间:2024/05/22 10:52

Java lucene 技术(7) : 编写索引器之二

程序6-2实现了基本的中文分词功能,可供读者参考:

 

public class MyselfChineseTokenizer {

      

       protected Reader reader;

       public MyselfChineseTokenizer(StringReader sr){

              reader = sr;

       }

      

       private int length;

       private int start;

       private int offset = 0, bufferIndex=0, dataLen=0;

       private final static int MAX_WORD_LEN = 255;

       private final static int IO_BUFFER_SIZE = 1024;

       private final char[] tokenBuffer = new char[MAX_WORD_LEN];

       private final char[] sourceBuffer = new char[IO_BUFFER_SIZE];

      

       public final MyselfToken next() throws IOException{

              length = 0;

              start = offset;

             

              while(true){

                     final char ch;

                     offset++;

                    

                     if(bufferIndex >= dataLen){

                            dataLen = reader.read(sourceBuffer);

                            bufferIndex = 0;

                     }

                     if(dataLen == -1){break;}

                     else{

                            ch = sourceBuffer[bufferIndex];

                            bufferIndex++;

                     }

                     tokenBuffer[length++] = Character.toLowerCase(ch);

                     return new MyselfToken(new String(tokenBuffer,0,length),start,start+length);

              }

              return null;

       }

      

       public MyselfToken[] getTokenArray() throws IOException{

             

              ArrayList tokenList = new ArrayList();

              while (true) {

                     MyselfToken token = next();

                     if (token == null) break;

                     tokenList.add(token);

                     }

              return (MyselfToken[])tokenList.toArray(new MyselfToken[0]);

       }

      

       public static void main(String args[]) throws IOException{

              StringReader sr = new StringReader("权利制约是美国大国崛起的软实力");

              MyselfChineseTokenizer tokenizer = new MyselfChineseTokenizer(sr);

              MyselfToken[] token = tokenizer.getTokenArray();

              for(int i = 0; i<token.length;i++){

                     System.out.print(token[i].getTermText());

              System.out.print(token[i].getStartOffset());

              System.out.print(token[i].getEndOffset());

              System.out.println("");

              }

       }

}

以下是打印结果:

0--1

1--2

2--3

3--4

4--5

5--6

6--7

7--8

8--9

9--10

10--11

11--12

12--13

13--14

14—15

可以发现,不仅实现了分词,而且每个词都已经被赋予了位置信息,看来,我们的第二步工作也已经完成了。

接下来,我们测试如下的文本信息:“权利制约是USA大国崛起的软实力”。测试结果如下:

0--1

1--2

2--3

3--4

4--5

u5--6

s6--7

a7--8

8--9

9--10

10--11

11--12

12--13

13--14

14--15

15—16

USA作为一个专门的英语词语,也被我们的分词器拆分了,这显然不是我们愿意得到,看来,还有需要改进的地方,下一章,就让我们一起对它的性能进行完善吧!

以上的论述介绍了最基本的分词器的编写,希望通过它,读者能够认识索引器工作原理,分词器原理以及lucene索引原理,欢迎您对我提出宝贵意见,欢迎您加我的QQ,MSN

                                       未完代续