Java中文排序方法总结

来源:互联网 发布:被女朋友撩硬 知乎 编辑:程序博客网 时间:2024/05/28 19:25

Java中文排序方法总结  

2011-04-07 13:10:51|  分类:Java |字号 订阅

1      问题提出

汉字排序不同于英文字母的排序,主要有两种排序方式:一种是按拼音首字母排序;一种是按笔画排序。大多数情况下是按拼音首字母排序。那汉字究竟怎样通过其拼音首字母排序呢?好在字符集帮我们解决了这个问题。

众所周知的包含汉字的字符集有gb2312GBK,后者是前者的扩展。Gb2312在设计的时候就将常用的中国汉字按照拼音的顺序包含到字符集中,因此,我们通过汉字的字符编码就可以判断汉字的拼音顺序。因为GBKgb2312的扩展,对gb2312完全兼容,只是在gb2312的字符集末尾加入了二次常用汉字,所以GBK字符集也可以通过这种方法实现拼音排序。

2      问题解决

Java运行时的编码是Unicode编码,所有的字符集都要转化成Unicode编码,所以,可以很方便的对gb2312GBK字符集的汉字实现拼音排序。

测试代码如下:

publicclass NormalComparatorimplements Comparator<Object> {

 

   RuleBasedCollator collator = (RuleBasedCollator)Collator.getInstance(Locale.CHINA);

 

   publicint compare(Object o1, Object o2) {

       //TODO Auto-generated method stub

       returncollator.compare(o1.toString(), o2.toString());

   }

}

代码说明:

Collator 类是执行区分语言环境的 String比较,可以使用静态工厂方法 getInstance来为给定的语言环境获得适当的Collator对象。如Collator.getInstance(Local.CHINA)是获得中国语言的Collator对象。

RuleBasedCollator Collator的子类,它实现了特定整理策略的细节或者需要修改策略。为了提高效率,对 RuleBasedCollator做了如下限制(其他子类可以用于更复杂的语言):

l           ?如果指定了由 <modifier>控制的特殊整理规则,则它将用于整个 collator对象。

l           ? 所有未指定的字符位于整理顺序的末尾。

整理表由一组整理规则组成,其中每个规则是以下三种形式之一:

<modifier>

<relation> <text-argument>

<reset> <text-argument>

规则元素的定义如下:

l           文本参数:文本参数可以是任何的字符序列,不包括特殊字符(即公共空白字符 [0009-000D0020]和规则语法字符 [0021-002F003A-0040005B-0060007B-007E])。如果需要使用这些字符,可以把它们放在单引号内(例如 & => '&')。注意,没有使用引号的空白字符将被忽略;例如 b c视为 bc

l           修饰符:目前有两个修饰符用于开启特殊的整理规则。

?    '@' : 开启重音字符的反向排序(二级区别),以法语为例。

?    '!' : 开启 Thai/Lao 元音-辅音字母交换。如果当 \U0E40-\U0E44范围内的 Thai 元音字母排在 \U0E01-\U0E2E范围内的 Thai 辅音字母前面,或者 \U0EC0-\U0EC4范围内的 Lao 元音字母排在 \U0E81-\U0EAE范围内的 Lao 辅音字母前面时此规则有效,那么经过整理后元音字母将被放置在辅音字母的后面。

'@' : 指示重音字符按反向排序,以法语为例。

l           关系:关系如下:

?    '<' : 大于,当字母不同时(一级)

?    ';' : 大于,当重音不同时(二级)

?    ',' : 大于,当大小写不同时(三级)

?    '=' : 等于

l           重置:存在单一的重置主要用于规则集的缩减和扩充,但它也可以用于在规则集的末尾添加修改。

'&' : 指示下一条规则在重置文本参数将要被排序的位置后面。

通过RuleBasedCollator类,我们可以对已有的整理规则进行修改,如:

String rule = “<a<b<f<g”;

String addrule = “&b<e”;

String newrule = rule+addrule;

RuleBasedCollator new_collator = new RuleBasedCollator(newrule);

新的规则将是:a,b,e,f,g

RuleBasedCollator collator_us = Collator.getInstance(Local.US);

RuleBasedCollator collator_ch = Collator.getInstance(Local.CHINA);

RuleBasedCollator new_collator = new RuleBasedCollator(collator_us.getRules()+collator_ch.getRules());

New_collator是英文和中文整理规则的组合。

因此,我们通过以上方式可以很方便的定义和修改整合规则。这对我们下面的讨论有很大的帮助。

Arrays.sort(array,comparator)方法是对数组array按照comparator定义的规则进行排序。与排序有关的接口和类有ComparatorComparableCollatortestComparator类实现了Comparator接口的compare方法,这是Arrays.sort()Comparator之间的契约。同时也是Collections.sort()Comparator之间的契约。其中Arrays.sort()是对数组排序,而Collections是对SetArrayList进行排序。如果不用Comparator,也可以由集合元素对象本身实现ComparablecompareTo()方法作为Arrays.sort()Collections.sort()的之间的契约。

3      问题扩展

如果对排序的结果要求不是很严格,上面的测试代码就足够了。但是如果我们在"板球", "排球", "香港", "足球", "篮球"的后面加上非常用字“怡再次用上面的测试代码排序时,会发现它就不是那么灵验了。

这主要是因为gb2312中的汉字为常用汉字,并且是按拼音顺序排列的。但GBKgb2312的基础上扩展了非常用字符,这些字符并不是按拼音顺序排列的,而是按笔画顺序排列。因此,如果有非常用字符时,排序结果就会有出入。

为了解决这个问题,我们有两种方案:

l           取排序的汉字的汉语拼音,再按照英文字符的比较方法进行排序。

l           扩展常用字符的比较规则,使其可以同时对常用和非常用字符进行排序。

第一种方案是我们最容易想到的,我们可以借助于Google的开源项目pinyin4j来帮助我们完成;第二种方案基于RuleBaseCollator的整合规则。

3.1   扩展比较规则

通过RuleBaseCollatorgetRules方法可以获得已有的规则,那怎样把非常用字的规则整合到原有的规则中呢?现在大多数系统还只能支持Unicode中的基本汉字那部分汉字,编码从U9FA6-U9FBF。所以我们可以按照下面的方法:

首先,用java程序生成一个文本文件(full_b.csv)。包括所有的从U9FA6-U9FBF的字符的编码和文字。利用excel的按拼音排序功能,对full_b.csv文件中的内容排序。

然后,删除第一列数据,只留下汉字。

最后,用java程序读取full_b.csv文件,生成新的整合规则,与原有的规则进行组合生成新的规则比较器。

通过上面的方法我们就可以扩展比较规则,对绝大多数的汉字正确的排序。

代码试例:

 

       //建立文件

       PrintWriter out = new PrintWriter("c:\\full_b.csv");

       //基本汉字

       for (char c = 0x4E00; c <= 0x9FA5; c++) {

           out.println((int) c +"," + c);

       }

       out.flush();

       out.close();

排序,删除第一列数据,只留下汉字

       //生成规则

       Scanner in = new Scanner(new File("c:\\full_b.csv"));

       PrintWriter out = new PrintWriter("c:\\full.csv");

       int stroke = 0;

       while (in.hasNextLine()) {

           String line = in.nextLine();

           if (in.hasNextLine())

               out.print(line + "<");

           else

               out.print(line);

       }

       out.flush();

       out.close();

    in.close();

建立规则

/**

 *构建新的比较规则

 *@authorUser

 */

publicclass Rules {

 

   privatestatic StringfileName ="Chinese.csv";

   privatestatic RuleBasedCollatorcollatorIns;

 

   privatestatic StringBuffer load() {

       InputStream in = Rules.class.getResourceAsStream(fileName);

       StringBuffer sb = new StringBuffer();

       Scanner sca = new Scanner(in);

       while (sca.hasNextLine()) {

           sb.append("<" + sca.nextLine());

       }

       return sb;

   }

 

   publicstatic RuleBasedCollator getCollator() {

       if (collatorIns == null) {

           RuleBasedCollator collator = (RuleBasedCollator) Collator

                   .getInstance(Locale.CHINA);

           try {

               collatorIns =new RuleBasedCollator(collator.getRules()

                       .substring(0, 2125)

                       + load());

           } catch (ParseException e) {

               e.printStackTrace();

           }

       }

       returncollatorIns;

   }

}

比较器

publicclass FullComparatorimplements Comparator {

 

   RuleBasedCollator collator = Rules.getCollator();

   publicint compare(Object o1, Object o2) {

       returncollator.compare(o1.toString(), o2.toString());

   }

}

3.2   借助pinyin4j

Googlepinyin4j项目可以将汉字转化为汉语拼音,借助于它,我们可以将常用和非常用汉字按照拼音的顺序排序。对于多音字我们只取第一个音节。

测试代码:

publicclass PinyinComparatorimplements Comparator {

 

   publicint compare(Object o1, Object o2) {  

       String key1 = o1.toString();

       String key2 = o2.toString();

       for (int i = 0; i < key1.length() && i < key2.length(); i++) {  

 

           int codePoint1 = key1.charAt(i);  

           int codePoint2 = key2.charAt(i);  

 

           if (Character.isSupplementaryCodePoint(codePoint1)  

                   || Character.isSupplementaryCodePoint(codePoint2)) {

               i++;  

           }  

 

           if (codePoint1 != codePoint2) {  

               if (Character.isSupplementaryCodePoint(codePoint1)  

                       || Character.isSupplementaryCodePoint(codePoint2)) {  

                   return codePoint1 - codePoint2;  

               }  

 

               String pinyin1 = pinyin((char) codePoint1);  

               String pinyin2 = pinyin((char) codePoint2);  

 

               if (pinyin1 !=null && pinyin2 !=null) {// 两个字符都是汉字  

                   if (!pinyin1.equals(pinyin2)) {  

                       return pinyin1.compareTo(pinyin2);  

                   }  

               } else {  

                   return codePoint1 - codePoint2;  

               }   

           }  

       }  

       return key1.length() - key2.length();  

   }  

 

   private String pinyin(char c) {  

       String[] pinyins = PinyinHelper.toHanyuPinyinStringArray(c);  

       if (pinyins ==null) {  

           returnnull;  //如果转换结果为空,则返回null

       } 

       return pinyins[0];  //如果为多音字返回第一个音节

   }

}

以上两种方案,可以很好的解决非常用汉字的排序。但它们都有自己的缺点,如多音字的处理。有待改进。

3.3   笔画排序

对于笔画排序,我们需要计算每个汉字的笔画数。最容易想到的方法就是建立汉字和笔画的映射表。方法和3.1中扩展规则的方法相同。如下:

l           java程序生成一个文本文件(Chinese.csv)。包括所有的从U9FA6-U9FBF的字符的编码和文字。利用excel的按笔画排序功能,对Chinese.csv文件中的内容排序。

l           编写Java程序分析Chinese.csv文件,求得笔画数,生成ChineseStroke.csv。矫正笔画数,重新按汉字的Unicode编码对ChineseStroke.csv文件排序。

l           只保留ChineseStroke.csv文件的最后一列,生成Stroke.csv

代码试例:

publicclass StrokeComparatorimplements Comparator {

 

   publicint compare(Object o1, Object o2) {

       //Auto-generated method stub

       int codepoint1 = 0;

       int codepoint2 = 0;

       String key1 = o1.toString();

       String key2 = o2.toString();

       for(int i=0;i<key1.length() && i<key2.length();i++){

           codepoint1 = key1.codePointAt(i);

           codepoint2 = key2.codePointAt(i);

           if(codepoint1 == codepoint2){

               continue;

           }

           

           if(Chinese.stroke(codepoint1)<0 || Chinese.stroke(codepoint2)<0){

               return codepoint1 - codepoint2;

           }

           

           if(codepoint1 != codepoint2){

               return Chinese.stroke(codepoint1) - Chinese.stroke(codepoint2);

           }

       }

       return key1.length()-key2.length();

   }

}

生成笔画

          Scanner in = new Scanner(new File("c:\\Chinese.csv"));      

           PrintWriter out = new PrintWriter("c:\\ChineseStroke.csv");

           String oldLine = "999999";

           int stroke = 0;

           while (in.hasNextLine()) {

               String line = in.nextLine();

               if (line.compareTo(oldLine) < 0) {

                   stroke++;              

               }

               oldLine = line;

               out.println(line + "," + stroke);          

           }

           out.flush();

           out.close();

       in.close();

4      完整代码

4.1      chinese.sort.comparators .NormalComparator

package chinese.sort.comparators;

 

import java.text.Collator;

import java.text.RuleBasedCollator;

import java.util.Comparator;

import java.util.Locale;

 

 

publicclass NormalComparatorimplements Comparator<Object> {

 

 RuleBasedCollator collator = (RuleBasedCollator)Collator.getInstance(Locale.CHINA);

 

 publicint compare(Object o1, Object o2) {

     //TODO Auto-generated method stub

     returncollator.compare(o1.toString(), o2.toString());

   }

}

 

4.2    chinese.sort.comparators .FullComparator

package chinese.sort.comparators;

 

import java.text.RuleBasedCollator;

import java.util.Comparator;

import chinese.sort.tools.Rules;

 

 

publicclass FullComparatorimplements Comparator {

 

 RuleBasedCollator collator = Rules.getCollator();

 publicint compare(Object o1, Object o2) {

       returncollator.compare(o1.toString(), o2.toString());

   }

}

 

4.3    chinese.sort.comparators .PinyinComparator

package chinese.sort.comparators;

 

import java.util.Comparator;

import net.sourceforge.pinyin4j.PinyinHelper;

 

 

publicclass PinyinComparatorimplements Comparator {

 

   publicint compare(Object o1, Object o2) {  

     String key1 = o1.toString();

     String key2 = o2.toString();

       for (int i = 0; i < key1.length() && i < key2.length(); i++) {  

 

           int codePoint1 = key1.charAt(i);  

           int codePoint2 = key2.charAt(i);  

 

           if (Character.isSupplementaryCodePoint(codePoint1)  

                   || Character.isSupplementaryCodePoint(codePoint2)) {

               i++;  

           }  

 

           if (codePoint1 != codePoint2) {  

               if (Character.isSupplementaryCodePoint(codePoint1)  

                       || Character.isSupplementaryCodePoint(codePoint2)) {  

                   return codePoint1 - codePoint2;  

               }  

 

               String pinyin1 = pinyin((char) codePoint1);  

               String pinyin2 = pinyin((char) codePoint2);  

 

               if (pinyin1 !=null && pinyin2 !=null) {// 两个字符都是汉字  

                   if (!pinyin1.equals(pinyin2)) {  

                       return pinyin1.compareTo(pinyin2);  

                   }  

               } else {  

                   return codePoint1 - codePoint2;  

               }  

           }  

       }  

       return key1.length() - key2.length();  

   }  

   private String pinyin(char c) {  

       String[] pinyins = PinyinHelper.toHanyuPinyinStringArray(c);  

       if (pinyins ==null) {  

           returnnull;  //如果转换结果为空,则返回null

       } 

       return pinyins[0];  //如果为多音字返回第一个音节

   }

}

4.4    chinese.sort.comparators .StrokeComparator

package chinese.sort.comparators;

import java.util.Comparator;

import chinese.sort.tools.Chinese;

publicclass StrokeComparatorimplements Comparator {

 

 publicint compare(Object o1, Object o2) {

       //Auto-generated method stub

     int codepoint1 = 0;

     int codepoint2 = 0;

     String key1 = o1.toString();

     String key2 = o2.toString();

     for(int i=0;i<key1.length() && i<key2.length();i++){

         codepoint1 = key1.codePointAt(i);

         codepoint2 = key2.codePointAt(i);

         if(codepoint1 == codepoint2){

             continue;

         }

         

         if(Chinese.stroke(codepoint1)<0 || Chinese.stroke(codepoint2)<0){

             return codepoint1 - codepoint2;

         }

         

         if(codepoint1 != codepoint2){

             return Chinese.stroke(codepoint1) - Chinese.stroke(codepoint2);

         }

     }

       return key1.length()-key2.length();

   }

}

4.5    chinese.sort.tools .Chinese

package chinese.sort.tools;

import java.io.File;

import java.util.Properties;

import java.util.ResourceBundle;

import java.util.Scanner;

publicclass Chinese {

 

 privatestatic PropertiesstrokesMap =new Properties();

 static {

      Scanner in = new Scanner(Chinese.class.getResourceAsStream("Stroke.csv"));

      String temp ;

      int i = 19968;

      while(in.hasNextLine()){

          temp = in.nextLine();

          strokesMap.setProperty((i++)+"", temp);

      }

//       System.out.println(i);

 }

 publicstaticint stroke(int codepoint){

     String result = strokesMap.getProperty(codepoint+"");

     if(result ==null) result ="-1";

     return Integer.valueOf(result);

 }

 

 publicstaticint stroke(char character){

     int codepoint = (character+"").codePointAt(0);

     returnstroke(codepoint);

 }

}

4.6    chinese.sort.tools .Rules

package chinese.sort.tools;

 

import java.io.InputStream;

import java.text.Collator;

import java.text.ParseException;

import java.text.RuleBasedCollator;

import java.util.Locale;

import java.util.Scanner;

 

/**

 *构建新的比较规则

 *@authorUser

 */

publicclass Rules {

 

 privatestatic StringfileName ="full.csv";

 privatestatic RuleBasedCollatorcollatorIns;

 

 privatestatic StringBuffer load() {

     InputStream in = Rules.class.getResourceAsStream(fileName);

     StringBuffer sb = new StringBuffer();

     Scanner sca = new Scanner(in);

     while (sca.hasNextLine()) {

         sb.append("<" + sca.nextLine());

     }

     return sb;

 }

 publicstatic RuleBasedCollator getCollator() {

     if (collatorIns == null) {

         RuleBasedCollator collator = (RuleBasedCollator) Collator

                 .getInstance(Locale.CHINA);

         try {

             collatorIns =new RuleBasedCollator(collator.getRules()

                     .substring(0, 2125)

                     + load());

         } catch (ParseException e) {

             e.printStackTrace();

         }

     }

     returncollatorIns;

 }

}

4.7   建立笔画映射表的源码

建立文件

       PrintWriter out = new PrintWriter("c:\\Chinese.csv");

       //基本汉字

       for (char c = 0x4E00; c <= 0x9FA5; c++) {

           out.println((int) c +"," + c);

       }

       out.flush();

       out.close();

排序,生成笔画数

       //生成笔画

       Scanner in = new Scanner(new File("c:\\Chinese.csv"));

       PrintWriter out = new PrintWriter("c:\\ChineseStroke.csv");

       String oldLine = "999999";

       int stroke = 0;

       while (in.hasNextLine()) {

           String line = in.nextLine();

           if (line.compareTo(oldLine) < 0) {

               stroke++;

           }

           oldLine = line;

           out.println(line + "," + stroke);

       }

       out.flush();

       out.close();

       in.close();

保留最后一列。

4.8   建立非常用字的规则表

建立文件

       PrintWriter out = new PrintWriter("c:\\full_b.csv");

       //基本汉字

       for (char c = 0x4E00; c <= 0x9FA5; c++) {

           out.println((int) c +"," + c);

       }

       out.flush();

       out.close();

排序,保留最后一列

       //生成规则

       Scanner in = new Scanner(new File("c:\\full_b.csv"));

       PrintWriter out = new PrintWriter("c:\\full.csv");

       int stroke = 0;

       while (in.hasNextLine()) {

           String line = in.nextLine();

           if (in.hasNextLine())

               out.print(line + "<");

           else

               out.print(line);

       }

       out.flush();

       out.close();

       in.close();



原创粉丝点击