pinyin4j使用说明

来源:互联网 发布:java微服务架构有哪些 编辑:程序博客网 时间:2024/06/18 15:00

Pinyin4j使用说明

一、自述文件翻译

 

pinyin4j的自述

 

表中的内容

一、主要特点

二、未来的工作

三、如何安装

四、开始

五、作者

六、版权

 

一、主要特点

1、支持从汉字(简体和复制)到不同的中国罗马化系统的转换

2、支持汉语拼音,通用拼音,韦德 -吉尔斯,MPS2(普通话拼音2),耶鲁拼音和Gwoyeu Romatzyh等各种目标的汉语系统

3、支持单个汉字的多个发音

4、几种输出格式

4.1、大写或小写

4.2 vu:或unicodeü

4.3、带有音调编号的unicode输出,带有音调或没有音调

 

二、未来的工作

1、向除汉语拼音以外的其他中国罗马化系统提供格式化功能

2、向汉语提供拼音

3、提供覆盖测试来发现未知的汉字

 

三、如何安装

1、将lib/pinyin4j-2.5.0.jar添加到您的类路径中

2、导入必需的类文件

import net.sourceforge.pinyin4j.PinyinHelper

import net.sourceforge.pinyin4j.format* //只有当您使用输出格式功能时,才需要导入

 

四、开始

1、打开命令行窗口,输入

cd $ {pinyin-install-dir}\lib

2、在lib目录中,有一个名为pinyin4j-2.5.0.jarJAR文件,输入

java -jar pinyin4j-2.5.0.jar

运行GUI演示应用程序

3、如果要修改源代码并进行编译,可能需要从http://sparta-xml.sourceforge.net/下载sparta.jar,并将其添加到classpath中。

Pinyin4j已经将sparta-xml库添加到pinyin4j-2.5.0.jar文件中,因此如果您只是简单地使用pinyin4j而不需要重新编译,则不需要sparta.jar。)

注意:使用pinyin4jjava文件应该保存为unicode支持的编码,如UTF-8

 

五、作者

李敏(xmlerlimin@gmail.com

你可以联系我:

Bloggerhttp://lemann.blogspot.com/

My.Operahttp://my.opera.com/lemann/

 

六、版权

1、包含从中文Unicode字符到汉语拼音的映射的字典文件来源于互联网,其被命名为“uc-to-py.tbl”。作者是stolfi

2、通用拼音,韦德吉尔斯,MPS2,耶鲁拼音,汉语拼音,Gwoyeu Romatzyh的比较表均来自http://www.pinyin.info/romanization/compare/tongyong.html

3、用于设置音调的算法来自http://en.wikipedia.org/wiki/Pinyin#Rules_for_placing_the_tone_mark

4XMLXPath解析库来自于http://sparta-xml.sourceforge.net/

5Pinyin4j的库的发行基于GNU GENERAL PUBLIC LICENSEGPL)。查看有关GPL的更多详细信息,请参阅COPYING.txt

 

二、重要类用法说明

1、在线Javadoc文档

http://pinyin4j.sourceforge.net/pinyin4j-doc/

 

2、包

2.1、net.sourceforge.pinyin4j  

ChineseToPinyinResource: 管理PinyinHelper类所需的所有外部资源。(读取/pinyindb/unicode_to_hanyu_pinyin.txt)

 

GwoyeuRomatzyhResource:一个包含从汉语拼音到Gwoyeu Romatzyh资源处理的类(读取/pinyindb/pinyin_gwoeu_mapping.xml)

 

GwoyeuRomatzyhTranslator:一个包含从汉语拼音翻译成GwoyeuRomatzyh的主要逻辑的类(汉语拼音转换为Gwoyeu拼音的类)

 

PinyinFormatter:对拼音字符串进行格式化操作的类( 汉语拼音格式化,如:根据提供的格式格式化拼音字符串;注音标等方法)

 

PinyinHelper:一个帮助类,提供了几个有用的方法,用于将汉字(简体和繁体)转换成各种中文罗马化表示(音标格式化方法类:六种拼音类型的获取方法等)

 

PinyinRomanizationResource:包含支持在不同中文罗马化系统之间翻译的资源(读取/pinyindb/pinyin_mapping.xml)

 

PinyinRomanizationTranslator:包含在不同中国罗马化体系之间进行转换的操作 (拼音转换,方法为convertRomanizationSystem(源拼音字符串,源拼音类型,目标拼音类型))

 

PinyinRomanizationType:该类描述了可变的汉语拼音罗马化系统

 即定义汉语拼音的六种类型(pinyin4j支持将汉字转化成六种拼音表示法。 其对应关系是:汉语拼音-HanyuPinyin,通用拼音-Tongyong Pinyin, 威妥玛拼音(威玛拼法)-Wade-Giles Pinyin, 注音符号第二式-MPSII Pinyin, 耶鲁拼法-Yale Pinyin和国语罗马字-Gwoyeu Romatzyh)

 

ResourceHelper:类的文件资源 (从classpath路径下读取文件流BufferedInputStream)

 

TextHelper:包含支持文本处理的有用方法(即获取汉语拼音中拼音或音调数字,方法如下:

extractToneNumber(StringhanyuPinyinWithToneNumber)返回音调数字,如输入:luan4 返回:4;extractPinyinString(String hanyuPinyinWithToneNumber)返回汉语拼音前的拼音,如输入:luan4 返回:luan)

 

2.2、net.sourceforge.pinyin4j.format

HanyuPinyinCaseType:定义汉语拼音字串的输出大小写格式(控制生成的拼音是以大写方式显示还是以小写方式显示) 这个类为汉语拼音字符串的输出格式提供了几个选项,如下所示。例如,汉字“民”

Options

Output

LOWERCASE

min2

UPPERCASE

MIN2

 

HanyuPinyinOutputFormat:这个类定义汉语拼音的输出方式(拼音格式类型构造类)

输出功能包括:

  • 字符“ü”的输出格式
  • 中文音调输出格式;
  • 输出字符串中的字母的大小写格式

这些功能的默认值如下所示:

HanyuPinyinVCharType := WITH_U_AND_COLON
HanyuPinyinCaseType := LOWERCASE
HanyuPinyinToneType := WITH_TONE_NUMBER

输出格式选项的某些组合是无意义的。例如, WITH_TONE_MARK and WITH_U_AND_COLON.

下面列出了不同输出格式选项的组合。例如,'吕'

LOWERCASE

组合

WITH_U_AND_COLON

WITH_V

WITH_U_UNICODE

WITH_TONE_NUMBER

lu:3

lv3

lü3

WITHOUT_TONE

lu:

lv

WITH_TONE_MARK

throw exception

throw exception

 

UPPERCASE

Combination

WITH_U_AND_COLON

WITH_V

WITH_U_UNICODE

WITH_TONE_NUMBER

LU:3

LV3

LÜ3

WITHOUT_TONE

LU:

LV

WITH_TONE_MARK

throw exception

throw exception

 

HanyuPinyinToneType:定义汉语拼音的音调类型

中国有四个声调和一个叫“轻音”的声调。他们叫Píng(平,flat)   Shǎng(上,rise),Qù(去,hign drop),Rù(入,drop)和Qing(轻,toneless)。   通常,我们使用1,2,3,4和5表示它们。这个类提供了中文音调输出的几个选项,如下所示。例如,汉字“打”

Options

Output

WITH_TONE_NUMBER

da3

WITHOUT_TONE

da

WITH_TONE_MARK

 

HanyuPinyinVCharType:定义字符'ü'的输出格式(碰到unicode的ü 、v 和 u时的显示方式)

'ü'是汉语拼音的特色,不能简单地用英文字母表示。 汉语拼音包括'ü', 'üe', 'üan', and 'ün'.

此类为“ü”的输出提供了几个选项,如下所示。

Options

Output

WITH_U_AND_COLON

u:

WITH_V

v

WITH_U_UNICODE

ü

 

2.3、net.sourceforge.pinyin4j.format.exception

BadHanyuPinyinOutputFormatCombination:表示拼音输出格式的错误组合的异常类

 

三、用法示例

1、demo分析

直接双击pinyin4j-2.5.0.jar文件,弹出如下界面:


     

上图是输入汉字“中”,执行Convert to Pinyin后的截图。Format后有三个下拉框,第一个下拉框有三个选项,用来控制生成的拼音声调的显示方式,三个方式及其效果如下(以汉字“中”,选中Formatted hanyu Pinyin选项卡测试):

WITH_TONE_NUMBER(以数字代替声调) :  zhong1  zhong4

WITHOUT_TONE (无声调) :                           zhong   zhong

WITH_TONE_MARK (有声调):                      zhōng  zhòng

第二个下拉框是碰到unicode 的ü 、v 和 u时的显示方式,共有三个方式, 以下是以声调为WITH_TONE_NUMBER方式显示汉字“吕”示例:

WITH_U_AND_COLON : lu:3

WITH_V :            lv3

WITH_U_UNICODE :    lü3

第三个下拉框是控制生成的拼音是以大写方式显示还是以小写方式显示,以汉字“国”示例如下:

LOWERCASE :guó

UPPERCASE :GUÓ

上图的汉字转化成拼音后,有六种显示方式,这是因为pinyin4j支持将汉字转化成六种拼音表示法。其对应关系是:汉语拼音-Hanyu Pinyin,通用拼音-Tongyong Pinyin, 威妥玛拼音(威玛拼法)-Wade-Giles  Pinyin, 注音符号第二式-MPSIIPinyin, 耶鲁拼法-Yale Pinyin和国语罗马字-Gwoyeu Romatzyh。 

 

2、代码示例

1、CharUtil.java

/**

 * 字符通用工具类

 */

public class CharUtil

{

         /**

          * 根据Unicode编码完美的判断中文汉字和符号

          *

          * @param c

          * @return

          */

         publicstatic boolean isChineseCharacter(char c)

         {

                   Character.UnicodeBlockub = Character.UnicodeBlock.of(c);

                   if(ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS

                                     ||ub == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS

                                     ||ub == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A

                                      || ub ==Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B

                                     ||ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION

                                     ||ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS

                                     ||ub == Character.UnicodeBlock.GENERAL_PUNCTUATION)

                   {

                            returntrue;

                   }

                   returnfalse;

         }

 

         /**

          * 完整的判断中文汉字和符号

          *

          * @param strName

          * @return

          */

         publicstatic boolean isContainsChinese(String strName)

         {

                   char[]ch = strName.toCharArray();

                   for(int i = 0; i < ch.length; i++)

                   {

                            charc = ch[i];

                            if(isChineseCharacter(c))

                            {

                                     returntrue;

                            }

                   }

                   returnfalse;

         }

}

 

2、PinyinUtil.java

package com.ctgu.test;

 

import java.util.ArrayList;

import java.util.Hashtable;

import java.util.List;

import java.util.Map;

 

importnet.sourceforge.pinyin4j.PinyinHelper;

importnet.sourceforge.pinyin4j.format.HanyuPinyinCaseType;

importnet.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;

importnet.sourceforge.pinyin4j.format.HanyuPinyinToneType;

importnet.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination;

 

public class PinyinUtil

{

 

         /**

          * 汉字转换位汉语拼音首字母,英文字符不变,特殊字符丢失 支持多音字,生成方式如(长沙市长:cssc,zssz,zssc,cssz)

          *

          * @param chines

          *           汉字

          * @return 拼音

          */

         publicstatic String converterToFirstSpell(String hanzi)

         {

                   StringBufferpinyinName = new StringBuffer();

                   char[]nameChar = hanzi.toCharArray();

                   HanyuPinyinOutputFormatdefaultFormat = new HanyuPinyinOutputFormat();

                   defaultFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);

                   defaultFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE);

                   for(int i = 0; i < nameChar.length; i++)

                   {

                            if(CharUtil.isChineseCharacter(nameChar[i]))                  //if (nameChar[i] > 128)

                            {

                                     try

                                     {

                                               //取得当前汉字的所有全拼

                                               String[]strs = PinyinHelper.toHanyuPinyinStringArray(

                                                                 nameChar[i],defaultFormat);

                                               if(strs != null)

                                               {

                                                        for(int j = 0; j < strs.length; j++)

                                                        {

                                                                 pinyinName.append(strs[j].charAt(0));   // 取首字母

                                                                 if(j != strs.length - 1)

                                                                 {

                                                                           pinyinName.append(",");

                                                                 }

                                                        }

                                               }

                                     }

                                     catch(BadHanyuPinyinOutputFormatCombination e)

                                     {

                                               e.printStackTrace();

                                     }

                            }

                            else

                            {

                                     pinyinName.append(nameChar[i]);

                            }

                            pinyinName.append("");

                   }

                   //return pinyinName.toString();

                   returnparseTheChineseByObject(discountTheChinese(pinyinName.toString()));

         }

 

         /**

          * 汉字转换位汉语全拼,英文字符不变,特殊字符丢失

          * 支持多音字,生成方式如(重当参:zhongdangcen,zhongdangcan,chongdangcen

          * ,chongdangshen,zhongdangshen,chongdangcan)

          *

          * @param chines

          *           汉字

          * @return 拼音

          */

         publicstatic String converterToSpell(String chines)

         {

                   StringBufferpinyinName = new StringBuffer();

                   char[]nameChar = chines.toCharArray();

                   HanyuPinyinOutputFormatdefaultFormat = new HanyuPinyinOutputFormat();

                   defaultFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);

                   defaultFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE);

                   for(int i = 0; i < nameChar.length; i++)

                   {

                            if(CharUtil.isChineseCharacter(nameChar[i]))                  //if (nameChar[i] > 128)不在ASCII码表内的字符即认为是中文字符

                            {

                                     try

                                     {

                                               //取得当前汉字的所有全拼

                                               String[]strs = PinyinHelper.toHanyuPinyinStringArray(

                                                                 nameChar[i],defaultFormat);

                                               if(strs != null)

                                               {

                                                        for(int j = 0; j < strs.length; j++)

                                                        {

                                                                 pinyinName.append(strs[j]);

                                                                 if(j != strs.length - 1)

                                                                 {

                                                                           pinyinName.append(",");

                                                                 }

                                                        }

                                               }

                                     }

                                     catch(BadHanyuPinyinOutputFormatCombination e)

                                     {

                                               e.printStackTrace();

                                     }

                            }

                            else

                            {

                                     pinyinName.append(nameChar[i]);

                            }

                            pinyinName.append("");

                   }

                   //System.out.println(pinyinName.toString());

                   returnparseTheChineseByObject(discountTheChinese(pinyinName.toString()));

         }

 

         /**

          * 去除多音字重复数据

          *

          * @param theStr

          * @return

          */

         privatestatic List<Map<String, Integer>> discountTheChinese(String theStr)

         {

                   //去除重复拼音后的拼音列表

                   List<Map<String,Integer>> mapList = new ArrayList<Map<String, Integer>>();

                   //用于处理每个字的多音字,去掉重复

                   Map<String,Integer> onlyOne = null;

                   String[]firsts = theStr.split(" ");

                   //读出每个汉字的拼音

                   for(String str : firsts)

                   {

                            onlyOne= new Hashtable<String, Integer>();

                            String[]china = str.split(",");

                            //多音字处理

                            for(String s : china)

                            {

                                     Integercount = onlyOne.get(s);

                                     if(count == null)

                                     {

                                               onlyOne.put(s,new Integer(1));

                                     }

                                     else

                                     {

                                               onlyOne.remove(s);

                                               count++;

                                               onlyOne.put(s,count);

                                     }

                            }

                            mapList.add(onlyOne);

                   }

                   returnmapList;

         }

 

         /**

          * 解析并组合拼音,对象合并方案(推荐使用)

          *

          * @return

          */

         privatestatic String parseTheChineseByObject(

                            List<Map<String,Integer>> list)

         {

                   Map<String,Integer> first = null; // 用于统计每一次,集合组合数据

                   //遍历每一组集合

                   for(int i = 0; i < list.size(); i++)

                   {

                            //每一组集合与上一次组合的Map

                            Map<String,Integer> temp = new Hashtable<String, Integer>();

                            //第一次循环,first为空

                            if(first != null)

                            {

                                     //取出上次组合与此次集合的字符,并保存

                                     for(String s : first.keySet())

                                     {

                                               for(String s1 : list.get(i).keySet())

                                               {

                                                        Stringstr = s + s1;

                                                        temp.put(str,1);

                                              }

                                     }

                                     //清理上一次组合数据

                                     if(temp != null && temp.size() > 0)

                                     {

                                               first.clear();

                                     }

                            }

                            else

                            {

                                     for(String s : list.get(i).keySet())

                                     {

                                               Stringstr = s;

                                               temp.put(str,1);

                                     }

                            }

                            //保存组合数据以便下次循环使用

                            if(temp != null && temp.size() > 0)

                            {

                                     first= temp;

                            }

                   }

                   StringreturnStr = "";

                   if(first != null)

                   {

                            //遍历取出组合字符串

                            for(String str : first.keySet())

                            {

                                     returnStr+= (str + ",");

                            }

                   }

                   if(returnStr.length() > 0)

                   {

                            returnStr= returnStr.substring(0, returnStr.length() - 1);

                   }

                   returnreturnStr;

         }

}

 

3、Main.java

public class Main

{

         publicstatic void main(String[] args)

         {

 

                   Stringmsg = "你好啊,重庆市长!";

                   Stringspell = PinyinUtil.converterToSpell(msg);

                   System.out.println(spell);

 

                   StringfirstSpell = PinyinUtil.converterToFirstSpell(msg);

                   System.out.println(firstSpell);

         }

}

 

程序运行结果如下:

nihaoazhongqingshichang,nihaoachongqingshizhang,nihaoachongqingshichang,nihaoazhongqingshizhang

nhazqsz,nhazqsc,nhacqsz,nhacqsc

 

可以看到,输出结果包括了多音字的处理结果。

 

四、参考资料

pinyin4j入门教程:http://blog.csdn.net/hfhwfw/article/details/6030816

pinyin4j学习笔记:http://blog.csdn.net/foamflower/article/details/6209552

pinyin4j使用示例(支持多音字):http://www.open-open.com/lib/view/open1392087364364.html

Java 完美判断中文字符:http://www.micmiu.com/lang/java/java-check-chinese/

ASCII码表:http://asciima.com/

java_生成Unicode/GB2312字符编码表:http://www.cnblogs.com/lijialong/archive/2010/10/12/1849220.html