Java 汉字转拼音(完美支持解决多音字)

来源:互联网 发布:苹果迅雷没网络异常 编辑:程序博客网 时间:2024/04/19 23:42

        上一篇文章 Java 汉字转拼音 介绍了Java 中利用Pinyin4j 实现汉字转拼音,但是对于多音字问题采取的是组合拼音方式,例如长沙 取拼音结果就是 changsha zhangsha。某些情况下我们希望能得到多音字的唯一拼音,此时就需要借助多音字字典了,原理很简单:给多音字一个默认的拼音并告诉计算机碰到哪些词的时候使用其它的拼音,例如 长 字,我们可以给它指定默认拼音为 zhang,并标识 长沙 拼音为 chang。


工程结构


多音字词典

本类库 支持自定义扩展词典,词典文件名称为py4j.dic,完整路径为:resources/py4j/dictionary/py4j.dic,词典文件格式如下:

a#阿ao#拗口/违拗/拗断/执拗/拗口/拗口风/拗口令/拗曲/拗性/拗折/警拗ai#艾bang#膀/磅/蚌ba#扒bai#叔伯/百/柏杨/㧳/梵呗/呗佛/呗音/呗唱/呗偈/呗声/呗赞/赞呗bao#剥皮/薄/暴/堡/曝bei#呗beng#蚌埠bi#复辟/臂/秘鲁/泌阳bing#屏息/屏弃/屏气/屏除/屏声bian#扁/便/便宜坊bo#薄荷/单薄/伯/泊/波/柏/萝卜/孛bu#卜/柨can#参cang#藏/欌cen#参差ceng#曾/噌cha#差/刹那/宝刹/一刹/查/碴/喳喳/喀喳chai#公差/差役/专差/官差/听差/美差/办差/差事/差使/肥差/当差/钦差/苦差/出差chan#颤/单于/禅chang#长/厂chao#朝/嘲/焯che#工尺/车chen#称职/匀称/称心/相称/对称cheng#称/乘/澄/噌吰/橙 秤/盛满/盛器/盛饭chu#畜chui#椎心chuai#揣chuan#传chi#匙/尺/吃chong#重庆/重重/虫chou#臭/帱chuang#经幢chuo#绰ci#参差/鳞差/伺候/龟兹cuan#攒聚/攒动/攒集/攒宫/攒所cuo#撮儿/撮要/撮合da#大/嗒dao#叨/帱载/帱察dai#大夫dan#单/弹/掸/澹dang#铛de#的/得di#堤/底/怎的/有的/目的/标的/打的/的确/有的放/的卢/矢之的/言中的/语中的/的士/地/提防/快的/美的diao#蓝调/调调/音调/论调/格调/调令/低调/笔调/基调/强调/声调/滥调/老调/色调/单调/腔调/跑调/曲调/步调/语调/主调/情调ding#丁du#读/都/度dou#全都/句读duo#舵/测度/忖度/揣度/猜度dun#粮囤/盾/顿/沌/敦e#阿谀/阿胶/阿弥/恶/擜er#儿fan#番feng#冯fei#婔fo#佛fu#仿佛/果脯/罘/莩fou#否fiao#覅ga#咖喱/伽马/嘎/戛纳gai#盖gao#告gang#扛鼎ge#革/蛤蚧/文蛤/蛤蜊/咯gei#给geng#脖颈gong#女红/共gu#谷/中鹄/鼓gui#龟/柜/硅/倭傀/傀异/傀然/傀垒/傀怪/傀卓/傀奇/傀伟/傀民/傀俄/琦傀/奇傀gua#呱guan#纶巾/东莞guang#广ha#蛤/哈/虾蟆hai#还/嗨/咳声/咳笑hao#貉子/貉绒hang#夯/总行/分行/支行/行业/排行/行情/央行/商行/外行/银行/中行/交行/招行/农行/工行/建行/商行/酒行/麻行/琴行/行业/同行/行列/行货/行会/行家/巷道/引吭/扼吭/批吭/搤吭/高吭/喉吭/咔吭/絶吭/吭嗌/吭咽/吭首he#和/合/核/鶴/猲heng#道行/涥hu#鹄/水浒/嗀/唬hua#滑/呚/椛huan#归还/放还/奉还/圜hui#会/浍河/媈/灳/哕/瑗珲hong#红/虹huo#软和/热和/暖和hun#尡/珲ji#病革/给养/自给/给水/薪给/给予/供给/稽/缉/藉/奇数/亟/诘屈/荠菜/愱jia#雪茄/伽/家/价/贾/戛jian#见/浅浅jiang#降jiao#嚼舌/嚼字/嚼蜡/角/剿/饺/脚/蕉/矫/睡觉/侥/校对/校验/校正/校准/审校/校场/校核/校勘/校订/校阅/校样jie#解/慰藉/蕴藉/诘/媘/煯jin#矜/劲/禁jing#颈/景/强劲/劲风/劲旅/劲敌/劲射/苍劲/遒劲/劲草jiong#炅ju#咀/居/桔/句/婮jun#均juan#棚圈/圈养/猪圈/羊圈jue#主角/角色/旦角/女角/丑角/角力/名角/配角/嚼/觉/䏐jun#龟裂/俊ka#咖/卡/喀kai#楷kang#扛ke#咳/壳keng#吭kuai#会计/财会/浍kui#傀kuo#括la#癞痢/腊/蜡lai#癞疮/癞子/癞蛤/癞皮lao#积潦/络子/落枕/落价/粩/姥le#乐/勒/了lei#勒紧lo#然咯lou#佝偻/泄露/露面/露脸/露骨/露底/露馅/露一手/露相/露马脚/露怯long#里弄/弄堂/泷li#跞/礼/櫔/栃liao#了解/了结/明了/了得/末了/未了/了如/潦/撩liang#靓/俩lie#挘lin#崊ling#霗/令liu#六/遛lu#碌/陆/露luo#络/落/漯/囖/洜/泺lv#率/绿lve#鋢/稤lun#纶ma#嫲/抹布/抹脸/抹桌子/摩挲mai#埋man#埋怨/蔓mai#脉mang#氓/芒mao#冒me#嚒men#椚meng#群氓/盟/癦mei#没/旀mo#淹没/没收/出没/沉没/没落/吞没/覆没/没入/埋没/鬼没/隐没/湮没/辱没/脉脉/模/摩/抹mou#绸缪/牟mi#秘/泌尿/分泌/谜/檷枸mian#渑ming#掵miu#谬/谬论/纰缪mu#大模/字模/模板/模样/模具/装模/模子/牟尼/子牟/夷牟/悬牟/相牟/头牟/宾牟/曹牟/岑牟/兜牟/卢牟/弥牟/牟食/牟槊/牟衫/牟光/牟牟/牟甲na#哪/娜/那nao#臑nan#南ne#哪吒/呢nei#氞neus#莻nong#弄/燶ni#毛呢/花呢/呢绒/线呢/呢料/呢子/呢喃/溺/檷niao#尿/鸟/便溺nian#粘膜/粘度/粘土/粘合剂/粘液/粘稠/粘合/粘着/粘结/粘性/粘附/不粘锅/粘糊/粘虫/粘聚/粘滞/焾/哖niang#酿nin#脌ning#倿/拧niu#拗/汼nu#努nuo#婀娜/袅娜/喏nv#女nve#疟/硸o#喔/筽ou#膒pa#扒手/扒窃/扒外/扒分/扒糕/扒灰/扒犁/扒龙/扒搂/扒山虎/扒艇pai#派/迫击/迫击炮pao#刨/炮/萢pan#番禺pang#胖/膀/磅pei#蓜pi#辟/否极/臧否/龙陂/芘pian#扁舟/便宜/魸piao#朴姓/饿莩/饥莩/葭莩pin#穦ping#屏/苹/冯河po#湖泊/血泊 /迫/朴刀/坡/陂pu#一曝十寒/里堡/十里堡/脯/朴/曝晒/瀑/埔qi#期/其/泣/祇qiu#龟兹/湭qi#稽首/缉鞋/栖/奇/漆/齐qia#卡脖/卡子/关卡/卡壳/哨卡/边卡/发卡/峠qiao#雀盲/雀子/地壳/甲壳/躯壳qian#纤/乾/浅qiang#强/㛨/㩖/䅚/䵁qie#茄/趔趄/聺/籡qin#亲/沁qing#干亲/亲家qiong#熍qu#区/趣/爠quan#圈/券que#雀/炔re#声喏/唱喏rong#嬫ruo#若/嵶saeng#栍sang#槡sai#塞/嘥sao#螦se#堵塞/搪塞/茅塞/闭塞/鼻塞/梗塞/阻塞/淤塞/拥塞/哽塞/色sha#莎/刹车/急刹/厦/杉木/杉篙shai#色子shao#勺/红苕shan#姓单/单县/杉/敾/禅让/受禅/禅变/禅代/禅诰shang#衣裳she#拾级/折本/射/蛇shen#沙参/野参/参王/人参/红参/丹参/山参/海参/鹿参/什么/身/沈/桑椹/食椹/烂椹/木椹sheng#野乘/千乘/史乘/省/晟/盛/陹/渑水shi#钥匙/什/识/似的/食/石/氏/拾/适/瑡shiwa#瓧shuai#表率/率性/率直/率真/粗率/率领/轻率/直率/草率/大率/坦率/衰shuang#泷水/鏯shu#属/数/术/熟shui#游说shuo#数见/说si#伺/似/思sou#蓃/摗su#宿/鯂sui#尿泡ta#拓片/拓印/拓本/拓墨/拓写/拓手/拓工/碑拓/疲沓/拖沓/杂沓/沓/塔/鸿塔tang#汤/镗tao#陶tan#反弹/弹性/弹簧/弹力/弹奏/弹跳/弹指/弹劾/弹唱/弹射/弹性体/吹弹/评弹/乱弹琴/弹压/弹指/弹簧/弹冠/弹雀/弹雀/弹丝/弹丸/澹台te#脦teng#虅ti#提/体tiao#调/苕ting#町/听tong#通tu#迌tuan#湪tui#褪tuo#拓/袥tun#囤/屯wei#尾/蔚/圩堤/圩垸/圩田/圩子/赶圩/歌圩weng#攚wu#无/可恶/交恶/好恶/厌恶/憎恶/嫌恶/痛恶/深恶/兀wan#藤蔓/枝蔓/根蔓/蔓草/瓜蔓/蔓儿/莞/万/百万/皖wang#亡wai#崴xia#虾/吓/夏/厦门/厦大/唬杀xi#栖/系/蹊/洗/溪/戏/焁/铣/褶衣/褶裤xiao#校/切削/削面/刀削/刮削xian#纤细/光纤/纤巧/纤柔/纤小/纤维/纤瘦/纤纤/化纤/纤秀/棉纤/纤尘/铣铁/金铣xiang#投降/巷xie#解数/出血/采血/换血/血糊/尿血/淤血/放血/血晕/血淋/便血/吐血/咯血/叶韵/蝎/蝎子/邪/猲猲xin#嬜/邤xiu#铜臭/乳臭/成宿/星宿/璓xin#馨/信/鸿信xing#深省/省视/内省/不省人事/省悟/省察/行/荥xiong#匂xu#牧畜/畜产/畜牧/畜养/并畜/畜锐/吁/圩/浒xuan#箮xue#削/血/樰xun#荨/寻ya#琊yao#钥/耀/曜/佋侥/侥觎/侥僺/侥利/侥傒/侥觊/侥会/侥滥/侥望/侥求/侥竞/侥薄/侥躐/侥取/侥奇/侥忝/侥速/侥冀/侥冒/疟子yan#咽/殷红/朱殷/腌/烟/曕ye#液/抽咽/哽咽/咽炎/呜咽/幽咽/悲咽/叶/葉/璍/潱/拽步/拽扶/拽扎yi#自艾/遗/屹/嬄/噫yin#殷/栶ying#荥经/緓/灜yo#杭育yong#涌/硧you#牗yu#余/呼吁/吁请/吁求/育/熨帖/熨烫/於yuan#员/茒/圜丘yun#熨yue#约/乐音/器乐/乐律/乐章/音乐/乐理/民乐/乐队/声乐/奏乐/弦乐/乐坛/管乐/配乐/乐曲/乐谱/锁钥/密钥/乐团/乐器/嬳/咽哕/唾哕/发哕/干哕/哕吐/哕饭/哕呕/哕息/哕厥/哕噫/哕逆/哕咽/哕骂/哕心/哕喈/口哕/呕哕za#绑扎/结扎/包扎/捆扎/咱家zan#攒/咱zang#宝藏/藏历/藏文/藏语/藏青/藏族/藏医/藏药/藏蓝/西藏zai#牛仔/龟仔/龙仔/鼻仔/羊仔/仔仔/麻仔/麵包仔/麦旺仔/鸿仔/煲仔/福仔/畠zao#栆ze#择zeng#曾国藩/曾孙/曾祖父/曾祖/曾祖母/曾孙女/曾巩/囎/缯zong#综/繌zha#扎/柞狭/柞薪/柞子/柞鄂/柞叶/柞撒/槱柞/一柞/五柞宫/五柞/雠柞/芟柞/蜡祭/喳zhai#宅/夈/择席/择菜zhan#粘zhang#列车长/行长/村长/镇长/乡长/区长/县长/市长/省长/会长/班长/排长/连长/营长/团长/旅长/师长/军长/委员长/局长/厅长/所长/部长/组长/生长/长大/长高/长个/zhao#朝朝/明朝/朝晖/朝夕/朝思/今朝/朝气/朝三/朝秦/朝霞/鹰爪/龙爪/魔爪/爪牙/着急/着迷/着火/怎么着/正着/着凉/一着/犯不着/着数/这么着/犯得着/着慌/着忙/数得着/龙爪槐/嘲哳/嘲惹zhe#折/着/褶zhen#殝/椹zhi#标识/吱/殖/枝/方祇/后祇/皇祇/黄祇/皇地祇/金祇/祇树/月氏zhong#重/种zhou#粥zhu#属意/著/駯zhua#爪子zhuai#拽zhuan#芈月传/外传/传记/自传/正传/小传/评传/传略/别传zhui#椎/隹zhuo#执著/着装/着落/着意/着力/附着/着笔/胶着/着实/衣着/着眼/着想/着重/穿着/执着/着墨/着实/沉着/着陆/着想/着色/焯见/焯烁/辉焯zhuang#幢房/一幢/幢楼/庒zi#仔/兹zu#足zuo#柞/穝


关键代码

Py4j.java

package com.bytebeats.py4j;import com.bytebeats.py4j.exception.BadHanYuPinYinException;import com.bytebeats.py4j.util.StringUtils;import com.google.common.collect.ArrayListMultimap;import net.sourceforge.pinyin4j.PinyinHelper;import net.sourceforge.pinyin4j.format.HanyuPinyinCaseType;import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat;import net.sourceforge.pinyin4j.format.HanyuPinyinToneType;import net.sourceforge.pinyin4j.format.HanyuPinyinVCharType;import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination;public class Py4j {private ArrayListMultimap<String,String> duoYinZiMap;public Py4j(){Py4jDictionary.getDefault().init();duoYinZiMap = Py4jDictionary.getDefault().getDuoYinZiMap();}public String[] getPinyin(char ch) {try{HanyuPinyinOutputFormat outputFormat = new HanyuPinyinOutputFormat();outputFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE);outputFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);outputFormat.setVCharType(HanyuPinyinVCharType.WITH_V);if(ch>=32 && ch<=125){//ASCII >=33 ASCII<=125的直接返回 ,ASCII码表:http://www.asciitable.com/return new String[]{String.valueOf(ch)};}return PinyinHelper.toHanyuPinyinStringArray(ch, outputFormat);} catch (BadHanyuPinyinOutputFormatCombination e) {throw new BadHanYuPinYinException(e);}}public String getPinyin(String chinese) {if(StringUtils.isEmpty(chinese)){return null;}chinese = chinese.replaceAll("[\\.,\\,!·\\!?\\?;\\;\\(\\)()\\[\\]\\:: ]+", " ").trim();StringBuilder py_sb = new StringBuilder(32);char[] chs = chinese.toCharArray();for(int i=0;i<chs.length;i++){String[] py_arr = getPinyin(chs[i]);if(py_arr==null || py_arr.length<1){throw new BadHanYuPinYinException("pinyin array is empty, char:"+chs[i]+",chinese:"+chinese);}if(py_arr.length==1){py_sb.append(convertInitialToUpperCase(py_arr[0]));}else if(py_arr.length==2 && py_arr[0].equals(py_arr[1])){py_sb.append(convertInitialToUpperCase(py_arr[0]));}else{String resultPy = null, defaultPy = null;;for (String py : py_arr) {String left = null;//向左多取一个字,例如 银[行]if(i>=1 && i+1<=chinese.length()){left = chinese.substring(i-1,i+1);if(duoYinZiMap.containsKey(py) && duoYinZiMap.get(py).contains(left)){resultPy = py;break;}}String right = null;//向右多取一个字,例如 [长]沙if(i<=chinese.length()-2){right = chinese.substring(i,i+2);if(duoYinZiMap.containsKey(py) && duoYinZiMap.get(py).contains(right)){resultPy = py;break;}}String middle = null;//左右各多取一个字,例如 龙[爪]槐if(i>=1 && i+2<=chinese.length()){middle = chinese.substring(i-1,i+2);if(duoYinZiMap.containsKey(py) && duoYinZiMap.get(py).contains(middle)){resultPy = py;break;}}String left3 = null;//向左多取2个字,如 芈月[传],列车长if(i>=2 && i+1<=chinese.length()){left3 = chinese.substring(i-2,i+1);if(duoYinZiMap.containsKey(py) && duoYinZiMap.get(py).contains(left3)){resultPy = py;break;}}String right3 = null;//向右多取2个字,如 [长]孙无忌if(i<=chinese.length()-3){right3 = chinese.substring(i,i+3);if(duoYinZiMap.containsKey(py) && duoYinZiMap.get(py).contains(right3)){resultPy = py;break;}}if(duoYinZiMap.containsKey(py) && duoYinZiMap.get(py).contains(String.valueOf(chs[i]))){//默认拼音defaultPy = py;}}if(StringUtils.isEmpty(resultPy)){if(StringUtils.isNotEmpty(defaultPy)){resultPy = defaultPy;}else{resultPy = py_arr[0];}}py_sb.append(convertInitialToUpperCase(resultPy));}}return py_sb.toString();}private String convertInitialToUpperCase(String str) {if (str == null || str.length()==0) {return "";}return str.substring(0, 1).toUpperCase()+str.substring(1);}}


Py4jDictionary.java
package com.bytebeats.py4j;import com.bytebeats.py4j.util.IoUtils;import com.bytebeats.py4j.util.StringUtils;import com.google.common.collect.ArrayListMultimap;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.net.URL;import java.util.Enumeration;/** * ${DESCRIPTION} * * @author Ricky Fung * @date 2017-02-16 20:16 */public class Py4jDictionary {    private ArrayListMultimap<String,String> duoYinZiMap;    private static final String PREFIX = "py4j/dictionary/";    private static final String CONFIG_NAME = "py4j.dic";    private static final String PINYIN_SEPARATOR = "#";    private static final String WORD_SEPARATOR = "/";    private volatile boolean inited;    private Py4jDictionary(){    }    public void init(){        if(inited){            return;        }        System.out.println("******start load py4j config******");        Enumeration<URL> configs = null;        try{            String fullName = PREFIX + CONFIG_NAME;            ClassLoader cl = Thread.currentThread().getContextClassLoader();            configs = cl.getResources(fullName);        } catch (Exception e){            e.printStackTrace();        }        this.duoYinZiMap = parse(configs);        inited = true;        System.out.println("******load py4j config over******");        System.out.println("py4j map key size:"+duoYinZiMap.keySet().size());    }    private ArrayListMultimap<String,String> parse(Enumeration<URL> configs){        ArrayListMultimap<String,String> duoYinZiMap = ArrayListMultimap.create(512, 16);        if(configs!=null){            while (configs.hasMoreElements()) {                parseURL(configs.nextElement(), duoYinZiMap);            }        }        return duoYinZiMap;    }    private void parseURL(URL url, ArrayListMultimap<String, String> duoYinZiMap){        System.out.println("parse py4j dictionary:"+url.getPath());        InputStream in = null;        BufferedReader br = null;        try {            in = url.openStream();            br = new BufferedReader(new InputStreamReader(in, "UTF-8"));            String line = null;            while ((line = br.readLine()) != null) {                String[] arr = line.split(PINYIN_SEPARATOR);                if (StringUtils.isNotEmpty(arr[1])) {                    String[] dyzs = arr[1].split(WORD_SEPARATOR);                    for (String dyz : dyzs) {                        if (StringUtils.isNotEmpty(dyz)) {                            duoYinZiMap.put(arr[0], dyz.trim());                        }                    }                }            }        } catch (IOException e) {            throw new RuntimeException(String.format("load py4j config:%s error", url), e);        } finally {            IoUtils.closeQuietly(br);            IoUtils.closeQuietly(in);        }    }    ArrayListMultimap<String,String> getDuoYinZiMap(){        return duoYinZiMap;    }    public static Py4jDictionary getDefault(){        return SingletonHolder.INSTANCE;    }    private static class SingletonHolder {        private static final Py4jDictionary INSTANCE = new Py4jDictionary();    }}


测试用例

package com.bytebeats.py4j;import org.junit.*;import java.util.Arrays;/** * Unit test for simple App. */public class Py4jTest {private Py4j py4j;@Beforepublic void init(){py4j = new Py4j();}@Testpublic void testChinesePy() {final String[] arr = {"肯德基", "重庆银行", "长沙银行", "便宜坊", "西藏", "藏宝图", "出差", "参加", "列车长"};for (String chinese : arr){String py = py4j.getPinyin(chinese);System.out.println(chinese+"\t"+py);}}@Testpublic void testCharPy(){char[] chs = {'长', '行', '藏', '度', '阿', '佛', '2', 'A', 'a'};for(char ch : chs){String[] arr_py = py4j.getPinyin(ch);System.out.println(ch+"\t"+Arrays.toString(arr_py));}}@Afterpublic void destroy(){py4j = null;}}


源代码下载

py4j:https://github.com/TiFG/py4j




0 0
原创粉丝点击