python正向最大匹配分词和逆向最大匹配分词
来源:互联网 发布:av淘宝avtaobao.me在线 编辑:程序博客网 时间:2024/05/22 14:04
正向最大匹配
# -*- coding:utf-8 -*-CODEC='utf-8'def u(s, encoding): 'converted other encoding to unicode encoding' if isinstance(s, unicode): return s else: return unicode(s, encoding)def fwd_mm_seg(wordDict, maxLen, str): 'forward max match segment' wordList = [] segStr = str segStrLen = len(segStr) for word in wordDict: print 'word: ', word print "\n" while segStrLen > 0: if segStrLen > maxLen: wordLen = maxLen else: wordLen = segStrLen subStr = segStr[0:wordLen] print "subStr: ", subStr while wordLen > 1: if subStr in wordDict: print "subStr1: %r" % subStr break else: print "subStr2: %r" % subStr wordLen = wordLen - 1 subStr = subStr[0:wordLen]# print "subStr3: ", subStr wordList.append(subStr) segStr = segStr[wordLen:] segStrLen = segStrLen - wordLen for wordstr in wordList: print "wordstr: ", wordstr return wordList def main(): fp_dict = open('words.dic') wordDict = {} for eachWord in fp_dict: wordDict[u(eachWord.strip(), 'utf-8')] = 1 segStr = u'你好世界hello world' print segStr wordList = fwd_mm_seg(wordDict, 10, segStr) print "==".join(wordList) if __name__ == '__main__': main()逆向最大匹配
# -*- coding:utf-8 -*-def u(s, encoding): 'converted other encoding to unicode encoding' if isinstance(s, unicode): return s else: return unicode(s, encoding)CODEC='utf-8'def bwd_mm_seg(wordDict, maxLen, str): 'forward max match segment' wordList = [] segStr = str segStrLen = len(segStr) for word in wordDict: print 'word: ', word print "\n" while segStrLen > 0: if segStrLen > maxLen: wordLen = maxLen else: wordLen = segStrLen subStr = segStr[-wordLen:None] print "subStr: ", subStr while wordLen > 1: if subStr in wordDict: print "subStr1: %r" % subStr break else: print "subStr2: %r" % subStr wordLen = wordLen - 1 subStr = subStr[-wordLen:None]# print "subStr3: ", subStr wordList.append(subStr) segStr = segStr[0: -wordLen] segStrLen = segStrLen - wordLen wordList.reverse() for wordstr in wordList: print "wordstr: ", wordstr return wordList def main(): fp_dict = open('words.dic') wordDict = {} for eachWord in fp_dict: wordDict[u(eachWord.strip(), 'utf-8')] = 1 segStr = ur'你好世界hello world' print segStr wordList = bwd_mm_seg(wordDict, 10, segStr) print "==".join(wordList)if __name__ == '__main__': main()
阅读全文
0 0
- python正向最大匹配分词和逆向最大匹配分词
- 用正向和逆向最大匹配算法进行中文分词
- 中文分词中的正向最大匹配与逆向最大匹配
- python 中文分词:正向最大匹配
- 最大正向匹配分词MM
- 【分词】正向最大匹配中文分词算法
- 分词】正向最大匹配中文分词算法
- 【分词】正向最大匹配中文分词算法
- 分词学习(1)--正向最大匹配分词
- 中文分词-- 正向最大匹配法分词
- 【分词】正向最大匹配中文分词算法
- 中文分词 正向最大匹配法 逆向最大匹配法 双向最大匹配法
- 正向(逆向)最大匹配和最大概率法分词的错误分析
- 逆向最大匹配分词算法
- 中文分词--逆向最大匹配
- 中文分词--逆向最大匹配
- 逆向最大匹配分词RMM
- 用正向和逆向最大匹配算法进行中文分词(续)
- Js基础知识学习
- 怎么配置Jupyter Notebook默认启动目录?
- Jquery+Json+JSP的一个Demo
- IDEA中多行注释及取消注释快捷键
- Android音视频-视频采集(Camera预览)
- python正向最大匹配分词和逆向最大匹配分词
- Android中Parcelable接口用法
- JS动态加载
- 算法-冒泡排序
- 第十六周
- flex色子练习
- mt6735 [Audio Common] 铃声选择列表排序顺序客制化
- Node.js入门,第一个APP,Hello World
- 【数据极客】任务总结_Week3