lingpipe: 文本分词识别例子
来源:互联网 发布:map转json字符串数组 编辑:程序博客网 时间:2024/05/17 01:40
1)什么是lingpipe?
详细见百度,简而言之是自然语言处理软件包(Natural Language Processing,NLP)。
lingpipe主要包含以下模块:
主题分类(Top Classification)
命名实体识别(Named Entity Recognition,NER)(什么是NER?继续百度。。。简而言之是人名、地名、机构名等文本识别)
词性标注(Part-of Speech Tagging)
句题检测(Sentence Detection)
查询拼写检查(Query Spell Checking)
兴趣短语检测(Interseting Phrase Detection)
聚类(Clustering)
字符语言建模(Character Language Modeling)
医学文献下载/解析/索引(MEDLINE Download, Parsing and Indexing)
数据库文本挖掘(Database Text Mining)
中文分词(Chinese Word Segmentation)
情感分析(Sentiment Analysis)
语言辨别(Language Identification)
Reference
lingpipe官方文档:http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html
背后NLP模型:http://nlp.stanford.edu/software/corenlp.shtml
2)我做了个分词的例子作为参考(应用到:命名实体识别、句题检测,用到 lingpipe-4.1.0.jar),e.g.
import java.io.File;import java.io.IOException;import java.util.ArrayList;import java.util.List;import com.aliasi.chunk.CharLmHmmChunker;import com.aliasi.chunk.Chunk;import com.aliasi.chunk.Chunker;import com.aliasi.chunk.Chunking;import com.aliasi.corpus.Parser;import com.aliasi.dict.DictionaryEntry;import com.aliasi.dict.MapDictionary;import com.aliasi.dict.ExactDictionaryChunker;import com.aliasi.hmm.HmmCharLmEstimator;import com.aliasi.sentences.IndoEuropeanSentenceModel;import com.aliasi.sentences.SentenceModel;import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.util.AbstractExternalizable;public class TextAnalyzer { static final double CHUNK_SCORE = 1.0; static final TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE; static final SentenceModel SENTENCE_MODEL = new IndoEuropeanSentenceModel(); public static void main(String[] args) {// testChunkSentences();// testChunkDictionary(); test(); } private static void test() { } // Sentences - Sentences Chunking(分句) private static void testChunkSentences() { String text = "50 Cent XYZ120 DVD Player 50 Cent lawyer. Person is john, he is a lawyer."; List<String> result = new ArrayList<String>(); List<String> tokenList = new ArrayList<String>();List<String> whiteList = new ArrayList<String>();Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(text.toCharArray(),0, text.length());tokenizer.tokenize(tokenList, whiteList);String[] tokens = new String[tokenList.size()];String[] whites = new String[whiteList.size()];tokenList.toArray(tokens);whiteList.toArray(whites);int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,whites);int sentStartTok = 0;int sentEndTok = 0;for (int i = 0; i < sentenceBoundaries.length; ++i) {System.out.println("Sentense " + (i + 1) + ", sentense's length(from 0):" + (sentenceBoundaries[i]));StringBuilder sb = new StringBuilder();sentEndTok = sentenceBoundaries[i];for (int j = sentStartTok; j <= sentEndTok; j++) {sb.append(tokens[j]).append(whites[j + 1]);}sentStartTok = sentEndTok + 1;result.add(sb.toString());}System.out.println("Final result:" + result); } // NER(named entity recognition) - Exact Dictionary-Based Chunking(分词) private static void testChunkDictionary() { String[] args1 = {"50 Cent XYZ120 DVD Player 50 Cent lawyer.", "person is john, he is a lawyer."}; MapDictionary<String> dictionary = new MapDictionary<String>(); dictionary.addEntry(new DictionaryEntry<String>("50 Cent","PERSON",CHUNK_SCORE)); dictionary.addEntry(new DictionaryEntry<String>("XYZ120 DVD Player","DB_ID_1232",CHUNK_SCORE)); dictionary.addEntry(new DictionaryEntry<String>("cent","MONETARY_UNIT",CHUNK_SCORE)); dictionary.addEntry(new DictionaryEntry<String>("dvd player","PRODUCT",CHUNK_SCORE)); ExactDictionaryChunker dictionaryChunkerTT = new ExactDictionaryChunker(dictionary, IndoEuropeanTokenizerFactory.INSTANCE, true,true); ExactDictionaryChunker dictionaryChunkerTF = new ExactDictionaryChunker(dictionary, IndoEuropeanTokenizerFactory.INSTANCE, true,false); // returnAllMatches is false means bypassing the matched text from further matching process ExactDictionaryChunker dictionaryChunkerFT = new ExactDictionaryChunker(dictionary, IndoEuropeanTokenizerFactory.INSTANCE, false,true); ExactDictionaryChunker dictionaryChunkerFF = new ExactDictionaryChunker(dictionary, IndoEuropeanTokenizerFactory.INSTANCE, false,false); System.out.println("\nDICTIONARY\n" + dictionary); for (int i = 0; i < args1.length; ++i) { String text = args1[i]; System.out.println("\n\nTEXT=" + text); chunk(dictionaryChunkerTT,text); chunk(dictionaryChunkerTF,text); chunk(dictionaryChunkerFT,text); chunk(dictionaryChunkerFF,text); } } static void chunk(ExactDictionaryChunker chunker, String text) { System.out.println("\nChunker." + " All matches=" + chunker.returnAllMatches() + " Case sensitive=" + chunker.caseSensitive()); Chunking chunking = chunker.chunk(text); for (Chunk chunk : chunking.chunkSet()) { int start = chunk.start(); int end = chunk.end(); String type = chunk.type(); double score = chunk.score(); String phrase = text.substring(start,end); System.out.println(" phrase=|" + phrase + "|" + " start=" + start + " end=" + end + " type=" + type + " score=" + score); } }}
- lingpipe: 文本分词识别例子
- 基于LingPipe的文本倾向性分析--LingPipe学习笔记
- [lingpipe学习笔记]基于LingPipe的文本倾向性分析
- [lingpipe学习笔记]基于LingPipe的文本倾向性分析
- 基于LingPipe的文本倾向性分析–LingPipe学习笔记
- lingpipe中文分词模块测试Demo的参数设置
- 使用lingpipe自然语言处理包进行文本分类
- 使用lingpipe自然语言处理包进行文本分类
- 基于LingPipe的文本基本极性分析【demo】
- 文本挖掘1分词
- R-文本处理-分词
- 文本分词方法
- ansj5.0.1分词例子
- 使用lingpipe自然语言处理包进行文本分类/** * 使用 lingpipe的tf/idf分类器训练语料 * * @author laigood */ public class trai
- LingPipe 学习
- 盘古分词-中文人名识别
- 盘古分词-中文人名识别
- 中文分词之识别语义
- linux多线程同步方式
- Ubuntu 14.04 配置 Nginx, MySQL and PHP 环境
- Proguard保持某个包下的类的方法
- 数据结构:二叉树线索化的部分实现
- Android ViewPager控件实现图片轮播
- lingpipe: 文本分词识别例子
- 导入项目到eclipse中代码HttpServletRequest 到不到对应jar
- JAVA 关于finally关键字
- JAVA开发_Base64编码与解码
- 每天一个linux命令(1):ls命令
- 同步和异步的区别
- Unity3d C# 脚本单体模式的实现
- 使用列索引创建筛选的视图
- 湖南实实在在公司怎么样