使用Stanford NER训练自己的model

来源:互联网 发布:域名地址解析 编辑:程序博客网 时间:2024/06/08 19:19

Standford NER

Standford NER(Stanford Named Entity Recognizer )是斯坦福大学提供开源命名实体识别库,使用Java语言实现, 可以用来识别文本中的人名、地名、组织名称等实体。采用的是CRF分类器进行实体识别。

使用Standford NER进行命名实体识别

该过程参考官方文档
1. 下载源代码stanford-ner-2015-12-09.zip
2. 将stanford-ner-2015-12-09.zip解压到某个目录下,比如stanford-ner
3. 进入stanford-ner目录cd stanford-ner
4. 在linux/mac系统中可以使用运行一下命令,使用sample.txt文件进行命名实体测试,采用的是Stanford NER库自带的英文模型,该模型可以识别人名、地名和组织关系名称

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt

5 . 运行以上命令后得到以下结果,其中每个单词后面都有标定结果, 0表示未识别,PERSON/ORGANIZATION分别表示人名和组织名称

The/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./O Several/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O 

使用Standford NER 训练自己语言模型

该过程参考官方文档
1. 准备训练数据,训练数据中,每行有两列,用tab分隔,第一列为单词,第二列为该单词的标记。用空行来分隔不同的”文档”, 这里”文档”可以是一个句子或一个段落,”文档”不宜过长,不然会很耗内存资源。 以印度尼西亚语为例。将以下句子转换成训练数据

Pengamat politik dari Universitas Gadjah Mada, Arie Sudjito, menilai, keinginan Ketua Umum Partai Golkar Aburizal Bakrie untuk maju kembali sebagai ketua umum merupakan pemaksaan kehendak.

训练数据存储到jane-austen-emma-ch1.tsv

Pengamat    Opolitik Odari    OUniversitas ORGANIZATIONGadjah  ORGANIZATIONMada    ORGANIZATIONArie    PERSONSudjito PERSON,   Omenilai Okeinginan   OKetua   UmumPartai  ORGANIZATIONGolkar  ORGANIZATIONAburizal    PERSONBakrie  PERSONuntuk   Omaju    Okembali Osebagai O

2 . 配置属性文件,存储到austen.prop

# location of the training filetrainFile = jane-austen-emma-ch1.tsv# location where you would like to save (serialize) your# classifier; adding .gz at the end automatically gzips the file,# making it smaller, and faster to loadserializeTo = ner-model.ser.gz# structure of your training file; this tells the classifier that# the word is in column 0 and the correct answer is in column 1map = word=0,answer=1# This specifies the order of the CRF: order 1 means that features# apply at most to a class pair of previous class and current class# or current class and next class.maxLeft=1# these are the features we'd like to train with# some are discussed below, the rest can be# understood by looking at NERFeatureFactoryuseClassFeature=trueuseWord=true# word character ngrams will be included up to length 6 as prefixes# and suffixes only useNGrams=truenoMidNGrams=truemaxNGramLeng=6usePrev=trueuseNext=trueuseDisjunctive=trueuseSequences=trueusePrevSequences=true# the last 4 properties deal with word shape featuresuseTypeSeqs=trueuseTypeSeqs2=trueuseTypeySequences=truewordShape=chris2useLC

3 . 运行以下命令进行模型训练,生成NER模型ner-model.ser.gz

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

4 . 创建测试数据,生成测试文件test.txt

Hal ini, kata Arie, berpotensi menimbukan perpecahan di kalangan kader Golkar di daerah.

5 . 运行以下命令进行测试

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz  -textFile test.txt

6 . 运行结果

Hal/O ini/O ,/O kata/O Arie/PERSON ,/O berpotensi/O menimbukan/O perpecahan/O di/O kalangan/O kader/O Golkar/ORGANIZATION di/O daerah/O ./O

备注

本文只是介绍Stanford NER的命令行使用过程,如何在代码中使用 Stanford NER可以参照Stanford NER.

0 0
原创粉丝点击