使用Stanford NER训练自己的model

来源:互联网 发布:域名地址解析 编辑:程序博客网 时间:2024/06/08 19:19

Standford NER

Standford NER(Stanford Named Entity Recognizer )是斯坦福大学提供开源命名实体识别库,使用Java语言实现, 可以用来识别文本中的人名、地名、组织名称等实体。采用的是CRF分类器进行实体识别。

使用Standford NER进行命名实体识别

1. 下载源代码stanford-ner-2015-12-09.zip
2. 将stanford-ner-2015-12-09.zip解压到某个目录下,比如stanford-ner
3. 进入stanford-ner目录cd stanford-ner
4. 在linux/mac系统中可以使用运行一下命令,使用sample.txt文件进行命名实体测试,采用的是Stanford NER库自带的英文模型,该模型可以识别人名、地名和组织关系名称

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt

5 . 运行以上命令后得到以下结果,其中每个单词后面都有标定结果, 0表示未识别,PERSON/ORGANIZATION分别表示人名和组织名称

The/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./O Several/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O 

使用Standford NER 训练自己语言模型

1. 准备训练数据,训练数据中,每行有两列,用tab分隔,第一列为单词,第二列为该单词的标记。用空行来分隔不同的”文档”, 这里”文档”可以是一个句子或一个段落,”文档”不宜过长,不然会很耗内存资源。 以印度尼西亚语为例。将以下句子转换成训练数据

Pengamat politik dari Universitas Gadjah Mada, Arie Sudjito, menilai, keinginan Ketua Umum Partai Golkar Aburizal Bakrie untuk maju kembali sebagai ketua umum merupakan pemaksaan kehendak.


Pengamat    Opolitik Odari    OUniversitas ORGANIZATIONGadjah  ORGANIZATIONMada    ORGANIZATIONArie    PERSONSudjito PERSON,   Omenilai Okeinginan   OKetua   UmumPartai  ORGANIZATIONGolkar  ORGANIZATIONAburizal    PERSONBakrie  PERSONuntuk   Omaju    Okembali Osebagai O

2 . 配置属性文件,存储到austen.prop

# location of the training filetrainFile = jane-austen-emma-ch1.tsv# location where you would like to save (serialize) your# classifier; adding .gz at the end automatically gzips the file,# making it smaller, and faster to loadserializeTo = ner-model.ser.gz# structure of your training file; this tells the classifier that# the word is in column 0 and the correct answer is in column 1map = word=0,answer=1# This specifies the order of the CRF: order 1 means that features# apply at most to a class pair of previous class and current class# or current class and next class.maxLeft=1# these are the features we'd like to train with# some are discussed below, the rest can be# understood by looking at NERFeatureFactoryuseClassFeature=trueuseWord=true# word character ngrams will be included up to length 6 as prefixes# and suffixes only useNGrams=truenoMidNGrams=truemaxNGramLeng=6usePrev=trueuseNext=trueuseDisjunctive=trueuseSequences=trueusePrevSequences=true# the last 4 properties deal with word shape featuresuseTypeSeqs=trueuseTypeSeqs2=trueuseTypeySequences=truewordShape=chris2useLC

3 . 运行以下命令进行模型训练,生成NER模型ner-model.ser.gz

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

4 . 创建测试数据,生成测试文件test.txt

Hal ini, kata Arie, berpotensi menimbukan perpecahan di kalangan kader Golkar di daerah.

5 . 运行以下命令进行测试

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz  -textFile test.txt

6 . 运行结果

Hal/O ini/O ,/O kata/O Arie/PERSON ,/O berpotensi/O menimbukan/O perpecahan/O di/O kalangan/O kader/O Golkar/ORGANIZATION di/O daerah/O ./O


本文只是介绍Stanford NER的命令行使用过程,如何在代码中使用 Stanford NER可以参照Stanford NER.

0 0