Ubuntu16.04配置使用deepnlp

来源:互联网 发布:ubuntu怎么注销用户 编辑:程序博客网 时间:2024/06/05 03:47

主要参考的是deep-nlp的readme文件。
DeepNLP包括以下几个模块

  • NLP Pipeline Modules:

    • Word Segmentation/Tokenization
    • Part-of-speech (POS)
    • Named-entity-recognition(NER)
    • textsum: automatic summarization Seq2Seq-Attention models
    • textrank: extract the most important sentences
    • textcnn: document classification
    • Web API: Free Tensorflow empowered web API
    • Planed: Parsing, Automatic Summarization
  • Algorithm(Closely following the state-of-Art)

    • Word Segmentation: Linear Chain CRF(conditional-random-field), based on python CRF++ module
    • POS: LSTM/BI-LSTM network, based on Tensorflow
    • NER: LSTM/BI-LSTM/LSTM-CRF network, based on Tensorflow
    • Textsum: Seq2Seq with attention mechanism
    • Texncnn: CNN
  • Pre-trained Model
    • Chinese: Segmentation, POS, NER (1998 china daily corpus)
    • English: POS (brown corpus)
    • For your Specific Language, you can easily use the script to train model with the corpus of your language choice.

安装

模型需要使用1.0版本的tensorflow。使用如下命令安装:

export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.1-cp35-cp35m-linux_x86_64.whlsudo pip install --upgrade $TF_BINARY_URL

模型不能使用python3。
使用如下命令安装

sudo pip install deepnlp

使用教程

下载预训练模型

使用pip命令安装的deepnlp并没有下载模型文件,所以需要另外下载,在python3使用如下命令:

import deepnlp# Download all the modulesdeepnlp.download()# Download only specific moduledeepnlp.download('segment')deepnlp.download('pos')deepnlp.download('ner')deepnlp.download('textsum')

分词

运行如下python程序

#coding=utf-8from __future__ import unicode_literalsfrom deepnlp import segmentertext = "我刚刚在浙江卫视看了电视剧老九门,觉得陈伟霆很帅"segList = segmenter.seg(text)text_seg = " ".join(segList)print (text.encode('utf-8'))print (text_seg.encode('utf-8'))

提示出现如下错误

Traceback (most recent call last):  File "test_segment.py", line 4, in <module>    from deepnlp import segmenter  File "/usr/local/lib/python2.7/dist-packages/deepnlp/segmenter.py", line 6, in <module>    import CRFPPImportError: No module named CRFPP

分词功能依赖于CRF++(>=0.54)。从网站下载crf0.58,解压,运行如下命令:

./configuremakesudo make install

然后进入python文件夹中,运行如下命令:

python setup.py buildsupython setup.py install

安装完成后,运行如下python程序

#coding=utf-8from __future__ import unicode_literalsfrom deepnlp import segmentertext = "我刚刚在浙江卫视看了电视剧老九门,觉得陈伟霆很帅"segList = segmenter.seg(text)text_seg = " ".join(segList)print (text.encode('utf-8'))print (text_seg.encode('utf-8'))

出现如下错误:

    import CRFPP  File "/usr/lib/python2.7/dist-packages/bpython/curtsiesfrontend/repl.py", line 257, in load_module    module = pkgutil.ImpLoader.load_module(self, name)  File "/usr/lib/python2.7/pkgutil.py", line 246, in load_module    mod = imp.load_module(fullname, self.file, self.filename, self.etc)  File "/usr/local/lib/python2.7/dist-packages/CRFPP.py", line 26, in <module>    _CRFPP = swig_import_helper()  File "/usr/local/lib/python2.7/dist-packages/CRFPP.py", line 22, in swig_import_helper    _mod = imp.load_module('_CRFPP', fp, pathname, description)ImportError: libcrfpp.so.0: 无法打开共享对象文件: 没有那个文件或目录

这是因为没有建立正确的链接,使用如下命令解决:

sudo ln -s /usr/local/lib/libcrfpp.so.* /usr/lib/ 

词性标注

运行如下程序:

#coding:utf-8from __future__ import unicode_literals # compatible with python3 unicodefrom deepnlp import segmenterfrom deepnlp import pos_taggertagger = pos_tagger.load_model(lang = 'zh')#Segmentationtext = "我爱吃北京烤鸭"         # unicode coding, py2 and py3 compatiblewords = segmenter.seg(text)print(" ".join(words).encode('utf-8'))#POS Taggingtagging = tagger.predict(words)for (w,t) in tagging:    str = w + "/" + t    print(str.encode('utf-8'))#Results#我/r#爱/v#吃/v#北京/ns#烤鸭/n
#coding:utf-8from __future__ import unicode_literalsimport deepnlpdeepnlp.download('pos')                     # download the POS pretrained models from github if installed from pipfrom deepnlp import pos_taggertagger = pos_tagger.load_model(lang = 'en')  # Loading English model, lang code 'en'#Segmentationtext = "I want to see a funny movie"words = text.split(" ")print (" ".join(words).encode('utf-8'))#POS Taggingtagging = tagger.predict(words)for (w,t) in tagging:    str = w + "/" + t    print (str.encode('utf-8'))#Results#I/nn#want/vb#to/to#see/vb#a/at#funny/jj#movie/nn

命名实体识别

运行如下程序:

#coding:utf-8from __future__ import unicode_literals # compatible with python3 unicodeimport deepnlpdeepnlp.download('ner')  # download the NER pretrained models from github if installed from pipfrom deepnlp import segmenterfrom deepnlp import ner_taggertagger = ner_tagger.load_model(lang = 'zh')#Segmentationtext = "我爱吃北京烤鸭"words = segmenter.seg(text)print (" ".join(words).encode('utf-8'))#NER taggingtagging = tagger.predict(words)for (w,t) in tagging:    str = w + "/" + t    print (str.encode('utf-8'))#Results#我/nt#爱/nt#吃/nt#北京/p#烤鸭/nt

Pipline

运行如下程序:

#coding:utf-8from __future__ import unicode_literals # compatible with python3 unicodeimport sys,osimport codecsimport deepnlpdeepnlp.download('segment')   # download all the required pretrained models from github if installed from pipdeepnlp.download('pos')       deepnlp.download('ner')from deepnlp import pipelinep = pipeline.load_model('zh')# concatenate tuples into one string "w1/t1 w2/t2 ..."def _concat_tuples(tagging):  TOKEN_BLANK = " "  wl = [] # wordlist  for (x, y) in tagging:    wl.append(x + "/" + y) # unicode  concat_str = TOKEN_BLANK.join(wl)  return concat_str# input fileBASE_DIR = os.path.dirname(os.path.abspath(__file__))docs = []file = codecs.open(os.path.join(BASE_DIR, 'docs_test.txt'), 'r', encoding='utf-8')for line in file:    line = line.replace("\n", "").replace("\r", "")    docs.append(line)# output filefileOut = codecs.open(os.path.join(BASE_DIR, 'pipeline_test_results.txt'), 'w', encoding='utf-8')# analyze function# @return: list of 3 elements [seg, pos, ner]text = docs[0]res = p.analyze(text)words = p.segment(text)pos_tagging = p.tag_pos(words)ner_tagging = p.tag_ner(words)# print pipeline.analyze() resultsfileOut.writelines("pipeline.analyze results:" + "\n")fileOut.writelines(res[0] + "\n")fileOut.writelines(res[1] + "\n")fileOut.writelines(res[2] + "\n")print (res[0].encode('utf-8'))print (res[1].encode('utf-8'))print (res[2].encode('utf-8'))# print modules resultsfileOut.writelines("modules results:" + "\n")fileOut.writelines(" ".join(words) + "\n")fileOut.writelines(_concat_tuples(pos_tagging) + "\n")fileOut.writelines(_concat_tuples(ner_tagging) + "\n")fileOut.close

自动摘要

参考https://github.com/rockingdingo/deepnlp/tree/master/deepnlp/textsum或者textsum文件夹下的readme。

交互式预测

cd ./ckptcat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gztar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gzsudo mkdir /mnt/python/pypi/deepnlp/deepnlp/textsum/ckptsudo cp * /mnt/python/pypi/deepnlp/deepnlp/textsum/ckptcd ..python predict.py

然后交互式输入中文分好词的新闻正文语料,词之间空格分割,结果返回自动生成的新闻标题。

预测和评估ROUGE分

python predict.py news/test/content-test.txt news/test/title-test.txt news/test/summary.txt