Pyhon 自然语言处理（二）文本预处理流程

来源：互联网发布：库里个人数据统计编辑：程序博客网时间：2024/05/21 22:35

Python自然语言处理（二）文本预处理流程

完整的文本预处理的过程如下：

原始文本语料—>分词Tokenize—>词性标注POSTag—>词干化Lemma/Stemming—>去除停用词—>处理后的文本语料

1. Tokenize

import nltk

sent="hello,Python"

tokens=nltk.word_tokenize(sent)

print tokens

['hello', ',', 'Python']

2. 中文分词

import jieba #引入jieba分词包

seg_list=jieba.cut("他去了一趟南京市长江大桥",cut_all=True)#全模式print "全模式：",'/'.join(seg_list)

Building prefix dict from the default dictionary ...全模式：Dumping model to file cache /tmp/jieba.cacheLoading model cost 3.598 seconds.Prefix dict has been built succesfully. 他/去/了/一趟/南京/南京市/京市/市长/长江/长江大桥/大桥

seg_list=jieba.cut("他去了一趟南京市长江大桥",cut_all=False)#精确模式，默认为精确模式print "精确模式：",'/'.join(seg_list)

精确模式： 他/去/了/一趟/南京市/长江大桥

seg_list=jieba.cut_for_search("他去了一趟南京市长江大桥")#搜索引擎模式print "搜索引擎模式：",'/'.join(seg_list)

搜索引擎模式： 他/去/了/一趟/南京/京市/南京市/长江/大桥/长江大桥

3. NLTK的Stemming

from nltk.stem.porter import PorterStemmer # 波特词干器

porter_stem=PorterStemmer()

porter_stem.stem('studying')

u'studi'

from nltk.stem.lancaster import LancasterStemmer#Lancaster词干器

LancasterStem=LancasterStemmer()

LancasterStem.stem('studying')

'study'

from nltk.stem import SnowballStemmer #Snowball词干器

snowball=SnowballStemmer('english')

snowball.stem('studying')

u'studi'

4. 基于WordNetLemmatizer的Stemming

from nltk.stem import WordNetLemmatizer

wordnet_lem=WordNetLemmatizer()

wordnet_lem.lemmatize('dogs')

u'dog'

5. NLTK标注POSTag

import nltk

text=nltk.word_tokenize('what does the fox say')

text

['what', 'does', 'the', 'fox', 'say']

nltk.pos_tag(text)

6. NLTK去除stopwords

from nltk.corpus import stopwords

# 先进行分词，得到wordlist# 。。。#然后filter进行筛选filtered_words=[word for word in wordlist if word not in stopwords.words('english')]

0 0