python爬虫学习第二十八天

来源：互联网发布：天刀捏脸数据女百度云编辑：程序博客网时间：2024/05/19 03:27

今天的内容：自然语言处理

我们即将用来做数据归纳的文字样本源自美国第九任总统威廉 ·亨利 ·哈里森的就职演说

练习简单修改一下之前的 n-gram 模型，就可以获得 2-gram 序列的频率数据，然后我们用 Python 的 operator 模块对 2-gram 序列的频率字典进行排序

from urllib.request import urlopenfrom bs4 import BeautifulSoup# from collections import OrderedDictimport reimport stringimport operatordef cleanInput(input1):    input1 = re.sub("\n+"," ",input1)    input1 = re.sub('\[[0-9]*\]', "", input1)     input1 = re.sub(" +"," ",input1)    input1 = bytes(input1,"utf-8")    input1 = input1.decode("ascii","ignore")    cleanOutput=[]    input1 = input1.split(" ")    for item in input1:        item = item.strip(string.punctuation)         if len(item)>1 or item.lower()=='a' or item.lower()=='b':            cleanOutput.append(item)    return cleanOutput    passdef ngrams(input1,n):    input1 = cleanInput(input1)    output = {}    for i in range(len(input1)-n+1):        ngramTemp = " ".join(input1[i:i+n])        if ngramTemp not in output:            output[ngramTemp] = 0        output[ngramTemp]+=1    return output    passcontent = str( urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8') ngrams = ngrams(content,2)sortedNgrams = sorted(ngrams.items(),key = operator.itemgetter(1),reverse=True)print(sortedNgrams)input()

练习增加一个 isCommon 函数来实现获取那些频率高的单词

from bs4 import BeautifulSoup# from collections import OrderedDictimport reimport stringimport operatordef commonWords(ngrams):    commonWord =["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", "an", "at", "but","we", "his", "from", "that", "not", "by", "she", "or", "as", "what", "go", "their","can", "who", "get", "if", "would", "her", "all", "my", "make", "about", "know", "will", "as", "up", "one", "time", "has", "been", "there", "year", "so", "think", "when", "which", "them", "some", "me", "people", "take", "out", "into", "just", "see", "him", "your", "come", "could", "now", "than", "like", "other", "how", "then", "its", "our", "two", "more", "these", "want", "way", "look", "first", "also", "new", "because", "day", "more", "use", "no", "man", "find", "here", "thing", "give", "many", "well"]    items = ngrams.split(" ")    for item in items:        if item in commonWord:            return True    return False    passdef cleanInput(input1):    input1 = re.sub("\n+"," ",input1)    input1 = re.sub('\[[0-9]*\]', "", input1)     input1 = re.sub(" +"," ",input1)    input1 = bytes(input1,"utf-8")    input1 = input1.decode("ascii","ignore")    cleanOutput=[]    input1 = input1.split(" ")    for item in input1:        item = item.strip(string.punctuation)         if len(item)>1 or item.lower()=='a' or item.lower()=='b':            cleanOutput.append(item)    return cleanOutput    passdef ngrams(input1,n):    input1 = cleanInput(input1)    output = {}    for i in range(len(input1)-n+1):        ngramTemp = " ".join(input1[i:i+n])        if ngramTemp not in output:            output[ngramTemp] = 0        output[ngramTemp]+=1    return output    passcontent = str( urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8') ngrams = ngrams(content,2)sortedNgrams = sorted(ngrams.items(),key = operator.itemgetter(1),reverse=True)# print(sortedNgrams)count = 0for item in sortedNgrams:    if commonWords(item[0]):        print(item)        count++print(count)input()

加入常用词筛选之前共有5894个2-gram，加入筛选后有4364个2-gram，筛掉了近四分之一的不常用词

今天第二个练习，由于书上的实例代码与之前的程序完全不搭，花了不少时间，就酱，打卡~

阅读全文

0 0