python爬虫学习第二十八天
来源:互联网 发布:天刀捏脸数据女百度云 编辑:程序博客网 时间:2024/05/19 03:27
今天的内容:自然语言处理
我们即将用来做数据归纳的文字样本源自美国第九任总统威廉 ·亨利 ·哈里森的就职演 说
练习 简单修改一下之前的 n-gram 模型,就可以获得 2-gram 序列的频率数据, 然后我们用 Python 的 operator 模块对 2-gram 序列的频率字典进行排序
from urllib.request import urlopenfrom bs4 import BeautifulSoup# from collections import OrderedDictimport reimport stringimport operatordef cleanInput(input1): input1 = re.sub("\n+"," ",input1) input1 = re.sub('\[[0-9]*\]', "", input1) input1 = re.sub(" +"," ",input1) input1 = bytes(input1,"utf-8") input1 = input1.decode("ascii","ignore") cleanOutput=[] input1 = input1.split(" ") for item in input1: item = item.strip(string.punctuation) if len(item)>1 or item.lower()=='a' or item.lower()=='b': cleanOutput.append(item) return cleanOutput passdef ngrams(input1,n): input1 = cleanInput(input1) output = {} for i in range(len(input1)-n+1): ngramTemp = " ".join(input1[i:i+n]) if ngramTemp not in output: output[ngramTemp] = 0 output[ngramTemp]+=1 return output passcontent = str( urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8') ngrams = ngrams(content,2)sortedNgrams = sorted(ngrams.items(),key = operator.itemgetter(1),reverse=True)print(sortedNgrams)input()
练习 增加一个 isCommon 函数来实现获取那些频率高的单词
from bs4 import BeautifulSoup# from collections import OrderedDictimport reimport stringimport operatordef commonWords(ngrams): commonWord =["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", "an", "at", "but","we", "his", "from", "that", "not", "by", "she", "or", "as", "what", "go", "their","can", "who", "get", "if", "would", "her", "all", "my", "make", "about", "know", "will", "as", "up", "one", "time", "has", "been", "there", "year", "so", "think", "when", "which", "them", "some", "me", "people", "take", "out", "into", "just", "see", "him", "your", "come", "could", "now", "than", "like", "other", "how", "then", "its", "our", "two", "more", "these", "want", "way", "look", "first", "also", "new", "because", "day", "more", "use", "no", "man", "find", "here", "thing", "give", "many", "well"] items = ngrams.split(" ") for item in items: if item in commonWord: return True return False passdef cleanInput(input1): input1 = re.sub("\n+"," ",input1) input1 = re.sub('\[[0-9]*\]', "", input1) input1 = re.sub(" +"," ",input1) input1 = bytes(input1,"utf-8") input1 = input1.decode("ascii","ignore") cleanOutput=[] input1 = input1.split(" ") for item in input1: item = item.strip(string.punctuation) if len(item)>1 or item.lower()=='a' or item.lower()=='b': cleanOutput.append(item) return cleanOutput passdef ngrams(input1,n): input1 = cleanInput(input1) output = {} for i in range(len(input1)-n+1): ngramTemp = " ".join(input1[i:i+n]) if ngramTemp not in output: output[ngramTemp] = 0 output[ngramTemp]+=1 return output passcontent = str( urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8') ngrams = ngrams(content,2)sortedNgrams = sorted(ngrams.items(),key = operator.itemgetter(1),reverse=True)# print(sortedNgrams)count = 0for item in sortedNgrams: if commonWords(item[0]): print(item) count++print(count)input()
加入常用词筛选之前共有5894个2-gram,加入筛选后有4364个2-gram,筛掉了近四分之一的不常用词
今天第二个练习,由于书上的实例代码与之前的程序完全不搭,花了不少时间,就酱,打卡~
阅读全文
0 0
- python爬虫学习第二十八天
- python爬虫学习第二十天
- python爬虫学习第二十五天
- python爬虫学习第二十九天
- python爬虫学习第八天
- python爬虫学习第十八天
- python爬虫学习第三十八天
- 第二十八天
- 第二十八天
- python爬虫学习第二十七天(补昨天)
- java学习第二十八天之XML解析
- 学习python的第二十八天-for循环,break语句,continue语句
- python学习---第八天
- Java学习总结第二十八天Java泛型(一)
- 第二十八天:听课笔记
- 第二十八天:总结
- 实习篇---第二十八天
- 第二十八天Notification
- 目标检测的图像特征提取之(一)HOG特征
- JavaWeb学习之JSP指令
- USER总结
- 古文觀止卷八_爭臣論_韓愈
- 单向链表-遍历(查找)
- python爬虫学习第二十八天
- 通过maven下载源码和javadoc方法
- Ubuntu解决eclipse CDT 不能识别C++11新特性问题
- where,having与 group by连用的区别
- Retrofit okhttp使用
- 树莓派3 B型刷入centos7及安装wifi模块
- solr搭建过程中控制台报错: Can't find (or read) directory to add to classloader: ../../../contrib/extraction/li
- 【特征检测】HOG特征算法
- Eclipse 开发工具基础配置