python爬虫学习第二十九天

来源：互联网发布：java安全等级设置编辑：程序博客网时间：2024/05/16 23:46

今天的主要内容：马尔可夫模型
Google的page rank算法就是基于马尔可夫模型，把网站做为节点，入站/出站链接做为节点之间的连线。连接某一个节点的“可能性”（likelihood）表示一个网站的相对关注度

练习，用一定长度的马尔科夫链构成句子
这里用字典来模拟马尔科夫链，字典为两层，每遇到一个新单词在第一层建立一个字典，然后查看它的后一个单词，如果后一个单词存在于它前一个单词的二级字典中，则不用为其在二级字典中建立字段，若二级字典中找不到这个单词，则为它建立二级字典并复制为0，结束后把当前单词的后一个单词对应的字段值加1。还有随机生成马尔科夫链的所用的两个函数：wordListSum（）以及getRandomText（）。阅读后不难理解。我自己写的时候在“randomIndex-=value”这一句上出了点问题，一开始我写的是”randomIndex-=1”，导致马尔科夫链取回的值是None，后来重新理了下逻辑知道了为什么错。举个例子说明，比如我单词word_c的二级字典为 {word_b : 5, word_d : 2}，如果遍历，每次减1，那么当randomIndex>2时永远取不到值，因为每个单词出现了多次。

from urllib.request import urlopenimport randomdef wordListSum(wordList):    sum = 0    for word,value in wordList.items():        sum+=value    return sum    passdef getRandomText(wordList):    sum = wordListSum(wordList)    randomIndex = random.randint(1,sum)    for word,value in wordList.items():        **randomIndex-=value**        if randomIndex<=0:            return word    passdef wordListDict(text):    text = text.replace("\n"," ")    text = text.replace("\"","")    punctuation = [',', '.', ';',':']    for symble in punctuation:        text = text.replace(symble," "+symble+" ")    words = text.split(" ")    words = [word for word in words if word!=""]    wordDict = {}    for i in range(len(words)):        if words[i-1] not in wordDict:            wordDict[words[i-1]] = {}        if words[i] not in wordDict[words[i-1]]:            wordDict[words[i-1]][words[i]] = 0        wordDict[words[i-1]][words[i]]+=1    return wordDict    passhtml = urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt")text = str(html.read(),"utf-8")wordDict = wordListDict(text)length = 200chain = ""currentWord = "I"for i in range(0,length):    chain+=" "+currentWord    currentWord = getRandomText(wordDict[currentWord])    passprint(chain)input()

今天先到这，最近每天都要做一点算法题，所以这部分稍时间紧一点~

阅读全文

0 0