博客用户分类（数据预处理）

来源：互联网发布：windows官方主题云编辑：程序博客网时间：2024/06/05 20:59

博客用户分类

数据预处理：需要先安装feedparser包解析各个订阅源

<pre name="code" class="python">#数据预处理，提供订阅源feedlist.txt列表，输出各个blog中频繁词汇出现的次数

# encoding: utf-8import feedparserimport re# Returns title and  dictionary of word counts for an RSS feeddef getwordcounts(url):  # Parse the feed  d=feedparser.parse(url) #调用feedparser模块parse函数，解析url，解析结果返回给变量d  wc={}  #字典wc word count  # Loop over all the entrie循环遍历所有文章条目  for e in d.entries:      #处理边界情况    if 'summary' in e: summary=e.summary   #'summary'为字符串，summary为变量  当条目中有summary 则summary=entries.summary    else: summary=e.description  #否则summary=entries.description                                 #summary为entries.description或者entries.summary    # Extract a list of words    words=getwords(e.title+' '+summary)  #words为单词列表     参数entries.title+' '+summary为摘要    for word in words:      wc.setdefault(word,0)   #字典自带的setdefault方法，创建新元素Word并设置默认值为0      wc[word]+=1    #Word出现一次 键值count添加一次  return d.feed.title,wc   #d.feed为添加的标题，wc为字典def getwords(html):  # Remove all the HTML tags  txt=re.compile(r'<[^>]+>').sub('',html)  #r''为转义字符 匹配以<开始，以>结束的文本。[^>]+  表示<>中间有1到多个非>就行                                     # 举例就是<>不匹配出来，而<一到多个非>字符匹配出来···>  # Split words by all non-alpha characters  words=re.compile(r'[^A-Z^a-z]+').split(txt)  # Convert to lowercase  return [word.lower() for word in words if word!='']#循环遍历数据feedlist.txt中的每一行，然后生成针对每个博客的单词统计，以及出现这些单词的博客数目apcount={}   #{单词：出现这些单词的博客数} ap  appearancewordcounts={}  #  该案例是嵌套字典{博客名：{单词：计数}}feedlist=[line for line in file('feedlist.txt')]  #循环遍历数据feedlist.txt中的每一行。feedlist为列表for feedurl in feedlist:  try:    title,wc=getwordcounts(feedurl)     #将订阅源解析出来的值给标题title（列表），及wc为字典单词与单词计数    wordcounts[title]=wc                #字典嵌套，wordcounts={}为之前定义的字典，字典内容为{博客名：{单词：计数}}    for word,count in wc.items():       #wc字典为为{单词：计数} iteams()返回（key,value）元组组成的列表[]      apcount.setdefault(word,0)        #此处Word为单词，默认值为0      if count>1:                       #如果单词计算大于1  不应该是大于等于一吗？？？        apcount[word]+=1                #字典{单词：value}value值加一  except:    #异常处理    print 'Failed to parse feed %s' % feedurl#由于thez这样单词到处是，而film—flam这种单词只出现在很少博客中，所有只选择介于某个百分比#范围内的单词，如10%—50%。wordlist=[]         #单词列表for w,bc in apcount.items():      # {单词，涉及该单词的博客数}变成[(单词，涉及该单词博客数)]列表中嵌套元组  frac=float(bc)/len(feedlist)     #bc为某单词涉及博客数/总博客数  if frac>0.1 and frac<0.5:       #某单词出现在10%-50%的博客数中    wordlist.append(w)            #将该单词添加到列表wordlist=[] 中out=file('blogdata1.txt','w') #文件操作，以写入方式打开文件。先删除原有内容，再重新输入新的内容。如果文件不存在，则创建一个新文件。out.write('Blog')  #把字符串blog写入文件for word in wordlist: out.write('\t%s' % word) #将wordlist列表中的单词写入文件out,即blogdata1.txt中，单词以\t分开out.write('\n') #所有单词输入blogdata1.txt第一行后，换行for blog,wc in wordcounts.items():   #该案例是嵌套字典{博客名：{单词：计数}}  print blog                         #打印博客名  out.write(blog)                     #博客名写入blogdata1.txt  for word in wordlist:              #循环遍历wordlist列表.*注意这是嵌套for循环，固定了blog,执行该遍历循环    if word in wc: out.write('\t%d' % wc[word])  #如果wordlist中单词在blog对应列表[（单词，计数）]中，输出该blog该单词的计数    else: out.write('\t0')           #如果该blog无该单词，则输入0  out.write('\n')    #扫描完一个blog后换行

feedlist.txt

http://feeds.feedburner.com/37signals/beMHhttp://feeds.feedburner.com/blogspot/bRuzhttp://battellemedia.com/index.xmlhttp://blog.guykawasaki.com/index.rdfhttp://blog.outer-court.com/rss.xmlhttp://feeds.searchenginewatch.com/sewbloghttp://blog.topix.net/index.rdfhttp://blogs.abcnews.com/theblotter/index.rdfhttp://feeds.feedburner.com/ConsumingExperienceFullhttp://flagrantdisregard.com/index.php/feed/http://featured.gigaom.com/feed/http://gizmodo.com/index.xmlhttp://gofugyourself.typepad.com/go_fug_yourself/index.rdfhttp://googleblog.blogspot.com/rss.xmlhttp://feeds.feedburner.com/GoogleOperatingSystemhttp://headrush.typepad.com/creating_passionate_users/index.rdfhttp://feeds.feedburner.com/instapundit/mainhttp://jeremy.zawodny.com/blog/rss2.xmlhttp://joi.ito.com/index.rdfhttp://feeds.feedburner.com/Mashablehttp://michellemalkin.com/index.rdfhttp://moblogsmoproblems.blogspot.com/rss.xmlhttp://newsbusters.org/node/feedhttp://beta.blogger.com/feeds/27154654/posts/full?alt=rsshttp://feeds.feedburner.com/paulstamatiouhttp://powerlineblog.com/index.rdfhttp://feeds.feedburner.com/Publishing20http://radar.oreilly.com/index.rdfhttp://scienceblogs.com/pharyngula/index.xmlhttp://scobleizer.wordpress.com/feed/http://sethgodin.typepad.com/seths_blog/index.rdfhttp://rss.slashdot.org/Slashdot/slashdothttp://thinkprogress.org/feed/http://feeds.feedburner.com/andrewsullivan/rApMhttp://wilwheaton.typepad.com/wwdnbackup/index.rdfhttp://www.43folders.com/feed/http://www.456bereastreet.com/feed.xmlhttp://www.autoblog.com/rss.xmlhttp://www.bloggersblog.com/rss.xmlhttp://www.bloglines.com/rss/about/newshttp://www.blogmaverick.com/rss.xmlhttp://www.boingboing.net/index.rdfhttp://www.buzzmachine.com/index.xmlhttp://www.captainsquartersblog.com/mt/index.rdfhttp://www.coolhunting.com/index.rdfhttp://feeds.copyblogger.com/Copybloggerhttp://feeds.feedburner.com/crooksandliars/YaCPhttp://feeds.dailykos.com/dailykos/index.xmlhttp://www.deadspin.com/index.xmlhttp://www.downloadsquad.com/rss.xmlhttp://www.engadget.com/rss.xmlhttp://www.gapingvoid.com/index.rdfhttp://www.gawker.com/index.xmlhttp://www.gothamist.com/index.rdfhttp://www.huffingtonpost.com/raw_feed_index.rdfhttp://www.hyperorg.com/blogger/index.rdfhttp://www.joelonsoftware.com/rss.xmlhttp://www.joystiq.com/rss.xmlhttp://www.kotaku.com/index.xmlhttp://feeds.kottke.org/mainhttp://www.lifehack.org/feed/http://www.lifehacker.com/index.xmlhttp://littlegreenfootballs.com/weblog/lgf-rss.phphttp://www.makezine.com/blog/index.xmlhttp://www.mattcutts.com/blog/feed/http://xml.metafilter.com/rss.xmlhttp://www.mezzoblue.com/rss/index.xmlhttp://www.micropersuasion.com/index.rdfhttp://www.neilgaiman.com/journal/feed/rss.xmlhttp://www.oilman.ca/feed/http://www.perezhilton.com/index.xmlhttp://www.plasticbag.org/index.rdfhttp://www.powazek.com/rss.xmlhttp://www.problogger.net/feed/http://feeds.feedburner.com/QuickOnlineTipshttp://www.readwriteweb.com/rss.xmlhttp://www.schneier.com/blog/index.rdfhttp://scienceblogs.com/sample/combined.xmlhttp://www.seroundtable.com/index.rdfhttp://www.shoemoney.com/feed/http://www.sifry.com/alerts/index.rdfhttp://www.simplebits.com/xml/rss.xmlhttp://feeds.feedburner.com/Spikedhumorhttp://www.stevepavlina.com/blog/feedhttp://www.talkingpointsmemo.com/index.xmlhttp://www.tbray.org/ongoing/ongoing.rsshttp://feeds.feedburner.com/TechCrunchhttp://www.techdirt.com/techdirt_rss.xmlhttp://www.techeblog.com/index.php/feed/http://www.thesuperficial.com/index.xmlhttp://www.tmz.com/rss.xmlhttp://www.treehugger.com/index.rdfWIREDhttp://www.tuaw.com/rss.xmlhttp://www.valleywag.com/index.xmlhttp://www.we-make-money-not-art.com/index.rdfhttp://www.wired.com/rss/index.xmlhttp://www.wonkette.com/index.xml

有些RSS源不能打开了，但是不影响程序运行

数据预处理结果

0 0