Python基础学习-爬虫小试3爬知乎用户小测

来源：互联网发布：尼康镜头大三元知乎编辑：程序博客网时间：2024/06/06 05:10

*----------------------------------------------------------------编程届菜鸟-------------------------------------------------------*

任务：根据知乎用户页面，统计粉丝人数，并按照粉丝继续爬。最后按照粉丝数排序

import reimport requestsclass crawlUser:    #定义构造函数    def __init__(self, userid,cookie):        self.userId = userid        self.fellowCount = 0        self.fellowlist = []        self.cookie = cookie    def getpage(self):        url = "http://www.zhihu.com/people/"+self.userId+"/followers"        self.response = requests.get(url, cookies=self.cookie)    def getfellowcount(self):        reg = 'data-tip="p\$t\$(.+)"'        pattern = re.compile(reg)        self.fellowCount = re.findall(pattern, self.page)        count=0        for x in self.fellowCount:            if (count%2)==0:                self.fellowlist.append(x)            count=count+1        return self.fellowlistm_cookie={"_za":"****************","a2404_times":"129",          "q_c1":"*********************",          "_xsrf":"********************",          "cap_id":"**************************",          "__utmt":"1","z_c0":"***********************1d5358",          "unlock_ticket":"***********************6d",          "__utma":"********************************",          "__utmb":"15**************","__utmc":"*****",          "__utmz":"155**********************************/",          "__utmv":"***************************************1"}userlist={}tempUserList=["****"]flag=0iter=0while flag<len(tempUserList) and iter<10:    user = crawlUser(tempUserList[flag],m_cookie)    user.getpage()    fellowlist=user.getfellowcount()    userlist[tempUserList[flag]]=len(fellowlist)    for x in fellowlist:        if x not in tempUserList:            tempUserList.append(x)    print("user %s fellowed %d"%(tempUserList[flag],userlist[tempUserList[flag]]))    flag=flag+1    iter=iter+1print("crawl is done!")sortRes=sorted(userlist.items(),key=lambda d:d[1])"""f=open("userlist.txt","w+")f.write(userlist)f.close()"""print(sortRes)print("sort is done!")

【过程】

1、前两天小试的下载图片之类的练习后，今天这个比前两天的难度再加一点，但是不算难度太高的

2、绕不开登陆验证，所以用了本地的cookie，方法是直接从浏览器里复制的，方法有点low，求指教更好的

3、后面设置只抓取了10人，因为访问得频繁就被服务器给挂了，我对这块的学习也不够，王先森说可以对进程添加sleep处理

【感受】

1、实践才是学习的最快途径

2、requests比urllib好用太多了

3、pycharm只比Notepad好用一点点

4、路漫漫其修远兮

*-------------------------------------------本博客旨在记录学习历程，望前辈能人留言指教----------------------------------------*

0 0