用python免登錄把人人網某個相冊的全部照片下載下來

来源：互联网发布：表情包文化知乎编辑：程序博客网时间：2024/04/29 15:37

昨天剛開始學python，老師說不如從爬虫入手，又有成就感又不會無聊，於是就學了最簡單的爬虫，並利用它把自己人人網上面的照片全都下載下來了哈哈。

該程式的代碼如下。

其中myurl是人人網相冊中第一張照片的url，不過我用的是手機版人人網，網址是m.renren.com。

具體如何得到myurl，首先從m.renren.com打開人人網，繼而點進待爬相冊中的第一張照片，這時的url就是myurl了。

# -*- coding: UTF-8 -*-import reimport requestsimport sys#下載人人相冊中的照片#how to set up myurl, please read my blogmyurl='http://3g.renren.com/album/meowmeow'for i in range(1,73+1):   #there are 73 photos in this album    html=requests.get(myurl)    picurl=re.search(' <a href="http://fmn(.*?)">(.*?)</a></p><p class="time">',html.text,re.S).group(1)    picurl=re.sub('&','&',picurl)    print picurl    picurl='http://fmn'+picurl    print picurl    picture=requests.get(picurl)    fp=open('pic\\'+str(i)+'.jpg','wb')    fp.write(picture.content)    fp.close()    myurl=re.search('</p></div><div class="sec"><a href="(.*?)">',html.text,re.S).group(1)    myurl=re.sub('&','&',myurl) #get the next url    print iprint '======================finish======================'

正則表達式部分，對於不同的頁面，有時候會不一樣。

-----------------------分界網---以下內容與上述內容無關-----------------------

小筆記：

.匹配任意字符，\n除外

*匹配前一個字符0次或無限次

?匹配前一個字符0次或一次

.*貪心算法

.*?非貪心算法

\d+匹配數字

()只返回括號內的東西

re.findall

re.search

re.sub

re.S

import requests #提供保存圖片所需東西的模塊

#讀文件f=open('xx.txt','r')html=f.read()f.close()

#存圖片pic=requests.get(url)fp=open('folder\\'+'filename'+'.jpg','wb')fp.write(pic.content)fp.close()

#取得網頁原代碼import requestshtml=requests.get('http://meowmeow')#對於反爬虫的網頁header={'User-Agent':'shgadvvgd'}html=requests.get('http://meowmeow',headers=header)

#對於一些print不出來的東西，可以試試<pre name="code" class="python">print html.encode("gb18030")

#向網頁提交數據mydata={'name':'kalari','password':'naive'}html_post=request.post(url,data=mydata)

info={}info('name')='kalari'info(name)info2=[]info2.append('kalari')

#類class meow(object)<span style="white-space:pre"></span>def __init__(self):<span style="white-space:pre"></span>print 'meow'<span style="white-space:pre"></span>def hello(self):<span style="white-space:pre"></span>print 'hello'if __name__=='__main__':<span style="white-space:pre"></span>kalari=meow()<span style="white-space:pre"></span>kalari.hello()

#xml爬虫from lxml import etreeselector=etree.HTML(html)content=selector.xpath('')#語法'''//根節點/往下找/text()文本內容@xxx屬性內容div[starts-with(@id,"test")]//body/ul[@id="123"]/li/text()無text()info=meow.xpath(string(.))content=info.replace('\n','').replace(' ','')<span style="font-family:Arial, Helvetica, sans-serif;">'''</span>

<span style="font-family:Arial, Helvetica, sans-serif;">#多線程from multiprocessing.dummy import Pooldef getss(url):<span style="white-space:pre"></span>html=request.get(url)if __name__=='__main__':<span style="white-space:pre"></span>pool=Pool(4)<span style="white-space:pre"></span>result=pool.map(getss,urls)<span style="white-space:pre"></span>pool.close()<span style="white-space:pre"></span>pool.join()</span>

<span style="font-family:Arial, Helvetica, sans-serif;">#時間import timetime1=time.time()</span>

0 0