Python数据分析

来源:互联网 发布:psv淘宝 编辑:程序博客网 时间:2024/05/23 12:54
作者:挖数
链接:https://www.zhihu.com/question/20899988/answer/96904827
来源:知乎
著作权归作者所有,转载请联系作者获得授权。

以下是我学python爬虫的打怪升级之路,过程充满艰辛,也充满欢乐,虽然还未打倒大boss,但一路的风景就是最大的乐趣,不是么?希望大家能get到想要的东西!

多图预警!
<img src="https://pic4.zhimg.com/55e8bc9324234bc88b354821ce005bc3_b.png" data-rawwidth="288" data-rawheight="179" class="content_image" width="288">
<img src="https://pic3.zhimg.com/af1baba1052c2cd49cea5ea6986eb30a_b.png" data-rawwidth="242" data-rawheight="268" class="content_image" width="242">

<img src="https://pic2.zhimg.com/5ec82828ba71e96a7d86b7e88254ccd9_b.png" data-rawwidth="254" data-rawheight="230" class="content_image" width="254">

<img src="https://pic3.zhimg.com/c60bde3fec9e5f791b1a217613879b46_b.png" data-rawwidth="278" data-rawheight="320" class="content_image" width="278">

<img src="https://pic3.zhimg.com/974b3d7c1c50bac62c14afe58ff0ed26_b.png" data-rawwidth="309" data-rawheight="318" class="content_image" width="309">
<img src="https://pic2.zhimg.com/2c3e1e5f18d6e6cc8758337663c548f5_b.png" data-rawwidth="313" data-rawheight="264" class="content_image" width="313">

<img src="https://pic4.zhimg.com/b65ad1e407e0335107eca80e4a0bdac3_b.png" data-rawwidth="266" data-rawheight="240" class="content_image" width="266">

<img src="https://pic2.zhimg.com/70067cc590378e31676ed48192633d7d_b.png" data-rawwidth="269" data-rawheight="246" class="content_image" width="269">

<img src="https://pic4.zhimg.com/2cecf7ef8b19f24a2fb287403a51142b_b.png" data-rawwidth="299" data-rawheight="254" class="content_image" width="299">

<img src="https://pic3.zhimg.com/b2867a2ddb861a04a91fde5d34ed5982_b.png" data-rawwidth="212" data-rawheight="266" class="content_image" width="212">

<img src="https://pic3.zhimg.com/ae5a6594ab77bfdeaaa9e45b9420c93e_b.png" data-rawwidth="313" data-rawheight="266" class="content_image" width="313">

<img src="https://pic4.zhimg.com/5f65be4b49e5f84ab99efc92ab6ea61b_b.png" data-rawwidth="304" data-rawheight="232" class="content_image" width="304">
<img src="https://pic2.zhimg.com/506899fbbe618e05cbe1e2768665b17d_b.png" data-rawwidth="287" data-rawheight="234" class="content_image" width="287">

<img src="https://pic1.zhimg.com/009fcaa5d4a08f4eda54fb38b88e575c_b.png" data-rawwidth="325" data-rawheight="354" class="content_image" width="325">

<img src="https://pic3.zhimg.com/b93fbe0719c946b1a68a3f0b33937942_b.png" data-rawwidth="289" data-rawheight="243" class="content_image" width="289">

<img src="https://pic2.zhimg.com/ded59bb8038a10b3bfb4e65fd14db631_b.png" data-rawwidth="309" data-rawheight="189" class="content_image" width="309">

<img src="https://pic2.zhimg.com/8d8337c43a58a5386227e037891f9d61_b.png" data-rawwidth="266" data-rawheight="346" class="content_image" width="266">

<img src="https://pic2.zhimg.com/e5dbb6f838f6532b0d0a481c69a79ddd_b.png" data-rawwidth="338" data-rawheight="269" class="content_image" width="338">

<img src="https://pic4.zhimg.com/5e1b525feb212ff0b860481ecb67288b_b.png" data-rawwidth="255" data-rawheight="175" class="content_image" width="255">
以下奉献一段爬取知乎头像的代码

import requests
import urllib
import re
import random
from time import sleep
def main():
url='知乎 - 与世界分享你的知识、经验和见解'
#感觉这个话题下面美女多
headers={省略}
i=1
for x in xrange(20,3600,20):
data={'start':'0',
'offset':str(x),
'_xsrf':'a128464ef225a69348cef94c38f4e428'}
#知乎用offset控制加载的个数,每次响应加载20
content=requests.post(url,headers=headers,data=data,timeout=10).text
#用post提交form data
imgs=re.findall('<img src=\\\\\"(.*?)_m.jpg',content)
#在爬下来的json上用正则提取图片地址,去掉_m为大图
for img in imgs:
try:
img=img.replace('\\','')
#去掉\字符这个干扰成分
pic=img+'.jpg'
path='d:\\bs4\\zhihu\\jpg\\'+str(i)+'.jpg'
#声明存储地址及图片名称
urllib.urlretrieve(pic,path)
#下载图片
print u'下载了第'+str(i)+u'张图片'
i+=1
sleep(random.uniform(0.5,1))
#睡眠函数用于防止爬取过快被封IP
except:
print u'抓漏1张'
pass
sleep(random.uniform(0.5,1))

if __name__=='__main__':

main()


结果:

&amp;lt;img src=&quot;https://pic2.zhimg.com/b1fc67ee3e290376fe882113ff7d44fd_b.png&quot; data-rawwidth=&quot;710&quot; data-rawheight=&quot;744&quot; class=&quot;origin_image zh-lightbox-thumb&quot; width=&quot;710&quot; data-original=&quot;https://pic2.zhimg.com/b1fc67ee3e290376fe882113ff7d44fd_r.png&quot;&amp;gt;
最后,请关注我吧,我会好好维护你的时间线的 \( ^▽^ )/
0 0
原创粉丝点击