网络爬虫学习一

来源：互联网发布：软件未响应关不掉编辑：程序博客网时间：2024/05/11 10:57

一. 根据url抓取页面源码：

import reimport urllibdef getHtml(url):    agent=''    page=urllib.urlopen(url)    html = page.read()    return htmltry:    html = getHtml(url='https://www.zhihu.com/question/20899988')    #html.encoding = 'utf-8'except Exception:    print 'getHtml fail'print html

二. 从抓取的网页中下载图片

def getImg(html):    reg = r'src="(.+?\.jpg)"'    #reg=r'src'    pat = re.compile(reg)    imgList = re.findall(pat,html)    x=1    for imgurl in imgList:        urllib.urlretrieve(imgurl,'%s.jpg' % x)        x+=1

三. 抓取前模拟登陆

相关知识：

http消息头：理解HTTP消息头

0 0

网络爬虫学习一
网络爬虫学习笔记(一)
网络爬虫学习（一）
网络爬虫学习笔记(一) 网络爬虫概述
学习python写网络爬虫（一）
学习Python之网络爬虫（一）
Python 网络爬虫学习（一）
Python网络爬虫学习scrapy(一)
Python网络爬虫学习笔记（一）
网络爬虫基本原理(一)
网络爬虫基本原理(一)
网络爬虫基本原理(一)
网络爬虫基本原理(一)
网络爬虫基本原理(一)
网络爬虫基本原理(一)
网络爬虫基本原理(一)
网络爬虫基本原理(一)
网络爬虫基本原理(一)
二分查找(边界问题)
ruby-数据类型
使用autotools自动生成makefile
Qt-----实现Tcp通信
c++不同继承方式的访问权限
网络爬虫学习一
epoll
Oracle RAC启动CRS-1028，CRS-0223错误
8.2 os.path--公共的路径名操作
spring基础－convert
@Repository、@Service、@Controller 和 @Component
codeforces_632A.Grandma Laura and Apples
IO
CentOS 配置WAMP环境