网站地图爬虫
来源:互联网 发布:sia国际艺术教育 知乎 编辑:程序博客网 时间:2024/05/16 17:23
def crawl_sitemap(url): html = '' #download the sitemap file sitemap = download_page(url, 2) # extract the sitemap links links = re.findall('<loc>(.*?)</loc>',sitemap) #load each link for link in links: html = download_page(link,2)if __name__ == '__main__': url = "https://www.meetup.com/" url = 'https://zhidao.baidu.com/question/2073804096754701028.html' url = 'http://example.webscraping.com/sitemap.xml ' crawl_sitemap(url) # page_buf = download_page(url, 2 , '127.0.0.1:8087')
output:
downloading: http://example.webscraping.com/sitemap.xml
downloading: http://example.webscraping.com/view/Afghanistan-1
downloading: http://example.webscraping.com/view/Aland-Islands-2
downloading: http://example.webscraping.com/view/Albania-3
download failed: timed out
downloading: http://example.webscraping.com/view/Algeria-4
downloading: http://example.webscraping.com/view/American-Samoa-5
downloading: http://example.webscraping.com/view/Andorra-6
download failed: timed out
downloading: http://example.webscraping.com/view/Angola-7
downloading: http://example.webscraping.com/view/Anguilla-8
downloading: http://example.webscraping.com/view/Antarctica-9
download failed: timed out
downloading: http://example.webscraping.com/view/Antigua-and-Barbuda-10
download failed: timed out
downloading: http://example.webscraping.com/view/Argentina-11
downloading: http://example.webscraping.com/view/Armenia-12
downloading: http://example.webscraping.com/view/Aruba-13
downloading: http://example.webscraping.com/view/Australia-14
downloading: http://example.webscraping.com/view/Austria-15
- 网站地图爬虫
- 网站爬虫
- 地图网站
- 网站地图
- 网站地图
- 网站地图
- Python 网络爬虫 007 (编程) 通过网站地图爬取目标站点的所有网页
- 网站地图爬虫章节遇到 TypeError: cannot use a string pattern on a bytes-like object
- HTML 网站地图与 XML 网站地图
- 网站爬虫防治
- python 爬虫网站
- 网站反爬虫
- python爬虫网站mark
- 爬虫相关网站
- xx网站爬虫
- 爬虫:猫途鹰网站
- 新闻网站爬虫设计
- 爬虫数据来源网站
- Flask源码解读 <2> --- 请求上下文和request对象
- 强化学习导论(Reinforcement Learning: An Introduction)读书笔记(二):多臂赌博机(Multi-arm Bandits)
- C++ WIN32(鼠标画点击放下矩形)
- 【NOIP2014八校联考第2场第2试9.28】分组(group)
- Oracle入门
- 网站地图爬虫
- C++ WIN32(奔跑吧飞鱼)
- android 引用分析
- JS javascript 中的高级知识
- (十五)剑指offer之从上向下打印二叉树
- 算法竞赛入门经典 习题1-9 三角形(triangle)
- 初探委托
- iOS中nil、Nil、NULL、NSNull 区别
- 分类器对未见过类别的识别问题