02. 使用上述01安装库实现最简单的网络爬虫

来源:互联网 发布:淘宝东西下架的原因 编辑:程序博客网 时间:2024/06/10 22:49

1.引言

本篇以简单的python程序代码为例,爬取猫途网(https://www.tripadvisor.cn/Attractions-g297407-Activities-Xiamen_Fujian.html)单页网址中所有景点图片地址标题两条信息。如下图所示:

2.操作流程

1)使用pycharm集成开发环境创建Trip_Advisor.py的python文件

2)在py文件开头引入库

from bs4 import BeautifulSoupimport requests

3)使用requests库获取网页数据(对于get方法)

url = 'https://www.tripadvisor.cn/Attractions-g297407-Activities-Xiamen_Fujian.html'wb_data = requests.get(url)print(wb_data.text)#打印部分结果如下:# <!DOCTYPE html># <html># <head># <meta http-equiv="content-type" content="text/html; charset=utf-8"/># <link rel='stylesheet' type='text/css' href='https://cc.ddcdn.com/css2/long_lived_global_legacy-v22091239456c.css' data-rup='long_lived_global_legacy'/># <link rel="icon" id="favicon" href="https://cc.ddcdn.com/favicon.ico" type="image/x-icon"/># <link rel="preload" href="https://cc.ddcdn.com/css2/webfonts/TripAdvisor/TripAdvisor_Regular.woff2?v003.230" as="font" type="font/woff2" crossorigin># <link rel="mask-icon" sizes="any" href="https://cc.ddcdn.com/img2/icons/ta_square.svg" color="#00a680"/># ..........

4)使用BeautifulSoup和lxml库解析网页数据

url = 'https://www.tripadvisor.cn/Attractions-g297407-Activities-Xiamen_Fujian.html'wb_data = requests.get(url)soup = BeautifulSoup(wb_data.text, 'lxml')print(soup)#打印部分结果如下:# <!DOCTYPE html># <html># <head># <meta content="text/html; charset=utf-8" http-equiv="content-type"/># <link data-rup="long_lived_global_legacy" href="https://cc.ddcdn.com/css2/long_lived_global_legacy-v22091239456c.css" rel="stylesheet" type="text/css"/># <link href="https://cc.ddcdn.com/favicon.ico" id="favicon" rel="icon" type="image/x-icon"/># <link as="font" crossorigin="" href="https://cc.ddcdn.com/css2/webfonts/TripAdvisor/TripAdvisor_Regular.woff2?v003.230" rel="preload" type="font/woff2"/># <link color="#00a680" href="https://cc.ddcdn.com/img2/icons/ta_square.svg" rel="mask-icon" sizes="any"/>#......

5)使用BeautifulSoup中的select()方法获取图片地址与标题所对应的html结构元素

url = 'https://www.tripadvisor.cn/Attractions-g297407-Activities-Xiamen_Fujian.html'wb_data = requests.get(url)soup = BeautifulSoup(wb_data.text, 'lxml')images = soup.select('div.centering_wrapper > img ')titles = soup.select('div.listing_title > a')print(images, titles, sep='\n')# 打印部分结果如下(images与titles为列表结构):# [<img alt="鼓浪屿" class="photo_image" height="111" src="https://ccm.ddcdn.com/photo-f/01/e7/22/cd/img-2021.jpg" style="height: 150px; width: 200px;" width="200"/>, <img alt="南普陀寺" class="photo_image" height="111" src="https://ccm.ddcdn.com/photo-f/01/ef/ce/73/dscn2022.jpg" style="height: 150px; width: 200px;" width="200"/>, <img alt="厦门大学" class="photo_image" height="111" src="https://ccm.ddcdn.com/ext/photo-f/03/ef/a9/07/xiamen-university.jpg" style="height: 200px; width: 200px;" width="200"/>, <img alt="中山路步行街" class="photo_image" height="111" src="https://ccm.ddcdn.com/photo-f/01/d1/0c/27/2.jpg" style="height: 150px; width: 200px;" width="200"/>, <img alt="厦门大学" class="photo_image" height="111" src="https://ccm.ddcdn.com/ext/photo-f/03/ef/a9/07/xiamen-university.jpg" style="height: 200px; width: 200px;" width="200"/>, <img alt="胡里山炮台" class=......]# [<a href="/Attraction_Review-g297407-d1131761-Reviews-Gulangyu_Island-Xiamen_Fujian.html" onclick="ta.setEvtCookie('Attraction_List_Click', 'POI_click', 'name', 1, '/Attraction_Review')" target="_blank">鼓浪屿</a>, <a href="/Attraction_Review-g297407-d502857-Reviews-Nanputuo_Temple-Xiamen_Fujian.html" onclick="ta.setEvtCookie('Attraction_List_Click', 'POI_click', 'name', 2, '/Attraction_Review')" target="_blank">南普陀寺</a>, <a href="/Attraction_Review-g297407-d1372931-Reviews-Xiamen_University-Xiamen_Fujian.html" onclick="ta.setEvtCookie('Attraction_List_Click', 'POI_click', 'name', 3, '/Attraction_Review')" target="_blank">厦门大学</a>, <a href="/Attraction_Review-g297407-d1930394-Reviews-Zhongshan_Road_Walking_Street-Xiamen_Fujian.html" onclick="ta.setEvtCookie('Attraction_List_Click', 'POI_click', 'name', 4, '/Attraction_Review')" target="_blank">中山路步行街</a>......]
说明:select()方法中为html中对应信息的唯一结构,本文通过copy select实现,如下图:



6)对images与titles的列表进行循环,并获取图片地址与标题信息(完整代码如下)

from bs4 import BeautifulSoupimport requestsurl = 'https://www.tripadvisor.cn/Attractions-g297407-Activities-Xiamen_Fujian.html'wb_data = requests.get(url)soup = BeautifulSoup(wb_data.text, 'lxml')images = soup.select('div.centering_wrapper > img ')titles = soup.select('div.listing_title > a')# print(images, titles, sep='\n')for image, title in zip(images, titles):    data = {        "image": image.get('src'),        "title": title.get_text()    }    print(data)#部分打印结果如下:# {'image': 'https://ccm.ddcdn.com/photo-f/01/e7/22/cd/img-2021.jpg', 'title': '鼓浪屿'}# {'image': 'https://ccm.ddcdn.com/photo-f/01/ef/ce/73/dscn2022.jpg', 'title': '南普陀寺'}# {'image': 'https://ccm.ddcdn.com/ext/photo-f/03/ef/a9/07/xiamen-university.jpg', 'title': '厦门大学'}# {'image': 'https://ccm.ddcdn.com/photo-f/01/d1/0c/27/2.jpg', 'title': '中山路步行街'}# {'image': 'https://ccm.ddcdn.com/ext/photo-f/03/ef/a9/07/xiamen-university.jpg', 'title': '鼓浪屿环岛路'}# {'image': 'https://ccm.ddcdn.com/photo-f/01/d1/0b/fc/caption.jpg', 'title': '厦门钢琴博物馆'}# {'image': 'https://ccm.ddcdn.com/photo-f/01/c7/77/ce/caption.jpg', 'title': '厦门园林植物园'}# {'image': 'https://ccm.ddcdn.com/ext/photo-f/0b/bf/f2/f4/xiamen-tianjie-temple.jpg', 'title': '日光岩'}# {'image': 'https://ccm.ddcdn.com/photo-s/02/47/d3/d9/caption.jpg', 'title': '厦门菽庄花园'}# {'image': 'https://ccm.ddcdn.com/ext/photo-s/01/ad/24/76/piano-museum.jpg', 'title': '厦门日月谷温泉渡假村'}# {'image': 'https://ccm.ddcdn.com/photo-f/02/01/ce/7a/173.jpg', 'title': '厦门白鹭洲公园'}# {'image': 'https://ccm.ddcdn.com/ext/photo-f/02/b6/80/b0/yuyuan-garden.jpg', 'title': '海湾公园'}# {'image': 'https://ccm.ddcdn.com/photo-f/01/d0/e7/58/2326109207538021970.jpg', 'title': '曾厝垵'}# {'image': 'https://ccm.ddcdn.com/photo-f/01/f0/3c/22/33.jpg', 'title': '怀旧鼓浪屿博物馆'}# {'image': 'https://ccm.ddcdn.com/ext/photo-s/02/47/ce/4f/main-entrance.jpg', 'title': '胡里山炮台'}# {'image': 'https://ccm.ddcdn.com/photo-f/02/01/96/cb/caption.jpg', 'title': '环岛路木栈道'}# {'image': 'https://ccm.ddcdn.com/photo-f/01/f9/79/2a/caption.jpg', 'title': '鼓浪屿国际刻字艺术馆'}# {'image': 'https://ccm.ddcdn.com/photo-s/05/17/f9/c6/caption.jpg', 'title': '芙蓉隧道'}

说明:此处获取html结构信息的方法中,图片地址image获取采用get('标签名‘)的方法,标题title采用get_text()获取文本的方法。

原创粉丝点击