Python实战：：四周实现爬虫系统笔记

来源：互联网发布：java如何定义字符数组编辑：程序博客网时间：2024/05/16 17:20

章节1 第零周：开始之前

勤快写，不浮躁，坚持坚持坚持。
科学上网好工具理解模仿实战

做一步就调试，防止出错。

章节2 第一周：学会爬取网页信息

本地:

1、BeautifulSoup解析网页，css选择器，标识唯一位置即可

2、找到正确元素，审查元素

3、处理标签文本，释放元素

from bs4 import BeautifulSouppath = './1_2_homework_required/index.html'  #这里使用了相对路径,只要你本地有这个文件就能打开with open(path, 'r') as wb_data: # 使用with open打开本地文件    Soup = BeautifulSoup(wb_data, 'lxml') # 解析网页内容    # print(wb_data)    titles = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a') # 复制每个元素的css selector 路径即可    images = Soup.select('body > div > div > div.col-md-9 > div > div > div > img')    reviews = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')    prices = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')    stars = Soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')    # 为了从父节点开始取,此处保留:nth-of-type(2),观察网页,多取几个星星的selector,就发现规律了    # print(prices)'''body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a#改成符合BeautifulSoup的格式'''# print(titles,images,rates,prices,stars,sep='\n--------\n')  # 打印每个元素,其中sep='\n--------\n'是为了在不同元素之间添加分割线for title, image, review, price, star in zip(titles, images, reviews, prices, stars):  # 使用for循环,把每个元素装到字典中，方便查询    data = {        'title': title.get_text(), # 使用get_text()方法取出文本        'image': image.get('src'), # 使用get方法取出带有src的图片链接        'review': review.get_text(),        'price': price.get_text(),        'star': len(star.find_all("span", class_='glyphicon glyphicon-star'))        # 观察发现,每一个星星会有一次<span class="glyphicon glyphicon-star"></span>,所以我们统计有多少次,就知道有多少个星星了;        # 使用find_all 统计有几处是★的样式,第一个参数定位标签名,第二个参数定位css 样式,具体可以参考BeautifulSoup 文档示例http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find-all;        # 由于find_all()返回的结果是列表,我们再使用len()方法去计算列表中的元素个数,也就是星星的数量    }    print(data)

cates = Soup.select('ul > li > div.article-info > p.meta-info') #解析多子标签，返回列表'cate': list(cate.stripped_strings)  #Object.stripped_strings【父集下面所有子标签的文本信息（聚合信息）】 高级get_text()

soup.select('img[width='170']')  #属性选择，缩小范围，仍是唯一定位。

外网：
1、服务器与本地交换机制 http协议，Request请求八种get，post ，Response回应，返回状态码

2、选择，大范围，利用属性缩小范围。# a[target='_blank']' 属性选择缩小范围

3、cookies伪造登录信息，headers

4、获取多页，定义函数，找每一页链接规律，列表解析式，反扒加入时间延时模块，模拟难搞的用手机页面伪造，简单。

urls = ['http://www.mm131.com/xinggan/3520_{}.html'.format(str(i)) for i in range(1,23,1)]for single_url in urls:    get_attractions(single_url)  #遍历列表并执行，保护反扒增加延时

5、登录爬取，使用cookies。扒手机页面headers = {    'User-Agent':'#mobile device user agent from chrome',
    'Cookeis':''}
6、动态加载，去network找链接
from bs4 import BeautifulSoupimport requestsimport timeurl = 'https://knewone.com/discover?page='def get_page(url,data=None):    wb_data = requests.get(url)    soup = BeautifulSoup(wb_data.text,'lxml')    imgs = soup.select('a.cover-inner > img')    titles = soup.select('section.content > h4 > a')    links = soup.select('section.content > h4 > a')    if data==None:        for img,title,link in zip(imgs,titles,links):            data = {                'img':img.get('src'),                'title':title.get('title'),                'link':link.get('href')            }            print(data)def get_more_pages(start,end):    for one in range(start,end):        get_page(url+str(one))        time.sleep(2)get_more_pages(1,10)
实战：58同城二手平板
soup.title.text  #得到标题内容
'cate' :'个人' if who_sells == 0 else '商家'  #列表解析式

章节3 第二周：学会爬取大规模数据

Mongdb类似Excel

import pymongo#本地端口client = pymongo.MongoClient('localhost',27017)#数据库名称walden = client['walden']#创建数据库表文件sheet_tab = walden['sheet_tab']path = 'walden.txt'#打开文件只读# with open(path,'r') as f:#     lines = f.readlines()#     for index,line in enumerate(lines):#         data = {#             'index':index,#             'line' :line,#             'words':len(line.split())#         }#         # print(data)#         #数据添加到表（写入数据）,添加后就可以屏蔽了，已经写入。#         sheet_tab.insert_one(data)# $lt/$lte/$gt/$gte/$ne，依次等价于</<=/>/>=/!=。（l表示less g表示greater e表示equal n表示not  ）#表的数据分析for item in sheet_tab.find({'words':{'$lt':5}}):    print(item)

大数据：1、观察页面特征，编写通用函数 2、设计工作流程爬虫1得到列表页url，得到商品链接，爬虫2得到详情页，得到每个商品详情。

'area':list(map(lambda x:x.text,soup.select('ul.det-infor > li:nth-of-type(3) > a')))解析式搞难搞文本

章节4 第三周：数据统计与分析

1、提出正确问题，正确解释现象，正确验证假设

2、数据论证，细分

3、解读数据，数据会说话，数据会说谎，样本偏差

整理清洗数据

更新数据库

数据可视化

章节5 第四周：搭建 Django 数据可视化网站

模板语言：理解上下文 render函数

阅读全文

0 0