爬虫

来源：互联网发布：怎样查找网络打印机编辑：程序博客网时间：2024/06/06 12:02

爬虫

1、网页的构成：html（结构） css（样式） javascript（功能）

2、<div></div>代表一个区域的框架，用来分区域，区域里面加其他标签用来构成区域里的网页

<p></p>用来去写文字的内容

<li></li>表示列表

<img>插入图片

<h1></h1>标题

<a href="链接">加载链接

一个简单网页的源码，省去css样式
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>The blah</title>
  <link rel="stylesheet" type="text/css" href="main.css">
</head>
<body>
  <div class="header">
    <img src="images/blah.png"> //src图片标识
    <ul class="nav">
      <li><a href="#">Home</a></li>
      <li><a href="#">Site</a></li>
      <li><a href="#">Other</a></li>
    </ul>
  </div>
  <div class="main-content">
    <h2>Article</h2>
    <ul class="article">
      <li>
        <img src="images/0001.jpg" width="100" height="90">
        <h3><a href="#">The blah</a></h3>
        <p>This is a dangerously delicious cake.</p>
      </li>
      <li>
        <img src="images/0002.jpg" width="100" height="90">
        <h3><a href="#">The blah</a></h3>
        <p>It's always taco night somewhere!</p>
      </li>
      <li>
        <img src="images/0003.jpg" width="100" height="90">
        <h3><a href="#">The blah</a></h3>
        <p>Omelette you in on a little secret </p>
      </li>
      <li>
        <img src="images/0004.jpg" width="100" height="90">
        <h3><a href="#">The blah</a></h3>
        <p>It's a sandwich.  That's all we .</p>
      </li>
    </ul>
  </div>
  <div class="footer">
    <p>&copy; Mugglecoding</p>
  </div>
</body>
</html>

标签嵌套构成网页。

一个简单的网页的信息提取

第一步：使用BeatuifulSoup解析网页

Soup=BeautifulSoup(html,'lxml')

第二步：描述要爬取的东西在哪

Soup.select()

使用css.selector

第三步：从标签中获得你要的信息，找到这个信息所对应的唯一标签。

实例源码

from bs4 import BeautifulSoup

data = []
path = './web/new_index.html'

with open(path, 'r') as f:
    Soup = BeautifulSoup(f.read(), 'lxml')
    titles = Soup.select('ul > li > div.article-info > h3 > a')
    pics = Soup.select('ul > li > img')
    descs = Soup.select('ul > li > div.article-info > p.description')
    rates = Soup.select('ul > li > div.rate > span')
    cates = Soup.select('ul > li > div.article-info > p.meta-info')

for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
    info = {
        'title': title.get_text(),
        'pic': pic.get('src'),
        'descs': desc.get_text(),
        'rate': rate.get_text(),
        'cate': list(cate.stripped_strings)
    }
    data.append(info)

for i in data:
    if len(i['rate']) >= 3:
        print(i['title'], i['cate'])

真实世界中的网页数据提取

使用request中的get（得到数据） post（发送数据）

提取图片的一种方法：通过确定其大小使用'img[width='60']'

使用request来跳过账号密码登录

headers={

'User-Agent'=''

'Cookie'=''

}

#使用时在使用request.get时加入这个参数

wb_data=requests.get(url,headers=headers)

网页需要翻页

找到其翻页变化的数字 urls=['http://xxxxxxxxx{}xxxxxxxxxx'.format(str(i)] for i in range(20,200,20)

使用time.sleep(x)来限制频率

对于图片难以提取可以换成手机页面来提取

使用爬虫爬取动态数据

异步加载：一个页面向下滑，慢慢加载出，没有翻页。

from bs4 import BeautifulSoup
import requests
import time

url = 'https://knewone.com/discover?page='

def get_page(url,data=None):

    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text,'lxml')
    imgs = soup.select('a.cover-inner > img')
    titles = soup.select('section.content > h4 > a')
    links = soup.select('section.content > h4 > a')

    if data==None:
        for img,title,link in zip(imgs,titles,links):
            data = {
                'img':img.get('src'),
                'title':title.get('title'),
                'link':link.get('href')
            }
            print(data)

def get_more_pages(start,end):
    for one in range(start,end):
        get_page(url+str(one))
        time.sleep(2)

get_more_pages(1,10)

阅读全文

0 0