Python使用urllib库和BeautifulSoup库爬虫总结

来源：互联网发布：breed mac地址修改编辑：程序博客网时间：2024/05/22 01:30

最近简单学习了一点爬虫，为此做一个小的总结，还望批评指正

Python爬虫总结

检查是否安装成功

python
- python
urllib
- from urllib.request import urlopen
BeautifulSoup4
- from bs4 import BeautifulSoup

存储数据到MySQL

通过pip安装pymysql
- pip install pymysql
通过安装文件
- python set.py install

备注：
下载源码（github-master）->解压->Cmd->cd 到setup.py所在文件夹下->运行此文件夹即可

模拟真实浏览器

携带User-Agent头
- req = request.Request(url)
- req.add_header(key,value)
- resp = request.urlopen(req)
- print(resp.read().decode(“utf_8”))

使用POST

导入urllib库下面的parse
- from urllib import parse
使用urlencode生成post数据
- postData = parse.urlencode([
  (key1,val1),
  (key2,val2),
  (keyn,valn)
使用postData发送post请求
- request.urlopen(req,data = postData.encode(“utf-8”))
得到请求状态
- resp.status
得到服务器的类型
- resp.reason

存储数据到MySQL

引入开发包
- import pymysql.cursors
获取数据库链接
- connection = pymysql.connect(host = “localhost”,
  user =’root’,
  password = ‘123456’,
  db = ‘wikiurl’,
  charset = ‘utf8mb4’)
提交
- connection.commit()
关闭
- connection.close()

读取MySQL数据

得到总记录数
- cursor.excute()
查询下一行
- cursor.fechone()
得到指定的大小
cursor.fetchall()
- cursor.fetchmany(size=None)
关闭
- connection.close()

常见文档读取

读取TXT文档
- urlopen()
读取pdf文档
- pdfminer3k

代码块

from urllib.request import urlopenfrom urllib.request import Requestfrom urllib import parsefrom bs4 import BeautifulSoupurl = "http://baidu.com"req = Request(url)postData = parse.urlencode([    ("StartStation" , "#####"),    ("EndStation","#####"),    ("####","####"),    ]    )req.add.header("User_Agent","Mozilla/5.0(Windows NT 10.0,WOW64) AppleWebKit/537.36(KHT)")resp = urlopen(req,data= postData.encode("utf-8"))#使用BeautifulSoup去解析soup = BeautifulSoup(resp,'html.parser')#获取所有以/wiki/开头的a标签的href属性listUrls = soup.findAll("a",href = re.compile("^/wiki/"))#输出所有的词条对应的名称和URLfor url in listUrls:    #过滤.jpg/.JPG结尾的URL    if not re.search("\.(jpg|JPG)$",url["href"]):        #输出URL的文字和对应的链接        #string只能获取一个get_text()获取标签下的所有文字        print(url.get.text(),"<--","https://en.wikipedia.org"+url["href"])#print(resp.read().decode("utf-8"))#获取数据库链接connection = pymysql.connect(host = "localhost",                user = "root",                password = "123456",                db = "\wikiurl",                charset = "utf8mb4")try :    #获取会话指针    with connection.cursor() as cursor:        #创建sql语句        sql = "insert into `urls`(`urlname`,`urlhref`) values (%s,%s)"        #执行sql语句        cursor.excute(sql,(url.get.text(),"https://en.wikipedia.org"+url["href"]))        #提交        connection.commit()finally:    connection.close()

资料区

Beautiful Soup 4.2.0官方文档

阅读全文

0 0