Python3 爬虫笔记, 顺带mysql编码解决方案
来源:互联网 发布:淘宝怎么搜失忆水 编辑:程序博客网 时间:2024/06/08 17:01
直接上代码:
#-*- coding: utf-8 -*-,#coding = utf-8import refrom urllib.request import urlopenimport urllibimport pymysqlimport uuiddef unescape(text): def fixup(m): text = m.group(0) if text[:2] == "&#": # character reference try: if text[:3] == "&#x": return chr(int(text[3:-1], 16)) else: return chr(int(text[2:-1])) except ValueError: print("erreur de valeur") pass else: # named entity try: if text[1:-1] == "amp": text = "&amp;" elif text[1:-1] == "gt": text = "&gt;" elif text[1:-1] == "lt": text = "&lt;" else: print(text[1:-1]) except KeyError: print("keyerror") pass return text # leave as is return re.sub("&#?\w+;", fixup, text)def getinfo(content): array = content.split('\n') one = '' two = '' three='' parent ='' for line in array : enil = line bc = enil.lstrip().split(' ') if len(bc)<=1: continue if line[:8] ==' ': parent = three elif line[:6] ==' ': three = bc[0] parent = two; elif line[:4] ==' ': two = bc[0] parent = one else : one = bc[0] parent = ''; write(bc[0],bc[1],parent)def resolve(url): url = 'http://www.tyut.edu.cn/xbsk/tougao/ztflh/'+url for data in urlopen(url): data = data.decode('gbk') clc = re.search('<pre class="style3">(.*?)</pre>', data,re.I|re.M|re.S) if clc!=None: content = clc.group(1) content = re.sub("<br>",'\n',content); getinfo(content)def write(no,name,parent): db = pymysql.connect("localhost","root","guddqs","eip") db.set_charset('utf8') cursor = db.cursor() cursor.execute('SET NAMES utf8;') cursor.execute('SET CHARACTER SET utf8;') cursor.execute('SET character_set_connection=utf8;') sql = "INSERT INTO tlib_bookCategory(bookCategoryId,bookCategoryNo,bookCategoryName,parentId) VALUES ('"+uuid.uuid1().urn[9:]+"','"+no+"','"+name+"','"+parent+"')"; try: cursor.execute(sql) print(no+' ok') db.commit() except : print(no+' shit') db.rollback()for line in urlopen('http://www.tyut.edu.cn/xbsk/tougao/ztflh/'): line = line.decode('gb2312') back = re.search('<a href="(.*?)">',line,re.I|re.M|re.S) if back!=None: lujing = back.group(1) url = unescape(lujing) url = urllib.parse.quote(url) resolve(url)
啊, 代码可直接运行(才怪了), 需要修改sql部分或者添加表并
pip install PyMySQL
阅读全文
0 0
- Python3 爬虫笔记, 顺带mysql编码解决方案
- python3 爬虫连接mysql
- Python3爬虫笔记一
- python3爬虫的编码问题
- 爬虫学习笔记二、 python3.4连接mysql数据库
- python3:爬虫并存入mysql
- Python3.4.1爬虫编写笔记
- 爬虫自学笔记(Python3.6.1)
- Python3.x 爬虫学习笔记——判断网页的编码方式
- python3爬虫初探(七)使用MySQL
- Mysql的编码解决方案
- python3爬虫笔记(一):了解HTTP协议
- Python3爬虫学习笔记1.0——什么是爬虫?
- python3.x爬虫学习:股票数据定向爬虫笔记
- Python3实现简单爬虫及一些编码问题
- python3 爬虫
- python3爬虫
- python3 爬虫
- atom使用全局配置ESLint
- Windows下select模型
- 【Pat】甲级1005
- HDU--统计同成绩学生人数
- HQL查询语句拼接规范,避免SQL注入攻击
- Python3 爬虫笔记, 顺带mysql编码解决方案
- Java语言开发OPC之Utgard的数据访问方式
- 配置jetty 远程调试
- CTime与COleDateTime时间操作类的使用
- MYSQL学习笔记(十六)更新和删除数据
- 微信登陆
- 一位资深程序员大牛给予Java初学者的学习路线建议(转)
- scrapy 爬虫框架
- Android沉浸式状态栏(透明状态栏)