#python学习笔记#使用python爬取网站数据并保存到数据库

来源：互联网发布：linux mysql开机自启编辑：程序博客网时间：2024/06/05 08:21

上篇说到如何使用python通过提取网页元素抓取网站数据并导出到excel中，今天就来说说如何通过获取json爬取数据并且保存到mysql数据库中。

本文主要涉及到三个知识点：

1.通过抓包工具获取网站接口api

2.通过python解析json数据

3.通过python与数据库进行连接，并将数据写入数据库。

抓包不是本文想说的主要内容，大家可以移步这里或者直接在百度搜索“fiddler手机抓包”去了解抓包的相关内容，对了，这篇简书中也公布了一些网站的接口，大家也可以直接去那儿获取。

ok，那直接切入正题，首先看看python是如何拿到json并且解析json的：

获取json数据：

def getHtmlData(url):    # 请求    headers = {        'User-Agent': 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166  Safari/535.19'}    request = urllib.request.Request(url, headers=headers)    response = urllib.request.urlopen(request)    data = response.read()    # 设置解码方式    data = data.decode('utf-8')    return data

解析json:

解析json之前，我们先来看看我们得到的json是怎样的(数据较多，相同结构的数据隐藏了一些)：

{    "id": 1,    "label": "头条",    "prev": "https://api.dongqiudi.com/app/tabs/android/1.json?before=1658116800",    "next": "https://api.dongqiudi.com/app/tabs/android/1.json?after=1500443152&page=2",    "max": 1658116800,    "min": 1500443152,    "page": 1,    "articles": [        {            "id": 375248,            "title": "还记得他们吗？那些年，我们也有自己的留洋军团",            "share_title": "还记得他们吗？那些年，我们也有自己的留洋军团",            "description": "",            "comments_total": 1026,            "share": "https://www.dongqiudi.com/article/375248",            "thumb": "http://img1.dongqiudi.com/fastdfs1/M00/97/55/180x135/crop/-/pIYBAFlkjm-AMc7AAAL4n-oihZs769.jpg",            "top": true,            "top_color": "#4782c4",            "url": "https://api.dongqiudi.com/article/375248.html?from=tab_1",            "url1": "https://api.dongqiudi.com/article/375248.html?from=tab_1",            "scheme": "dongqiudi:///news/375248",            "is_video": false,            "new_video_detail": null,            "collection_type": null,            "add_to_tab": "0",            "show_comments": true,            "published_at": "2022-07-18 12:00:00",            "sort_timestamp": 1658116800,            "channel": "article",            "label": "深度",            "label_color": "#4782c4"        },        {            "id": 382644,            "title": "连续三年英超主场负于水晶宫，今晚克洛普的扑克牌怎么打呢？",            "share_title": "连续三年英超主场负于水晶宫，今晚克洛普的扑克牌怎么打呢？",            "comments_total": 0,            "share": "https://www.dongqiudi.com/article/382644",            "thumb": "",            "top": false,            "top_color": "",            "url": "https://api.dongqiudi.com/article/382644.html?from=tab_1",            "url1": "https://api.dongqiudi.com/article/382644.html?from=tab_1",            "scheme": null,            "is_video": true,            "new_video_detail": "1",            "collection_type": null,            "add_to_tab": null,            "show_comments": true,            "published_at": "2017-07-19 14:55:25",            "sort_timestamp": 1500447325,            "channel": "video"        },        {            "id": 382599,            "title": "梦想不会褪色！慈善机构圆孟买贫民区女孩儿的足球梦",            "share_title": "梦想不会褪色！慈善机构圆孟买贫民区女孩儿的足球梦",            "comments_total": 9,            "share": "https://www.dongqiudi.com/article/382599",            "thumb": "http://img1.dongqiudi.com/fastdfs1/M00/9C/D3/180x135/crop/-/o4YBAFlu8F2AcFtwAACX_DJbrwo612.jpg",            "top": false,            "top_color": "",            "url": "https://api.dongqiudi.com/article/382599.html?from=tab_1",            "url1": "https://api.dongqiudi.com/article/382599.html?from=tab_1",            "scheme": null,            "is_video": true,            "new_video_detail": "1",            "collection_type": null,            "add_to_tab": null,            "show_comments": true,            "published_at": "2017-07-19 14:45:20",            "sort_timestamp": 1500446720,            "channel": "video"        }    ],    "hotwords": "JJ同学",    "ad": [],    "quora": [        {            "id": 182,            "type": "ask",            "title": "足坛历史上有哪些有名的更衣室故事？",            "ico": "",            "thumb": "http://img1.dongqiudi.com/fastdfs1/M00/9B/BE/pIYBAFlt3uyACqEnAADhb9FVavU28.jpeg",            "answer_total": 222,            "scheme": "dongqiudi:///ask/182",            "position": 7,            "sort_timestamp": 1500533674,            "published_at": "2017-07-20 14:54:34"        }    ]}

好，我们现在就将articles这个数组中的数据解析出来，通过这个过程你就会知道为什么python会这么火了~：

先导入解析json的包：

imprt json

然后解析：

dataList = json.loads(data)['articles']

你没看错，就这一步便取出了articles这个json数组；

接下来取出articles中的对象并添加到python的list中，留待后面添加到数据库中使用：

 for index in range(len(dataList)):            newsObj = dataList[index]            #print(newsObj.get('title'))            newsObjs = [newsObj.get('id'), newsObj.get('title'), newsObj.get('share_title'), newsObj.get('description'),                        newsObj.get('comments_total'), newsObj.get('share'), newsObj.get('thumb'), newsObj.get('top'),                        newsObj.get('top_color'), newsObj.get('url'), newsObj.get('url1'), newsObj.get('scheme'),                        newsObj.get('is_video'), newsObj.get('new_video_detail'), newsObj.get('collection_type'),                        newsObj.get('add_to_tab'), newsObj.get('show_comments'), newsObj.get('published_at'),                        newsObj.get('channel'), str(first_label), newsObj.get('comments_total')]

解析json的工作到这就完成了，接下来就是连接数据库了：

#执行sql语句def executeSql(sql,values):    conn = pymysql.connect(host=str(etAddress.get()), port=int(etPort.get()), user=str(etName.get()),                           passwd=str(etPassWd.get()), db=str(etDBName.get()))    cursor = conn.cursor()    conn.set_charset('utf8')    effect_row = cursor.execute(sql, values)    # 提交，不然无法保存新建或者修改的数据    conn.commit()    # 关闭游标    cursor.close()    # 关闭连接    conn.close()

是不是觉得很眼熟，的确python连接数据库和java等类似，也是建立连接，输入mysql的地址，端口号，数据库的用户名，密码然后通过cursor返回操作结果，当然最后要把连接，cursor都关掉。（python连接数据库需要导入pymysql的包，直接通过pip安装，然后import即可）sql语句的写法也和java等类似，整个过程是这样的：

#插入新闻def insertNews(data):        if len(data) > 2:            dataList = json.loads(data)['articles']            first_label = json.loads(data)['label']        for index in range(len(dataList)):            newsObj = dataList[index]            #print(newsObj.get('title'))            newsObjs = [newsObj.get('id'), newsObj.get('title'), newsObj.get('share_title'), newsObj.get('description'),                        newsObj.get('comments_total'), newsObj.get('share'), newsObj.get('thumb'), newsObj.get('top'),                        newsObj.get('top_color'), newsObj.get('url'), newsObj.get('url1'), newsObj.get('scheme'),                        newsObj.get('is_video'), newsObj.get('new_video_detail'), newsObj.get('collection_type'),                        newsObj.get('add_to_tab'), newsObj.get('show_comments'), newsObj.get('published_at'),                        newsObj.get('channel'), str(first_label), newsObj.get('comments_total')]            sql = "insert into news(id,title,share_title,description,comments_total," \                  "share,thumb,top,top_color,url,url1,scheme,is_video,new_video_detail," \                  "collection_type,add_to_tab,show_comments,published_at,channel,label)" \                  "values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) " \                  "ON DUPLICATE KEY UPDATE comments_total = %s"            executeSql(sql=sql,values=newsObjs)#执行sql语句def executeSql(sql,values):    print(str(etPassWd.get()))    conn = pymysql.connect(host=str(etAddress.get()), port=int(etPort.get()), user=str(etName.get()),                           passwd=str(etPassWd.get()), db=str(etDBName.get()))    cursor = conn.cursor()    conn.set_charset('utf8')    effect_row = cursor.execute(sql, values)    # 提交，不然无法保存新建或者修改的数据    conn.commit()    # 关闭游标    cursor.close()    # 关闭连接    conn.close()

最后在main里面：

data = getHtmlData(url)insertNews(data=data)

调用即可，最后数据就存进了数据库里：

当然你也可以做一个界面出来玩玩：

如果大家有需要，我会把demo也传上来抛砖引玉！

阅读全文

0 0