基于Python检索系统（2）爬虫

来源：互联网发布：林俊杰家世知乎编辑：程序博客网时间：2024/06/05 09:12

将上海理工大学的新闻中心（http://www.usst.edu.cn/s/1/t/517/p/2/i/411/list.htm）的标题或全文爬取下来，存入News.txt 文件。简单的应用正则表达式（re模块）和字符串的处理即可实现。

导入requests模块，并使用requests.get()，可以从获得我们所需要的所有信息，得到的结果如下：

可以看出，我们所需要的新闻标题是在标签中，其中特殊的带有加粗字体的新闻标题是在标签中的，需要进行简单的处理。最终将近期的新闻标题全部写入News.txt文件。

代码实现：

import requestsimport redef Usst_News_Spider(page=1):    url = "http://www.usst.edu.cn/s/1/t/517/p/2/i/" + str(page) + "/list.htm"    full_text = requests.get(url)    key_content = full_text.text    #特殊字符串的处理    content_left_treated = key_content.replace('<b>', '')    content_right_treated = content_left_treated.replace('</b>', '')    #正则表达式进行匹配    title = re.findall("<font color=''>(.*?)</font>", content_right_treated)    print(title)    print(key_content)    for i in title:        f.write(i)        f.write("\n")f = open("News.txt", "w", encoding='utf-8')for i in range(1, 380):    Usst_News_Spider(i)f.close()

阅读全文

0 0