一键下载当天ArXiv上的pdf文件

来源：互联网发布：宿舍知乎编辑：程序博客网时间：2024/06/05 04:36

一键下载当天ArXiv上的pdf文件

（本人是天文专业的，以天体物理作为例子）

ArXiv网址：https://arxiv.org/list/astro-ph/new

完整代码如下：

from urllib.request import urlretrievefrom urllib.request import urlopenfrom bs4 import BeautifulSoup import numpy as npimport pandas as npimport urllibimport reimport oshtml = urlopen("https://arxiv.org/list/astro-ph/new")bsObj = BeautifulSoup(html, "lxml")# 判断是否下载pdf文件：def decide(title, abstract, regular):    s1 = re.search(regular, str(title.get_text()))    s2 = re.search(regular, str(abstract.get_text()))    if s1 is not None:        return True    elif s2 is not None:        return True    else:        return False# 无输入用默认regular = input("input regular expression:")if regular == '':    regular = "GRB|FRB|GW"# 将pdf保存到path路径下：dateline = bsObj.find("h3")year = '20' + dateline.get_text().split(' ')[-1] + '/'month = dateline.get_text().split(' ')[-2] + '/'path = 'F:/ArXiv/' + year + month# 若没有此路径，创建一个路径：isExists = os.path.exists(path)if not isExists:    os.makedirs(path)titleList = bsObj.findAll("div", {"class":"list-title mathjax"})for title in titleList:    abstract = title.parent.find("p", {"class":"mathjax"})    if abstract is not None:        if decide(title, abstract, regular):      # 判断是否下载到本地            download = title.parent.parent.previous_sibling.previous_sibling.find("a", {"title":"Download PDF"}).attrs['href']            fileUrl = 'https://arxiv.org' + download            savePath = path + download[5:] + '.pdf'            if os.path.isfile(savePath):                os.remove(savePath)             # 覆盖原文件            urlretrieve(fileUrl, savePath)            print('%s is done!' %title.get_text())print('Finished')

附录：

了解ArXiv网页的HTML内容，可参考我的博客：http://blog.csdn.net/yuliumt/article/details/78763065

获取当天日期的年月，创建文件夹以时间命名：

ArXiv 网站上的日期格式如下：

New submissions for Tue, 12 Dec 17

获取年和月：

dateline = bsObj.find("h3")year = '20' + dateline.get_text().split(' ')[-1] + '/'month = dateline.get_text().split(' ')[-2] + '/'savePath = 'F:/ArXiv/' + year + month

文献保存的路径为： F:\ArXiv\year\month

自动按年月分好类放在F:\ArXiv文件夹下，可在path处修改。

输入正则表达式：

input regular expression:

直接回车为默认值，默认搜索和GRB, FRB, GW相关的内容，可在regular处修改。可根据自己的需求输入正则表达式。

看到Finished表面下载完成

若ArXiv网页的HTML有很大的改动，此代码可能不再适用！

阅读全文

0 0