一键下载当天ArXiv上的pdf文件

来源:互联网 发布:宿舍 知乎 编辑:程序博客网 时间:2024/06/05 04:36

一键下载当天ArXiv上的pdf文件

(本人是天文专业的,以天体物理作为例子)

ArXiv网址:https://arxiv.org/list/astro-ph/new

完整代码如下:

from urllib.request import urlretrievefrom urllib.request import urlopenfrom bs4 import BeautifulSoup import numpy as npimport pandas as npimport urllibimport reimport oshtml = urlopen("https://arxiv.org/list/astro-ph/new")bsObj = BeautifulSoup(html, "lxml")# 判断是否下载pdf文件:def decide(title, abstract, regular):    s1 = re.search(regular, str(title.get_text()))    s2 = re.search(regular, str(abstract.get_text()))    if s1 is not None:        return True    elif s2 is not None:        return True    else:        return False# 无输入用默认regular = input("input regular expression:")if regular == '':    regular = "GRB|FRB|GW"# 将pdf保存到path路径下:dateline = bsObj.find("h3")year = '20' + dateline.get_text().split(' ')[-1] + '/'month = dateline.get_text().split(' ')[-2] + '/'path = 'F:/ArXiv/' + year + month# 若没有此路径,创建一个路径:isExists = os.path.exists(path)if not isExists:    os.makedirs(path)titleList = bsObj.findAll("div", {"class":"list-title mathjax"})for title in titleList:    abstract = title.parent.find("p", {"class":"mathjax"})    if abstract is not None:        if decide(title, abstract, regular):      # 判断是否下载到本地            download = title.parent.parent.previous_sibling.previous_sibling.find("a", {"title":"Download PDF"}).attrs['href']            fileUrl = 'https://arxiv.org' + download            savePath = path + download[5:] + '.pdf'            if os.path.isfile(savePath):                os.remove(savePath)             # 覆盖原文件            urlretrieve(fileUrl, savePath)            print('%s is done!' %title.get_text())print('Finished')

附录:

了解ArXiv网页的HTML内容,可参考我的博客:http://blog.csdn.net/yuliumt/article/details/78763065

获取当天日期的年月,创建文件夹以时间命名:

ArXiv 网站上的日期格式如下:

New submissions for Tue, 12 Dec 17

获取年和月:

dateline = bsObj.find("h3")year = '20' + dateline.get_text().split(' ')[-1] + '/'month = dateline.get_text().split(' ')[-2] + '/'savePath = 'F:/ArXiv/' + year + month

文献保存的路径为: F:\ArXiv\year\month

自动按年月分好类放在F:\ArXiv文件夹下,可在path处修改。

输入正则表达式:

input regular expression:
直接回车为默认值,默认搜索和GRB, FRB, GW相关的内容,可在regular处修改。可根据自己的需求输入正则表达式。

看到Finished表面下载完成

若ArXiv网页的HTML有很大的改动,此代码可能不再适用!


原创粉丝点击