wget -个网站以便脱机浏览

来源：互联网发布：淘宝卖衣服代理编辑：程序博客网时间：2024/04/29 17:57

----------------------------------------
wget 一个网站以便脱机浏览
----------------------------------------
这个网站内容不错,是个elisp入门网页，但打开网页缓慢，不爽, 离线浏览该多好啊！
下载它！
wget -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB5)" -c -r -np -k -L -p -A c,h http://www.delorie.com/gnu/docs/emacs-lisp-intro/emacs-lisp-intro.html#SEC_Top
--2016-08-03 13:37:48-- http://www.delorie.com/gnu/docs/emacs-lisp-intro/emacs-lisp-intro_1.html
Reusing existing connection to www.delorie.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘emacs-lisp-intro_1.html’

    [ <=>                                                                                            ] 6,028       20.3KB/s   in 0.3s

2016-08-03 13:37:49 (20.3 KB/s) - ‘emacs-lisp-intro_1.html’ saved [6028]

Removing emacs-lisp-intro_1.html since it should be rejected.

--2016-08-03 13:37:49-- http://www.delorie.com/gnu/docs/emacs-lisp-intro/emacs-lisp-intro_8.html
Reusing existing connection to www.delorie.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘emacs-lisp-intro_8.html’

    [ <=>                                                                                             ] 6,654       --.-K/s   in 0.009s

2016-08-03 13:37:49 (740 KB/s) - ‘emacs-lisp-intro_8.html’ saved [6654]

Removing emacs-lisp-intro_8.html since it should be rejected.

只看到网页在不断的下载，但浏览下载目录，一个文件也没有下载，或者明明看见文件名称了，可一会又没了，被删掉了
甚感奇怪。

仔细研究后，修正一下，用下面命令可以下载。

wget -U "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB5)" -c -r -np -k -L -p http://www.delorie.com/gnu/docs/emacs-lisp-intro/emacs-lisp-intro.html#SEC_Top

解释:
-c, --continue                resume getting a partially-downloaded file. 断点续传
-r, --recursive          specify recursive download. 递归下载
-np, --no-parent                 don't ascend to the parent directory. 不要向父目录扩散
-k, --convert-links      make links in downloaded HTML or CSS point to. 链接要转换
-L, --relative                  follow relative links only. 只跟踪绝对地址
-p, --page-requisites    get all images, etc. needed to display HTML page. 下载页面所有东西，包括图像

碰到的问题.
1. 不加-U 选项，网站拒绝下载
Resolving www.delorie.com (www.delorie.com)... 65.175.133.15
Connecting to www.delorie.com (www.delorie.com)|65.175.133.15|:80... connected.
HTTP request sent, awaiting response... 403 Bulk download prohibited due to recursion abuse. If you promise to be careful, use wget -U
-U 是指明代理之意. 对应此例, -U 后面留空，不具体指明代理名称也可以运行
-U, --user-agent=AGENT      identify as AGENT instead of Wget/VERSION.

2. 不正确使用-A 选项会删除未指定的文件
-A, --accept=LIST               comma-separated list of accepted extensions.
例子中-A c,h 说只关心.c, .h 文件，故html文件下载后立即被删除. 要是根本就不下载，问题就更难觉察了.
所以，不知道的选项最好不要用，一般默认的是适合大多数人的胃口的。

另外，我用wkhtmltopdf 将网页转换为pdf, 需要一页一页的转，写个批命令。但是好像不支持跳转。

然后用gs 把pdf 连接起来。注意输出文件不能与输入文件重名。

凑合着可以用，达不到理想效果。

0 0