java爬虫,提供链接直接爬取网页代码

来源：互联网发布：java单例模式实例编辑：程序博客网时间：2024/05/21 14:53

其实我只想要爬到整个网页的源代码的就好.通过java的一个包jsoup,就可以直接爬取了,后面有下载源代码(含jsoup包)的链接.

输入:网页链接

输出:网页源代码

代码比较简单,解析都在代码中:

import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import java.io.FileOutputStream;public class Main {    public static void main(String[] args) {        //在这里输入所有想要爬取的网址        String[] urlPath = new String[]{                "http://daily.zhihu.com/"        };        for (String anUrlPath : urlPath) {            try {                Document document = Jsoup.connect(anUrlPath)                        .userAgent("Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)")                        .get();                //此时document.html()就是全部网页信息,如果想要让程序酷炫一些,可以把这些内容打印在控制台                String pathname = anUrlPath;                //将url作为文件名,下面是消除不能在文件名中出现的非法字符                pathname = pathname.replace("http://", "");                pathname = pathname.replace('/', ' ');                pathname = pathname.replace('\\', ' ');                pathname = pathname.replace(':', ' ');                pathname = pathname.replace('<', ' ');                pathname = pathname.replace('>', ' ');                pathname = pathname.replace('|', ' ');                pathname = pathname.replace(" ", "");                pathname = pathname + ".txt";                //将内容保存到本地                FileOutputStream os = new FileOutputStream(pathname, true);                //在文件的第一行写入爬取的网页的url,方便以后用程序自动处理时识别url                os.write(anUrlPath.getBytes("utf-8"));                os.write("\r\n".getBytes());                os.write(document.html().getBytes("utf-8"));            } catch (Exception e) {                //如果出现比如 DNS解析失败,或是拒绝访问等报错,将它们写在exception.txt文件中,并且保证程序继续运行                try {                    FileOutputStream os = new FileOutputStream("exception.txt", true);                    os.write(e.toString().getBytes("utf-8"));                    os.write("\r\n".getBytes());                    System.out.println(e);                } catch (Exception e1) {                    System.out.println(e1);                }            }        }    }}

只需要在urlPath中输入想要爬取的网页链接,就可以直接运行了.我这里就跟风,用"知乎日报"的网址做栗子了.

之后在当前目录中会出现一个文件来保存网页源代码:daily.zhihu.com.txt,如果出现任何报错,都不会使程序中断,而且会将报错的信息保存在一个文件:exception.txt中.

得到了网站的源代码,就可以通过自定义的方式来提取网页中的信息了,之后如果有时间我还会写一个爬取整站代码的博客,到时候输入多个链接,保存成文件夹,把它的整站代码爬下来.

如果有兴趣,可以下载我的源码,连jsoup的包都包含在里面了:http://download.csdn.net/download/weixin_35757704/10013327

阅读全文

0 0