java爬虫(一)

来源:互联网 发布:哈尔滨网络广播电视台 编辑:程序博客网 时间:2024/06/04 18:51

网络爬虫的基本原理就是对获得的内容进行匹配解析,一般采用正则表达式实现。基于此,我们一步步开始用java写网络爬虫

第一步:写正则表达式匹配一个字符串(表达式不唯一,仅供参考)

import java.util.regex.Matcher;import java.util.regex.Pattern;public class Test02 {public static void main(String[] args) {//匹配字符串String str = "1@qq.com2@qq.com3@qq.com4@qq.com";//设置匹配规则String regexp = "\\w+(\\.\\w)*@\\w+(\\.\\w{2,3}){1,3}";//编译规则Pattern regpattern = Pattern.compile(regexp);//匹配Matcher matcher = regpattern.matcher(str);//循环输出结果while (matcher.find()) {System.out.println(matcher.group());}}}
第二步,将网页存为本地文件,采用输入流将读取文件并解析,网页链接:https://www.douban.com/event/14146775/discussion/40108760/

import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.IOException;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Test03 {public static void main(String[] args) throws IOException {//创建文件File file = new File("E:/tmp/在这里留下邮箱_.html");//设置正则规则String regStr = "\\w+(\\.\\w)*@\\w+(\\.\\w{2,3}){1,3}";Pattern pattern = Pattern.compile(regStr);//创建输入流FileReader fr = new FileReader(file);BufferedReader br = new BufferedReader(fr);//创建字符串接收读取内容StringBuilder strBlder = new StringBuilder();String content = br.readLine();//循环读取将读取结果放到finalStr中while (content != null) {strBlder.append(content); content = br.readLine();}//匹配字符串Matcher matcher = pattern.matcher(strBlder);;while (matcher.find()) {System.out.println(matcher.group());}//关闭流br.close();}}

第三步,选择网址,使用URL解析后对网页内容进行爬取

import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.net.URL;import java.net.URLConnection;import java.util.regex.Matcher;import java.util.regex.Pattern;public class Test04 {public static void main(String[] args) throws IOException {//获取urlURL url = new URL("https://www.douban.com/event/14146775/discussion/40108760/");URLConnection  uc = url.openConnection();//获取输入流InputStream is = uc.getInputStream();//转换为字符流InputStreamReader isr = new InputStreamReader(is);//使用包装流BufferedReader br = new BufferedReader(isr);//保存读取结果StringBuilder strBlder = new StringBuilder();String content = br.readLine();//循环读取页面内容存入finalStrwhile (content != null) {strBlder.append(content);content = br.readLine();}//创建正则规则String regStr = "\\w+(\\.\\w)*@\\w+(\\.\\w{2,3}){1,3}";Pattern pattern = Pattern.compile(regStr);//设置匹配Matcher matcher = pattern.matcher(strBlder);//循环匹配输出结果while (matcher.find()) {System.out.println(matcher.group());}}}

总结:这三个例子写完,对爬虫算是有一个入门级的了解了,后续会深入探究爬虫原理和实现

例子1采用了String存取待匹配文件,因为不存在循环,整个代码执行过程中只会在方法区分配一次内存;例子2、3改为了非线程安全的StringBuilder,为了提高效率。笔记:在循环内部的字符串最好采用StringBuffer或StringBuilder,后者效率更高,但线程非安全。

原创粉丝点击