爬虫总结 && 部分正则匹配

来源：互联网发布：淘宝注册资金编辑：程序博客网时间：2024/06/08 04:51

工作流大致是：

首先利用多线程,能过http协议连接对方网站，获取html字符串，可以用java.net包里的工具类或者其它开源包。
接着通过正则表达式解析html标记，网上资源很多的可以搜一下也可以用开源包。
这样一个基本的爬虫就实现了，剩下来的问题就是如何防止重复爬取网页，如何防止爬取其它链接资源，还有抓取目录的
可以去google搜索，很多的。关键字 htmlparser ，httpclient 爬虫层级。

google: baidu:
java html解析器

//匹配url

[html] view plain copy

print?

//匹配(\w+(-\w+)*)(\.(\w+(-\w+)*))*(\?\S*)?$

//匹配(\w+(-\w+)*)(\.(\w+(-\w+)*))*(\?\S*)?$

知道正则表达式中匹配汉字用：

[html] view plain copy

print?

\u4e00-\u9fa5

\u4e00-\u9fa5

知道用\d匹配数字，\w匹配单词,\n换行……，可用什么匹配双引号呢"

正则表达式双引号

[html] view plain copy

print?

\u0022

\u0022

匹配标题：

[html] view plain copy

print?

<title>([^</title>]*)

<title>([^</title>]*)

对于html代码是:

[html] view plain copy

print?

0501010320

<span name="shangcode" id="shangcode">0501010320</span>

的，使用如下正则表达式精准匹配id的值
正则（反斜杠）：

[html] view plain copy

print?

<span\sname=\u0022shangcode\u0022\sid=\u0022shangcode\u0022>([^]*)

<span\sname=\u0022shangcode\u0022\sid=\u0022shangcode\u0022>([^</span>]*)

java正则（斜杠）：

[html] view plain copy

print?

<span/sname=/u0022shangcode/u0022/sid=/u0022shangcode/u0022>([^]*)

<span/sname=/u0022shangcode/u0022/sid=/u0022shangcode/u0022>([^</span>]*)

对于html代码是:

[html] view plain copy

print?

209.00

<span class="s2" id="webspan"> 209.00 </span>

的，使用如下正则表达式精准匹配价格的值
正则（反斜杠）：

[html] view plain copy

print?

<span\sclass=\u0022s2\u0022\sid=\u0022webspan\u0022>([^]*)

<span\sclass=\u0022s2\u0022\sid=\u0022webspan\u0022>([^</span>]*)

java正则（斜杠）：

[html] view plain copy

print?

<span/sclass=/u0022s2/u0022/sid=/u0022webspan/u0022>([^]*)

<span/sclass=/u0022s2/u0022/sid=/u0022webspan/u0022>([^</span>]*)

对于html代码是:

[html] view plain copy

print?

0-1岁
6-12个月

<span title="0-1岁">0-1岁</span><span title="6-12个月">6-12个月</span>

的，使用如下正则表达式精准匹配年龄的值

正则（反斜杠）：

[html] view plain copy

print?

<span\stitle=\u00220-1\u5c81\u0022>([^]*)

<span\stitle=\u00220-1\u5c81\u0022>([^</span>]*)

java正则（双反斜杠）：

[html] view plain copy

print?

<span\\stitle=\\u00220-1\\u5c81\\u0022>([^]*)

<span\\stitle=\\u00220-1\\u5c81\\u0022>([^</span>]*)

html代码是:

年龄：</td>

[html] view plain copy

print?

\u5e74\u9f84\uff1a</td>

\u5e74\u9f84\uff1a</td>

适合年龄：</td><td width="631" bgcolor="#FFFFFF">

[html] view plain copy

print?

\u5e74\u9f84[^*]{20,58}([^\u0022>]*)
\u5e74\u9f84\uff1a</td>[^\s]{20,22}
\u5e74\u9f84\uff1a</td>[^*]{1,}

\u5e74\u9f84[^*]{20,58}([^\u0022>]*)\u5e74\u9f84\uff1a</td>[^\s]{20,22}\u5e74\u9f84\uff1a</td>[^*]{1,}

对于html代码是：

[html] view plain copy

print?

<a id="bighref" href="http://www.***.com/images/product/8b/b0/8bb05984b23b470593694b7d4d1da2b5_1_l.jpg"
class="MagicZoom">

<a id="bighref" href="http://www.***.com/images/product/8b/b0/8bb05984b23b470593694b7d4d1da2b5_1_l.jpg"class="MagicZoom">

的，使用如下正则精准匹配：

正则（反斜杠）：

[html] view plain copy

print?

<a\sid=\u0022bighref\u0022\shref=\u0022([^\u0022]*)

<a\sid=\u0022bighref\u0022\shref=\u0022([^\u0022]*)

java正则（斜杠）：

[html] view plain copy

print?

<a\\sid=\\u0022bighref\\u0022\\shref=\\u0022([^\\u0022]*)

<a\\sid=\\u0022bighref\\u0022\\shref=\\u0022([^\\u0022]*)

\u5e74\u9f84\uff1a.*?\u0022([^\u0022>]*)
love.*?you

年龄：50个字符第一个引号，

[html] view plain copy

print?

\u5e74\u9f84\uff1a[^*]{50,50}.*?\u0022([^\u0022>]*)
<div class="product-heading">
<div\sclass=\u0022product-heading\u0022>[^*]{200,200}.*?>([^</>]*)

\u5e74\u9f84\uff1a[^*]{50,50}.*?\u0022([^\u0022>]*)<div class="product-heading"><div\sclass=\u0022product-heading\u0022>[^*]{200,200}.*?>([^</>]*)

阅读全文

0 0