php 正则匹配html页面中img标签

来源:互联网 发布:想做一个程序员要考研 编辑:程序博客网 时间:2024/05/16 06:42

一、抓取网页的时候需要把html中的img标签取出来就需要用preg_match_all 和正则了,preg_match_all和正则了

<?php
$con = file_get_contents("http://www.jb51.net/news/jb-1.html");
$pattern="/<[img|IMG].*?src=[\'|\"](.*?(?:[\.gif|\.jpg|\.png]))[\'|\"].*?[\/]?>/";
preg_match_all($pattern,$con,$match);
print_r($match);
?>


----------------------------------------------------------------

preg_match_all+正则,替换掉html页面中不想要的东西


别人的:http://www.jb51.net/article/16146.htm


我的:<?php
 
 $str='<p> <!-- ImageReady Slices (czh.jpg) -->
 <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0"
 width="980" height="1000"> <tbody>
 <tr> <td> <img alt=""
 src="/images/upload/Image/4a092116c4afbbb3b67613a43e6c9d1a.jpg"
 width="980" height="100" /> </td> </tr> <tr> <td> <img alt=""
 src="/images/upload/Image/e855b4925115a604acb97389e52d37c0.jpg"
 width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="
 /images/upload/Image/99f62b9d7314498a937d48385e035560.jpg" width="980"
  height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/2ef61fb51a4550f2019487b47385052a.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/f29d0f73f69a4afb78999c7fa3fc9a11.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/acaeeb9353dabf906463344f8dbafe24.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/71d27a7f5dc455d115c51081106aff07.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/ce602c16ec6c05cac8b6ebd55aebbd67.jpg" width="980" height="100" /> </td> </tr> <tr> <td>
   <img alt="" src="/images/upload/Image/e1307d3dfb0b8cccfa550c8db32adcc7.jpg"
   width="980" height="100" /> </td> </tr> <tr> <td>
   <img alt="" src="/images/upload/Image/dac21ae8a398fd56cec307e158b7873c.jpg" width="980" height="100" /> </td> </tr> </tbody> </table> <!-- ImageReady Slices (czh.jpg) --> <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0" width="980" height="1000"> <tbody> <tr> <td> <img alt="" src="/images/upload/Image/9a4c483b7352742833c7a7efb1308aa6.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/bf38486ccc0e030542536fcc75454a56.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/6734650fee1bc6086cf5ef1ec68b9bed.jpg" width="980" height="100" /> </td> </tr> <tr> <td>
   <img alt="" src="/images/upload/Image/6e487d676e4fa21645658dcb77f59838.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/3157746b72a02f0d83a79f97aaaa2306.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/b3280ef6b142d171a88cbeafab2f5987.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/2b3894d3306b49fc224b55652c266a0c.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/38776cd41480ddf27e94d7a4cbd39ebd.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/837cbac49a0db916249f6d164ecd7abe.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/ae30a7ec07428176239cd487d7b455cb.jpg" width="980" height="100" /> </td> </tr> </tbody> </table> <!-- ImageReady Slices (czh.jpg) --> <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0" width="980" height="1000"> <tbody> <tr> <td> <img alt="" src="/images/upload/Image/e8731b1eefc612c41c33276f784dc8a2.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/f1bcc2002d92a8f53df6f679c903caae.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/e893dbe22070dd52d1c9070f37ca36e4.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/050525021e38695d2b13c26fec1296f9.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/c39dda4b709135c15f4671fc04da8a61.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/18bddfaa59cf004a5f46415390b35a12.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/1558b379b034095dbe5c1239704fbe29.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/d2189c1ba90be3570341bca4855ac6f3.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/f5445b66d9774881f1b020a079739f8a.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/b77e702ec70bee66205e185d7338d4e5.jpg" width="980" height="100" /> </td> </tr> </tbody> </table> <!-- ImageReady Slices (czh.jpg) --> <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0" width="980" height="1000"> <tbody> <tr> <td> <img alt="" src="/images/upload/Image/a78872204e74390f29cf0b028fd18d75.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/8412bf629df21c8b3421395a128f4fec.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/d85a3d494b525e6bdc9f089a5ecce368.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/6ec0ba1e92827a6f7200aaa0ef0d4992.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/1c9fc2ef3ee1e9c6272c2cd7847d14ea.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/f48f9064ea02faa05cb8b6cd22fafa12.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/72efc64476a037b3f9848525a84820b0.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/c101ad2d4ce0d3c30431b87590c7dce4.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/4ff8fad35694f646bbb9fdabd14f7e89.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/59ff9d00eed67eebac933a78af5490c4.jpg" width="980" height="100" /> </td> </tr> </tbody> </table> <!-- ImageReady Slices (czh.jpg) --> <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0" width="980" height="1000"> <tbody> <tr> <td> <img alt="" src="/images/upload/Image/630e438a3f951b14e12dc47da248f917.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/68a59d7d703a7b8156e8e81218cf56f5.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/7ea529908bb392bf399f5990b6bad59d.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/4632ef6fd1422d34b32dbb4a0560aa43.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/b78be28849087272d937c42d2339b8e8.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/2af322042f6369ac12a478fdf16db310.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/2b0fece44db65cf805a5415481752acb.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/b33ef324f6f591ec646b8959ed5d6174.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/c7639faa5f186c0226d5f7f9e80ada3e.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/cfa2372d8098946dd70d5f7a019e3bfc.jpg" width="980" height="100" /> </td> </tr> </tbody> </table> <!-- ImageReady Slices (czh.jpg) --> <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0" width="980" height="1000"> <tbody> <tr> <td> <img alt="" src="/images/upload/Image/45b9e1774bc2c1efd845daaa57bd26b8.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/dc6e197554c5e659b3d1602d0f6830f1.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/e710278b0984b52131b8e59de8ba6134.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/9c0fa692d8e2c1aa442be360d51a8256.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/d73dd7d18d369a331050329eb05c7716.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/904c3353f5d2236f5b6d71126a783251.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/fd1cff6978dca197401a4ba965a993f9.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/805f23a3b35dc0e2d3c37349a8aaee9f.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/0a4e4bf84e96a20f11309769b38a0349.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/ebc48c96f0936d7b90406b3bb0ba0178.jpg" width="980" height="100" /> </td> </tr> </tbody> </table> <!-- ImageReady Slices (czh.jpg) --> <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0" width="980" height="1000"> <tbody> <tr> <td> <img alt="" src="/images/upload/Image/183f0b516b13fecf4cf196f9b7a01b89.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/ff2097cd9aa76bcb45a51bedf6233895.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/ddf4ab5686b9cc9f64aca3eeb6873704.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/27c93c046c6a9e640e421d0e1947a16c.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/6fe66075b816c30a6f0763695bc6b380.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/adf1aec05811fecf0a180c4c0541ae47.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/52435bb0c8cd8de759dfcc61954234c4.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/e9a5a5bb269015436ce5b737e906bf9f.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/a32feac2c6cf4521b4ecef2a8bf6808a.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/dc550a9c08d0d53b441c820ae0c2e2e5.jpg" width="980" height="100" /> </td> </tr> </tbody> </table> <!-- ImageReady Slices (czh.jpg) --> <table id="__01" class="ke-zeroborder" border="0" cellspacing="0" cellpadding="0" width="980" height="570"> <tbody> <tr> <td> <img alt="" src="/images/upload/Image/84d1fb84c6e16b6ab5c1200dca515ed6.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/0969515313ada2e639f0d5d6cb918450.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/1313f752ef089d143b95a392d7a08aaa.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/abf1522c5b1d68069abd68950c83be24.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/8326b90093be0098e7dfb9fdfc6c6264.jpg" width="980" height="100" /> </td> </tr> <tr> <td> <img alt="" src="/images/upload/Image/30ed3c3ba587ad2a8112b5ae0a25d782.jpg" width="980" height="70" /> </td> </tr> </tbody> </table> <!-- End ImageReady Slices --><!-- End ImageReady Slices --><!-- End ImageReady Slices --><!-- End ImageReady Slices --><!-- End ImageReady Slices --><!-- End ImageReady Slices --><!-- End ImageReady Slices --><!-- End ImageReady Slices --> <p> <br /> </p>" } ';



//$str= file_get_contents('http://item.jd.com/245431.html');
 //var_dump($str);die;
 
 
 
$pattern="/<[img|IMG].*?src=[\'|\"](.*?(?:[\.gif|\.jpg|\.png]))[\'|\"].*?[\/]?>/";
preg_match_all($pattern,$str,$match);
print_r($match);

/* 项目中用到的方案*/
/*$pattern="/<!--.*-->/isU";

preg_match_all($pattern,$str,$match);
var_dump($match);
for($i=0;$i<count($match[0]);$i++)
{
    $str=str_replace($match[0][$i],'',$str);
}


echo $str;*/


三、php正则匹配模式的介绍

--------------------------------------------------------

模式修正符
模式修正符 -- 解说正则表达式模式中使用的修正符
说明
下面列出了当前在 PCRE 中可能使用的修正符。括号中是这些修正符的内部 PCRE 名。修正符中的空格和换行被忽略,其它字符会导致错误。 



i (PCRE_CASELESS)
如果设定此修正符,模式中的字符将同时匹配大小写字母。 

m(PCRE_MULTILINE)
默认情况下,PCRE 将目标字符串作为单一的一“行”字符所组成的(甚至其中包含有换行符也是如此)。“行起始”元字符(^)仅仅匹配字符串的起始,“行结束”元字符($)仅仅匹配字符串的结束,或者最后一个字符是换行符时其前面(除非设定了 D 修正符)。这和 Perl 是一样的。 

当设定了此修正符,“行起始”和“行结束”除了匹配整个字符串开头和结束外,还分别匹配其中的换行符的之后和之前。这和 Perl 的 /m 修正符是等效的。如果目标字符串中没有“\n”字符或者模式中没有 ^ 或 $,则设定此修正符没有任何效果。 

s(PCRE_DOTALL)
如果设定了此修正符,模式中的圆点元字符(.)匹配所有的字符,包括换行符。没有此设定的话,则不包括换行符。这和 Perl 的 /s 修正符是等效的。排除字符类例如 [^a] 总是匹配换行符的,无论是否设定了此修正符。 

x(PCRE_EXTENDED)
如果设定了此修正符,模式中的空白字符除了被转义的或在字符类中的以外完全被忽略,在未转义的字符类之外的 # 以及下一个换行符之间的所有字符,包括两头,也都被忽略。这和 Perl 的 /x 修正符是等效的,使得可以在复杂的模式中加入注释。然而注意,这仅适用于数据字符。空白字符可能永远不会出现于模式中的特殊字符序列,例如引入条件子模式的序列 (?( 中间。 

e
如果设定了此修正符,preg_replace() 在替换字符串中对逆向引用作正常的替换,将其作为 PHP 代码求值,并用其结果来替换所搜索的字符串。 

只有 preg_replace() 使用此修正符,其它 PCRE 函数将忽略之。 

注: 本修正符在 PHP3 中不可用。 


A(PCRE_ANCHORED)
如果设定了此修正符,模式被强制为“anchored”,即强制仅从目标字符串的开头开始匹配。此效果也可以通过适当的模式本身来实现(在 Perl 中实现的唯一方法)。 

D(PCRE_DOLLAR_ENDONLY)
如果设定了此修正符,模式中的美元元字符仅匹配目标字符串的结尾。没有此选项时,如果最后一个字符是换行符的话,美元符号也会匹配此字符之前(但不会匹配任何其它换行符之前)。如果设定了 m 修正符则忽略此选项。Perl 中没有与其等价的修正符。 

S
当一个模式将被使用若干次时,为加速匹配起见值得先对其进行分析。如果设定了此修正符则会进行额外的分析。目前,分析一个模式仅对没有单一固定起始字符的 non-anchored 模式有用。 

U(PCRE_UNGREEDY)
本修正符反转了匹配数量的值使其不是默认的重复,而变成在后面跟上“?”才变得重复。这和 Perl 不兼容。也可以通过在模式之中设定 (?U) 修正符或者在数量符之后跟一个问号(如 .*?)来启用此选项。 

X(PCRE_EXTRA)
此修正符启用了一个 PCRE 中与 Perl 不兼容的额外功能。模式中的任何反斜线后面跟上一个没有特殊意义的字母导致一个错误,从而保留此组合以备将来扩充。默认情况下,和 Perl 一样,一个反斜线后面跟一个没有特殊意义的字母被当成该字母本身。当前没有其它特性受此修正符控制。 

u(PCRE_UTF8)
此修正符启用了一个 PCRE 中与 Perl 不兼容的额外功能。模式字符串被当成 UTF-8。本修正符在 Unix 下自 PHP 4.1.0 起可用,在 win32 下自 PHP 4.2.3 起可用。自 PHP 4.3.5 起开始检查模式的 UTF-8 合法性。

---------------------------------------- 我是分割线  ---------------------------------------------------------------------

最后在记录下  刚看到的 preg_replace() 替换&;

网页中看到的& 很有可能 是&amp;或者&#38;

于是 替换就不能简单的替换;

1 $amp = '&#38;AA';2 echo preg_replace("/&(amp|#38);/i",'',$amp); //将&替换为空


js中正则匹配模式

------------------------------------------------------------------

单行模式允许小数点(.)匹配包括换行符(\n)在内的任意字符(出自Regex Match Tracer).

依据<<Microsoft Windows 脚本技术>>chm文档中所言:

小数点(.)匹配除 "\n" 之外的任何单个字符。要匹配包括 '\n' 在内的任何字符,请使用象 '[.\n]' 的模式。

正则表达式对象模式仅有如下三种:

g (全文查找出现的所有 pattern)
i (忽略大小写)
m (多行查找)

即没有单行匹配模式

但chm文档中所说要匹配包括 '\n' 在内的任何字符,请使用象 '[.\n]' 的模式是错误的


(以下摘自:关于正则表达式:如何匹配所有字符)

  小数点(.)一量进入[]中间就变成真的小数点了(\.)
   
  不信可以这样测试:  
  asdfasdf<span   style="font-size:   22px">asdfasdf</span>asdfasdf  
  正则表达式:  
  <span   style=\"font-size\:   22px\">[^.]+</span>  
  就可以匹配成功  

最终结果应该用(.|\n)或(.|\n)


















原创粉丝点击