python re正则匹配网页中图片url地址
来源:互联网 发布:c4d for mac 编辑:程序博客网 时间:2024/05/16 16:20
最近写了个python抓取必应搜索首页http://cn.bing.com/的背景图片并将此图片更换为我的电脑桌面的程序,在正则匹配图片url时遇到了匹配失败问题。
要抓取的图片地址如图所示:
首先,使用这个pattern
reg = re.compile('.*g_img={url: "(http.*?jpg)"')
无论怎么匹配都匹配不到,后来把网页源码抓下来放在notepad++中查看,并用notepad++的正则匹配查找,很轻易就匹配到了,如图:
后来我写了个测试代码,把图片地址在的那一行保存在一个字符串中,很快就匹配到了,如下面代码所示,data是匹配不到的,然而line是可以匹配到的。
# -*-coding:utf-8-*-import osimport ref = open('bing.html','r')line = r'''Bnp.Internal.Close(0,0,60056); } });;g_img={url: "https://az12410.vo.msecnd.net/homepage/app/2016hw/BingHalloween_BkgImg.jpg",id:'bgDiv',d:'200',cN'''data = f.read().decode('utf-8','ignore').encode('gbk','ignore')print " "reg = re.compile('.*g_img={url: "(http.*?jpg)"')if re.match(reg, data): m1 = reg.findall(data) print m1[0]else: print("data Not match .") print 20*'-'#print lineif re.match(reg, line): m2 = reg.findall(line) print m2[0]else: print("line Not match .")
由此可见line和data是有区别的,什么区别呢?那就是data是多行的,包含换行符,而line是单行的,没有换行符。我有在字符串line中加了换行符,结果line没有匹配到。
到这了原因就清楚了。原因就在这句话re.compile('.*g_img={url: "(http.*?jpg)"')。后来翻阅python文档,发现re.compile()这个函数的第二个可选参数flags。这个参数是re中定义的常量,有如下常量
re.DEBUG Display debug information about compiled expression.re.I re.IGNORECASE Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.
re.L
re.LOCALE Make \w, \W, \b, \B, \s and \S dependent on the current locale.
re.M
re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
re.S
re.DOTALL Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.re.U re.UNICODE Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.New in version 2.0.
re.X
re.VERBOSE This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
这里我们需要的就是re.S 让'.'匹配所有字符,包括换行符。修改正则表达式为
reg = re.compile('.*g_img={url: "(http.*?jpg)"', re.S)
0 0
- python re正则匹配网页中图片url地址
- python re.sub 正则匹配
- java获取url内容及正则匹配链接图片地址
- Python中re进行匹配
- re.S--python中正则
- Javascript、js 查找匹配网页html中图片url
- Python爬虫实例(4)-用urllib、re和正则表达式爬取网页图片
- python匹配url中是否存在IP地址
- 使用Python正则表达式RE从CSDN博客源代码中匹配出博客信息
- 超强匹配url地址的正则表达式
- 用JS正则表达式取网页中图片地址
- swift 获取网页中图片地址的正则表达式
- python正则表达式 re (二)匹配对象和组
- Python re库 非贪婪匹配(正则表达式库)
- 关于python的正则匹配包re的一些经验教训
- python re 正则表达式总结 匹配指定字符
- python中正则表达式与re 模块
- python中正则表达式与re 模块
- 音频设备驱动代码单独存放于sound/目录而不在drivers/目录中
- 安卓微信浏览器中location.href失效的问题
- oracle12c新建表空间
- Jersey框架一:Jersey RESTful WebService框架简介
- Mac 安装Elasticsearch 5.0
- python re正则匹配网页中图片url地址
- 手势解析工具类-GestureDetector
- linux 文件权限编码区分
- HDU 5949 Relative atomic mass 【模拟】 (2016ACM/ICPC亚洲区沈阳站)
- widget中调用qml
- Java OOM系列(一)
- JAVA编程之static、final修饰符
- Qt中添加自定义Slot函数的方法(VS+Qt)
- I am happy to join