正则表达式爬虫1

来源:互联网 发布:360医药软件 编辑:程序博客网 时间:2024/05/17 22:41

正则表达式小例子

import reli='hellonihaohello'a=re.search(r'\Ahello',li)print a.group()b=re.search(r'hello\Z',li)print b.group()li='i have a dream'c=re.search(r'\bhave\b',li)print c.group() content = 'i have a 34332589@qq.com dream one day ... 280000089@qq.com money neau'data=re.findall(r'\d{6,11}@qq\.com',content)print data

使用正则表达式爬取糗事百科图片

"""通过正则表达式,下载糗事百科图片"""import requestsimport re#设置下载网页的页面urlurl='https://www.qiushibaike.com/imgrank/'#设置请求头数据headers={    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}#发送请求response=requests.get(url,headers=headers)#解析数据html=response.content.decode('utf8')#print html# with io.open('d:/python/pachong/jiandan.txt','w',encoding='utf8') as f:#     f.write(html)#通过正则表达式获取我们需要的数据reStr='<img src="/{1,2}[^(static)](.*?)"'data=re.findall(reStr,html)#print datafor item in data:    if not item.startswith("http:"):        item="http://"+item    print item    response = requests.get(item)    data=response.content    nameList=item.split("/")    imageName=nameList[len(nameList)-1]    a=re.search('.*(jpg)$',imageName)    if(a!=None):        print imageName        with io.open('d:/python/pachong/'+imageName,'wb') as f:            f.write(data)
原创粉丝点击