python爬取HTML网页

来源：互联网发布：淘宝有哪些部门编辑：程序博客网时间：2024/05/17 06:29

记录python正则学习中遇到的问题，以供日后参考。
例如，使用python正则爬取freebuf最新内容title和URL
思路：查看源代码，发现所有最新内容title和url都在 “news-info”和”news-img”这两个类中，通过一次性定位到这两个类中的内容来进行爬取。
难点主要是不会构造正则表达式，经过学习可写出来，但是较为繁琐。
这里写图片描述

#coding=utf-8  import reimport requestsfrom distutils.filelist import findallcontents= requests.get('http://www.freebuf.com/').text  pattern=re.compile('<div class="news-img.*?<a target="_blank" href="(.*?)">.*?<\/a>.*?<div class="news-info.*?<dl>.*?<dt>.*?<a.*?>(.*?)<\/a>',re.S)items = re.findall(pattern,contents)for item in items:    print item[1].strip()+'\n'+item[0]

使用BeautifulSoup更简单一些，代码如下：

#coding=utf-8 import requestsfrom bs4 import BeautifulSoup  contents= requests.get('http://www.freebuf.com/').textsoup = BeautifulSoup(contents,"html.parser")  for tag in soup.select('.news-img'):    name = tag.find('img', class_='img-responsive').get('title')            url = tag.find('a').get('href')    print (name + '\n' + url)

首先定位到“.news-img”,然后在“class_=’img-responsive”中寻找title内容，同理寻找href。

阅读全文

0 0