爬虫第四课(RegEx爬取新闻网站)

来源:互联网 发布:python numpy 编辑:程序博客网 时间:2024/05/16 11:52

import requestsimport redef crawler163():    content = requests.get('http://www.163.com/').text    pattern1 = re.compile('<div class="tab_main clearfix".*?</ul>', re.S)    results_part = re.findall(pattern1, content)    pattern2 = re.compile('<li.*?href="(.*?)">(.*?)</a>', re.S)    results_filter = re.findall(pattern2,str(results_part))    for result in results_filter:        http,title = result        http = re.sub('\s', '', http)        title = re.sub('\s', '', title)        print(http,title)if __name__ == "__main__" :    crawler163()
RegEx爬取新闻网站

原创粉丝点击