BS4模块爬取第一财经练手

来源:互联网 发布:天尚网络机顶盒 编辑:程序博客网 时间:2024/05/09 06:29
采用requests包和beautiful soup
注意requests get请求得到的html要encode为‘UTF-8’
得到的数据结果为新闻标题+url
# coding=utf-8from bs4 import BeautifulSoupimport requestsimport timeh1={    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}o_g=['113.200.159.155:9999']pro={'http': o_g[0],             }html=requests.get('http://www.yicai.com/news/cbndata/',timeout=20,headers=h1,proxies=pro)html.encoding = "utf-8"html = html.textsoup = BeautifulSoup(html,'lxml')con=soup.find_all('h3',class_="f-ff1 f-fwn f-fs22")for item in con:    print item.get_text()    print item.a.attrs['href']