用requests和beautifulsoup爬取豆瓣电影top250,代码及遇到的问题

来源:互联网 发布:深圳冰川网络 张雄 编辑:程序博客网 时间:2024/05/17 04:53
初始代码如下:
# -*-coding:utf8-*-import requestsfrom bs4 import BeautifulSoupurl='http://movie.douban.com/top250'html=requests.get(url)soup=BeautifulSoup(html)print soup.title
结果报错(如下)和警告(略),
后来将代码改成如下,解决了问题
# -*-coding:utf8-*-import requestsfrom bs4 import BeautifulSoupurl='http://movie.douban.com/top250'html=requests.get(url)soup=BeautifulSoup(html.text,"lxml")print soup.title
最终程序如下:
#!/usr/bin/env python# -*-coding:utf8-*-import requestsimport sysfrom bs4 import BeautifulSoupreload(sys)sys.setdefaultencoding("utf-8")# 获取电影名def get_movie(soup,name):    titles=soup.find_all(class_="title")    for title in titles:        if title.string[1]!='/':                   # 去除其他同名            name.append(title.string)    return name# 获取电影排名和评分def get_number_score(soup,number,score):    number_score=soup.find_all('em')    for i in range(len(number_score)):        if i%2==0:            number.append(number_score[i].string)        else:            score.append(number_score[i].string)    return number,scorename=[];number=[];score=[]                         # 变量初始化f=open('movie.txt','w')# 得到豆瓣top250的电影for i in range(10):    url='http://movie.douban.com/top250?start=%s&filter=&type=' %(i*25)    html=requests.get(url).text    soup=BeautifulSoup(html,"lxml")    name=get_movie(soup,name)    (number,score)=get_number_score(soup, number, score)# 将结果写入文件for j in range(len(name)):    title_str='%s %s %s' %(number[j],name[j],score[j])    f.writelines(title_str+'\n')f.close()

最后导出豆瓣top250的电影,格式如下:
1 肖申克的救赎 9.6
x xxxxxxx x.x
0 0
原创粉丝点击