在IDLE 中用python 写新闻爬虫

来源：互联网发布：sql over order by 编辑：程序博客网时间：2024/05/05 19:04

1.在IDLE环境中，在进行文件读写时注意路径的合法写法。

eg：import osimport urllib2url='http://biz.finance.sina.com.cn/usstock/usstock_news.php?pageIndex=1&symbol=AA'try:    content=urllib2.urlopen(url).read()    file_name='G:\\新闻爬取\\A.txt'    fp=open(file_name,'w')    fp.write(content)    fp.close()except:    print '无法获取页面内容'

上述file_name 路径中不能写成 file_name='G:\新闻爬取\A.txt'

如果非要这样写的话可以写成 file_name=r'G:\新闻爬取\A.txt'，前面加上一个字符‘r’。

在保存文件时，最好加上异常处理机制。否则可能会抛出IOError Error2 或 Error 22 错误。

正确写法：

import osimport urllib2url='http://biz.finance.sina.com.cn/usstock/usstock_news.php?pageIndex=1&symbol=AA'content=urllib2.urlopen(url).read()file_name='G:\\新闻爬取\\A.txt'try:    fp=open(file_name,'w')    fp.write(content)except IOError:    print "IOError：无法写入"    //将不能写入的文件url保存。fp.close()

2.在爬取某一新闻列表中的url时，我的想法是建立一个list ，名为links

links=[]

然后将所有的url append 在links中。在程序设计的时候一定要防止进入死循环。

我的写法：

def Get_url_quene(self,list_url):#针对某一列表页面获取所有新闻页面的url        links=[]#url集合        content=urllib2.urlopen(list_url).read()        pos3=0        temp_begin = content.find('<tr><td>')        temp_end=content.find('</a></td><th>')        if temp_begin!=-1:            while (temp_end!=-1 and temp_begin!=-1):                pos1=content.find('<a href="',temp_begin)                pos2=pos1+len('<a href="');#size of <tr><td>·<a href="                pos3=content.find('" target=',pos2)                url_string=content[pos2:pos3]                #print url_string                links.append(url_string)                temp_begin = content.find('<tr><td>',temp_end)                temp_end=content.find('</a></td><th>',temp_begin)        return links

程序的测试url 见上部。

3.如何实现将LDLE中python的输出重定向到文件中。

import sysorigin=sys.stdoutf=open('file.txt','w')sys.stdout=f'''---------以下为含有print的代码----------'''sys.stdout=originf.close()