Python爬取药智网的中药材图谱网页

来源：互联网发布：c#判断网络是否连接编辑：程序博客网时间：2024/04/27 01:02

这次学习了python中的BeautifulSoup模块，并用bs爬取的药智网。网址：http://db.yaozh.com/tupu?p=

首先要安装bs，要注意版本问题，我一开始用的beautifulsoup4-4.4.1版本装不上，换成beautifulsoup4-4.2.0就可以了。

这是我写的爬取药智网的代码：

#coding=utf-8from bs4 import BeautifulSoupimport urllib2class ZYC:    def __init__(self):        self.user_agent = 'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'        self.headers = {'User_Agent' :self.user_agent}    #获取源代码    def getHtml(self,pageIndex):        try:            url='http://db.yaozh.com/tupu?p='+str(pageIndex)            request=urllib2.Request(url,headers=self.headers)            response=urllib2.urlopen(request)            html=response.read()            return html        except  urllib2.URLError,e:            if hasattr(e,'reason'):                print u'loading error',e.reason                return None    #获取每页的数据    def getPage(self):        #事先将标题写入        f=open('zyc.txt','a+')        f.write("序号|中药材名称|图谱来源|页码|查看图谱|"+"\n")        for i in range(1,11):            html=self.getHtml(i)            soup=BeautifulSoup(html)            SJ=soup.find_all('tr')            #删除每页重复类别标题            #len(SJ)            SJ.remove(SJ[0])#操作之前要先用len()函数看一下有没有超出列表索引范围            for item in SJ:                f.write(item.get_text('|',strip=True).encode('utf-8')+'\n')        f.close()spider=ZYC()f=spider.getPage()

用了bs中的find_all()和get_text()

下面这个是借鉴（链接）的：

 def getSJ(self):        for s in range(1,11):            html=self.getHtml(s)            soup=BeautifulSoup(html)            pageSJ=soup.find_all('tr')            if s==1:                for item in pageSJ[0]:                    if item not in ['\n','\t',' ']:                        with open('SJ.txt','a') as f:                            f.write(item.get_text(strip=True).encode('utf-8')+'|')            f=open('SJ.txt','a')            for i in pageSJ[1:]:                f.write('\n')                for item in i:                    if item not in ['\n','\t',' ']:                        # if item==None:#将空白项填入“None”                        #     f.write('None'+'|')                        # else:                        f.write(item.get_text(strip=True).encode('utf-8')+'|')            f.close()

因为我所爬取的这个网页没有空白项，所以我的代码可用，但要是有空白项，还是第二个比较好。一直在尝试解决这个问题。

0 0