简单的python爬取网页字符串内容并保存

来源:互联网 发布:女鞋设计软件 编辑:程序博客网 时间:2024/04/29 09:41

最近想试试python的爬虫库,就找了个只有字符串的的网页来爬取。网址如下:

http://mobilecdn.kugou.com/api/v3/special/song?plat=0&page=1&pagesize=-1&version=7993&with_res_tag=1&specialid=26430

打开后看到是一些歌名还有hash等信息。按照hash|filename的方式存在文件里,先贴代码


#coding=utf-8import urllibimport reimport os def getHtml(url):    page = urllib.urlopen(url)    html = page.read()    return html def getHash(html):    reg = r'"hash":"(.+?)",'    has = re.compile(reg)     hashlist = re.findall(has,html)    with  open('1.txt','w') as f:      for has in hashlist:        f.write(has+"|"+"\r\n")    def getName(html):    reg=r'"filename":"(.+?)",'    name=re.compile(reg)    namelist=re.findall(name,html)    with open('1.txt','rb') as fr:      with open('2.txt','wb') as fw:        for name in namelist:  for l in fr:            fw.write(l.replace(b'\r\n', name+b'\r\n'))                      break html=getHtml("http://mobilecdn.kugou.com/api/v3/special/song?plat=0&page=1&pagesize=-1&version=7993&with_res_tag=1&specialid=26430")getHash(html)getName(html)os.remove('1.txt')

说起来也比较简单,就是拿到取html页面后按照正则取两次内容后存在txt里面。

0 0