python第二天_网络爬虫

来源：互联网发布：网络接入设备编辑：程序博客网时间：2024/06/01 08:20

学python的第二天，学习来自于 http://blog.csdn.net/lyjamare/article/details/17006027

# -*- coding: cp936 -*-
#http://movie.douban.com/tag/%E5%8A%A8%E4%BD%9C?start=0&type=T
import urllib2
import re
import sys

# 获取当前系统编码格式
type = sys.getfilesystemencoding()
j = 0
url = 'http://tieba.baidu.com/f?kw=%D1%F8%D5%FD%D6%D0%D1%A7'
content = urllib2.urlopen(url).read()
match = re.findall(r' <a .*?class="j_th_tit">(.*?)</a>', content)
for i in range(0,2000):
print match[i]
print len(match)

自己打了一篇，然后就萌生出想获取贴吧的帖子的想法。

但最终只获取到了置顶帖子的名字。

分析了一下原因应该是出现在url上的获取没有一个重新赋值的过程，今天继续加油。

Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
★★▂▃▄↑ 养正中学吧欢迎你 ↑▄▃▂ ★★

Traceback (most recent call last):
File "G:\pythonCode\crawler1.0.py", line 14, in <module>
print match[i]
IndexError: list index out of range
>>>