python一个简单的小爬虫

来源:互联网 发布:手机唱歌调音软件 编辑:程序博客网 时间:2024/04/28 13:15

晚上废了一个来小时才终于搞完了。。。。中间碰到无数的问题 主要是由于 python版本的问题,网上的好多爬虫教程都是用的python2 而python3相对于python2感觉改了好多=.=


python3的urllib2不叫这个名字了,改成了urllib.request

还有urlopen.read函数返回的不再是string 而是byte 需要在后面加上句decode('utf-8') 这个调了好久=.=

def getHtml(url):    page = urllib.request.urlopen(url)    html = page.read()    return html.decode('utf-8')

再就是正则表达了,本来只是打算爬一下贴吧上的图片,但是正则表达式也是调了好久才勉强说得过去

主要是本来只想爬jpg结尾的图片没想到出现这么一种特殊情况。。。真蛋疼。。


http://tb1.bdstatic.com/tb/cms/img/tieba_index_banner960x90.png"/></a></div>        <div id="container" class="l_container  "><div class="content clearfix"><div class="card_top_wrap clearfix card_top_theme2 " ><div class="card_top_right">    <div class="sign_mod_bright" id="sign_mod"><div class="sign_tip_container"><div class="j_succ_info sign_succ1" style="display:none"><div class="sign_tip_bdwrap clearfix"><div class="sign_tip_bd_arr"></div><div class="sign_tip_main"><div class="sign_succ_calendar"><div class="sign_succ_calendar_title"><div class="calendar_title_month clearfix"><div class="calendar_month_next j_calendar_month_next"> </div><div class="calendar_month_prev j_calendar_month_prev"> </div><div class="calendar_month_span j_calendar_month"> </div></div></div><table class="sign_succ_table "  ><thead align="center"><tr class="sign_succ_canlerdar_head"><td>日</td><td>一</td><td>二</td><td>三</td><td>四</td><td>五</td><td>六</td></tr></thead><tbody align="center" class="sign_succ_canlerdar_days j_canlerdar_days"><tr><td class="j_1_0"> </td><td class="j_1_1"> </td><td class="j_1_2"> </td><td class="j_1_3"> </td><td class="j_1_4"> </td><td class="j_1_5"> </td><td class="j_1_6"> </td></tr><tr><td class="j_2_0"> </td><td class="j_2_1"> </td><td class="j_2_2"> </td><td class="j_2_3"> </td><td class="j_2_4"> </td><td class="j_2_5"> </td><td class="j_2_6"> </td></tr><tr><td class="j_3_0"> </td><td class="j_3_1"> </td><td class="j_3_2"> </td><td class="j_3_3"> </td><td class="j_3_4"> </td><td class="j_3_5"> </td><td class="j_3_6"> </td></tr><tr><td class="j_4_0"> </td><td class="j_4_1"> </td><td class="j_4_2"> </td><td class="j_4_3"> </td><td class="j_4_4"> </td><td class="j_4_5"> </td><td class="j_4_6"> </td></tr><tr class="j_5" style="display:none"><td class="j_5_0"> </td><td class="j_5_1"> </td><td class="j_5_2"> </td><td class="j_5_3"> </td><td class="j_5_4"> </td><td class="j_5_5"> </td><td class="j_5_6"> </td></tr><tr class="j_6" style="display:none"><td class="j_6_0"> </td><td class="j_6_1"> </td><td class="j_6_2"> </td><td class="j_6_3"> </td><td class="j_6_4"> </td><td class="j_6_5"> </td><td class="j_6_6"> </td></tr></tbody></table></div><div class="sign_tip_boards"><div class="sign_tip_board sign_tip_board_urank j_sign_ad_mobi"><div class="sign_tip_board_ico"></div><p>签到排名:今日本吧第<span class="sign_index_num j_signin_index"></span>个签到,</p><p><span class="j_succ_text">本吧因你更精彩,明天继续来努力!</span></p></div><div class="sign_tip_board sign_tip_board_barrank"><div class="sign_tip_board_ico"></div>                        <p>本吧签到人数:0</p></div></div></div><div class="sign_tip_aside">                <div class="sign_tip_sbox sign_tip_sbox_first sign_tip_sbox_1key"><div class="sign_tip_sbox_hd">一键签到</div><div class="sign_tip_sbox_bd"><div class="sign_tip_sbox_cnt"><a class="sign_tip_sbox_card j_sign_tip_1key_icon sign_tip_sbox_card_pencil" href="/tbmall/tshow?tab=detail" target="_blank"></a><div class="sign_tip_sbox_txt">可签<span class="orange_text">7</span>级以上的吧<span class="orange_text">50</span>个</div><div class="sign_tip_sbox_btn"><a href="/home/main?id=#stipsign" target="_blank" class="ui_btn ui_btn_sub_s"><span><em><b class="sign_crown sign_crown_pencil" title="无瑕的T秀勋章"></b>一键签到</em></span></a></div></div></div></div>                <div class="sign_tip_sbox sign_tip_sbox_fixsign"><div class="sign_tip_sbox_hd sign_tip_sbox_hd_inf j_need_rpln_wrap">本月漏签<span class="j_lack_sign_monthly_count sign_num">0</span>次!</div><div class="sign_tip_sbox_bd"><div class="sign_tip_sbox_cnt"><a href="/tbmall/propslist?category=108" class="sign_tip_sbox_card" target="_blank"><span class="sign_num"><span class="j_rpln_card_count">0</span></span></a><div class="sign_tip_sbox_txt">成为超级会员,赠 送8张补签卡</div><div class="sign_tip_sbox_btn"><a href="#" class="ui_btn ui_btn_sub_s j_lack_sign_monthly_help" target="_blank"><span><em>如何使用?</em></span></a><div class="lack_sign_monthly_tip_wrap"><div class="ui_card_wrap lack_sign_monthly_tip_card j_lack_sign_monthly_tip_card" style="display:none;"><div class="ui_card_content "><div class="time_gift_tip">点击日历上漏签日期,即可进行<span class="strongerText">补签</span>。</div></div><span class="arrow ui_white_down" style="left:48%;"></span></div></div></div></div></div></div><div class="sign_tip_sbox sign_tip_sbox_chainsign"><div class="sign_tip_sbox_hd sign_tip_sbox_hd_inf">连续签到:<span class="sign_num j_sign_succ_keep"></span>天  累计签到:<span class="sign_num j_sign_succ_count"></span>天</div><div class="sign_tip_sbox_bd"><div class="sign_tip_sbox_cnt"><a href="/tbmall/propslist?category=108" class="sign_tip_sbox_card" target="_blank"><span class="sign_num"><span class="j_sign_chainsign_num">0</span></span></a><div class="sign_tip_sbox_txt">超级会员单次开通12个月以上,赠送连续签到卡3张</div><div class="sign_tip_sbox_btn"><a href="#" class="ui_btn ui_btn_sub_s j_cont_sign_card" target="_blank"><span><em>使用连续签到卡</em></span></a></div></div></div></div><div class="sign_tip_sbox sign_tip_sbox_last sign_tip_sbox_rights"><div class="sign_tip_sbox_bd j_sign_rights"><div class="sign_rights_display clearfix"><div class="sign_rights_icon j_sign_rights_icon rights_1"></div><div class="sign_rights_icon j_sign_rights_icon rights_2"></div><div class="sign_rights_icon j_sign_rights_icon rights_3"></div><div class="sign_rights_icon j_sign_rights_icon rights_4"></div><div class="sign_rights_icon j_sign_rights_icon rights_5"></div><span class="split_line"></span><a href="/f/like/level?kw=%E6%81%B6%E9%AD%94&ie=utf-8&lv_t=lv_nav_who" class="balv_help" title="签到规则" target="_blank"></a></div></div></div></div>            </div></div></div><div id="signstar_wrapper" class="j_sign_box sign_box_bright" ><a href="#" onclick="return false" data-dw="5" tabindex="3" title="签到" class="j_signbtn sign_btn_bright" ><span class="sign_today_date">02月05日</span><span class="sign_month_lack_days">漏签<span class="j_sign_month_lack_days">0</span>天</span></a></div>        </div></div><div class="card_top  clearfix">        <div class="card_head "><a href="/f?kw=%E6%81%B6%E9%AD%94&ie=utf-8">            <img class="card_head_img" src="http://m.tiebaimg.com/timg?wapp&quality=80&size=b150_150&subsize=20480&cut_x=0&cut_w=0&cut_y=0&cut_h=0&sec=1369815402&srctrace&di=8e019dfa8f5118e54889723078bf8ddc&wh_rate=null&src=http%3A%2F%2Fimgsrc.baidu.com%2Fforum%2Fpic%2Fitem%2F9922720e0cf3d7caf096bf27f31fbe096b63a97b.jpg

特么的还真是以src=开头的 以jpg结尾的。。

这是原来的正则表达式:

r = r'src="(.*?\.jpg)"'

修改之后勉强能说的过去吧,唯一的瑕疵就是扒下来的是个二维的元组 里面那一维的第一个是图片网址 第二个是.jpg  .gif .png啥的

总代码:

import reimport urllib.requestdef getHtml(url):    page = urllib.request.urlopen(url)    html = page.read()    return html.decode('utf-8')url = "http://tieba.baidu.com/p/3665883057"r = r'src="(.*?\.(jpg|png|gif))"'com = re.compile(r)html = getHtml(url)ans = com.findall(html)num = 1for img in ans:    urllib.request.urlretrieve(img[0], 'G:/lala/%d.jpg' % num)  #下载 注意第二个参数一定要写文件名 不然报错    num += 1


1 0