Python抓取one网页上的内容

来源:互联网 发布:济南专业淘宝拍摄 编辑:程序博客网 时间:2024/04/30 12:17

1.python环境搭建

安装homebrew

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

安装pip

首先安装easy_install:
curl https://bootstrap.pypa.io/ez_setup.py -o - | sudo python
接着:
sudo easy_install pip

安装virtualenv

pip install virtualenv

安装request和beautifulsoup4

pip install requests beautifulsoup4

2.网页分析

请移步源网址,本文参考原文

3.python编码

import argparseimport refrom multiprocessing import Poolimport requestsimport bs4import timeimport jsonimport ioroot_url = 'http://wufazhuce.com'def get_url(num):    return root_url+'/one/'+str(num)def get_urls(num):    urls = map(get_url,range(100,100+num))    return urlsdef get_data(url):    dataList = {}    response = requests.get(url)    if response.status_code != 200:        return {'noValue':'noValue'}    soup = bs4.BeautifulSoup(response.text,'html.parser')    print soup.title.string    dataList['index'] = soup.title.string[4:7]    for meta in soup.select('meta'):        if meta.get('name') == 'description':            dataList['content'] = meta.get('content')        dataList['imgUrl'] = soup.find_all('img')[1]['src']    return dataListif __name__ == '__main__':    pool = Pool(4)    dataList = []    urls = get_urls(10)    start = time.time()    dataList = pool.map(get_data,urls)    end = time.time()    print 'use:%.2f s'%(end-start)    jsonData = json.dumps({'data':dataList})    with open('data.txt','w') as outfile:        json.dump(jsonData,outfile)
0 0