使用Python

来源：互联网发布：网络发展阶段编辑：程序博客网时间：2024/05/22 12:54

在Windows的cmd下执行命令

比如pip(3)，python

前提是环境变量PATH指向当前的python安装路径

使用pip3安装flask

直接安装

pip3 install flask

使用代理

pip --proxy <proxy> install <module>

(sudo) pip --proxy http://proxy.hell:3128 install flask

升级pip

C:\Users\wwcheng>python -m pip --proxy http://cn-proxy.jp.oracle.com install --upgrade pip

使用Beautiful Soup处理网页

取网页数据

取<p class="seller"><a href="XXXXXX">name</a></p>中的链接地址

假设有一段html如下，我要获取其中的url

<p class="seller">    <a href="http://www.changjia1.com">厂家名字1</a><p><p class="seller">    <a href="http://www.changjia2.com">厂家名字2</a><p><p class="seller">    <a href="http://www.changjia3.com">厂家名字3</a><p>

参考例子

1. http://stackoverflow.com/questions/25277517/using-beautiful-soup-4-to-scrape-urls-within-a-p-class-postbody-tag-and-save

2. http://stackoverflow.com/questions/21581147/extracting-scraping-text-from-a-href-inside-p-inside-div

Python代码

for link in snippet.select('p.seller a'):   fulllink = link.get('href')   logfile.write(fulllink + "\n")

美团的商户信息采集

采集导航结果中的商户

以北京汽车服务为例，商户的列表采用的是lazy渲染模式，当用户往下滚动页面时才陆续加载

加载的数据来自于

data-async-params=

具体为

poiidList

所以读取其中的id列表就可以获取商户的url地址，

http://bj.meituan.com/shop/{id}

于是我们可以进一步进入某一个商户的主页面，然后可以采集商户的名称，地址和联系方式

采集商户搜索的结果

搜索url类似http://bj.meituan.com/shops/?w=关键词

在返回的结果页面里有

<div data-mtnode="G1"

<span class="shop-meta__address"

<span class="shop-meta__phone">

采集这些信息

分页信息

<div class="paginator-wrapper" data-mod="zd">

0 0