python3.x爬虫实战:阿里巴巴网站定向信息抓取
来源:互联网 发布:js点击登录弹出登录框 编辑:程序博客网 时间:2024/04/30 07:16
#!/usr/bin/env python3# -*- coding: utf-8 -*-import requestsimport refrom bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECimport pymongoimport timefrom pymongo.collection import ReturnDocumentb=webdriver.Chrome()wait=WebDriverWait(b,10)KEY_WORD="建筑"URL="https://www.XXXXXXXXX.com"MONGO_URL='localhost'MONGO_DB='albbs'MONGO_TABLE='supplier'client=pymongo.MongoClient(MONGO_URL)db=client[MONGO_DB]def search(): try: b.get(URL) ul=b.find_element_by_css_selector("#masthead > div.ali-search.fd-right > div.searchtypeContainer > ul") ul.click() b.find_element(By.XPATH,"//*[@id='masthead']/div[2]/div[1]/ul/li[2]").click() input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#alisearch-keywords"))) submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#alisearch-submit"))) input.send_keys(KEY_WORD) submit.click() total= wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#sw_mod_pagination_form > div > span"))) get_url() time.sleep(2) return total.text except TimeoutException: return search()def next_page(page_number): try: input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#jumpto"))) submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#jump-sub"))) input.clear() input.send_keys(page_number) submit.click() time.sleep(5) wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR,"#sw_mod_pagination_content > div > span.page-cur"),str(page_number))) get_url() time.sleep(8) except TimeoutException: next_page(page_number)def get_url(): wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#sw_mod_searchlist"))) html=b.page_source soup = BeautifulSoup(html,'html.parser') a=soup.find_all("a",class_="list-item-title-text") for i in a: try: get_information_url(i.attrs['href']) except: continuedef save_to_mongo(results): try: if db[MONGO_TABLE].insert(results): print('存储成功') except: print('存储异常')def get_information_url(url): try: b.get(url) contactinfo=b.find_element_by_link_text("联系方式") contactinfo.click() time.sleep(1) wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#site_content > div.grid-main > div > div > div > div.m-content"))) html=b.page_source soup = BeautifulSoup(html,'html.parser') div=soup.find("h4") company_name=div.text contactinfo_name=soup.find("a",class_="membername").text MobilePhone=soup.find("dl",class_="m-mobilephone").text MobilePhone=int(re.compile('(\d+)').search(MobilePhone).group()) addr=soup.find("dd",class_="address").text print(company_name) print(contactinfo_name) print(MobilePhone) print(addr) contactinfos={"公司名":company_name,"联系人":contactinfo_name,"手机号码":MobilePhone,"地址":addr} save_to_mongo(contactinfos) time.sleep(8) except TimeoutException: get_information_url(url)def main(): total=search() total=int(re.compile('(\d+)').search(total).group(1)) print(total) for i in range(2,total +1): next_page(i)if __name__ == '__main__': main()自己的学习笔记。
阅读全文
0 0
- python3.x爬虫实战:阿里巴巴网站定向信息抓取
- python3爬虫--抓取网页信息
- python3爬虫--抓取天气信息
- Python3 定向爬虫之“抓取糗事百科图片”
- Python爬虫实战---抓取图书馆借阅信息
- python3.x爬虫学习:股票数据定向爬虫笔记
- 抓取防爬虫的网站信息
- Python3爬虫抓取网易云音乐热评实战
- 直播网站LiveTV Mining,爬虫抓取数据 python3+selenium+phantomjs
- java爬虫实战(1):抓取信息门户网站中的图片及其他文件并保存至本地
- 爬虫框架Scrapy实战之批量抓取招聘信息
- 爬虫框架Scrapy实战之批量抓取招聘信息
- python2.7爬虫实战(房地产信息抓取)
- 爬虫框架Scrapy实战之批量抓取招聘信息
- Python爬虫实战三 | 蓝奏网盘抓取网盘链接信息
- Python3.X 爬虫实战(先爬起来嗨)
- Python3.X 爬虫实战(先爬起来嗨)
- Python3.X 爬虫实战(缓存与持久化)
- 关于XP\win7系统中安装.net4.0 程序运行版本出错的原因之一
- Android Wi-Fi wpa_cli与wpa_supplicant的交互
- PHP上传文件配置
- GIT 常用命令
- 印度1亿用户数据泄露 微软聚焦云业务拟裁员3000丨阿里云河南
- python3.x爬虫实战:阿里巴巴网站定向信息抓取
- 将电脑背景颜色设置为豆沙绿
- 0x30与0x20
- Java反射二 动态调用类的方法
- sql 查询转换时间技巧
- Java反射一 动态修改类的属性
- Java反射三 动态创建数组
- nio 读写文件
- something