使用python进行收据搜集示例之crawl_and_parse
来源:互联网 发布:淘宝退差价怎么弄 编辑:程序博客网 时间:2024/04/27 17:01
这里是用jupyter notebook写的关于使用python进行数据收集的基本知识,包括crawl_and_parse、different_format_data_processing、feature_engineering_example和python_regular_expression等。之前课程里提供的资料,移植到了python3+windows环境上。代码上传到csdn资源啦:ABC of data_collection 。
为了方便查看,代码分开4篇博客里。
下面是jupyter notebook代码导出的md文件。
1.crawl_and_parse
Crawl and parsing HTML with Beauitful Soup
- 寒小阳(hanxiaoyang.ml@gmail.com)
- 2016-08
# 载入模块import requestsfrom bs4 import BeautifulSoupimport pandas as pd
### 创建dataframe然后输出出来,为一会儿爬取做准备# 构建一个字典raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3], 'postTestScore': [25, 94, 57, 62, 70]}# 创建dataframeraw_df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])# 输出看一眼raw_df
# urlurl = 'http://nbviewer.ipython.org/github/HanXiaoyang/python_and_data_easy_examples/crawl_and_parse.ipynb'# 用requests访问获取内容r = requests.get(url)# 用BeautifulSoup解析一下soup = BeautifulSoup(r.text, "lxml")
### 解析Beautiful Soup结构体# Create four variables to score the scraped data infirst_name = []last_name = []age = []preTestScore = []postTestScore = []# Create an object of the first object that is class=dataframetable = soup.find(class_='dataframe')# Find all the <tr> tag pairs, skip the first one, then for each.for row in table.find_all('tr')[1:]: # Create a variable of all the <td> tag pairs in each <tr> tag pair, col = row.find_all('td') # Create a variable of the string inside 1st <td> tag pair, column_1 = col[0].string.strip() # and append it to first_name variable first_name.append(column_1) # Create a variable of the string inside 2nd <td> tag pair, column_2 = col[1].string.strip() # and append it to last_name variable last_name.append(column_2) # Create a variable of the string inside 3rd <td> tag pair, column_3 = col[2].string.strip() # and append it to age variable age.append(column_3) # Create a variable of the string inside 4th <td> tag pair, column_4 = col[3].string.strip() # and append it to preTestScore variable preTestScore.append(column_4) # Create a variable of the string inside 5th <td> tag pair, column_5 = col[4].string.strip() # and append it to postTestScore variable postTestScore.append(column_5)# Create a variable of the value of the columnscolumns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}# Create a dataframe from the columns variabledf = pd.DataFrame(columns)
————————————————————————— AttributeError Traceback (most recent call last) in () 10 11 # Find all the tag pairs, skip the first one, then for each. —> 12 for row in table.find_all(‘tr’)[1:]: 13 # Create a variable of all the tag pairs in each tag pair, 14 col = row.find_all(‘td’) AttributeError: ‘NoneType’ object has no attribute ‘find_all’# View the dataframedf
5 rows × 5 columns
0 0
- 使用python进行收据搜集示例之crawl_and_parse
- 使用python进行收据搜集示例之different_format_data_processing
- 使用python进行收据搜集示例之feature_engineering_example
- 使用python进行收据搜集示例之python_regular_expression
- python argparse使用示例
- python regex 使用示例
- Python正则使用示例
- Xapian ( Python ) 之 TermGenerator 的简单理解和使用示例
- Python进行数据的多表去重示例
- 使用libsvm进行分类之python和java版本
- 使用python进行数据迁移重组之mysql工具类
- Python学习之使用Pillow(PIL)进行图像操作方法详解
- 使用Python进行聚类分析
- python学习网站搜集
- python之lambda简单示例
- caffe示例实现之3使用底层C++ API进行图像分类
- Python字符串join使用示例
- Python urlopen 使用小示例
- linux下设置mysql数据库字符集utf8
- The Data Mining of Lanzhou University of Finance and Economics
- protobuf + grpc 使用入门 二
- Lock-Free Data Structures with Hazard Pointers笔记
- 小e开发板WiFi微信登录后的回调函数et_message_process、et_event_process
- 使用python进行收据搜集示例之crawl_and_parse
- HTML初步认识
- Swift 条件
- CentOS 6.5搭建Tomcat+Mysql+JDK+FTP环境并部署项目
- Cordova添加android平台时选择安装版本
- 运行vs时会出现的一些小错误(不断更新,小白专用)
- 小e开发板音频模式下的处理流程(i2s和slc和补充MES_FILE_TRANSFERS消息类型等)
- VanlOS 10 RedWhale 2016.1216 全容器化部署
- Oracle行列转换的简单实现