使用python进行收据搜集示例之crawl_and_parse

来源：互联网发布：淘宝退差价怎么弄编辑：程序博客网时间：2024/04/27 17:01

这里是用jupyter notebook写的关于使用python进行数据收集的基本知识，包括crawl_and_parse、different_format_data_processing、feature_engineering_example和python_regular_expression等。之前课程里提供的资料，移植到了python3+windows环境上。代码上传到csdn资源啦：ABC of data_collection 。
为了方便查看，代码分开4篇博客里。
下面是jupyter notebook代码导出的md文件。

1.crawl_and_parse

Crawl and parsing HTML with Beauitful Soup

寒小阳(hanxiaoyang.ml@gmail.com)
2016-08

# 载入模块import requestsfrom bs4 import BeautifulSoupimport pandas as pd

### 创建dataframe然后输出出来，为一会儿爬取做准备

# 构建一个字典raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],         'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],         'age': [42, 52, 36, 24, 73],         'preTestScore': [4, 24, 31, 2, 3],        'postTestScore': [25, 94, 57, 62, 70]}# 创建dataframeraw_df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])# 输出看一眼raw_df

first_name last_name age preTestScore postTestScore 0 Jason Miller 42 4 25 1 Molly Jacobson 52 24 94 2 Tina Ali 36 31 57 3 Jake Milner 24 2 62 4 Amy Cooze 73 3 70

### Download the HTML and create a Beautiful Soup object

# urlurl = 'http://nbviewer.ipython.org/github/HanXiaoyang/python_and_data_easy_examples/crawl_and_parse.ipynb'# 用requests访问获取内容r = requests.get(url)# 用BeautifulSoup解析一下soup = BeautifulSoup(r.text, "lxml")

### 解析Beautiful Soup结构体

# Create four variables to score the scraped data infirst_name = []last_name = []age = []preTestScore = []postTestScore = []# Create an object of the first object that is class=dataframetable = soup.find(class_='dataframe')# Find all the <tr> tag pairs, skip the first one, then for each.for row in table.find_all('tr')[1:]:    # Create a variable of all the <td> tag pairs in each <tr> tag pair,    col = row.find_all('td')    # Create a variable of the string inside 1st <td> tag pair,    column_1 = col[0].string.strip()    # and append it to first_name variable    first_name.append(column_1)    # Create a variable of the string inside 2nd <td> tag pair,    column_2 = col[1].string.strip()    # and append it to last_name variable    last_name.append(column_2)    # Create a variable of the string inside 3rd <td> tag pair,    column_3 = col[2].string.strip()    # and append it to age variable    age.append(column_3)    # Create a variable of the string inside 4th <td> tag pair,    column_4 = col[3].string.strip()    # and append it to preTestScore variable    preTestScore.append(column_4)    # Create a variable of the string inside 5th <td> tag pair,    column_5 = col[4].string.strip()    # and append it to postTestScore variable    postTestScore.append(column_5)# Create a variable of the value of the columnscolumns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}# Create a dataframe from the columns variabledf = pd.DataFrame(columns)

————————————————————————— AttributeError Traceback (most recent call last) in () 10 11 # Find all the tag pairs, skip the first one, then for each. —> 12 for row in table.find_all(‘tr’)[1:]: 13 # Create a variable of all the tag pairs in each tag pair, 14 col = row.find_all(‘td’) AttributeError: ‘NoneType’ object has no attribute ‘find_all’

# View the dataframedf

age first_name last_name postTestScore preTestScore 0 42 Jason Miller 25 4 1 52 Molly Jacobson 94 24 2 36 Tina Ali 57 31 3 24 Jake Milner 62 2 4 73 Amy Cooze 70 3

5 rows × 5 columns

0 0