
来源:互联网 发布:淘宝退差价怎么弄 编辑:程序博客网 时间:2024/04/27 17:01

这里是用jupyter notebook写的关于使用python进行数据收集的基本知识,包括crawl_and_parse、different_format_data_processing、feature_engineering_example和python_regular_expression等。之前课程里提供的资料,移植到了python3+windows环境上。代码上传到csdn资源啦:ABC of data_collection 。
下面是jupyter notebook代码导出的md文件。


Crawl and parsing HTML with Beauitful Soup

  • 寒小阳(hanxiaoyang.ml@gmail.com)
  • 2016-08
# 载入模块import requestsfrom bs4 import BeautifulSoupimport pandas as pd
### 创建dataframe然后输出出来,为一会儿爬取做准备
# 构建一个字典raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],         'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],         'age': [42, 52, 36, 24, 73],         'preTestScore': [4, 24, 31, 2, 3],        'postTestScore': [25, 94, 57, 62, 70]}# 创建dataframeraw_df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])# 输出看一眼raw_df
first_name last_name age preTestScore postTestScore 0 Jason Miller 42 4 25 1 Molly Jacobson 52 24 94 2 Tina Ali 36 31 57 3 Jake Milner 24 2 62 4 Amy Cooze 73 3 70
### Download the HTML and create a Beautiful Soup object
# urlurl = 'http://nbviewer.ipython.org/github/HanXiaoyang/python_and_data_easy_examples/crawl_and_parse.ipynb'# 用requests访问获取内容r = requests.get(url)# 用BeautifulSoup解析一下soup = BeautifulSoup(r.text, "lxml")
### 解析Beautiful Soup结构体
# Create four variables to score the scraped data infirst_name = []last_name = []age = []preTestScore = []postTestScore = []# Create an object of the first object that is class=dataframetable = soup.find(class_='dataframe')# Find all the <tr> tag pairs, skip the first one, then for each.for row in table.find_all('tr')[1:]:    # Create a variable of all the <td> tag pairs in each <tr> tag pair,    col = row.find_all('td')    # Create a variable of the string inside 1st <td> tag pair,    column_1 = col[0].string.strip()    # and append it to first_name variable    first_name.append(column_1)    # Create a variable of the string inside 2nd <td> tag pair,    column_2 = col[1].string.strip()    # and append it to last_name variable    last_name.append(column_2)    # Create a variable of the string inside 3rd <td> tag pair,    column_3 = col[2].string.strip()    # and append it to age variable    age.append(column_3)    # Create a variable of the string inside 4th <td> tag pair,    column_4 = col[3].string.strip()    # and append it to preTestScore variable    preTestScore.append(column_4)    # Create a variable of the string inside 5th <td> tag pair,    column_5 = col[4].string.strip()    # and append it to postTestScore variable    postTestScore.append(column_5)# Create a variable of the value of the columnscolumns = {'first_name': first_name, 'last_name': last_name, 'age': age, 'preTestScore': preTestScore, 'postTestScore': postTestScore}# Create a dataframe from the columns variabledf = pd.DataFrame(columns)
————————————————————————— AttributeError Traceback (most recent call last) in () 10 11 # Find all the tag pairs, skip the first one, then for each. —> 12 for row in table.find_all(‘tr’)[1:]: 13 # Create a variable of all the tag pairs in each tag pair, 14 col = row.find_all(‘td’) AttributeError: ‘NoneType’ object has no attribute ‘find_all’
# View the dataframedf
age first_name last_name postTestScore preTestScore 0 42 Jason Miller 25 4 1 52 Molly Jacobson 94 24 2 36 Tina Ali 57 31 3 24 Jake Milner 62 2 4 73 Amy Cooze 70 3

5 rows × 5 columns

0 0