CS109 Lecture 7

来源:互联网 发布:手机棋牌游戏平台源码 编辑:程序博客网 时间:2024/06/06 13:22

CS109 Lecture 7

Data Scraping

Sources

  • From a Web Sites
  • With An API

Copyrights and permission

  • Be careful and polite
  • Give credit
  • Care about media law
  • Don’t be evil

Useful tags

<h1></h1><p></p><br><a href = 'url'>Link</a>

Useful Libraries for Scraping

  • urllib
  • beautifulsoup
  • pattern
  • LXML

Get Data From Website

url = 'url'scource = urllib2.urlopen(url).read()
soup = bs4.BeautifulSoup(source)soup.findAll('a') # find <a><\a> tag
tag = soup.find('a')tag.get('href')
C = soup.findAll('p',{'class':'Event'})t=C[0] t.findNextSiblings

Get Data With An API

import json # JavaScript Obejct Notationimport requestsapi_key = 'mykey'url = 'url' + api_keyscource = urllib2.urlopen(url).read()
#---simple example--------a = {'a':1,'b':2}s = json.dump(a) a2 = json.loads(s) #-------------------------dataDict = json.loads(data)dtatDict.keys()
0 0
原创粉丝点击